* [PATCH v4 0/6] fuse: remove temp page copies in writeback
@ 2024-11-07 23:56 Joanne Koong
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
` (5 more replies)
0 siblings, 6 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-07 23:56 UTC (permalink / raw)
To: miklos, linux-fsdevel
Cc: shakeel.butt, jefflexu, josef, linux-mm, bernd.schubert, kernel-team
The purpose of this patchset is to help make writeback-cache write
performance in FUSE filesystems as fast as possible.
In the current FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap"))), a temp page is allocated for every dirty
page to be written back, the contents of the dirty page are copied over to the
temp page, and the temp page gets handed to the server to write back. This is
done so that writeback may be immediately cleared on the dirty page, which is
done in order to mitigate the following deadlock scenario that may arise if
reclaim waits on writeback on the dirty page to complete (more details can be
found in this thread [1]):
* single-threaded FUSE server is in the middle of handling a request
that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
direct reclaim
Allocating and copying dirty pages to temp pages is the biggest performance
bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
altogether (which will also allow us to get rid of the internal FUSE rb tree
that is needed to keep track of writeback status on the temp pages).
Benchmarks show approximately a 20% improvement in throughput for 4k
block-size writes and a 45% improvement for 1M block-size writes.
With removing the temp page, writeback state is now only cleared on the dirty
page after the server has written it back to disk. This may take an
indeterminate amount of time. As well, there is also the possibility of
malicious or well-intentioned but buggy servers where writeback may in the
worst case scenario, never complete. This means that any
folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to
be carefully audited.
In particular, these are the cases that need to be accounted for:
* potentially deadlocking in reclaim, as mentioned above
* potentially stalling sync(2)
* potentially stalling page migration / compaction
This patchset adds a new mapping flag, AS_WRITEBACK_MAY_BLOCK, which
filesystems may set on its inode mappings to indicate that writeback
operations
may take an indeterminate amount of time to complete. FUSE will set this flag
on its mappings. This patchset adds checks to the critical parts of reclaim,
sync, and page migration logic where writeback may be waited on.
Please note the following:
* For sync(2), waiting on writeback will be skipped for FUSE, but this has no
effect on existing behavior. Dirty FUSE pages are already not guaranteed to
be written to disk by the time sync(2) returns (eg writeback is cleared on
the dirty page but the server may not have written out the temp page to disk
yet). If the caller wishes to ensure the data has actually been synced to
disk, they should use fsync(2)/fdatasync(2) instead.
* AS_WRITEBACK_MAY_BLOCK does not indicate that the folios should never be
waited on when in writeback. There are some cases where the wait is
desirable. For example, for the sync_file_range() syscall, it is fine to
wait on the writeback since the caller passes in a fd for the operation.
[1]
https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/
Changelog
---------
v3:
https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@gmail.com/
Changes from v3 -> v4:
* Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in
readahead
v2:
https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/
Changes from v2 -> v3:
* Account for sync and page migration cases as well (Miklos)
* Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK
* For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache
is enabled
v1:
https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t
Changes from v1 -> v2:
* Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel)
* Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel)
Joanne Koong (6):
mm: add AS_WRITEBACK_MAY_BLOCK mapping flag
mm: skip reclaiming folios in legacy memcg writeback contexts that may
block
fs/writeback: in wait_sb_inodes(), skip wait for
AS_WRITEBACK_MAY_BLOCK mappings
mm/memory-hotplug: add finite retries in offline_pages() if migration
fails
mm/migrate: skip migrating folios under writeback with
AS_WRITEBACK_MAY_BLOCK mappings
fuse: remove tmp folio for writebacks and internal rb tree
fs/fs-writeback.c | 3 +
fs/fuse/file.c | 339 ++++------------------------------------
include/linux/pagemap.h | 11 ++
mm/memory_hotplug.c | 13 +-
mm/migrate.c | 5 +-
mm/vmscan.c | 10 +-
6 files changed, 60 insertions(+), 321 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
@ 2024-11-07 23:56 ` Joanne Koong
2024-11-09 0:10 ` Shakeel Butt
2024-11-07 23:56 ` [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block Joanne Koong
` (4 subsequent siblings)
5 siblings, 1 reply; 32+ messages in thread
From: Joanne Koong @ 2024-11-07 23:56 UTC (permalink / raw)
To: miklos, linux-fsdevel
Cc: shakeel.butt, jefflexu, josef, linux-mm, bernd.schubert, kernel-team
Add a new mapping flag AS_WRITEBACK_MAY_BLOCK which filesystems may set
to indicate that writeback operations may block or take an indeterminate
amount of time to complete. Extra caution should be taken when waiting
on writeback for folios belonging to mappings where this flag is set.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/pagemap.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301..eb5a7837e142 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -210,6 +210,7 @@ enum mapping_flags {
AS_STABLE_WRITES = 7, /* must wait for writeback before modifying
folio contents */
AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */
+ AS_WRITEBACK_MAY_BLOCK = 9, /* Use caution when waiting on writeback */
/* Bits 16-25 are used for FOLIO_ORDER */
AS_FOLIO_ORDER_BITS = 5,
AS_FOLIO_ORDER_MIN = 16,
@@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
return test_bit(AS_INACCESSIBLE, &mapping->flags);
}
+static inline void mapping_set_writeback_may_block(struct address_space *mapping)
+{
+ set_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
+}
+
+static inline bool mapping_writeback_may_block(struct address_space *mapping)
+{
+ return test_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
+}
+
static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
{
return mapping->gfp_mask;
--
2.43.5
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
@ 2024-11-07 23:56 ` Joanne Koong
2024-11-09 0:16 ` Shakeel Butt
2024-11-07 23:56 ` [PATCH v4 3/6] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
` (3 subsequent siblings)
5 siblings, 1 reply; 32+ messages in thread
From: Joanne Koong @ 2024-11-07 23:56 UTC (permalink / raw)
To: miklos, linux-fsdevel
Cc: shakeel.butt, jefflexu, josef, linux-mm, bernd.schubert, kernel-team
Currently in shrink_folio_list(), reclaim for folios under writeback
falls into 3 different cases:
1) Reclaim is encountering an excessive number of folios under
writeback and this folio has both the writeback and reclaim flags
set
2) Dirty throttling is enabled (this happens if reclaim through cgroup
is not enabled, if reclaim through cgroupv2 memcg is enabled, or
if reclaim is on the root cgroup), or if the folio is not marked for
immediate reclaim, or if the caller does not have __GFP_FS (or
__GFP_IO if it's going to swap) set
3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
set and the caller did not have __GFP_FS (or __GFP_IO if swap) set
In cases 1) and 2), we activate the folio and skip reclaiming it while
in case 3), we wait for writeback to finish on the folio and then try
to reclaim the folio again. In case 3, we wait on writeback because
cgroupv1 does not have dirty folio throttling, as such this is a
mitigation against the case where there are too many folios in writeback
with nothing else to reclaim.
The issue is that for filesystems where writeback may block, sub-optimal
workarounds may need to be put in place to avoid a potential deadlock
that may arise from reclaim waiting on writeback. (Even though case 3
above is rare given that legacy cgroupv1 is on its way to being
deprecated, this case still needs to be accounted for). For example, for
FUSE filesystems, a temp page gets allocated per dirty page and the
contents of the dirty page are copied over to the temp page so that
writeback can be immediately cleared on the dirty page in order to avoid
the following deadlock:
* single-threaded FUSE server is in the middle of handling a request that
needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback (eg falls into case 3
above) that needs to be written back to the FUSE server
* the FUSE server can't write back the folio since it's stuck in direct
reclaim
In this commit, if legacy memcg encounters a folio with the reclaim flag
set (eg case 3) and the folio belongs to a mapping that has the
AS_WRITEBACK_MAY_BLOCK flag set, the folio will be activated and skip
reclaim (eg default to behavior in case 2) instead.
This allows for the suboptimal workarounds added to address the
"reclaim wait on writeback" deadlock scenario to be removed.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
mm/vmscan.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749cdc110c74..e9755cb7211b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
if (writeback && folio_test_reclaim(folio))
stat->nr_congested += nr_pages;
+ mapping = folio_mapping(folio);
+
/*
* If a folio at the tail of the LRU is under writeback, there
* are three cases to consider.
@@ -1129,8 +1131,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
* 2) Global or new memcg reclaim encounters a folio that is
* not marked for immediate reclaim, or the caller does not
* have __GFP_FS (or __GFP_IO if it's simply going to swap,
- * not to fs). In this case mark the folio for immediate
- * reclaim and continue scanning.
+ * not to fs), or writebacks in the mapping may block.
+ * In this case mark the folio for immediate reclaim and
+ * continue scanning.
*
* Require may_enter_fs() because we would wait on fs, which
* may not have submitted I/O yet. And the loop driver might
@@ -1165,7 +1168,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* Case 2 above */
} else if (writeback_throttling_sane(sc) ||
!folio_test_reclaim(folio) ||
- !may_enter_fs(folio, sc->gfp_mask)) {
+ !may_enter_fs(folio, sc->gfp_mask) ||
+ (mapping && mapping_writeback_may_block(mapping))) {
/*
* This is slightly racy -
* folio_end_writeback() might have
--
2.43.5
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v4 3/6] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_MAY_BLOCK mappings
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
2024-11-07 23:56 ` [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block Joanne Koong
@ 2024-11-07 23:56 ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails Joanne Koong
` (2 subsequent siblings)
5 siblings, 0 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-07 23:56 UTC (permalink / raw)
To: miklos, linux-fsdevel
Cc: shakeel.butt, jefflexu, josef, linux-mm, bernd.schubert, kernel-team
For filesystems with the AS_WRITEBACK_MAY_BLOCK flag set, writeback
operations may block or take an indeterminate time to complete. For
example, writing data back to disk in FUSE filesystems depends on the
userspace server successfully completing writeback.
In this commit, wait_sb_inodes() skips waiting on writeback if the
inode's mapping has AS_WRITEBACK_MAY_BLOCK set, else sync(2) may take an
indeterminate amount of time to complete.
If the caller wishes to ensure the data for a mapping with the
AS_WRITEBACK_MAY_BLOCK flag set has actually been written back to disk,
they should use fsync(2)/fdatasync(2) instead.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fs-writeback.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d8bec3c1bb1f..c80c45972162 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2659,6 +2659,9 @@ static void wait_sb_inodes(struct super_block *sb)
if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
continue;
+ if (mapping_writeback_may_block(mapping))
+ continue;
+
spin_unlock_irq(&sb->s_inode_wblist_lock);
spin_lock(&inode->i_lock);
--
2.43.5
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
` (2 preceding siblings ...)
2024-11-07 23:56 ` [PATCH v4 3/6] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
@ 2024-11-07 23:56 ` Joanne Koong
2024-11-08 17:33 ` SeongJae Park
2024-11-07 23:56 ` [PATCH v4 5/6] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
5 siblings, 1 reply; 32+ messages in thread
From: Joanne Koong @ 2024-11-07 23:56 UTC (permalink / raw)
To: miklos, linux-fsdevel
Cc: shakeel.butt, jefflexu, josef, linux-mm, bernd.schubert, kernel-team
In offline_pages(), do_migrate_range() may potentially retry forever if
the migration fails. Add a return value for do_migrate_range(), and
allow offline_page() to try migrating pages 5 times before erroring
out, similar to how migration failures in __alloc_contig_migrate_range()
is handled.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
mm/memory_hotplug.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 621ae1015106..49402442ea3b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1770,13 +1770,14 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
return 0;
}
-static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
+static int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
{
struct folio *folio;
unsigned long pfn;
LIST_HEAD(source);
static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
+ int ret = 0;
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
struct page *page;
@@ -1833,7 +1834,6 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
.reason = MR_MEMORY_HOTPLUG,
};
- int ret;
/*
* We have checked that migration range is on a single zone so
@@ -1863,6 +1863,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
putback_movable_pages(&source);
}
}
+ return ret;
}
static int __init cmdline_parse_movable_node(char *p)
@@ -1940,6 +1941,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
const int node = zone_to_nid(zone);
unsigned long flags;
struct memory_notify arg;
+ unsigned int tries = 0;
char *reason;
int ret;
@@ -2028,11 +2030,8 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
ret = scan_movable_pages(pfn, end_pfn, &pfn);
if (!ret) {
- /*
- * TODO: fatal migration failures should bail
- * out
- */
- do_migrate_range(pfn, end_pfn);
+ if (do_migrate_range(pfn, end_pfn) && ++tries == 5)
+ ret = -EBUSY;
}
} while (!ret);
--
2.43.5
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v4 5/6] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_MAY_BLOCK mappings
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
` (3 preceding siblings ...)
2024-11-07 23:56 ` [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails Joanne Koong
@ 2024-11-07 23:56 ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
5 siblings, 0 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-07 23:56 UTC (permalink / raw)
To: miklos, linux-fsdevel
Cc: shakeel.butt, jefflexu, josef, linux-mm, bernd.schubert, kernel-team
For migrations called in MIGRATE_SYNC mode, skip migrating the folio if
it is under writeback and has the AS_WRITEBACK_MAY_BLOCK flag set on its
mapping. If the AS_WRITEBACK_MAY_BLOCK flag is set on the mapping, the
writeback may take an indeterminate amount of time to complete, so we
should not wait.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
mm/migrate.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index df91248755e4..1d038a4202ae 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
*/
switch (mode) {
case MIGRATE_SYNC:
- break;
+ if (!src->mapping ||
+ !mapping_writeback_may_block(src->mapping))
+ break;
+ fallthrough;
default:
rc = -EBUSY;
goto out;
--
2.43.5
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
` (4 preceding siblings ...)
2024-11-07 23:56 ` [PATCH v4 5/6] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
@ 2024-11-07 23:56 ` Joanne Koong
2024-11-08 8:48 ` Jingbo Xu
` (2 more replies)
5 siblings, 3 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-07 23:56 UTC (permalink / raw)
To: miklos, linux-fsdevel
Cc: shakeel.butt, jefflexu, josef, linux-mm, bernd.schubert, kernel-team
Currently, we allocate and copy data to a temporary folio when
handling writeback in order to mitigate the following deadlock scenario
that may arise if reclaim waits on writeback to complete:
* single-threaded FUSE server is in the middle of handling a request
that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
direct reclaim
To work around this, we allocate a temporary folio and copy over the
original folio to the temporary folio so that writeback can be
immediately cleared on the original folio. This additionally requires us
to maintain an internal rb tree to keep track of writeback state on the
temporary folios.
A recent change prevents reclaim logic from waiting on writeback for
folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
will prevent FUSE folios from running into the reclaim deadlock described
above) and removes the temporary folio + extra copying and the internal
rb tree.
fio benchmarks --
(using averages observed from 10 runs, throwing away outliers)
Setup:
sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
bs = 1k 4k 1M
Before 351 MiB/s 1818 MiB/s 1851 MiB/s
After 341 MiB/s 2246 MiB/s 2685 MiB/s
% diff -3% 23% 45%
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/file.c | 339 +++++--------------------------------------------
1 file changed, 29 insertions(+), 310 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 88d0946b5bc9..f8719d8c56ca 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
struct fuse_writepage_args {
struct fuse_io_args ia;
- struct rb_node writepages_entry;
struct list_head queue_entry;
- struct fuse_writepage_args *next;
struct inode *inode;
struct fuse_sync_bucket *bucket;
};
-static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
- pgoff_t idx_from, pgoff_t idx_to)
-{
- struct rb_node *n;
-
- n = fi->writepages.rb_node;
-
- while (n) {
- struct fuse_writepage_args *wpa;
- pgoff_t curr_index;
-
- wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry);
- WARN_ON(get_fuse_inode(wpa->inode) != fi);
- curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT;
- if (idx_from >= curr_index + wpa->ia.ap.num_folios)
- n = n->rb_right;
- else if (idx_to < curr_index)
- n = n->rb_left;
- else
- return wpa;
- }
- return NULL;
-}
-
-/*
- * Check if any page in a range is under writeback
- */
-static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
- pgoff_t idx_to)
-{
- struct fuse_inode *fi = get_fuse_inode(inode);
- bool found;
-
- if (RB_EMPTY_ROOT(&fi->writepages))
- return false;
-
- spin_lock(&fi->lock);
- found = fuse_find_writeback(fi, idx_from, idx_to);
- spin_unlock(&fi->lock);
-
- return found;
-}
-
-static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
-{
- return fuse_range_is_writeback(inode, index, index);
-}
-
-/*
- * Wait for page writeback to be completed.
- *
- * Since fuse doesn't rely on the VM writeback tracking, this has to
- * use some other means.
- */
-static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
-{
- struct fuse_inode *fi = get_fuse_inode(inode);
-
- wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
-}
-
-static inline bool fuse_folio_is_writeback(struct inode *inode,
- struct folio *folio)
-{
- pgoff_t last = folio_next_index(folio) - 1;
- return fuse_range_is_writeback(inode, folio_index(folio), last);
-}
-
-static void fuse_wait_on_folio_writeback(struct inode *inode,
- struct folio *folio)
-{
- struct fuse_inode *fi = get_fuse_inode(inode);
-
- wait_event(fi->page_waitq, !fuse_folio_is_writeback(inode, folio));
-}
-
/*
* Wait for all pending writepages on the inode to finish.
*
@@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
* have writeback that extends beyond the lifetime of the folio. So
* make sure we read a properly synced folio.
*/
- fuse_wait_on_folio_writeback(inode, folio);
+ folio_wait_writeback(folio);
attr_ver = fuse_get_attr_version(fm->fc);
@@ -1003,16 +925,15 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
static void fuse_readahead(struct readahead_control *rac)
{
struct inode *inode = rac->mapping->host;
- struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_conn *fc = get_fuse_conn(inode);
unsigned int max_pages, nr_pages;
- pgoff_t first = readahead_index(rac);
- pgoff_t last = first + readahead_count(rac) - 1;
+ loff_t first = readahead_pos(rac);
+ loff_t last = first + readahead_length(rac) - 1;
if (fuse_is_bad(inode))
return;
- wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
+ filemap_fdatawait_range(inode->i_mapping, first, last);
max_pages = min_t(unsigned int, fc->max_pages,
fc->max_read / PAGE_SIZE);
@@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
int err;
for (i = 0; i < ap->num_folios; i++)
- fuse_wait_on_folio_writeback(inode, ap->folios[i]);
+ folio_wait_writeback(ap->folios[i]);
fuse_write_args_fill(ia, ff, pos, count);
ia->write.in.flags = fuse_write_flags(iocb);
@@ -1622,7 +1543,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
return res;
}
}
- if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
+ if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
if (!write)
inode_lock(inode);
fuse_sync_writes(inode);
@@ -1824,8 +1745,10 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa)
if (wpa->bucket)
fuse_sync_bucket_dec(wpa->bucket);
- for (i = 0; i < ap->num_folios; i++)
+ for (i = 0; i < ap->num_folios; i++) {
+ folio_end_writeback(ap->folios[i]);
folio_put(ap->folios[i]);
+ }
fuse_file_put(wpa->ia.ff, false);
@@ -1838,7 +1761,7 @@ static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
struct backing_dev_info *bdi = inode_to_bdi(inode);
dec_wb_stat(&bdi->wb, WB_WRITEBACK);
- node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
+ node_stat_sub_folio(folio, NR_WRITEBACK);
wb_writeout_inc(&bdi->wb);
}
@@ -1861,7 +1784,6 @@ static void fuse_send_writepage(struct fuse_mount *fm,
__releases(fi->lock)
__acquires(fi->lock)
{
- struct fuse_writepage_args *aux, *next;
struct fuse_inode *fi = get_fuse_inode(wpa->inode);
struct fuse_write_in *inarg = &wpa->ia.write.in;
struct fuse_args *args = &wpa->ia.ap.args;
@@ -1898,19 +1820,8 @@ __acquires(fi->lock)
out_free:
fi->writectr--;
- rb_erase(&wpa->writepages_entry, &fi->writepages);
fuse_writepage_finish(wpa);
spin_unlock(&fi->lock);
-
- /* After rb_erase() aux request list is private */
- for (aux = wpa->next; aux; aux = next) {
- next = aux->next;
- aux->next = NULL;
- fuse_writepage_finish_stat(aux->inode,
- aux->ia.ap.folios[0]);
- fuse_writepage_free(aux);
- }
-
fuse_writepage_free(wpa);
spin_lock(&fi->lock);
}
@@ -1938,43 +1849,6 @@ __acquires(fi->lock)
}
}
-static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root,
- struct fuse_writepage_args *wpa)
-{
- pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT;
- pgoff_t idx_to = idx_from + wpa->ia.ap.num_folios - 1;
- struct rb_node **p = &root->rb_node;
- struct rb_node *parent = NULL;
-
- WARN_ON(!wpa->ia.ap.num_folios);
- while (*p) {
- struct fuse_writepage_args *curr;
- pgoff_t curr_index;
-
- parent = *p;
- curr = rb_entry(parent, struct fuse_writepage_args,
- writepages_entry);
- WARN_ON(curr->inode != wpa->inode);
- curr_index = curr->ia.write.in.offset >> PAGE_SHIFT;
-
- if (idx_from >= curr_index + curr->ia.ap.num_folios)
- p = &(*p)->rb_right;
- else if (idx_to < curr_index)
- p = &(*p)->rb_left;
- else
- return curr;
- }
-
- rb_link_node(&wpa->writepages_entry, parent, p);
- rb_insert_color(&wpa->writepages_entry, root);
- return NULL;
-}
-
-static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa)
-{
- WARN_ON(fuse_insert_writeback(root, wpa));
-}
-
static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
int error)
{
@@ -1994,41 +1868,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
if (!fc->writeback_cache)
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
spin_lock(&fi->lock);
- rb_erase(&wpa->writepages_entry, &fi->writepages);
- while (wpa->next) {
- struct fuse_mount *fm = get_fuse_mount(inode);
- struct fuse_write_in *inarg = &wpa->ia.write.in;
- struct fuse_writepage_args *next = wpa->next;
-
- wpa->next = next->next;
- next->next = NULL;
- tree_insert(&fi->writepages, next);
-
- /*
- * Skip fuse_flush_writepages() to make it easy to crop requests
- * based on primary request size.
- *
- * 1st case (trivial): there are no concurrent activities using
- * fuse_set/release_nowrite. Then we're on safe side because
- * fuse_flush_writepages() would call fuse_send_writepage()
- * anyway.
- *
- * 2nd case: someone called fuse_set_nowrite and it is waiting
- * now for completion of all in-flight requests. This happens
- * rarely and no more than once per page, so this should be
- * okay.
- *
- * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle
- * of fuse_set_nowrite..fuse_release_nowrite section. The fact
- * that fuse_set_nowrite returned implies that all in-flight
- * requests were completed along with all of their secondary
- * requests. Further primary requests are blocked by negative
- * writectr. Hence there cannot be any in-flight requests and
- * no invocations of fuse_writepage_end() while we're in
- * fuse_set_nowrite..fuse_release_nowrite section.
- */
- fuse_send_writepage(fm, next, inarg->offset + inarg->size);
- }
fi->writectr--;
fuse_writepage_finish(wpa);
spin_unlock(&fi->lock);
@@ -2115,19 +1954,18 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc,
}
static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
- struct folio *tmp_folio, uint32_t folio_index)
+ uint32_t folio_index)
{
struct inode *inode = folio->mapping->host;
struct fuse_args_pages *ap = &wpa->ia.ap;
- folio_copy(tmp_folio, folio);
-
- ap->folios[folio_index] = tmp_folio;
+ folio_get(folio);
+ ap->folios[folio_index] = folio;
ap->descs[folio_index].offset = 0;
ap->descs[folio_index].length = PAGE_SIZE;
inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
- node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);
+ node_stat_add_folio(folio, NR_WRITEBACK);
}
static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio,
@@ -2162,18 +2000,12 @@ static int fuse_writepage_locked(struct folio *folio)
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_writepage_args *wpa;
struct fuse_args_pages *ap;
- struct folio *tmp_folio;
struct fuse_file *ff;
- int error = -ENOMEM;
-
- tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
- if (!tmp_folio)
- goto err;
+ int error = -EIO;
- error = -EIO;
ff = fuse_write_file_get(fi);
if (!ff)
- goto err_nofile;
+ goto err;
wpa = fuse_writepage_args_setup(folio, ff);
error = -ENOMEM;
@@ -2184,22 +2016,17 @@ static int fuse_writepage_locked(struct folio *folio)
ap->num_folios = 1;
folio_start_writeback(folio);
- fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0);
+ fuse_writepage_args_page_fill(wpa, folio, 0);
spin_lock(&fi->lock);
- tree_insert(&fi->writepages, wpa);
list_add_tail(&wpa->queue_entry, &fi->queued_writes);
fuse_flush_writepages(inode);
spin_unlock(&fi->lock);
- folio_end_writeback(folio);
-
return 0;
err_writepage_args:
fuse_file_put(ff, false);
-err_nofile:
- folio_put(tmp_folio);
err:
mapping_set_error(folio->mapping, error);
return error;
@@ -2209,7 +2036,6 @@ struct fuse_fill_wb_data {
struct fuse_writepage_args *wpa;
struct fuse_file *ff;
struct inode *inode;
- struct folio **orig_folios;
unsigned int max_folios;
};
@@ -2244,69 +2070,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
struct fuse_writepage_args *wpa = data->wpa;
struct inode *inode = data->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
- int num_folios = wpa->ia.ap.num_folios;
- int i;
spin_lock(&fi->lock);
list_add_tail(&wpa->queue_entry, &fi->queued_writes);
fuse_flush_writepages(inode);
spin_unlock(&fi->lock);
-
- for (i = 0; i < num_folios; i++)
- folio_end_writeback(data->orig_folios[i]);
-}
-
-/*
- * Check under fi->lock if the page is under writeback, and insert it onto the
- * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's
- * one already added for a page at this offset. If there's none, then insert
- * this new request onto the auxiliary list, otherwise reuse the existing one by
- * swapping the new temp page with the old one.
- */
-static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa,
- struct folio *folio)
-{
- struct fuse_inode *fi = get_fuse_inode(new_wpa->inode);
- struct fuse_writepage_args *tmp;
- struct fuse_writepage_args *old_wpa;
- struct fuse_args_pages *new_ap = &new_wpa->ia.ap;
-
- WARN_ON(new_ap->num_folios != 0);
- new_ap->num_folios = 1;
-
- spin_lock(&fi->lock);
- old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa);
- if (!old_wpa) {
- spin_unlock(&fi->lock);
- return true;
- }
-
- for (tmp = old_wpa->next; tmp; tmp = tmp->next) {
- pgoff_t curr_index;
-
- WARN_ON(tmp->inode != new_wpa->inode);
- curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT;
- if (curr_index == folio->index) {
- WARN_ON(tmp->ia.ap.num_folios != 1);
- swap(tmp->ia.ap.folios[0], new_ap->folios[0]);
- break;
- }
- }
-
- if (!tmp) {
- new_wpa->next = old_wpa->next;
- old_wpa->next = new_wpa;
- }
-
- spin_unlock(&fi->lock);
-
- if (tmp) {
- fuse_writepage_finish_stat(new_wpa->inode,
- folio);
- fuse_writepage_free(new_wpa);
- }
-
- return false;
}
static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
@@ -2315,15 +2083,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
{
WARN_ON(!ap->num_folios);
- /*
- * Being under writeback is unlikely but possible. For example direct
- * read to an mmaped fuse file will set the page dirty twice; once when
- * the pages are faulted with get_user_pages(), and then after the read
- * completed.
- */
- if (fuse_folio_is_writeback(data->inode, folio))
- return true;
-
/* Reached max pages */
if (ap->num_folios == fc->max_pages)
return true;
@@ -2333,7 +2092,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
return true;
/* Discontinuity */
- if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
+ if (ap->folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
return true;
/* Need to grow the pages array? If so, did the expansion fail? */
@@ -2352,7 +2111,6 @@ static int fuse_writepages_fill(struct folio *folio,
struct inode *inode = data->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_conn *fc = get_fuse_conn(inode);
- struct folio *tmp_folio;
int err;
if (!data->ff) {
@@ -2367,54 +2125,23 @@ static int fuse_writepages_fill(struct folio *folio,
data->wpa = NULL;
}
- err = -ENOMEM;
- tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
- if (!tmp_folio)
- goto out_unlock;
-
- /*
- * The page must not be redirtied until the writeout is completed
- * (i.e. userspace has sent a reply to the write request). Otherwise
- * there could be more than one temporary page instance for each real
- * page.
- *
- * This is ensured by holding the page lock in page_mkwrite() while
- * checking fuse_page_is_writeback(). We already hold the page lock
- * since clear_page_dirty_for_io() and keep it held until we add the
- * request to the fi->writepages list and increment ap->num_folios.
- * After this fuse_page_is_writeback() will indicate that the page is
- * under writeback, so we can release the page lock.
- */
if (data->wpa == NULL) {
err = -ENOMEM;
wpa = fuse_writepage_args_setup(folio, data->ff);
- if (!wpa) {
- folio_put(tmp_folio);
+ if (!wpa)
goto out_unlock;
- }
fuse_file_get(wpa->ia.ff);
data->max_folios = 1;
ap = &wpa->ia.ap;
}
folio_start_writeback(folio);
- fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_folios);
- data->orig_folios[ap->num_folios] = folio;
+ fuse_writepage_args_page_fill(wpa, folio, ap->num_folios);
err = 0;
- if (data->wpa) {
- /*
- * Protected by fi->lock against concurrent access by
- * fuse_page_is_writeback().
- */
- spin_lock(&fi->lock);
- ap->num_folios++;
- spin_unlock(&fi->lock);
- } else if (fuse_writepage_add(wpa, folio)) {
+ ap->num_folios++;
+ if (!data->wpa)
data->wpa = wpa;
- } else {
- folio_end_writeback(folio);
- }
out_unlock:
folio_unlock(folio);
@@ -2441,13 +2168,6 @@ static int fuse_writepages(struct address_space *mapping,
data.wpa = NULL;
data.ff = NULL;
- err = -ENOMEM;
- data.orig_folios = kcalloc(fc->max_pages,
- sizeof(struct folio *),
- GFP_NOFS);
- if (!data.orig_folios)
- goto out;
-
err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
if (data.wpa) {
WARN_ON(!data.wpa->ia.ap.num_folios);
@@ -2456,7 +2176,6 @@ static int fuse_writepages(struct address_space *mapping,
if (data.ff)
fuse_file_put(data.ff, false);
- kfree(data.orig_folios);
out:
return err;
}
@@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
if (IS_ERR(folio))
goto error;
- fuse_wait_on_page_writeback(mapping->host, folio->index);
+ folio_wait_writeback(folio);
if (folio_test_uptodate(folio) || len >= folio_size(folio))
goto success;
@@ -2545,13 +2264,11 @@ static int fuse_launder_folio(struct folio *folio)
{
int err = 0;
if (folio_clear_dirty_for_io(folio)) {
- struct inode *inode = folio->mapping->host;
-
/* Serialize with pending writeback for the same page */
- fuse_wait_on_page_writeback(inode, folio->index);
+ folio_wait_writeback(folio);
err = fuse_writepage_locked(folio);
if (!err)
- fuse_wait_on_page_writeback(inode, folio->index);
+ folio_wait_writeback(folio);
}
return err;
}
@@ -2595,7 +2312,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf)
return VM_FAULT_NOPAGE;
}
- fuse_wait_on_folio_writeback(inode, folio);
+ folio_wait_writeback(folio);
return VM_FAULT_LOCKED;
}
@@ -3413,9 +3130,12 @@ static const struct address_space_operations fuse_file_aops = {
void fuse_init_file_inode(struct inode *inode, unsigned int flags)
{
struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
inode->i_fop = &fuse_file_operations;
inode->i_data.a_ops = &fuse_file_aops;
+ if (fc->writeback_cache)
+ mapping_set_writeback_may_block(&inode->i_data);
INIT_LIST_HEAD(&fi->write_files);
INIT_LIST_HEAD(&fi->queued_writes);
@@ -3423,7 +3143,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
fi->iocachectr = 0;
init_waitqueue_head(&fi->page_waitq);
init_waitqueue_head(&fi->direct_io_waitq);
- fi->writepages = RB_ROOT;
if (IS_ENABLED(CONFIG_FUSE_DAX))
fuse_dax_inode_init(inode, flags);
--
2.43.5
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
@ 2024-11-08 8:48 ` Jingbo Xu
2024-11-08 22:33 ` Joanne Koong
2024-11-11 8:32 ` Jingbo Xu
2024-11-12 9:25 ` Jingbo Xu
2 siblings, 1 reply; 32+ messages in thread
From: Jingbo Xu @ 2024-11-08 8:48 UTC (permalink / raw)
To: Joanne Koong, miklos, linux-fsdevel
Cc: shakeel.butt, josef, linux-mm, bernd.schubert, kernel-team
Hi, Joanne,
Thanks for the continuing work!
On 11/8/24 7:56 AM, Joanne Koong wrote:
> Currently, we allocate and copy data to a temporary folio when
> handling writeback in order to mitigate the following deadlock scenario
> that may arise if reclaim waits on writeback to complete:
> * single-threaded FUSE server is in the middle of handling a request
> that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
> direct reclaim
>
> To work around this, we allocate a temporary folio and copy over the
> original folio to the temporary folio so that writeback can be
> immediately cleared on the original folio. This additionally requires us
> to maintain an internal rb tree to keep track of writeback state on the
> temporary folios.
>
> A recent change prevents reclaim logic from waiting on writeback for
> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> will prevent FUSE folios from running into the reclaim deadlock described
> above) and removes the temporary folio + extra copying and the internal
> rb tree.
>
> fio benchmarks --
> (using averages observed from 10 runs, throwing away outliers)
>
> Setup:
> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>
> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>
> bs = 1k 4k 1M
> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> After 341 MiB/s 2246 MiB/s 2685 MiB/s
> % diff -3% 23% 45%
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> @@ -1622,7 +1543,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
> return res;
> }
> }
> - if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
> + if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
filemap_range_has_writeback() is not equivalent to
fuse_range_is_writeback(), as it will return true as long as there's any
locked or dirty page? I can't find an equivalent helper function at
hand though.
> @@ -3423,7 +3143,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
> fi->iocachectr = 0;
> init_waitqueue_head(&fi->page_waitq);
> init_waitqueue_head(&fi->direct_io_waitq);
> - fi->writepages = RB_ROOT;
It seems that 'struct rb_root writepages' is not removed from fuse_inode
structure.
Besides, I also looked through the former 5 patches and can't find any
obvious errors at the very first glance. Hopefully the MM guys could
offer more professional reviews.
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails
2024-11-07 23:56 ` [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails Joanne Koong
@ 2024-11-08 17:33 ` SeongJae Park
2024-11-08 18:56 ` David Hildenbrand
2024-11-08 21:59 ` Joanne Koong
0 siblings, 2 replies; 32+ messages in thread
From: SeongJae Park @ 2024-11-08 17:33 UTC (permalink / raw)
To: Joanne Koong
Cc: SeongJae Park, miklos, linux-fsdevel, shakeel.butt, jefflexu,
josef, linux-mm, bernd.schubert, kernel-team, David Hildenbrand
+ David Hildenbrand
On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@gmail.com> wrote:
> In offline_pages(), do_migrate_range() may potentially retry forever if
> the migration fails. Add a return value for do_migrate_range(), and
> allow offline_page() to try migrating pages 5 times before erroring
> out, similar to how migration failures in __alloc_contig_migrate_range()
> is handled.
I'm curious if this could cause unexpected behavioral differences to memory
hotplugging users, and how '5' is chosen. Could you please enlighten me?
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> mm/memory_hotplug.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 621ae1015106..49402442ea3b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1770,13 +1770,14 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
> return 0;
> }
>
> -static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> +static int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
Seems the return value is used for only knowing if it is failed or not. If
there is no plan to use the error code in future, what about using bool return
type?
> {
> struct folio *folio;
> unsigned long pfn;
> LIST_HEAD(source);
> static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL,
> DEFAULT_RATELIMIT_BURST);
> + int ret = 0;
>
> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> struct page *page;
> @@ -1833,7 +1834,6 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> .reason = MR_MEMORY_HOTPLUG,
> };
> - int ret;
>
> /*
> * We have checked that migration range is on a single zone so
> @@ -1863,6 +1863,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> putback_movable_pages(&source);
> }
> }
> + return ret;
> }
>
> static int __init cmdline_parse_movable_node(char *p)
> @@ -1940,6 +1941,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> const int node = zone_to_nid(zone);
> unsigned long flags;
> struct memory_notify arg;
> + unsigned int tries = 0;
> char *reason;
> int ret;
>
> @@ -2028,11 +2030,8 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>
> ret = scan_movable_pages(pfn, end_pfn, &pfn);
> if (!ret) {
> - /*
> - * TODO: fatal migration failures should bail
> - * out
> - */
> - do_migrate_range(pfn, end_pfn);
> + if (do_migrate_range(pfn, end_pfn) && ++tries == 5)
> + ret = -EBUSY;
> }
In the '++tries == 5' case, users will show the failure reason as "unmovable
page" from the debug log. What about setting 'reason' here to be more
specific, e.g., "multiple migration failures"?
Also, my humble understanding of the intention of this change is as follow. If
there are 'AS_WRITEBACK_MAY_BLOCK' pages in the migration target range,
do_migrate_range() will continuously fail. And hence this could become
infinite loop. Pleae let me know if I'm misunderstanding.
But if I'm not wrong... There is a check for expected failures above
(scan_movable_pages()). What about adding 'AS_WRITEBACK_MAY_BLOCK' pages
existence check there?
> } while (!ret);
>
> --
> 2.43.5
Thanks,
SJ
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails
2024-11-08 17:33 ` SeongJae Park
@ 2024-11-08 18:56 ` David Hildenbrand
2024-11-08 19:00 ` David Hildenbrand
2024-11-08 21:59 ` Joanne Koong
1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-11-08 18:56 UTC (permalink / raw)
To: SeongJae Park, Joanne Koong
Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef, linux-mm,
bernd.schubert, kernel-team
On 08.11.24 18:33, SeongJae Park wrote:
> + David Hildenbrand
>
> On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@gmail.com> wrote:
>
>> In offline_pages(), do_migrate_range() may potentially retry forever if
>> the migration fails. Add a return value for do_migrate_range(), and
>> allow offline_page() to try migrating pages 5 times before erroring
>> out, similar to how migration failures in __alloc_contig_migrate_range()
>> is handled.
>
> I'm curious if this could cause unexpected behavioral differences to memory
> hotplugging users, and how '5' is chosen. Could you please enlighten me?
>
I'm wondering how much more often I'll have to nack such a patch. :)
On a related note: MAINTAINERS file exists for a reason.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails
2024-11-08 18:56 ` David Hildenbrand
@ 2024-11-08 19:00 ` David Hildenbrand
2024-11-08 21:27 ` Shakeel Butt
0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-11-08 19:00 UTC (permalink / raw)
To: SeongJae Park, Joanne Koong
Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef, linux-mm,
bernd.schubert, kernel-team
On 08.11.24 19:56, David Hildenbrand wrote:
> On 08.11.24 18:33, SeongJae Park wrote:
>> + David Hildenbrand
>>
>> On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>>> In offline_pages(), do_migrate_range() may potentially retry forever if
>>> the migration fails. Add a return value for do_migrate_range(), and
>>> allow offline_page() to try migrating pages 5 times before erroring
>>> out, similar to how migration failures in __alloc_contig_migrate_range()
>>> is handled.
>>
>> I'm curious if this could cause unexpected behavioral differences to memory
>> hotplugging users, and how '5' is chosen. Could you please enlighten me?
>>
>
> I'm wondering how much more often I'll have to nack such a patch. :)
A more recent discussion:
https://lore.kernel.org/linux-mm/52161997-15aa-4093-a573-3bfd8da14ff1@fujitsu.com/T/#mdda39b2956a11c46f8da8796f9612ac007edbdab
Long story short: this is expected and documented
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails
2024-11-08 19:00 ` David Hildenbrand
@ 2024-11-08 21:27 ` Shakeel Butt
2024-11-08 21:42 ` Joanne Koong
0 siblings, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2024-11-08 21:27 UTC (permalink / raw)
To: David Hildenbrand
Cc: SeongJae Park, Joanne Koong, miklos, linux-fsdevel, jefflexu,
josef, linux-mm, bernd.schubert, kernel-team
On Fri, Nov 08, 2024 at 08:00:25PM +0100, David Hildenbrand wrote:
> On 08.11.24 19:56, David Hildenbrand wrote:
> > On 08.11.24 18:33, SeongJae Park wrote:
> > > + David Hildenbrand
> > >
> > > On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > > In offline_pages(), do_migrate_range() may potentially retry forever if
> > > > the migration fails. Add a return value for do_migrate_range(), and
> > > > allow offline_page() to try migrating pages 5 times before erroring
> > > > out, similar to how migration failures in __alloc_contig_migrate_range()
> > > > is handled.
> > >
> > > I'm curious if this could cause unexpected behavioral differences to memory
> > > hotplugging users, and how '5' is chosen. Could you please enlighten me?
> > >
> >
> > I'm wondering how much more often I'll have to nack such a patch. :)
>
> A more recent discussion: https://lore.kernel.org/linux-mm/52161997-15aa-4093-a573-3bfd8da14ff1@fujitsu.com/T/#mdda39b2956a11c46f8da8796f9612ac007edbdab
>
> Long story short: this is expected and documented
Thanks David for the background.
Joanne, simply drop this patch. It is not required for your series.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails
2024-11-08 21:27 ` Shakeel Butt
@ 2024-11-08 21:42 ` Joanne Koong
2024-11-08 22:16 ` Shakeel Butt
0 siblings, 1 reply; 32+ messages in thread
From: Joanne Koong @ 2024-11-08 21:42 UTC (permalink / raw)
To: Shakeel Butt
Cc: David Hildenbrand, SeongJae Park, miklos, linux-fsdevel,
jefflexu, josef, linux-mm, bernd.schubert, kernel-team
On Fri, Nov 8, 2024 at 1:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Fri, Nov 08, 2024 at 08:00:25PM +0100, David Hildenbrand wrote:
> > On 08.11.24 19:56, David Hildenbrand wrote:
> > > On 08.11.24 18:33, SeongJae Park wrote:
> > > > + David Hildenbrand
> > > >
> > > > On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > > In offline_pages(), do_migrate_range() may potentially retry forever if
> > > > > the migration fails. Add a return value for do_migrate_range(), and
> > > > > allow offline_page() to try migrating pages 5 times before erroring
> > > > > out, similar to how migration failures in __alloc_contig_migrate_range()
> > > > > is handled.
> > > >
> > > > I'm curious if this could cause unexpected behavioral differences to memory
> > > > hotplugging users, and how '5' is chosen. Could you please enlighten me?
> > > >
> > >
> > > I'm wondering how much more often I'll have to nack such a patch. :)
> >
> > A more recent discussion: https://lore.kernel.org/linux-mm/52161997-15aa-4093-a573-3bfd8da14ff1@fujitsu.com/T/#mdda39b2956a11c46f8da8796f9612ac007edbdab
> >
> > Long story short: this is expected and documented
>
> Thanks David for the background.
>
> Joanne, simply drop this patch. It is not required for your series.
Awesome, I'm happy to drop this patch.
Just curious though - don't we need this in order to mitigate the
scenario where if an unprivileged fuse server never completes
writeback, we don't run into this infinite loop? Or is it that memory
hotplugging is always initiated from userspace so if it does run into
an infinite loop (like also in that thread David linked), userspace is
responsible for sending a signal to terminate it?
Thanks,
Joanne
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails
2024-11-08 17:33 ` SeongJae Park
2024-11-08 18:56 ` David Hildenbrand
@ 2024-11-08 21:59 ` Joanne Koong
1 sibling, 0 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-08 21:59 UTC (permalink / raw)
To: SeongJae Park
Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef, linux-mm,
bernd.schubert, kernel-team, David Hildenbrand
On Fri, Nov 8, 2024 at 9:33 AM SeongJae Park <sj@kernel.org> wrote:
>
> + David Hildenbrand
>
> On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > In offline_pages(), do_migrate_range() may potentially retry forever if
> > the migration fails. Add a return value for do_migrate_range(), and
> > allow offline_page() to try migrating pages 5 times before erroring
> > out, similar to how migration failures in __alloc_contig_migrate_range()
> > is handled.
>
Hi SeongJae,
Thanks for taking a look. I'm going to drop this patch per the
conversation with David and Shakeel below, but wanted to reply back to
some of the questions here for completion's sake.
> I'm curious if this could cause unexpected behavioral differences to memory
> hotplugging users, and how '5' is chosen. Could you please enlighten me?
>
Most of this logic was copied from __alloc_contig_migrate_range() -
in that function, '5' is hard-coded as the number of times to retry
for migration failures. No other reason for '5' other than that.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > mm/memory_hotplug.c | 13 ++++++-------
> > 1 file changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 621ae1015106..49402442ea3b 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1770,13 +1770,14 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
> > return 0;
> > }
> >
> > -static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> > +static int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>
> Seems the return value is used for only knowing if it is failed or not. If
> there is no plan to use the error code in future, what about using bool return
> type?
>
> > {
> > struct folio *folio;
> > unsigned long pfn;
> > LIST_HEAD(source);
> > static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL,
> > DEFAULT_RATELIMIT_BURST);
> > + int ret = 0;
> >
> > for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > struct page *page;
> > @@ -1833,7 +1834,6 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> > .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> > .reason = MR_MEMORY_HOTPLUG,
> > };
> > - int ret;
> >
> > /*
> > * We have checked that migration range is on a single zone so
> > @@ -1863,6 +1863,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> > putback_movable_pages(&source);
> > }
> > }
> > + return ret;
> > }
> >
> > static int __init cmdline_parse_movable_node(char *p)
> > @@ -1940,6 +1941,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> > const int node = zone_to_nid(zone);
> > unsigned long flags;
> > struct memory_notify arg;
> > + unsigned int tries = 0;
> > char *reason;
> > int ret;
> >
> > @@ -2028,11 +2030,8 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> >
> > ret = scan_movable_pages(pfn, end_pfn, &pfn);
> > if (!ret) {
> > - /*
> > - * TODO: fatal migration failures should bail
> > - * out
> > - */
> > - do_migrate_range(pfn, end_pfn);
> > + if (do_migrate_range(pfn, end_pfn) && ++tries == 5)
> > + ret = -EBUSY;
> > }
>
> In the '++tries == 5' case, users will show the failure reason as "unmovable
> page" from the debug log. What about setting 'reason' here to be more
> specific, e.g., "multiple migration failures"?
>
> Also, my humble understanding of the intention of this change is as follow. If
> there are 'AS_WRITEBACK_MAY_BLOCK' pages in the migration target range,
> do_migrate_range() will continuously fail. And hence this could become
> infinite loop. Pleae let me know if I'm misunderstanding.
>
> But if I'm not wrong... There is a check for expected failures above
> (scan_movable_pages()). What about adding 'AS_WRITEBACK_MAY_BLOCK' pages
> existence check there?
The main difference between adding migrate_pages() retries (this
patch) vs adding an 'AS_WRITEBACK_MAY_BLOCK' check in
scan_movable_pages() is that in the latter, all pages in an
'AS_WRITEBACK_MAY_BLOCK' mapping will be skipped for migration whereas
in the former, only pages under writeback will be skipped. I think the
latter is probably fine too for this case but the former seemed a bit
more optimal to me.
Thanks,
Joanne
>
> > } while (!ret);
> >
> > --
> > 2.43.5
>
>
> Thanks,
> SJ
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails
2024-11-08 21:42 ` Joanne Koong
@ 2024-11-08 22:16 ` Shakeel Butt
2024-11-08 22:20 ` Joanne Koong
0 siblings, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2024-11-08 22:16 UTC (permalink / raw)
To: Joanne Koong
Cc: David Hildenbrand, SeongJae Park, miklos, linux-fsdevel,
jefflexu, josef, linux-mm, bernd.schubert, kernel-team
On Fri, Nov 08, 2024 at 01:42:15PM -0800, Joanne Koong wrote:
> On Fri, Nov 8, 2024 at 1:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Fri, Nov 08, 2024 at 08:00:25PM +0100, David Hildenbrand wrote:
> > > On 08.11.24 19:56, David Hildenbrand wrote:
> > > > On 08.11.24 18:33, SeongJae Park wrote:
> > > > > + David Hildenbrand
> > > > >
> > > > > On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > > In offline_pages(), do_migrate_range() may potentially retry forever if
> > > > > > the migration fails. Add a return value for do_migrate_range(), and
> > > > > > allow offline_page() to try migrating pages 5 times before erroring
> > > > > > out, similar to how migration failures in __alloc_contig_migrate_range()
> > > > > > is handled.
> > > > >
> > > > > I'm curious if this could cause unexpected behavioral differences to memory
> > > > > hotplugging users, and how '5' is chosen. Could you please enlighten me?
> > > > >
> > > >
> > > > I'm wondering how much more often I'll have to nack such a patch. :)
> > >
> > > A more recent discussion: https://lore.kernel.org/linux-mm/52161997-15aa-4093-a573-3bfd8da14ff1@fujitsu.com/T/#mdda39b2956a11c46f8da8796f9612ac007edbdab
> > >
> > > Long story short: this is expected and documented
> >
> > Thanks David for the background.
> >
> > Joanne, simply drop this patch. It is not required for your series.
>
> Awesome, I'm happy to drop this patch.
>
> Just curious though - don't we need this in order to mitigate the
> scenario where if an unprivileged fuse server never completes
> writeback, we don't run into this infinite loop? Or is it that memory
> hotplugging is always initiated from userspace so if it does run into
> an infinite loop (like also in that thread David linked), userspace is
> responsible for sending a signal to terminate it?
I think irrespective of the source of the hotplug, the current behavior
of infinite retries in some cases is documented and kind of expected, so
no need to fix it. (Though I don't know all the source of hotplug.)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails
2024-11-08 22:16 ` Shakeel Butt
@ 2024-11-08 22:20 ` Joanne Koong
0 siblings, 0 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-08 22:20 UTC (permalink / raw)
To: Shakeel Butt
Cc: David Hildenbrand, SeongJae Park, miklos, linux-fsdevel,
jefflexu, josef, linux-mm, bernd.schubert, kernel-team
On Fri, Nov 8, 2024 at 2:16 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Fri, Nov 08, 2024 at 01:42:15PM -0800, Joanne Koong wrote:
> > On Fri, Nov 8, 2024 at 1:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Fri, Nov 08, 2024 at 08:00:25PM +0100, David Hildenbrand wrote:
> > > > On 08.11.24 19:56, David Hildenbrand wrote:
> > > > > On 08.11.24 18:33, SeongJae Park wrote:
> > > > > > + David Hildenbrand
> > > > > >
> > > > > > On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > >
> > > > > > > In offline_pages(), do_migrate_range() may potentially retry forever if
> > > > > > > the migration fails. Add a return value for do_migrate_range(), and
> > > > > > > allow offline_page() to try migrating pages 5 times before erroring
> > > > > > > out, similar to how migration failures in __alloc_contig_migrate_range()
> > > > > > > is handled.
> > > > > >
> > > > > > I'm curious if this could cause unexpected behavioral differences to memory
> > > > > > hotplugging users, and how '5' is chosen. Could you please enlighten me?
> > > > > >
> > > > >
> > > > > I'm wondering how much more often I'll have to nack such a patch. :)
> > > >
> > > > A more recent discussion: https://lore.kernel.org/linux-mm/52161997-15aa-4093-a573-3bfd8da14ff1@fujitsu.com/T/#mdda39b2956a11c46f8da8796f9612ac007edbdab
> > > >
> > > > Long story short: this is expected and documented
> > >
> > > Thanks David for the background.
> > >
> > > Joanne, simply drop this patch. It is not required for your series.
> >
> > Awesome, I'm happy to drop this patch.
> >
> > Just curious though - don't we need this in order to mitigate the
> > scenario where if an unprivileged fuse server never completes
> > writeback, we don't run into this infinite loop? Or is it that memory
> > hotplugging is always initiated from userspace so if it does run into
> > an infinite loop (like also in that thread David linked), userspace is
> > responsible for sending a signal to terminate it?
>
> I think irrespective of the source of the hotplug, the current behavior
> of infinite retries in some cases is documented and kind of expected, so
> no need to fix it. (Though I don't know all the source of hotplug.)
>
Awesome, this sounds great. Thanks!
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-08 8:48 ` Jingbo Xu
@ 2024-11-08 22:33 ` Joanne Koong
0 siblings, 0 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-08 22:33 UTC (permalink / raw)
To: Jingbo Xu
Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm,
bernd.schubert, kernel-team
On Fri, Nov 8, 2024 at 12:49 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi, Joanne,
>
> Thanks for the continuing work!
>
> On 11/8/24 7:56 AM, Joanne Koong wrote:
> > Currently, we allocate and copy data to a temporary folio when
> > handling writeback in order to mitigate the following deadlock scenario
> > that may arise if reclaim waits on writeback to complete:
> > * single-threaded FUSE server is in the middle of handling a request
> > that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in
> > direct reclaim
> >
> > To work around this, we allocate a temporary folio and copy over the
> > original folio to the temporary folio so that writeback can be
> > immediately cleared on the original folio. This additionally requires us
> > to maintain an internal rb tree to keep track of writeback state on the
> > temporary folios.
> >
> > A recent change prevents reclaim logic from waiting on writeback for
> > folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> > This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> > will prevent FUSE folios from running into the reclaim deadlock described
> > above) and removes the temporary folio + extra copying and the internal
> > rb tree.
> >
> > fio benchmarks --
> > (using averages observed from 10 runs, throwing away outliers)
> >
> > Setup:
> > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> > ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >
> > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >
> > bs = 1k 4k 1M
> > Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> > After 341 MiB/s 2246 MiB/s 2685 MiB/s
> > % diff -3% 23% 45%
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>
>
> > @@ -1622,7 +1543,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
> > return res;
> > }
> > }
> > - if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
> > + if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
>
> filemap_range_has_writeback() is not equivalent to
> fuse_range_is_writeback(), as it will return true as long as there's any
> locked or dirty page? I can't find an equivalent helper function at
> hand though.
>
Hi Jingbo,
I couldn't find an equivalent helper function either. My guess is that
filemap_range_has_writeback() returns true if the page is locked
because it doesn't have a way of determining if the page is dirty or
not (it seems like if a page is locked, then we can't read the
writeback bit on it) so it errs on the side of assuming yes.
For this case, it seems okay to me to use
filemap_range_has_writeback() because if we get back a false positive
(eg filemap_range_has_writeback() returns true when it's actually
false), the only cost is the overhead of an additional
fuse_sync_writes() call but fuse_sync_writes() will return immediately
from the wait().
>
>
> > @@ -3423,7 +3143,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
> > fi->iocachectr = 0;
> > init_waitqueue_head(&fi->page_waitq);
> > init_waitqueue_head(&fi->direct_io_waitq);
> > - fi->writepages = RB_ROOT;
>
> It seems that 'struct rb_root writepages' is not removed from fuse_inode
> structure.
>
Nice catch! I'll remove this from the fuse_inode struct in v5.
>
> Besides, I also looked through the former 5 patches and can't find any
> obvious errors at the very first glance. Hopefully the MM guys could
> offer more professional reviews.
>
Thanks for looking through this code in this version and the past
versions of this patchset too. It's much appreciated!
> --
> Thanks,
> Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
@ 2024-11-09 0:10 ` Shakeel Butt
2024-11-11 21:11 ` Joanne Koong
0 siblings, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2024-11-09 0:10 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, linux-fsdevel, jefflexu, josef, linux-mm, bernd.schubert,
kernel-team
On Thu, Nov 07, 2024 at 03:56:09PM -0800, Joanne Koong wrote:
> Add a new mapping flag AS_WRITEBACK_MAY_BLOCK which filesystems may set
> to indicate that writeback operations may block or take an indeterminate
> amount of time to complete. Extra caution should be taken when waiting
> on writeback for folios belonging to mappings where this flag is set.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> include/linux/pagemap.h | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 68a5f1ff3301..eb5a7837e142 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -210,6 +210,7 @@ enum mapping_flags {
> AS_STABLE_WRITES = 7, /* must wait for writeback before modifying
> folio contents */
> AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */
> + AS_WRITEBACK_MAY_BLOCK = 9, /* Use caution when waiting on writeback */
To me 'may block' does not feel right. For example in reclaim code,
folio_wait_writeback() can get blocked and that is fine. However with
non-privileged fuse involved, there are security concerns. Somehow 'may
block' does not convey that. Anyways, I am not really pushing back but
I think there is a need for better name here.
> /* Bits 16-25 are used for FOLIO_ORDER */
> AS_FOLIO_ORDER_BITS = 5,
> AS_FOLIO_ORDER_MIN = 16,
> @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
> return test_bit(AS_INACCESSIBLE, &mapping->flags);
> }
>
> +static inline void mapping_set_writeback_may_block(struct address_space *mapping)
> +{
> + set_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
> +}
> +
> +static inline bool mapping_writeback_may_block(struct address_space *mapping)
> +{
> + return test_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
> +}
> +
> static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> {
> return mapping->gfp_mask;
> --
> 2.43.5
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block
2024-11-07 23:56 ` [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block Joanne Koong
@ 2024-11-09 0:16 ` Shakeel Butt
0 siblings, 0 replies; 32+ messages in thread
From: Shakeel Butt @ 2024-11-09 0:16 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, linux-fsdevel, jefflexu, josef, linux-mm, bernd.schubert,
kernel-team
On Thu, Nov 07, 2024 at 03:56:10PM -0800, Joanne Koong wrote:
> Currently in shrink_folio_list(), reclaim for folios under writeback
> falls into 3 different cases:
> 1) Reclaim is encountering an excessive number of folios under
> writeback and this folio has both the writeback and reclaim flags
> set
> 2) Dirty throttling is enabled (this happens if reclaim through cgroup
> is not enabled, if reclaim through cgroupv2 memcg is enabled, or
> if reclaim is on the root cgroup), or if the folio is not marked for
> immediate reclaim, or if the caller does not have __GFP_FS (or
> __GFP_IO if it's going to swap) set
> 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
> set and the caller did not have __GFP_FS (or __GFP_IO if swap) set
>
> In cases 1) and 2), we activate the folio and skip reclaiming it while
> in case 3), we wait for writeback to finish on the folio and then try
> to reclaim the folio again. In case 3, we wait on writeback because
> cgroupv1 does not have dirty folio throttling, as such this is a
> mitigation against the case where there are too many folios in writeback
> with nothing else to reclaim.
>
> The issue is that for filesystems where writeback may block, sub-optimal
> workarounds may need to be put in place to avoid a potential deadlock
> that may arise from reclaim waiting on writeback. (Even though case 3
> above is rare given that legacy cgroupv1 is on its way to being
> deprecated, this case still needs to be accounted for). For example, for
> FUSE filesystems, a temp page gets allocated per dirty page and the
> contents of the dirty page are copied over to the temp page so that
> writeback can be immediately cleared on the dirty page in order to avoid
> the following deadlock:
> * single-threaded FUSE server is in the middle of handling a request that
> needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback (eg falls into case 3
> above) that needs to be written back to the FUSE server
> * the FUSE server can't write back the folio since it's stuck in direct
> reclaim
>
> In this commit, if legacy memcg encounters a folio with the reclaim flag
> set (eg case 3) and the folio belongs to a mapping that has the
> AS_WRITEBACK_MAY_BLOCK flag set, the folio will be activated and skip
> reclaim (eg default to behavior in case 2) instead.
>
> This allows for the suboptimal workarounds added to address the
> "reclaim wait on writeback" deadlock scenario to be removed.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
This looks good just one nit below.
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
> mm/vmscan.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..e9755cb7211b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (writeback && folio_test_reclaim(folio))
> stat->nr_congested += nr_pages;
>
> + mapping = folio_mapping(folio);
Move the above line within folio_test_writeback() check block.
> +
> /*
> * If a folio at the tail of the LRU is under writeback, there
> * are three cases to consider.
> @@ -1129,8 +1131,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> * 2) Global or new memcg reclaim encounters a folio that is
> * not marked for immediate reclaim, or the caller does not
> * have __GFP_FS (or __GFP_IO if it's simply going to swap,
> - * not to fs). In this case mark the folio for immediate
> - * reclaim and continue scanning.
> + * not to fs), or writebacks in the mapping may block.
> + * In this case mark the folio for immediate reclaim and
> + * continue scanning.
> *
> * Require may_enter_fs() because we would wait on fs, which
> * may not have submitted I/O yet. And the loop driver might
> @@ -1165,7 +1168,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> /* Case 2 above */
> } else if (writeback_throttling_sane(sc) ||
> !folio_test_reclaim(folio) ||
> - !may_enter_fs(folio, sc->gfp_mask)) {
> + !may_enter_fs(folio, sc->gfp_mask) ||
> + (mapping && mapping_writeback_may_block(mapping))) {
> /*
> * This is slightly racy -
> * folio_end_writeback() might have
> --
> 2.43.5
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-08 8:48 ` Jingbo Xu
@ 2024-11-11 8:32 ` Jingbo Xu
2024-11-11 21:30 ` Joanne Koong
2024-11-12 9:25 ` Jingbo Xu
2 siblings, 1 reply; 32+ messages in thread
From: Jingbo Xu @ 2024-11-11 8:32 UTC (permalink / raw)
To: Joanne Koong, miklos, linux-fsdevel
Cc: shakeel.butt, josef, linux-mm, bernd.schubert, kernel-team
Hi, Joanne and Miklos,
On 11/8/24 7:56 AM, Joanne Koong wrote:
> Currently, we allocate and copy data to a temporary folio when
> handling writeback in order to mitigate the following deadlock scenario
> that may arise if reclaim waits on writeback to complete:
> * single-threaded FUSE server is in the middle of handling a request
> that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
> direct reclaim
>
> To work around this, we allocate a temporary folio and copy over the
> original folio to the temporary folio so that writeback can be
> immediately cleared on the original folio. This additionally requires us
> to maintain an internal rb tree to keep track of writeback state on the
> temporary folios.
>
> A recent change prevents reclaim logic from waiting on writeback for
> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> will prevent FUSE folios from running into the reclaim deadlock described
> above) and removes the temporary folio + extra copying and the internal
> rb tree.
>
> fio benchmarks --
> (using averages observed from 10 runs, throwing away outliers)
>
> Setup:
> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>
> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>
> bs = 1k 4k 1M
> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> After 341 MiB/s 2246 MiB/s 2685 MiB/s
> % diff -3% 23% 45%
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
IIUC this patch seems to break commit
8b284dc47291daf72fe300e1138a2e7ed56f38ab ("fuse: writepages: handle same
page rewrites").
> - /*
> - * Being under writeback is unlikely but possible. For example direct
> - * read to an mmaped fuse file will set the page dirty twice; once when
> - * the pages are faulted with get_user_pages(), and then after the read
> - * completed.
> - */
In short, the target scenario is like:
```
# open a fuse file and mmap
fd1 = open("fuse-file-path", ...)
uaddr = mmap(fd1, ...)
# DIRECT read to the mmaped fuse file
fd2 = open("ext4-file-path", O_DIRECT, ...)
read(fd2, uaddr, ...)
# get_user_pages() of uaddr, and triggers faultin
# a_ops->dirty_folio() <--- mark PG_dirty
# when DIRECT IO completed:
# a_ops->dirty_folio() <--- mark PG_dirty
```
The auxiliary write request list was introduced to fix this.
I'm not sure if there's an alternative other than the auxiliary list to
fix it, e.g. calling folio_wait_writeback() in a_ops->dirty_folio() so
that the same folio won't get dirtied when the writeback has not
completed yet?
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag
2024-11-09 0:10 ` Shakeel Butt
@ 2024-11-11 21:11 ` Joanne Koong
2024-11-15 19:33 ` Joanne Koong
0 siblings, 1 reply; 32+ messages in thread
From: Joanne Koong @ 2024-11-11 21:11 UTC (permalink / raw)
To: Shakeel Butt
Cc: miklos, linux-fsdevel, jefflexu, josef, linux-mm, bernd.schubert,
kernel-team
On Fri, Nov 8, 2024 at 4:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Nov 07, 2024 at 03:56:09PM -0800, Joanne Koong wrote:
> > Add a new mapping flag AS_WRITEBACK_MAY_BLOCK which filesystems may set
> > to indicate that writeback operations may block or take an indeterminate
> > amount of time to complete. Extra caution should be taken when waiting
> > on writeback for folios belonging to mappings where this flag is set.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > include/linux/pagemap.h | 11 +++++++++++
> > 1 file changed, 11 insertions(+)
> >
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 68a5f1ff3301..eb5a7837e142 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -210,6 +210,7 @@ enum mapping_flags {
> > AS_STABLE_WRITES = 7, /* must wait for writeback before modifying
> > folio contents */
> > AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */
> > + AS_WRITEBACK_MAY_BLOCK = 9, /* Use caution when waiting on writeback */
>
> To me 'may block' does not feel right. For example in reclaim code,
> folio_wait_writeback() can get blocked and that is fine. However with
> non-privileged fuse involved, there are security concerns. Somehow 'may
> block' does not convey that. Anyways, I am not really pushing back but
> I think there is a need for better name here.
Ahh I see where this naming causes confusion - the "MAY_BLOCK" part
could be interpreted in two ways: a) may block as in it's possible for
the writeback to block and b) may block as in it's permissible/ok for
the writeback to block. I intended "may block" to signify a) but
you're right, it could be easily interpreted as b).
I'll change this to AS_WRITEBACK_BLOCKING.
Thanks,
Joanne
>
> > /* Bits 16-25 are used for FOLIO_ORDER */
> > AS_FOLIO_ORDER_BITS = 5,
> > AS_FOLIO_ORDER_MIN = 16,
> > @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
> > return test_bit(AS_INACCESSIBLE, &mapping->flags);
> > }
> >
> > +static inline void mapping_set_writeback_may_block(struct address_space *mapping)
> > +{
> > + set_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
> > +}
> > +
> > +static inline bool mapping_writeback_may_block(struct address_space *mapping)
> > +{
> > + return test_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
> > +}
> > +
> > static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> > {
> > return mapping->gfp_mask;
> > --
> > 2.43.5
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-11 8:32 ` Jingbo Xu
@ 2024-11-11 21:30 ` Joanne Koong
2024-11-12 2:31 ` Jingbo Xu
0 siblings, 1 reply; 32+ messages in thread
From: Joanne Koong @ 2024-11-11 21:30 UTC (permalink / raw)
To: Jingbo Xu
Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm,
bernd.schubert, kernel-team
On Mon, Nov 11, 2024 at 12:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi, Joanne and Miklos,
>
> On 11/8/24 7:56 AM, Joanne Koong wrote:
> > Currently, we allocate and copy data to a temporary folio when
> > handling writeback in order to mitigate the following deadlock scenario
> > that may arise if reclaim waits on writeback to complete:
> > * single-threaded FUSE server is in the middle of handling a request
> > that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in
> > direct reclaim
> >
> > To work around this, we allocate a temporary folio and copy over the
> > original folio to the temporary folio so that writeback can be
> > immediately cleared on the original folio. This additionally requires us
> > to maintain an internal rb tree to keep track of writeback state on the
> > temporary folios.
> >
> > A recent change prevents reclaim logic from waiting on writeback for
> > folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> > This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> > will prevent FUSE folios from running into the reclaim deadlock described
> > above) and removes the temporary folio + extra copying and the internal
> > rb tree.
> >
> > fio benchmarks --
> > (using averages observed from 10 runs, throwing away outliers)
> >
> > Setup:
> > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> > ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >
> > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >
> > bs = 1k 4k 1M
> > Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> > After 341 MiB/s 2246 MiB/s 2685 MiB/s
> > % diff -3% 23% 45%
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>
>
> IIUC this patch seems to break commit
> 8b284dc47291daf72fe300e1138a2e7ed56f38ab ("fuse: writepages: handle same
> page rewrites").
>
Interesting! My understanding was that we only needed that commit
because we were clearing writeback on the original folio before
writeback had actually finished.
Now that folio writeback state is accounted for normally (eg through
writeback being set/cleared on the original folio), does the
folio_wait_writeback() call we do in fuse_page_mkwrite() not mitigate
this?
> > - /*
> > - * Being under writeback is unlikely but possible. For example direct
> > - * read to an mmaped fuse file will set the page dirty twice; once when
> > - * the pages are faulted with get_user_pages(), and then after the read
> > - * completed.
> > - */
>
> In short, the target scenario is like:
>
> ```
> # open a fuse file and mmap
> fd1 = open("fuse-file-path", ...)
> uaddr = mmap(fd1, ...)
>
> # DIRECT read to the mmaped fuse file
> fd2 = open("ext4-file-path", O_DIRECT, ...)
> read(fd2, uaddr, ...)
> # get_user_pages() of uaddr, and triggers faultin
> # a_ops->dirty_folio() <--- mark PG_dirty
>
> # when DIRECT IO completed:
> # a_ops->dirty_folio() <--- mark PG_dirty
If you have the direct io function call stack at hand, could you point
me to the function where the direct io completion marks this folio as
dirty?
> ```
>
> The auxiliary write request list was introduced to fix this.
>
> I'm not sure if there's an alternative other than the auxiliary list to
> fix it, e.g. calling folio_wait_writeback() in a_ops->dirty_folio() so
> that the same folio won't get dirtied when the writeback has not
> completed yet?
>
I'm curious how other filesystems solve for this - this seems like a
generic situation other filesystems would run into as well.
Thanks,
Joanne
>
>
> --
> Thanks,
> Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-11 21:30 ` Joanne Koong
@ 2024-11-12 2:31 ` Jingbo Xu
2024-11-13 19:11 ` Joanne Koong
0 siblings, 1 reply; 32+ messages in thread
From: Jingbo Xu @ 2024-11-12 2:31 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm,
bernd.schubert, kernel-team
On 11/12/24 5:30 AM, Joanne Koong wrote:
> On Mon, Nov 11, 2024 at 12:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi, Joanne and Miklos,
>>
>> On 11/8/24 7:56 AM, Joanne Koong wrote:
>>> Currently, we allocate and copy data to a temporary folio when
>>> handling writeback in order to mitigate the following deadlock scenario
>>> that may arise if reclaim waits on writeback to complete:
>>> * single-threaded FUSE server is in the middle of handling a request
>>> that needs a memory allocation
>>> * memory allocation triggers direct reclaim
>>> * direct reclaim waits on a folio under writeback
>>> * the FUSE server can't write back the folio since it's stuck in
>>> direct reclaim
>>>
>>> To work around this, we allocate a temporary folio and copy over the
>>> original folio to the temporary folio so that writeback can be
>>> immediately cleared on the original folio. This additionally requires us
>>> to maintain an internal rb tree to keep track of writeback state on the
>>> temporary folios.
>>>
>>> A recent change prevents reclaim logic from waiting on writeback for
>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
>>> will prevent FUSE folios from running into the reclaim deadlock described
>>> above) and removes the temporary folio + extra copying and the internal
>>> rb tree.
>>>
>>> fio benchmarks --
>>> (using averages observed from 10 runs, throwing away outliers)
>>>
>>> Setup:
>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>
>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>
>>> bs = 1k 4k 1M
>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
>>> % diff -3% 23% 45%
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>
>>
>> IIUC this patch seems to break commit
>> 8b284dc47291daf72fe300e1138a2e7ed56f38ab ("fuse: writepages: handle same
>> page rewrites").
>>
>
> Interesting! My understanding was that we only needed that commit
> because we were clearing writeback on the original folio before
> writeback had actually finished.
>
> Now that folio writeback state is accounted for normally (eg through
> writeback being set/cleared on the original folio), does the
> folio_wait_writeback() call we do in fuse_page_mkwrite() not mitigate
> this?
Yes, after inspecting the writeback logic more, it seems that the second
writeback won't be initiated if the first one has not completed yet, see
```
a_ops->writepages
write_cache_pages
writeback_iter
writeback_get_folio
folio_prepare_writeback
if folio_test_writeback(folio):
folio_wait_writeback(folio)
```
and thus it won't be an issue to remove the auxiliary list ;)
>
>>> - /*
>>> - * Being under writeback is unlikely but possible. For example direct
>>> - * read to an mmaped fuse file will set the page dirty twice; once when
>>> - * the pages are faulted with get_user_pages(), and then after the read
>>> - * completed.
>>> - */
>>
>> In short, the target scenario is like:
>>
>> ```
>> # open a fuse file and mmap
>> fd1 = open("fuse-file-path", ...)
>> uaddr = mmap(fd1, ...)
>>
>> # DIRECT read to the mmaped fuse file
>> fd2 = open("ext4-file-path", O_DIRECT, ...)
>> read(fd2, uaddr, ...)
>> # get_user_pages() of uaddr, and triggers faultin
>> # a_ops->dirty_folio() <--- mark PG_dirty
>>
>> # when DIRECT IO completed:
>> # a_ops->dirty_folio() <--- mark PG_dirty
>
> If you have the direct io function call stack at hand, could you point
> me to the function where the direct io completion marks this folio as
> dirty?
FYI The full call stack is like:
```
# DIRECT read(2) to the mmaped fuse file
read(fd2, uaddr1, ...)
f_ops->read_iter()
(iomap-based ) iomap_dio_rw
# for READ && user_backed_iter(iter):
dio->flags |= IOMAP_DIO_DIRTY
iomap_dio_iter
iomap_dio_bio_iter
# add user or kernel pages to a bio
bio_iov_iter_get_pages
...
pin_user_pages_fast(..., FOLL_WRITE, ...)
# find corresponding vma of dest buffer (fuse page cache)
# search page table (pet) to find corresponding page
# if not fault yet, trigger explicit faultin:
faultin_page(..., FOLL_WRITE, ...)
handle_mm_fault(..., FAULT_FLAG_WRITE)
handle_pte_fault
do_wp_page
(vma->vm_flags & VM_SHARED) wp_page_shared
...
fault_dirty_shared_page
folio_mark_dirty
a_ops->dirty_folio(), i.e.,
filemap_dirty_folio()
# set PG_dirty
folio_test_set_dirty(folio)
# set PAGECACHE_TAG_DIRTY
__folio_mark_dirty
# if dio->flags & IOMAP_DIO_DIRTY:
bio_set_pages_dirty
(for each dest page) folio_mark_dirty
a_ops->dirty_folio(), i.e., filemap_dirty_folio()
# set PG_dirty
folio_test_set_dirty(folio)
# set PAGECACHE_TAG_DIRTY
__folio_mark_dirty
```
>
>> ```
>>
>> The auxiliary write request list was introduced to fix this.
>>
>> I'm not sure if there's an alternative other than the auxiliary list to
>> fix it, e.g. calling folio_wait_writeback() in a_ops->dirty_folio() so
>> that the same folio won't get dirtied when the writeback has not
>> completed yet?
>>
>
> I'm curious how other filesystems solve for this - this seems like a
> generic situation other filesystems would run into as well.
>
As mentioned above, the writeback path will prevent the duplicate
writeback request on the same page when the first writeback IO has not
completed yet.
Sorry for the noise...
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-08 8:48 ` Jingbo Xu
2024-11-11 8:32 ` Jingbo Xu
@ 2024-11-12 9:25 ` Jingbo Xu
2024-11-14 0:39 ` Joanne Koong
2 siblings, 1 reply; 32+ messages in thread
From: Jingbo Xu @ 2024-11-12 9:25 UTC (permalink / raw)
To: Joanne Koong, miklos, linux-fsdevel
Cc: shakeel.butt, josef, linux-mm, bernd.schubert, kernel-team
Hi Joanne,
On 11/8/24 7:56 AM, Joanne Koong wrote:
> Currently, we allocate and copy data to a temporary folio when
> handling writeback in order to mitigate the following deadlock scenario
> that may arise if reclaim waits on writeback to complete:
> * single-threaded FUSE server is in the middle of handling a request
> that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
> direct reclaim
>
> To work around this, we allocate a temporary folio and copy over the
> original folio to the temporary folio so that writeback can be
> immediately cleared on the original folio. This additionally requires us
> to maintain an internal rb tree to keep track of writeback state on the
> temporary folios.
>
> A recent change prevents reclaim logic from waiting on writeback for
> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> will prevent FUSE folios from running into the reclaim deadlock described
> above) and removes the temporary folio + extra copying and the internal
> rb tree.
>
> fio benchmarks --
> (using averages observed from 10 runs, throwing away outliers)
>
> Setup:
> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>
> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>
> bs = 1k 4k 1M
> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> After 341 MiB/s 2246 MiB/s 2685 MiB/s
> % diff -3% 23% 45%
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
I think there are some places checking or waiting for writeback could be
reconsidered if they are still needed or not after we dropping the temp
page design.
As they are inherited from the original implementation, at least they
are harmless. I think they could be remained in this patch, and could
be cleaned up later if really needed.
> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
> * have writeback that extends beyond the lifetime of the folio. So
> * make sure we read a properly synced folio.
> */
> - fuse_wait_on_folio_writeback(inode, folio);
> + folio_wait_writeback(folio);
I doubt if wait-on-writeback is needed here, as now page cache won't be
freed until the writeback IO completes.
The routine attempts to free page cache, e.g. invalidate_inode_pages2()
(generally called by distributed filesystems when the file content has
been modified from remote) or truncate_inode_pages() (called from
truncate(2) or inode eviction) will wait for writeback completion (if
any) before freeing page.
Thus I don't think there's any possibility that .read_folio() or
.readahead() will be called when the writeback has not completed.
> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> int err;
>
> for (i = 0; i < ap->num_folios; i++)
> - fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> + folio_wait_writeback(ap->folios[i]);
Ditto.
> static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> - struct folio *tmp_folio, uint32_t folio_index)
> + uint32_t folio_index)
> {
> struct inode *inode = folio->mapping->host;
> struct fuse_args_pages *ap = &wpa->ia.ap;
>
> - folio_copy(tmp_folio, folio);
> -
> - ap->folios[folio_index] = tmp_folio;
> + folio_get(folio);
I still think this folio_get() here is harmless but redundant.
Ditto page cache won't be freed before writeback completes.
Besides, other .writepages() implementaions e.g. iomap_writepages() also
doen't get the refcount when constructing the writeback IO.
> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
> if (IS_ERR(folio))
> goto error;
>
> - fuse_wait_on_page_writeback(mapping->host, folio->index);
> + folio_wait_writeback(folio);
I also doubt if wait_on_writeback() is needed here, as now there won't
be duplicate writeback IOs for the same page.
> @@ -2545,13 +2264,11 @@ static int fuse_launder_folio(struct folio *folio)
> {
> int err = 0;
> if (folio_clear_dirty_for_io(folio)) {
> - struct inode *inode = folio->mapping->host;
> -
> /* Serialize with pending writeback for the same page */
> - fuse_wait_on_page_writeback(inode, folio->index);
> + folio_wait_writeback(folio);
I think folio_wait_writeback() is unneeded after dropping the temp page
copying. This is introduced in commit
3993382bb3198cc5e263c3519418e716bd57b056 ("fuse: launder page should
wait for page writeback") since .launder_page() could be called when the
previous writeback of the same page has not completed yet. Since now we
won't clear PG_writeback until the writeback completes, .launder_page()
won't be called on the same page when the corresponding writeback IO is
still inflight.
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-12 2:31 ` Jingbo Xu
@ 2024-11-13 19:11 ` Joanne Koong
0 siblings, 0 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-13 19:11 UTC (permalink / raw)
To: Jingbo Xu
Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm,
bernd.schubert, kernel-team
On Mon, Nov 11, 2024 at 6:31 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 11/12/24 5:30 AM, Joanne Koong wrote:
> > On Mon, Nov 11, 2024 at 12:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >> Hi, Joanne and Miklos,
> >>
> >> On 11/8/24 7:56 AM, Joanne Koong wrote:
> >>> Currently, we allocate and copy data to a temporary folio when
> >>> handling writeback in order to mitigate the following deadlock scenario
> >>> that may arise if reclaim waits on writeback to complete:
> >>> * single-threaded FUSE server is in the middle of handling a request
> >>> that needs a memory allocation
> >>> * memory allocation triggers direct reclaim
> >>> * direct reclaim waits on a folio under writeback
> >>> * the FUSE server can't write back the folio since it's stuck in
> >>> direct reclaim
> >>>
> >>> To work around this, we allocate a temporary folio and copy over the
> >>> original folio to the temporary folio so that writeback can be
> >>> immediately cleared on the original folio. This additionally requires us
> >>> to maintain an internal rb tree to keep track of writeback state on the
> >>> temporary folios.
> >>>
> >>> A recent change prevents reclaim logic from waiting on writeback for
> >>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> >>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> >>> will prevent FUSE folios from running into the reclaim deadlock described
> >>> above) and removes the temporary folio + extra copying and the internal
> >>> rb tree.
> >>>
> >>> fio benchmarks --
> >>> (using averages observed from 10 runs, throwing away outliers)
> >>>
> >>> Setup:
> >>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>
> >>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>
> >>> bs = 1k 4k 1M
> >>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> >>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
> >>> % diff -3% 23% 45%
> >>>
> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>
> >>
> >> IIUC this patch seems to break commit
> >> 8b284dc47291daf72fe300e1138a2e7ed56f38ab ("fuse: writepages: handle same
> >> page rewrites").
> >>
> >
> > Interesting! My understanding was that we only needed that commit
> > because we were clearing writeback on the original folio before
> > writeback had actually finished.
> >
> > Now that folio writeback state is accounted for normally (eg through
> > writeback being set/cleared on the original folio), does the
> > folio_wait_writeback() call we do in fuse_page_mkwrite() not mitigate
> > this?
>
> Yes, after inspecting the writeback logic more, it seems that the second
> writeback won't be initiated if the first one has not completed yet, see
>
> ```
> a_ops->writepages
> write_cache_pages
> writeback_iter
> writeback_get_folio
> folio_prepare_writeback
> if folio_test_writeback(folio):
> folio_wait_writeback(folio)
> ```
>
> and thus it won't be an issue to remove the auxiliary list ;)
>
Awesome, thanks for double-checking!
> >
> >>> - /*
> >>> - * Being under writeback is unlikely but possible. For example direct
> >>> - * read to an mmaped fuse file will set the page dirty twice; once when
> >>> - * the pages are faulted with get_user_pages(), and then after the read
> >>> - * completed.
> >>> - */
> >>
> >> In short, the target scenario is like:
> >>
> >> ```
> >> # open a fuse file and mmap
> >> fd1 = open("fuse-file-path", ...)
> >> uaddr = mmap(fd1, ...)
> >>
> >> # DIRECT read to the mmaped fuse file
> >> fd2 = open("ext4-file-path", O_DIRECT, ...)
> >> read(fd2, uaddr, ...)
> >> # get_user_pages() of uaddr, and triggers faultin
> >> # a_ops->dirty_folio() <--- mark PG_dirty
> >>
> >> # when DIRECT IO completed:
> >> # a_ops->dirty_folio() <--- mark PG_dirty
> >
> > If you have the direct io function call stack at hand, could you point
> > me to the function where the direct io completion marks this folio as
> > dirty?
>
>
> FYI The full call stack is like:
>
> ```
> # DIRECT read(2) to the mmaped fuse file
> read(fd2, uaddr1, ...)
> f_ops->read_iter()
> (iomap-based ) iomap_dio_rw
> # for READ && user_backed_iter(iter):
> dio->flags |= IOMAP_DIO_DIRTY
> iomap_dio_iter
> iomap_dio_bio_iter
> # add user or kernel pages to a bio
> bio_iov_iter_get_pages
> ...
> pin_user_pages_fast(..., FOLL_WRITE, ...)
> # find corresponding vma of dest buffer (fuse page cache)
> # search page table (pet) to find corresponding page
> # if not fault yet, trigger explicit faultin:
> faultin_page(..., FOLL_WRITE, ...)
> handle_mm_fault(..., FAULT_FLAG_WRITE)
> handle_pte_fault
> do_wp_page
> (vma->vm_flags & VM_SHARED) wp_page_shared
> ...
> fault_dirty_shared_page
> folio_mark_dirty
> a_ops->dirty_folio(), i.e.,
> filemap_dirty_folio()
> # set PG_dirty
> folio_test_set_dirty(folio)
> # set PAGECACHE_TAG_DIRTY
> __folio_mark_dirty
>
>
> # if dio->flags & IOMAP_DIO_DIRTY:
> bio_set_pages_dirty
> (for each dest page) folio_mark_dirty
> a_ops->dirty_folio(), i.e., filemap_dirty_folio()
> # set PG_dirty
> folio_test_set_dirty(folio)
> # set PAGECACHE_TAG_DIRTY
> __folio_mark_dirty
> ```
>
Thanks for this info, Jingbo.
>
> >
> >> ```
> >>
> >> The auxiliary write request list was introduced to fix this.
> >>
> >> I'm not sure if there's an alternative other than the auxiliary list to
> >> fix it, e.g. calling folio_wait_writeback() in a_ops->dirty_folio() so
> >> that the same folio won't get dirtied when the writeback has not
> >> completed yet?
> >>
> >
> > I'm curious how other filesystems solve for this - this seems like a
> > generic situation other filesystems would run into as well.
> >
>
> As mentioned above, the writeback path will prevent the duplicate
> writeback request on the same page when the first writeback IO has not
> completed yet.
>
> Sorry for the noise...
>
> --
> Thanks,
> Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-12 9:25 ` Jingbo Xu
@ 2024-11-14 0:39 ` Joanne Koong
2024-11-14 1:46 ` Jingbo Xu
0 siblings, 1 reply; 32+ messages in thread
From: Joanne Koong @ 2024-11-14 0:39 UTC (permalink / raw)
To: Jingbo Xu
Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm,
bernd.schubert, kernel-team
On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi Joanne,
>
> On 11/8/24 7:56 AM, Joanne Koong wrote:
> > Currently, we allocate and copy data to a temporary folio when
> > handling writeback in order to mitigate the following deadlock scenario
> > that may arise if reclaim waits on writeback to complete:
> > * single-threaded FUSE server is in the middle of handling a request
> > that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in
> > direct reclaim
> >
> > To work around this, we allocate a temporary folio and copy over the
> > original folio to the temporary folio so that writeback can be
> > immediately cleared on the original folio. This additionally requires us
> > to maintain an internal rb tree to keep track of writeback state on the
> > temporary folios.
> >
> > A recent change prevents reclaim logic from waiting on writeback for
> > folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> > This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> > will prevent FUSE folios from running into the reclaim deadlock described
> > above) and removes the temporary folio + extra copying and the internal
> > rb tree.
> >
> > fio benchmarks --
> > (using averages observed from 10 runs, throwing away outliers)
> >
> > Setup:
> > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> > ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >
> > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >
> > bs = 1k 4k 1M
> > Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> > After 341 MiB/s 2246 MiB/s 2685 MiB/s
> > % diff -3% 23% 45%
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>
> I think there are some places checking or waiting for writeback could be
> reconsidered if they are still needed or not after we dropping the temp
> page design.
>
> As they are inherited from the original implementation, at least they
> are harmless. I think they could be remained in this patch, and could
> be cleaned up later if really needed.
>
Thank you for the thorough inspection!
>
> > @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
> > * have writeback that extends beyond the lifetime of the folio. So
> > * make sure we read a properly synced folio.
> > */
> > - fuse_wait_on_folio_writeback(inode, folio);
> > + folio_wait_writeback(folio);
>
> I doubt if wait-on-writeback is needed here, as now page cache won't be
> freed until the writeback IO completes.
>
> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
> (generally called by distributed filesystems when the file content has
> been modified from remote) or truncate_inode_pages() (called from
> truncate(2) or inode eviction) will wait for writeback completion (if
> any) before freeing page.
>
> Thus I don't think there's any possibility that .read_folio() or
> .readahead() will be called when the writeback has not completed.
>
Great point. I'll remove this line and the comment above it too.
>
> > @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> > int err;
> >
> > for (i = 0; i < ap->num_folios; i++)
> > - fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> > + folio_wait_writeback(ap->folios[i]);
>
> Ditto.
>
Why did we need this fuse_wait_on_folio_writeback() even when we had
the temp pages? If I'm understanding it correctly,
fuse_send_write_pages() is only called for the writethrough case (by
fuse_perform_write()), so these folios would never even be under
writeback, no?
>
>
> > static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> > - struct folio *tmp_folio, uint32_t folio_index)
> > + uint32_t folio_index)
> > {
> > struct inode *inode = folio->mapping->host;
> > struct fuse_args_pages *ap = &wpa->ia.ap;
> >
> > - folio_copy(tmp_folio, folio);
> > -
> > - ap->folios[folio_index] = tmp_folio;
> > + folio_get(folio);
>
> I still think this folio_get() here is harmless but redundant.
>
> Ditto page cache won't be freed before writeback completes.
>
> Besides, other .writepages() implementaions e.g. iomap_writepages() also
> doen't get the refcount when constructing the writeback IO.
>
Point taken. I'll remove this then, since other .writepages() also
don't obtain a refcount.
>
> > @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
> > if (IS_ERR(folio))
> > goto error;
> >
> > - fuse_wait_on_page_writeback(mapping->host, folio->index);
> > + folio_wait_writeback(folio);
>
> I also doubt if wait_on_writeback() is needed here, as now there won't
> be duplicate writeback IOs for the same page.
What prevents there from being a duplicate writeback write for the
same page? This is the path I'm looking at:
ksys_write()
vfs_write()
new_sync_write()
op->write_iter()
fuse_file_write_iter()
fuse_cache_write_iter()
generic_file_write_iter()
__generic_file_write_iter()
generic_perform_write()
op->write_begin()
fuse_write_begin()
but I'm not seeing where there's anything that prevents a duplicate
write from happening.
>
>
> > @@ -2545,13 +2264,11 @@ static int fuse_launder_folio(struct folio *folio)
> > {
> > int err = 0;
> > if (folio_clear_dirty_for_io(folio)) {
> > - struct inode *inode = folio->mapping->host;
> > -
> > /* Serialize with pending writeback for the same page */
> > - fuse_wait_on_page_writeback(inode, folio->index);
> > + folio_wait_writeback(folio);
>
> I think folio_wait_writeback() is unneeded after dropping the temp page
> copying. This is introduced in commit
> 3993382bb3198cc5e263c3519418e716bd57b056 ("fuse: launder page should
> wait for page writeback") since .launder_page() could be called when the
> previous writeback of the same page has not completed yet. Since now we
> won't clear PG_writeback until the writeback completes, .launder_page()
> won't be called on the same page when the corresponding writeback IO is
> still inflight.
>
Nice catch, I'll remove this in v4.
Thanks,
Joanne
>
> --
> Thanks,
> Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-14 0:39 ` Joanne Koong
@ 2024-11-14 1:46 ` Jingbo Xu
2024-11-14 18:19 ` Joanne Koong
0 siblings, 1 reply; 32+ messages in thread
From: Jingbo Xu @ 2024-11-14 1:46 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm,
bernd.schubert, kernel-team
On 11/14/24 8:39 AM, Joanne Koong wrote:
> On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi Joanne,
>>
>> On 11/8/24 7:56 AM, Joanne Koong wrote:
>>> Currently, we allocate and copy data to a temporary folio when
>>> handling writeback in order to mitigate the following deadlock scenario
>>> that may arise if reclaim waits on writeback to complete:
>>> * single-threaded FUSE server is in the middle of handling a request
>>> that needs a memory allocation
>>> * memory allocation triggers direct reclaim
>>> * direct reclaim waits on a folio under writeback
>>> * the FUSE server can't write back the folio since it's stuck in
>>> direct reclaim
>>>
>>> To work around this, we allocate a temporary folio and copy over the
>>> original folio to the temporary folio so that writeback can be
>>> immediately cleared on the original folio. This additionally requires us
>>> to maintain an internal rb tree to keep track of writeback state on the
>>> temporary folios.
>>>
>>> A recent change prevents reclaim logic from waiting on writeback for
>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
>>> will prevent FUSE folios from running into the reclaim deadlock described
>>> above) and removes the temporary folio + extra copying and the internal
>>> rb tree.
>>>
>>> fio benchmarks --
>>> (using averages observed from 10 runs, throwing away outliers)
>>>
>>> Setup:
>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>
>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>
>>> bs = 1k 4k 1M
>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
>>> % diff -3% 23% 45%
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>
>> I think there are some places checking or waiting for writeback could be
>> reconsidered if they are still needed or not after we dropping the temp
>> page design.
>>
>> As they are inherited from the original implementation, at least they
>> are harmless. I think they could be remained in this patch, and could
>> be cleaned up later if really needed.
>>
>
> Thank you for the thorough inspection!
>
>>
>>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
>>> * have writeback that extends beyond the lifetime of the folio. So
>>> * make sure we read a properly synced folio.
>>> */
>>> - fuse_wait_on_folio_writeback(inode, folio);
>>> + folio_wait_writeback(folio);
>>
>> I doubt if wait-on-writeback is needed here, as now page cache won't be
>> freed until the writeback IO completes.
>>
>> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
>> (generally called by distributed filesystems when the file content has
>> been modified from remote) or truncate_inode_pages() (called from
>> truncate(2) or inode eviction) will wait for writeback completion (if
>> any) before freeing page.
>>
>> Thus I don't think there's any possibility that .read_folio() or
>> .readahead() will be called when the writeback has not completed.
>>
>
> Great point. I'll remove this line and the comment above it too.
>
>>
>>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>>> int err;
>>>
>>> for (i = 0; i < ap->num_folios; i++)
>>> - fuse_wait_on_folio_writeback(inode, ap->folios[i]);
>>> + folio_wait_writeback(ap->folios[i]);
>>
>> Ditto.
Actually this is a typo and I originally meant that waiting for
writeback in fuse_send_readpages() is unneeded as page cache won't be
freed until the writeback IO completes.
> - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
> + filemap_fdatawait_range(inode->i_mapping, first, last);
In fact the above waiting for writeback in fuse_send_write_pages() is
needed. The reason is as follows:
>>
>
> Why did we need this fuse_wait_on_folio_writeback() even when we had
> the temp pages? If I'm understanding it correctly,
> fuse_send_write_pages() is only called for the writethrough case (by
> fuse_perform_write()), so these folios would never even be under
> writeback, no?
I think mmap write could make the page dirty and the writeback could be
triggered then.
>
>>
>>
>>> static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
>>> - struct folio *tmp_folio, uint32_t folio_index)
>>> + uint32_t folio_index)
>>> {
>>> struct inode *inode = folio->mapping->host;
>>> struct fuse_args_pages *ap = &wpa->ia.ap;
>>>
>>> - folio_copy(tmp_folio, folio);
>>> -
>>> - ap->folios[folio_index] = tmp_folio;
>>> + folio_get(folio);
>>
>> I still think this folio_get() here is harmless but redundant.
>>
>> Ditto page cache won't be freed before writeback completes.
>>
>> Besides, other .writepages() implementaions e.g. iomap_writepages() also
>> doen't get the refcount when constructing the writeback IO.
>>
>
> Point taken. I'll remove this then, since other .writepages() also
> don't obtain a refcount.
>
>>
>>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
>>> if (IS_ERR(folio))
>>> goto error;
>>>
>>> - fuse_wait_on_page_writeback(mapping->host, folio->index);
>>> + folio_wait_writeback(folio);
>>
>> I also doubt if wait_on_writeback() is needed here, as now there won't
>> be duplicate writeback IOs for the same page.
>
> What prevents there from being a duplicate writeback write for the
> same page? This is the path I'm looking at:
>
> ksys_write()
> vfs_write()
> new_sync_write()
> op->write_iter()
> fuse_file_write_iter()
> fuse_cache_write_iter()
> generic_file_write_iter()
> __generic_file_write_iter()
> generic_perform_write()
> op->write_begin()
> fuse_write_begin()
>
> but I'm not seeing where there's anything that prevents a duplicate
> write from happening.
I mean there won't be duplicate *writeback* rather than *write* for the
same page. You could write the page cache and make it dirty at the time
when the writeback for the same page is still on going, as long as we
can ensure that even when the page is dirtied again, there won't be a
duplicate writeback IO for the same page when the previous writeback IO
has not completed yet.
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-14 1:46 ` Jingbo Xu
@ 2024-11-14 18:19 ` Joanne Koong
2024-11-15 2:18 ` Jingbo Xu
0 siblings, 1 reply; 32+ messages in thread
From: Joanne Koong @ 2024-11-14 18:19 UTC (permalink / raw)
To: Jingbo Xu
Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm,
bernd.schubert, kernel-team
On Wed, Nov 13, 2024 at 5:47 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 11/14/24 8:39 AM, Joanne Koong wrote:
> > On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >> Hi Joanne,
> >>
> >> On 11/8/24 7:56 AM, Joanne Koong wrote:
> >>> Currently, we allocate and copy data to a temporary folio when
> >>> handling writeback in order to mitigate the following deadlock scenario
> >>> that may arise if reclaim waits on writeback to complete:
> >>> * single-threaded FUSE server is in the middle of handling a request
> >>> that needs a memory allocation
> >>> * memory allocation triggers direct reclaim
> >>> * direct reclaim waits on a folio under writeback
> >>> * the FUSE server can't write back the folio since it's stuck in
> >>> direct reclaim
> >>>
> >>> To work around this, we allocate a temporary folio and copy over the
> >>> original folio to the temporary folio so that writeback can be
> >>> immediately cleared on the original folio. This additionally requires us
> >>> to maintain an internal rb tree to keep track of writeback state on the
> >>> temporary folios.
> >>>
> >>> A recent change prevents reclaim logic from waiting on writeback for
> >>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> >>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> >>> will prevent FUSE folios from running into the reclaim deadlock described
> >>> above) and removes the temporary folio + extra copying and the internal
> >>> rb tree.
> >>>
> >>> fio benchmarks --
> >>> (using averages observed from 10 runs, throwing away outliers)
> >>>
> >>> Setup:
> >>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>
> >>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>
> >>> bs = 1k 4k 1M
> >>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> >>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
> >>> % diff -3% 23% 45%
> >>>
> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>
> >> I think there are some places checking or waiting for writeback could be
> >> reconsidered if they are still needed or not after we dropping the temp
> >> page design.
> >>
> >> As they are inherited from the original implementation, at least they
> >> are harmless. I think they could be remained in this patch, and could
> >> be cleaned up later if really needed.
> >>
> >
> > Thank you for the thorough inspection!
> >
> >>
> >>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
> >>> * have writeback that extends beyond the lifetime of the folio. So
> >>> * make sure we read a properly synced folio.
> >>> */
> >>> - fuse_wait_on_folio_writeback(inode, folio);
> >>> + folio_wait_writeback(folio);
> >>
> >> I doubt if wait-on-writeback is needed here, as now page cache won't be
> >> freed until the writeback IO completes.
> >>
> >> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
> >> (generally called by distributed filesystems when the file content has
> >> been modified from remote) or truncate_inode_pages() (called from
> >> truncate(2) or inode eviction) will wait for writeback completion (if
> >> any) before freeing page.
> >>
> >> Thus I don't think there's any possibility that .read_folio() or
> >> .readahead() will be called when the writeback has not completed.
> >>
> >
> > Great point. I'll remove this line and the comment above it too.
> >
> >>
> >>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> >>> int err;
> >>>
> >>> for (i = 0; i < ap->num_folios; i++)
> >>> - fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> >>> + folio_wait_writeback(ap->folios[i]);
> >>
> >> Ditto.
>
> Actually this is a typo and I originally meant that waiting for
> writeback in fuse_send_readpages() is unneeded as page cache won't be
> freed until the writeback IO completes.
>
> > - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
> > + filemap_fdatawait_range(inode->i_mapping, first, last);
>
Gotcha and agreed. I'll remove this wait from readahead().
>
> In fact the above waiting for writeback in fuse_send_write_pages() is
> needed. The reason is as follows:
>
> >>
> >
> > Why did we need this fuse_wait_on_folio_writeback() even when we had
> > the temp pages? If I'm understanding it correctly,
> > fuse_send_write_pages() is only called for the writethrough case (by
> > fuse_perform_write()), so these folios would never even be under
> > writeback, no?
>
> I think mmap write could make the page dirty and the writeback could be
> triggered then.
>
Ohhh I see, thanks for the explanation.
> >
> >>
> >>
> >>> static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> >>> - struct folio *tmp_folio, uint32_t folio_index)
> >>> + uint32_t folio_index)
> >>> {
> >>> struct inode *inode = folio->mapping->host;
> >>> struct fuse_args_pages *ap = &wpa->ia.ap;
> >>>
> >>> - folio_copy(tmp_folio, folio);
> >>> -
> >>> - ap->folios[folio_index] = tmp_folio;
> >>> + folio_get(folio);
> >>
> >> I still think this folio_get() here is harmless but redundant.
> >>
> >> Ditto page cache won't be freed before writeback completes.
> >>
> >> Besides, other .writepages() implementaions e.g. iomap_writepages() also
> >> doen't get the refcount when constructing the writeback IO.
> >>
> >
> > Point taken. I'll remove this then, since other .writepages() also
> > don't obtain a refcount.
> >
> >>
> >>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
> >>> if (IS_ERR(folio))
> >>> goto error;
> >>>
> >>> - fuse_wait_on_page_writeback(mapping->host, folio->index);
> >>> + folio_wait_writeback(folio);
> >>
> >> I also doubt if wait_on_writeback() is needed here, as now there won't
> >> be duplicate writeback IOs for the same page.
> >
> > What prevents there from being a duplicate writeback write for the
> > same page? This is the path I'm looking at:
> >
> > ksys_write()
> > vfs_write()
> > new_sync_write()
> > op->write_iter()
> > fuse_file_write_iter()
> > fuse_cache_write_iter()
> > generic_file_write_iter()
> > __generic_file_write_iter()
> > generic_perform_write()
> > op->write_begin()
> > fuse_write_begin()
> >
> > but I'm not seeing where there's anything that prevents a duplicate
> > write from happening.
>
> I mean there won't be duplicate *writeback* rather than *write* for the
> same page. You could write the page cache and make it dirty at the time
> when the writeback for the same page is still on going, as long as we
> can ensure that even when the page is dirtied again, there won't be a
> duplicate writeback IO for the same page when the previous writeback IO
> has not completed yet.
>
I think we still need this folio_wait_writeback() since we're calling
fuse_do_readfolio() and removing the folio_wait_writeback() from
fuse_do_readfolio(). else we could read back stale data if the
writeback hasn't gone through yet.
I think we could probably move the folio_wait_writeback() here in
fuse_write_begin() to be right before the fuse_do_readfolio() call and
skip waiting on writeback if we hit the "success" gotos.
Thanks,
Joanne
>
>
> --
> Thanks,
> Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-14 18:19 ` Joanne Koong
@ 2024-11-15 2:18 ` Jingbo Xu
2024-11-15 18:29 ` Joanne Koong
0 siblings, 1 reply; 32+ messages in thread
From: Jingbo Xu @ 2024-11-15 2:18 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm,
bernd.schubert, kernel-team
On 11/15/24 2:19 AM, Joanne Koong wrote:
> On Wed, Nov 13, 2024 at 5:47 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>>
>>
>> On 11/14/24 8:39 AM, Joanne Koong wrote:
>>> On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> Hi Joanne,
>>>>
>>>> On 11/8/24 7:56 AM, Joanne Koong wrote:
>>>>> Currently, we allocate and copy data to a temporary folio when
>>>>> handling writeback in order to mitigate the following deadlock scenario
>>>>> that may arise if reclaim waits on writeback to complete:
>>>>> * single-threaded FUSE server is in the middle of handling a request
>>>>> that needs a memory allocation
>>>>> * memory allocation triggers direct reclaim
>>>>> * direct reclaim waits on a folio under writeback
>>>>> * the FUSE server can't write back the folio since it's stuck in
>>>>> direct reclaim
>>>>>
>>>>> To work around this, we allocate a temporary folio and copy over the
>>>>> original folio to the temporary folio so that writeback can be
>>>>> immediately cleared on the original folio. This additionally requires us
>>>>> to maintain an internal rb tree to keep track of writeback state on the
>>>>> temporary folios.
>>>>>
>>>>> A recent change prevents reclaim logic from waiting on writeback for
>>>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
>>>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
>>>>> will prevent FUSE folios from running into the reclaim deadlock described
>>>>> above) and removes the temporary folio + extra copying and the internal
>>>>> rb tree.
>>>>>
>>>>> fio benchmarks --
>>>>> (using averages observed from 10 runs, throwing away outliers)
>>>>>
>>>>> Setup:
>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>>>
>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>>>
>>>>> bs = 1k 4k 1M
>>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
>>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
>>>>> % diff -3% 23% 45%
>>>>>
>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>
>>>> I think there are some places checking or waiting for writeback could be
>>>> reconsidered if they are still needed or not after we dropping the temp
>>>> page design.
>>>>
>>>> As they are inherited from the original implementation, at least they
>>>> are harmless. I think they could be remained in this patch, and could
>>>> be cleaned up later if really needed.
>>>>
>>>
>>> Thank you for the thorough inspection!
>>>
>>>>
>>>>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
>>>>> * have writeback that extends beyond the lifetime of the folio. So
>>>>> * make sure we read a properly synced folio.
>>>>> */
>>>>> - fuse_wait_on_folio_writeback(inode, folio);
>>>>> + folio_wait_writeback(folio);
>>>>
>>>> I doubt if wait-on-writeback is needed here, as now page cache won't be
>>>> freed until the writeback IO completes.
>>>>
>>>> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
>>>> (generally called by distributed filesystems when the file content has
>>>> been modified from remote) or truncate_inode_pages() (called from
>>>> truncate(2) or inode eviction) will wait for writeback completion (if
>>>> any) before freeing page.
>>>>
>>>> Thus I don't think there's any possibility that .read_folio() or
>>>> .readahead() will be called when the writeback has not completed.
>>>>
>>>
>>> Great point. I'll remove this line and the comment above it too.
>>>
>>>>
>>>>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>>>>> int err;
>>>>>
>>>>> for (i = 0; i < ap->num_folios; i++)
>>>>> - fuse_wait_on_folio_writeback(inode, ap->folios[i]);
>>>>> + folio_wait_writeback(ap->folios[i]);
>>>>
>>>> Ditto.
>>
>> Actually this is a typo and I originally meant that waiting for
>> writeback in fuse_send_readpages() is unneeded as page cache won't be
>> freed until the writeback IO completes.
>>
>>> - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
>>> + filemap_fdatawait_range(inode->i_mapping, first, last);
>>
>
> Gotcha and agreed. I'll remove this wait from readahead().
>
>>
>> In fact the above waiting for writeback in fuse_send_write_pages() is
>> needed. The reason is as follows:
>>
>>>>
>>>
>>> Why did we need this fuse_wait_on_folio_writeback() even when we had
>>> the temp pages? If I'm understanding it correctly,
>>> fuse_send_write_pages() is only called for the writethrough case (by
>>> fuse_perform_write()), so these folios would never even be under
>>> writeback, no?
>>
>> I think mmap write could make the page dirty and the writeback could be
>> triggered then.
>>
>
> Ohhh I see, thanks for the explanation.
>
>>>
>>>>
>>>>
>>>>> static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
>>>>> - struct folio *tmp_folio, uint32_t folio_index)
>>>>> + uint32_t folio_index)
>>>>> {
>>>>> struct inode *inode = folio->mapping->host;
>>>>> struct fuse_args_pages *ap = &wpa->ia.ap;
>>>>>
>>>>> - folio_copy(tmp_folio, folio);
>>>>> -
>>>>> - ap->folios[folio_index] = tmp_folio;
>>>>> + folio_get(folio);
>>>>
>>>> I still think this folio_get() here is harmless but redundant.
>>>>
>>>> Ditto page cache won't be freed before writeback completes.
>>>>
>>>> Besides, other .writepages() implementaions e.g. iomap_writepages() also
>>>> doen't get the refcount when constructing the writeback IO.
>>>>
>>>
>>> Point taken. I'll remove this then, since other .writepages() also
>>> don't obtain a refcount.
>>>
>>>>
>>>>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
>>>>> if (IS_ERR(folio))
>>>>> goto error;
>>>>>
>>>>> - fuse_wait_on_page_writeback(mapping->host, folio->index);
>>>>> + folio_wait_writeback(folio);
>>>>
>>>> I also doubt if wait_on_writeback() is needed here, as now there won't
>>>> be duplicate writeback IOs for the same page.
>>>
>>> What prevents there from being a duplicate writeback write for the
>>> same page? This is the path I'm looking at:
>>>
>>> ksys_write()
>>> vfs_write()
>>> new_sync_write()
>>> op->write_iter()
>>> fuse_file_write_iter()
>>> fuse_cache_write_iter()
>>> generic_file_write_iter()
>>> __generic_file_write_iter()
>>> generic_perform_write()
>>> op->write_begin()
>>> fuse_write_begin()
>>>
>>> but I'm not seeing where there's anything that prevents a duplicate
>>> write from happening.
>>
>> I mean there won't be duplicate *writeback* rather than *write* for the
>> same page. You could write the page cache and make it dirty at the time
>> when the writeback for the same page is still on going, as long as we
>> can ensure that even when the page is dirtied again, there won't be a
>> duplicate writeback IO for the same page when the previous writeback IO
>> has not completed yet.
>>
>
> I think we still need this folio_wait_writeback() since we're calling
> fuse_do_readfolio() and removing the folio_wait_writeback() from
> fuse_do_readfolio(). else we could read back stale data if the
> writeback hasn't gone through yet.
> I think we could probably move the folio_wait_writeback() here in
> fuse_write_begin() to be right before the fuse_do_readfolio() call and
> skip waiting on writeback if we hit the "success" gotos.
IIUC if cache hit when fuse_write_begin() is called, the page is marked
with PG_update which indicates that (.readpage()) the IO request reading
data from disk to page is already done. Thus fuse_write_begin() will do
nothing and return directly.
> if (folio_test_uptodate(folio) || len >= folio_size(folio))
> goto success;
Otherwise fuse_do_readfolio() is called to read data when cache miss,
i.e. the page doesn't exist in the page cache before or it gets
invalidated previously. Since we can ensure that no page can be
invalidated until the writeback IO completes, we can ensure that there's
no in-flight writeback IO when fuse_do_readfolio() is called.
Other folks can also correct me if I get wrong, since I also attempts to
figure out all these details about the page cache management recently.
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
2024-11-15 2:18 ` Jingbo Xu
@ 2024-11-15 18:29 ` Joanne Koong
0 siblings, 0 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-15 18:29 UTC (permalink / raw)
To: Jingbo Xu
Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm,
bernd.schubert, kernel-team
On Thu, Nov 14, 2024 at 6:18 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 11/15/24 2:19 AM, Joanne Koong wrote:
> > On Wed, Nov 13, 2024 at 5:47 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 11/14/24 8:39 AM, Joanne Koong wrote:
> >>> On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> Hi Joanne,
> >>>>
> >>>> On 11/8/24 7:56 AM, Joanne Koong wrote:
> >>>>> Currently, we allocate and copy data to a temporary folio when
> >>>>> handling writeback in order to mitigate the following deadlock scenario
> >>>>> that may arise if reclaim waits on writeback to complete:
> >>>>> * single-threaded FUSE server is in the middle of handling a request
> >>>>> that needs a memory allocation
> >>>>> * memory allocation triggers direct reclaim
> >>>>> * direct reclaim waits on a folio under writeback
> >>>>> * the FUSE server can't write back the folio since it's stuck in
> >>>>> direct reclaim
> >>>>>
> >>>>> To work around this, we allocate a temporary folio and copy over the
> >>>>> original folio to the temporary folio so that writeback can be
> >>>>> immediately cleared on the original folio. This additionally requires us
> >>>>> to maintain an internal rb tree to keep track of writeback state on the
> >>>>> temporary folios.
> >>>>>
> >>>>> A recent change prevents reclaim logic from waiting on writeback for
> >>>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> >>>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> >>>>> will prevent FUSE folios from running into the reclaim deadlock described
> >>>>> above) and removes the temporary folio + extra copying and the internal
> >>>>> rb tree.
> >>>>>
> >>>>> fio benchmarks --
> >>>>> (using averages observed from 10 runs, throwing away outliers)
> >>>>>
> >>>>> Setup:
> >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>>>
> >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>>>
> >>>>> bs = 1k 4k 1M
> >>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> >>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
> >>>>> % diff -3% 23% 45%
> >>>>>
> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>
> >>>> I think there are some places checking or waiting for writeback could be
> >>>> reconsidered if they are still needed or not after we dropping the temp
> >>>> page design.
> >>>>
> >>>> As they are inherited from the original implementation, at least they
> >>>> are harmless. I think they could be remained in this patch, and could
> >>>> be cleaned up later if really needed.
> >>>>
> >>>
> >>> Thank you for the thorough inspection!
> >>>
> >>>>
> >>>>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
> >>>>> * have writeback that extends beyond the lifetime of the folio. So
> >>>>> * make sure we read a properly synced folio.
> >>>>> */
> >>>>> - fuse_wait_on_folio_writeback(inode, folio);
> >>>>> + folio_wait_writeback(folio);
> >>>>
> >>>> I doubt if wait-on-writeback is needed here, as now page cache won't be
> >>>> freed until the writeback IO completes.
> >>>>
> >>>> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
> >>>> (generally called by distributed filesystems when the file content has
> >>>> been modified from remote) or truncate_inode_pages() (called from
> >>>> truncate(2) or inode eviction) will wait for writeback completion (if
> >>>> any) before freeing page.
> >>>>
> >>>> Thus I don't think there's any possibility that .read_folio() or
> >>>> .readahead() will be called when the writeback has not completed.
> >>>>
> >>>
> >>> Great point. I'll remove this line and the comment above it too.
> >>>
> >>>>
> >>>>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> >>>>> int err;
> >>>>>
> >>>>> for (i = 0; i < ap->num_folios; i++)
> >>>>> - fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> >>>>> + folio_wait_writeback(ap->folios[i]);
> >>>>
> >>>> Ditto.
> >>
> >> Actually this is a typo and I originally meant that waiting for
> >> writeback in fuse_send_readpages() is unneeded as page cache won't be
> >> freed until the writeback IO completes.
> >>
> >>> - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
> >>> + filemap_fdatawait_range(inode->i_mapping, first, last);
> >>
> >
> > Gotcha and agreed. I'll remove this wait from readahead().
> >
> >>
> >> In fact the above waiting for writeback in fuse_send_write_pages() is
> >> needed. The reason is as follows:
> >>
> >>>>
> >>>
> >>> Why did we need this fuse_wait_on_folio_writeback() even when we had
> >>> the temp pages? If I'm understanding it correctly,
> >>> fuse_send_write_pages() is only called for the writethrough case (by
> >>> fuse_perform_write()), so these folios would never even be under
> >>> writeback, no?
> >>
> >> I think mmap write could make the page dirty and the writeback could be
> >> triggered then.
> >>
> >
> > Ohhh I see, thanks for the explanation.
> >
> >>>
> >>>>
> >>>>
> >>>>> static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> >>>>> - struct folio *tmp_folio, uint32_t folio_index)
> >>>>> + uint32_t folio_index)
> >>>>> {
> >>>>> struct inode *inode = folio->mapping->host;
> >>>>> struct fuse_args_pages *ap = &wpa->ia.ap;
> >>>>>
> >>>>> - folio_copy(tmp_folio, folio);
> >>>>> -
> >>>>> - ap->folios[folio_index] = tmp_folio;
> >>>>> + folio_get(folio);
> >>>>
> >>>> I still think this folio_get() here is harmless but redundant.
> >>>>
> >>>> Ditto page cache won't be freed before writeback completes.
> >>>>
> >>>> Besides, other .writepages() implementaions e.g. iomap_writepages() also
> >>>> doen't get the refcount when constructing the writeback IO.
> >>>>
> >>>
> >>> Point taken. I'll remove this then, since other .writepages() also
> >>> don't obtain a refcount.
> >>>
> >>>>
> >>>>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
> >>>>> if (IS_ERR(folio))
> >>>>> goto error;
> >>>>>
> >>>>> - fuse_wait_on_page_writeback(mapping->host, folio->index);
> >>>>> + folio_wait_writeback(folio);
> >>>>
> >>>> I also doubt if wait_on_writeback() is needed here, as now there won't
> >>>> be duplicate writeback IOs for the same page.
> >>>
> >>> What prevents there from being a duplicate writeback write for the
> >>> same page? This is the path I'm looking at:
> >>>
> >>> ksys_write()
> >>> vfs_write()
> >>> new_sync_write()
> >>> op->write_iter()
> >>> fuse_file_write_iter()
> >>> fuse_cache_write_iter()
> >>> generic_file_write_iter()
> >>> __generic_file_write_iter()
> >>> generic_perform_write()
> >>> op->write_begin()
> >>> fuse_write_begin()
> >>>
> >>> but I'm not seeing where there's anything that prevents a duplicate
> >>> write from happening.
> >>
> >> I mean there won't be duplicate *writeback* rather than *write* for the
> >> same page. You could write the page cache and make it dirty at the time
> >> when the writeback for the same page is still on going, as long as we
> >> can ensure that even when the page is dirtied again, there won't be a
> >> duplicate writeback IO for the same page when the previous writeback IO
> >> has not completed yet.
> >>
> >
> > I think we still need this folio_wait_writeback() since we're calling
> > fuse_do_readfolio() and removing the folio_wait_writeback() from
> > fuse_do_readfolio(). else we could read back stale data if the
> > writeback hasn't gone through yet.
> > I think we could probably move the folio_wait_writeback() here in
> > fuse_write_begin() to be right before the fuse_do_readfolio() call and
> > skip waiting on writeback if we hit the "success" gotos.
>
> IIUC if cache hit when fuse_write_begin() is called, the page is marked
> with PG_update which indicates that (.readpage()) the IO request reading
> data from disk to page is already done. Thus fuse_write_begin() will do
> nothing and return directly.
>
> > if (folio_test_uptodate(folio) || len >= folio_size(folio))
> > goto success;
>
> Otherwise fuse_do_readfolio() is called to read data when cache miss,
> i.e. the page doesn't exist in the page cache before or it gets
> invalidated previously. Since we can ensure that no page can be
> invalidated until the writeback IO completes, we can ensure that there's
> no in-flight writeback IO when fuse_do_readfolio() is called.
>
> Other folks can also correct me if I get wrong, since I also attempts to
> figure out all these details about the page cache management recently.
>
You're right, this writeback wait is unneeded. I'll remove this for v5.
Thanks,
Joanne
>
> --
> Thanks,
> Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag
2024-11-11 21:11 ` Joanne Koong
@ 2024-11-15 19:33 ` Joanne Koong
2024-11-15 20:17 ` Joanne Koong
0 siblings, 1 reply; 32+ messages in thread
From: Joanne Koong @ 2024-11-15 19:33 UTC (permalink / raw)
To: Shakeel Butt
Cc: miklos, linux-fsdevel, jefflexu, josef, linux-mm, bernd.schubert,
kernel-team
On Mon, Nov 11, 2024 at 1:11 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Nov 8, 2024 at 4:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Nov 07, 2024 at 03:56:09PM -0800, Joanne Koong wrote:
> > > Add a new mapping flag AS_WRITEBACK_MAY_BLOCK which filesystems may set
> > > to indicate that writeback operations may block or take an indeterminate
> > > amount of time to complete. Extra caution should be taken when waiting
> > > on writeback for folios belonging to mappings where this flag is set.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > > include/linux/pagemap.h | 11 +++++++++++
> > > 1 file changed, 11 insertions(+)
> > >
> > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > index 68a5f1ff3301..eb5a7837e142 100644
> > > --- a/include/linux/pagemap.h
> > > +++ b/include/linux/pagemap.h
> > > @@ -210,6 +210,7 @@ enum mapping_flags {
> > > AS_STABLE_WRITES = 7, /* must wait for writeback before modifying
> > > folio contents */
> > > AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */
> > > + AS_WRITEBACK_MAY_BLOCK = 9, /* Use caution when waiting on writeback */
> >
> > To me 'may block' does not feel right. For example in reclaim code,
> > folio_wait_writeback() can get blocked and that is fine. However with
> > non-privileged fuse involved, there are security concerns. Somehow 'may
> > block' does not convey that. Anyways, I am not really pushing back but
> > I think there is a need for better name here.
>
> Ahh I see where this naming causes confusion - the "MAY_BLOCK" part
> could be interpreted in two ways: a) may block as in it's possible for
> the writeback to block and b) may block as in it's permissible/ok for
> the writeback to block. I intended "may block" to signify a) but
> you're right, it could be easily interpreted as b).
>
> I'll change this to AS_WRITEBACK_BLOCKING.
Thinking about this some more, I think AS_WRITEBACK_ASYNC would be a
better name. (AS_WRITEBACK_BLOCKING might imply that the writeback
->writepages() operation itself is blocking).
I'll make this change for v5.
Thanks,
Joanne
>
> Thanks,
> Joanne
>
> >
> > > /* Bits 16-25 are used for FOLIO_ORDER */
> > > AS_FOLIO_ORDER_BITS = 5,
> > > AS_FOLIO_ORDER_MIN = 16,
> > > @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
> > > return test_bit(AS_INACCESSIBLE, &mapping->flags);
> > > }
> > >
> > > +static inline void mapping_set_writeback_may_block(struct address_space *mapping)
> > > +{
> > > + set_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
> > > +}
> > > +
> > > +static inline bool mapping_writeback_may_block(struct address_space *mapping)
> > > +{
> > > + return test_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
> > > +}
> > > +
> > > static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> > > {
> > > return mapping->gfp_mask;
> > > --
> > > 2.43.5
> > >
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag
2024-11-15 19:33 ` Joanne Koong
@ 2024-11-15 20:17 ` Joanne Koong
0 siblings, 0 replies; 32+ messages in thread
From: Joanne Koong @ 2024-11-15 20:17 UTC (permalink / raw)
To: Shakeel Butt
Cc: miklos, linux-fsdevel, jefflexu, josef, linux-mm, bernd.schubert,
kernel-team
On Fri, Nov 15, 2024 at 11:33 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Nov 11, 2024 at 1:11 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Fri, Nov 8, 2024 at 4:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Nov 07, 2024 at 03:56:09PM -0800, Joanne Koong wrote:
> > > > Add a new mapping flag AS_WRITEBACK_MAY_BLOCK which filesystems may set
> > > > to indicate that writeback operations may block or take an indeterminate
> > > > amount of time to complete. Extra caution should be taken when waiting
> > > > on writeback for folios belonging to mappings where this flag is set.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > ---
> > > > include/linux/pagemap.h | 11 +++++++++++
> > > > 1 file changed, 11 insertions(+)
> > > >
> > > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > > index 68a5f1ff3301..eb5a7837e142 100644
> > > > --- a/include/linux/pagemap.h
> > > > +++ b/include/linux/pagemap.h
> > > > @@ -210,6 +210,7 @@ enum mapping_flags {
> > > > AS_STABLE_WRITES = 7, /* must wait for writeback before modifying
> > > > folio contents */
> > > > AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */
> > > > + AS_WRITEBACK_MAY_BLOCK = 9, /* Use caution when waiting on writeback */
> > >
> > > To me 'may block' does not feel right. For example in reclaim code,
> > > folio_wait_writeback() can get blocked and that is fine. However with
> > > non-privileged fuse involved, there are security concerns. Somehow 'may
> > > block' does not convey that. Anyways, I am not really pushing back but
> > > I think there is a need for better name here.
> >
> > Ahh I see where this naming causes confusion - the "MAY_BLOCK" part
> > could be interpreted in two ways: a) may block as in it's possible for
> > the writeback to block and b) may block as in it's permissible/ok for
> > the writeback to block. I intended "may block" to signify a) but
> > you're right, it could be easily interpreted as b).
> >
> > I'll change this to AS_WRITEBACK_BLOCKING.
>
> Thinking about this some more, I think AS_WRITEBACK_ASYNC would be a
> better name. (AS_WRITEBACK_BLOCKING might imply that the writeback
> ->writepages() operation itself is blocking).
Ugh, AS_WRITEBACK_ASYNC probably doesn't work either since NFS is also
async. Okay, maybe "AS_WRITEBACK_INDETERMINATE" then? We can keep
riffing on this, for v5 I'll submit it using
AS_WRITEBACK_INDETERMINATE.
>
> I'll make this change for v5.
>
> Thanks,
> Joanne
>
> >
> > Thanks,
> > Joanne
> >
> > >
> > > > /* Bits 16-25 are used for FOLIO_ORDER */
> > > > AS_FOLIO_ORDER_BITS = 5,
> > > > AS_FOLIO_ORDER_MIN = 16,
> > > > @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
> > > > return test_bit(AS_INACCESSIBLE, &mapping->flags);
> > > > }
> > > >
> > > > +static inline void mapping_set_writeback_may_block(struct address_space *mapping)
> > > > +{
> > > > + set_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
> > > > +}
> > > > +
> > > > +static inline bool mapping_writeback_may_block(struct address_space *mapping)
> > > > +{
> > > > + return test_bit(AS_WRITEBACK_MAY_BLOCK, &mapping->flags);
> > > > +}
> > > > +
> > > > static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> > > > {
> > > > return mapping->gfp_mask;
> > > > --
> > > > 2.43.5
> > > >
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2024-11-15 20:17 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
2024-11-09 0:10 ` Shakeel Butt
2024-11-11 21:11 ` Joanne Koong
2024-11-15 19:33 ` Joanne Koong
2024-11-15 20:17 ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block Joanne Koong
2024-11-09 0:16 ` Shakeel Butt
2024-11-07 23:56 ` [PATCH v4 3/6] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails Joanne Koong
2024-11-08 17:33 ` SeongJae Park
2024-11-08 18:56 ` David Hildenbrand
2024-11-08 19:00 ` David Hildenbrand
2024-11-08 21:27 ` Shakeel Butt
2024-11-08 21:42 ` Joanne Koong
2024-11-08 22:16 ` Shakeel Butt
2024-11-08 22:20 ` Joanne Koong
2024-11-08 21:59 ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 5/6] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-08 8:48 ` Jingbo Xu
2024-11-08 22:33 ` Joanne Koong
2024-11-11 8:32 ` Jingbo Xu
2024-11-11 21:30 ` Joanne Koong
2024-11-12 2:31 ` Jingbo Xu
2024-11-13 19:11 ` Joanne Koong
2024-11-12 9:25 ` Jingbo Xu
2024-11-14 0:39 ` Joanne Koong
2024-11-14 1:46 ` Jingbo Xu
2024-11-14 18:19 ` Joanne Koong
2024-11-15 2:18 ` Jingbo Xu
2024-11-15 18:29 ` Joanne Koong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox