* [PATCHSET v5 0/17] Uncached buffered IO
@ 2024-11-14 15:25 Jens Axboe
2024-11-14 15:25 ` [PATCH 01/17] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
` (17 more replies)
0 siblings, 18 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster
Hi,
5 years ago I posted patches adding support for RWF_UNCACHED, as a way
to do buffered IO that isn't page cache persistent. The approach back
then was to have private pages for IO, and then get rid of them once IO
was done. But that then runs into all the issues that O_DIRECT has, in
terms of synchronizing with the page cache.
So here's a new approach to the same concent, but using the page cache
as synchronization. That makes RWF_UNCACHED less special, in that it's
just page cache IO, except it prunes the ranges once IO is completed.
Why do this, you may ask? The tldr is that device speeds are only
getting faster, while reclaim is not. Doing normal buffered IO can be
very unpredictable, and suck up a lot of resources on the reclaim side.
This leads people to use O_DIRECT as a work-around, which has its own
set of restrictions in terms of size, offset, and length of IO. It's
also inherently synchronous, and now you need async IO as well. While
the latter isn't necessarily a big problem as we have good options
available there, it also should not be a requirement when all you want
to do is read or write some data without caching.
Even on desktop type systems, a normal NVMe device can fill the entire
page cache in seconds. On the big system I used for testing, there's a
lot more RAM, but also a lot more devices. As can be seen in some of the
results in the following patches, you can still fill RAM in seconds even
when there's 1TB of it. Hence this problem isn't solely a "big
hyperscaler system" issue, it's common across the board.
Common for both reads and writes with RWF_UNCACHED is that they use the
page cache for IO. Reads work just like a normal buffered read would,
with the only exception being that the touched ranges will get pruned
after data has been copied. For writes, the ranges will get writeback
kicked off before the syscall returns, and then writeback completion
will prune the range. Hence writes aren't synchronous, and it's easy to
pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by
RWF_UNCACHED IO are left untouched. This means you that uncached IO
will take advantage of the page cache for uptodate data, but not leave
anything it instantiated/created in cache.
File systems need to support this. The patches add support for the
generic filemap helpers, and for iomap. Then ext4 and XFS are marked as
supporting it. The last patch adds support for btrfs as well, lightly
tested. The read side is already done by filemap, only the write side
needs a bit of help. The amount of code here is really trivial, and the
only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO
return -EOPNOTSUPP just in case the fs doesn't use either the generic
paths or iomap. Adding "support" to other file systems should be
trivial, most of the time just a one-liner adding FOP_UNCACHED to the
fop_flags in the file_operations struct.
Performance results are in patch 8 for reads and patch 10 for writes,
with the tldr being that I see about a 65% improvement in performance
for both, with fully predictable IO times. CPU reduction is substantial
as well, with no kswapd activity at all for reclaim when using uncached
IO.
Using it from applications is trivial - just set RWF_UNCACHED for the
read or write, using pwritev2(2) or preadv2(2). For io_uring, same
thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write
operation. And that's it.
Patches 1..7 are just prep patches, and should have no functional
changes at all. Patch 8 adds support for the filemap path for
RWF_UNCACHED reads, patch 10 adds support for filemap RWF_UNCACHED
writes, and patches 13..17 adds ext4, xfs/iomap, and btrfs support.
Passes full xfstests and fsx overnight runs, no issues observed. That
includes the vm running the testing also using RWF_UNCACHED on the host.
I'll post fsstress and fsx patches for RWF_UNCACHED separately. As far
as I'm concerned, no further work needs doing here. Once we're into
the 6.13 merge window, I'll split up this series and aim to get it
landed that way. There are really 4 parts to this - generic mm bits,
ext4 bits, xfs bits, and btrfs bits.
And git tree for the patches is here:
https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.7
fs/btrfs/bio.c | 4 +-
fs/btrfs/bio.h | 2 +
fs/btrfs/extent_io.c | 8 ++-
fs/btrfs/file.c | 9 ++-
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 2 +-
fs/ext4/inline.c | 7 +-
fs/ext4/inode.c | 18 +++++-
fs/ext4/page-io.c | 28 ++++----
fs/iomap/buffered-io.c | 15 ++++-
fs/xfs/xfs_aops.c | 7 +-
fs/xfs/xfs_file.c | 3 +-
include/linux/fs.h | 21 +++++-
include/linux/iomap.h | 8 ++-
include/linux/page-flags.h | 5 ++
include/linux/pagemap.h | 14 ++++
include/trace/events/mmflags.h | 3 +-
include/uapi/linux/fs.h | 6 +-
mm/filemap.c | 114 +++++++++++++++++++++++++++++----
mm/readahead.c | 22 +++++--
mm/swap.c | 2 +
mm/truncate.c | 35 ++++++----
22 files changed, 271 insertions(+), 63 deletions(-)
Since v3
- Use foliop_is_uncached() in ext4 rather than do manual compares with
foliop_uncached.
- Add filemap_fdatawrite_range_kick() helper and use that in
generic_write_sync() to kick off uncached writeback, rather than need
every fs adding a call to generic_uncached_write().
- Drop generic_uncached_write() helper, not needed anymore.
- Skip folio_unmap_invalidate() if the folio is dirty.
- Move IOMAP_F_UNCACHED to the internal iomap flags section, and add
comment from Darrick to it as well.
- Only kick uncached writeback in generic_write_sync() if
iocb_is_dsync() isn't true.
- Disable RWF_UNCACHED on dax mappings. They require more extensive
invalidation, and as it isn't a likely use case, just disable it
for now.
- Update a few commit messages
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 01/17] mm/filemap: change filemap_create_folio() to take a struct kiocb
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 02/17] mm/readahead: add folio allocation helper Jens Axboe
` (16 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Rather than pass in both the file and position directly from the kiocb,
just take a struct kiocb instead. While doing so, move the ki_flags
checking into filemap_create_folio() as well. In preparation for actually
needing the kiocb in the function.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 56fa431c52af..91974308e9bf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2460,15 +2460,17 @@ static int filemap_update_page(struct kiocb *iocb,
return error;
}
-static int filemap_create_folio(struct file *file,
- struct address_space *mapping, loff_t pos,
- struct folio_batch *fbatch)
+static int filemap_create_folio(struct kiocb *iocb,
+ struct address_space *mapping, struct folio_batch *fbatch)
{
struct folio *folio;
int error;
unsigned int min_order = mapping_min_folio_order(mapping);
pgoff_t index;
+ if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
+ return -EAGAIN;
+
folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
if (!folio)
return -ENOMEM;
@@ -2487,7 +2489,7 @@ static int filemap_create_folio(struct file *file,
* well to keep locking rules simple.
*/
filemap_invalidate_lock_shared(mapping);
- index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
+ index = (iocb->ki_pos >> (PAGE_SHIFT + min_order)) << min_order;
error = filemap_add_folio(mapping, folio, index,
mapping_gfp_constraint(mapping, GFP_KERNEL));
if (error == -EEXIST)
@@ -2495,7 +2497,8 @@ static int filemap_create_folio(struct file *file,
if (error)
goto error;
- error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
+ error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio,
+ folio);
if (error)
goto error;
@@ -2551,9 +2554,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
}
if (!folio_batch_count(fbatch)) {
- if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
- return -EAGAIN;
- err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
+ err = filemap_create_folio(iocb, mapping, fbatch);
if (err == AOP_TRUNCATED_PAGE)
goto retry;
return err;
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 02/17] mm/readahead: add folio allocation helper
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
2024-11-14 15:25 ` [PATCH 01/17] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 03/17] mm: add PG_uncached page flag Jens Axboe
` (15 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Just a wrapper around filemap_alloc_folio() for now, but add it in
preparation for modifying the folio based on the 'ractl' being passed
in.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/readahead.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 3dc6c7a128dd..003cfe79880d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -188,6 +188,12 @@ static void read_pages(struct readahead_control *rac)
BUG_ON(readahead_count(rac));
}
+static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
+ gfp_t gfp_mask, unsigned int order)
+{
+ return filemap_alloc_folio(gfp_mask, order);
+}
+
/**
* page_cache_ra_unbounded - Start unchecked readahead.
* @ractl: Readahead control.
@@ -260,8 +266,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
continue;
}
- folio = filemap_alloc_folio(gfp_mask,
- mapping_min_folio_order(mapping));
+ folio = ractl_alloc_folio(ractl, gfp_mask,
+ mapping_min_folio_order(mapping));
if (!folio)
break;
@@ -431,7 +437,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
pgoff_t mark, unsigned int order, gfp_t gfp)
{
int err;
- struct folio *folio = filemap_alloc_folio(gfp, order);
+ struct folio *folio = ractl_alloc_folio(ractl, gfp, order);
if (!folio)
return -ENOMEM;
@@ -753,7 +759,7 @@ void readahead_expand(struct readahead_control *ractl,
if (folio && !xa_is_value(folio))
return; /* Folio apparently present */
- folio = filemap_alloc_folio(gfp_mask, min_order);
+ folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
if (!folio)
return;
@@ -782,7 +788,7 @@ void readahead_expand(struct readahead_control *ractl,
if (folio && !xa_is_value(folio))
return; /* Folio apparently present */
- folio = filemap_alloc_folio(gfp_mask, min_order);
+ folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
if (!folio)
return;
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 03/17] mm: add PG_uncached page flag
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
2024-11-14 15:25 ` [PATCH 01/17] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
2024-11-14 15:25 ` [PATCH 02/17] mm/readahead: add folio allocation helper Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 04/17] mm/readahead: add readahead_control->uncached member Jens Axboe
` (14 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Add a page flag that file IO can use to indicate that the IO being done
is uncached, as in it should not persist in the page cache after the IO
has been completed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/page-flags.h | 5 +++++
include/trace/events/mmflags.h | 3 ++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index cc839e4365c1..3c4003495929 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -110,6 +110,7 @@ enum pageflags {
PG_reclaim, /* To be reclaimed asap */
PG_swapbacked, /* Page is backed by RAM/swap */
PG_unevictable, /* Page is "unevictable" */
+ PG_uncached, /* uncached read/write IO */
#ifdef CONFIG_MMU
PG_mlocked, /* Page is vma mlocked */
#endif
@@ -562,6 +563,10 @@ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
FOLIO_FLAG(readahead, FOLIO_HEAD_PAGE)
FOLIO_TEST_CLEAR_FLAG(readahead, FOLIO_HEAD_PAGE)
+FOLIO_FLAG(uncached, FOLIO_HEAD_PAGE)
+ FOLIO_TEST_CLEAR_FLAG(uncached, FOLIO_HEAD_PAGE)
+ __FOLIO_SET_FLAG(uncached, FOLIO_HEAD_PAGE)
+
#ifdef CONFIG_HIGHMEM
/*
* Must use a macro here due to header dependency issues. page_zone() is not
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index bb8a59c6caa2..b60057284102 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -116,7 +116,8 @@
DEF_PAGEFLAG_NAME(head), \
DEF_PAGEFLAG_NAME(reclaim), \
DEF_PAGEFLAG_NAME(swapbacked), \
- DEF_PAGEFLAG_NAME(unevictable) \
+ DEF_PAGEFLAG_NAME(unevictable), \
+ DEF_PAGEFLAG_NAME(uncached) \
IF_HAVE_PG_MLOCK(mlocked) \
IF_HAVE_PG_HWPOISON(hwpoison) \
IF_HAVE_PG_IDLE(idle) \
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 04/17] mm/readahead: add readahead_control->uncached member
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (2 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 03/17] mm: add PG_uncached page flag Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 05/17] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
` (13 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
If ractl->uncached is set to true, then folios created are marked as
uncached as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/pagemap.h | 1 +
mm/readahead.c | 8 +++++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301..8afacb7520d4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1350,6 +1350,7 @@ struct readahead_control {
pgoff_t _index;
unsigned int _nr_pages;
unsigned int _batch_count;
+ bool uncached;
bool _workingset;
unsigned long _pflags;
};
diff --git a/mm/readahead.c b/mm/readahead.c
index 003cfe79880d..8dbeab9bc1f0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -191,7 +191,13 @@ static void read_pages(struct readahead_control *rac)
static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
gfp_t gfp_mask, unsigned int order)
{
- return filemap_alloc_folio(gfp_mask, order);
+ struct folio *folio;
+
+ folio = filemap_alloc_folio(gfp_mask, order);
+ if (folio && ractl->uncached)
+ __folio_set_uncached(folio);
+
+ return folio;
}
/**
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 05/17] mm/filemap: use page_cache_sync_ra() to kick off read-ahead
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (3 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 04/17] mm/readahead: add readahead_control->uncached member Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 06/17] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
` (12 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly. In preparation for needing
to modify ractl inside filemap_get_pages().
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 91974308e9bf..02d9cb585195 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2528,7 +2528,6 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
{
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
- struct file_ra_state *ra = &filp->f_ra;
pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
pgoff_t last_index;
struct folio *folio;
@@ -2543,12 +2542,13 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
if (!folio_batch_count(fbatch)) {
+ DEFINE_READAHEAD(ractl, filp, &filp->f_ra, mapping, index);
+
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
if (iocb->ki_flags & IOCB_NOWAIT)
flags = memalloc_noio_save();
- page_cache_sync_readahead(mapping, ra, filp, index,
- last_index - index);
+ page_cache_sync_ra(&ractl, last_index - index);
if (iocb->ki_flags & IOCB_NOWAIT)
memalloc_noio_restore(flags);
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 06/17] mm/truncate: add folio_unmap_invalidate() helper
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (4 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 05/17] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 07/17] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
` (11 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Add a folio_unmap_invalidate() helper, which unmaps and invalidates a
given folio. The caller must already have locked the folio. Use this
new helper in invalidate_inode_pages2_range(), rather than duplicate
the code there.
In preparation for using this elsewhere as well, have it take a gfp_t
mask rather than assume GFP_KERNEL is the right choice. This bubbles
back to invalidate_complete_folio2() as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/pagemap.h | 2 ++
mm/truncate.c | 35 ++++++++++++++++++++++-------------
2 files changed, 24 insertions(+), 13 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8afacb7520d4..d55bf995bd9e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -34,6 +34,8 @@ int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count);
int filemap_invalidate_pages(struct address_space *mapping,
loff_t pos, loff_t end, bool nowait);
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+ gfp_t gfp);
int write_inode_now(struct inode *, int sync);
int filemap_fdatawrite(struct address_space *);
diff --git a/mm/truncate.c b/mm/truncate.c
index 0668cd340a46..6ea16c537534 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -547,12 +547,12 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
* sitting in the folio_add_lru() caches.
*/
static int invalidate_complete_folio2(struct address_space *mapping,
- struct folio *folio)
+ struct folio *folio, gfp_t gfp_mask)
{
if (folio->mapping != mapping)
return 0;
- if (!filemap_release_folio(folio, GFP_KERNEL))
+ if (!filemap_release_folio(folio, gfp_mask))
return 0;
spin_lock(&mapping->host->i_lock);
@@ -584,6 +584,25 @@ static int folio_launder(struct address_space *mapping, struct folio *folio)
return mapping->a_ops->launder_folio(folio);
}
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+ gfp_t gfp)
+{
+ int ret;
+
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+ if (folio_test_dirty(folio))
+ return 0;
+ if (folio_mapped(folio))
+ unmap_mapping_folio(folio);
+ BUG_ON(folio_mapped(folio));
+
+ ret = folio_launder(mapping, folio);
+ if (!ret && !invalidate_complete_folio2(mapping, folio, gfp))
+ return -EBUSY;
+ return ret;
+}
+
/**
* invalidate_inode_pages2_range - remove range of pages from an address_space
* @mapping: the address_space
@@ -641,18 +660,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
folio_unlock(folio);
continue;
}
- VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
folio_wait_writeback(folio);
-
- if (folio_mapped(folio))
- unmap_mapping_folio(folio);
- BUG_ON(folio_mapped(folio));
-
- ret2 = folio_launder(mapping, folio);
- if (ret2 == 0) {
- if (!invalidate_complete_folio2(mapping, folio))
- ret2 = -EBUSY;
- }
+ ret2 = folio_unmap_invalidate(mapping, folio, GFP_KERNEL);
if (ret2 < 0)
ret = ret2;
folio_unlock(folio);
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 07/17] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (5 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 06/17] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 08/17] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
` (10 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
If a file system supports uncached buffered IO, it may set FOP_UNCACHED
and enable RWF_UNCACHED. If RWF_UNCACHED is attempted without the file
system supporting it, it'll get errored with -EOPNOTSUPP.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/fs.h | 14 +++++++++++++-
include/uapi/linux/fs.h | 6 +++++-
2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3559446279c1..45510d0b8de0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -320,6 +320,7 @@ struct readahead_control;
#define IOCB_NOWAIT (__force int) RWF_NOWAIT
#define IOCB_APPEND (__force int) RWF_APPEND
#define IOCB_ATOMIC (__force int) RWF_ATOMIC
+#define IOCB_UNCACHED (__force int) RWF_UNCACHED
/* non-RWF related bits - start at 16 */
#define IOCB_EVENTFD (1 << 16)
@@ -354,7 +355,8 @@ struct readahead_control;
{ IOCB_SYNC, "SYNC" }, \
{ IOCB_NOWAIT, "NOWAIT" }, \
{ IOCB_APPEND, "APPEND" }, \
- { IOCB_ATOMIC, "ATOMIC"}, \
+ { IOCB_ATOMIC, "ATOMIC" }, \
+ { IOCB_UNCACHED, "UNCACHED" }, \
{ IOCB_EVENTFD, "EVENTFD"}, \
{ IOCB_DIRECT, "DIRECT" }, \
{ IOCB_WRITE, "WRITE" }, \
@@ -2116,6 +2118,8 @@ struct file_operations {
#define FOP_HUGE_PAGES ((__force fop_flags_t)(1 << 4))
/* Treat loff_t as unsigned (e.g., /dev/mem) */
#define FOP_UNSIGNED_OFFSET ((__force fop_flags_t)(1 << 5))
+/* File system supports uncached read/write buffered IO */
+#define FOP_UNCACHED ((__force fop_flags_t)(1 << 6))
/* Wrap a directory iterator that needs exclusive inode access */
int wrap_directory_iterator(struct file *, struct dir_context *,
@@ -3532,6 +3536,14 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
return -EOPNOTSUPP;
}
+ if (flags & RWF_UNCACHED) {
+ /* file system must support it */
+ if (!(ki->ki_filp->f_op->fop_flags & FOP_UNCACHED))
+ return -EOPNOTSUPP;
+ /* DAX mappings not supported */
+ if (IS_DAX(ki->ki_filp->f_mapping->host))
+ return -EOPNOTSUPP;
+ }
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..dc77cd8ae1a3 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -332,9 +332,13 @@ typedef int __bitwise __kernel_rwf_t;
/* Atomic Write */
#define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000040)
+/* buffered IO that drops the cache after reading or writing data */
+#define RWF_UNCACHED ((__force __kernel_rwf_t)0x00000080)
+
/* mask of flags supported by the kernel */
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
- RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
+ RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
+ RWF_UNCACHED)
#define PROCFS_IOCTL_MAGIC 'f'
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 08/17] mm/filemap: add read support for RWF_UNCACHED
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (6 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 07/17] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-15 8:49 ` Kirill A. Shutemov
2024-11-14 15:25 ` [PATCH 09/17] mm/filemap: drop uncached pages when writeback completes Jens Axboe
` (9 subsequent siblings)
17 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Add RWF_UNCACHED as a read operation flag, which means that any data
read wil be removed from the page cache upon completion. Uses the page
cache to synchronize, and simply prunes folios that were instantiated
when the operation completes. While it would be possible to use private
pages for this, using the page cache as synchronization is handy for a
variety of reasons:
1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable
and the code to support this is much simpler as a result.
You can think of uncached buffered IO as being the much more attractive
cousin of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes,
it will copy the data, but unlike regular buffered IO, it doesn't run
into the unpredictability of the page cache in terms of reclaim. As an
example, on a test box with 32 drives, reading them with buffered IO
looks as follows:
Reading bs 65536, uncached 0
1s: 145945MB/sec
2s: 158067MB/sec
3s: 157007MB/sec
4s: 148622MB/sec
5s: 118824MB/sec
6s: 70494MB/sec
7s: 41754MB/sec
8s: 90811MB/sec
9s: 92204MB/sec
10s: 95178MB/sec
11s: 95488MB/sec
12s: 95552MB/sec
13s: 96275MB/sec
where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7535 root 20 0 267004 0 0 S 3199 0.0 8:40.65 uncached
3326 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd4
3327 root 20 0 0 0 0 R 100.0 0.0 0:17.22 kswapd5
3328 root 20 0 0 0 0 R 100.0 0.0 0:13.29 kswapd6
3332 root 20 0 0 0 0 R 100.0 0.0 0:11.11 kswapd10
3339 root 20 0 0 0 0 R 100.0 0.0 0:16.25 kswapd17
3348 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd26
3343 root 20 0 0 0 0 R 100.0 0.0 0:16.30 kswapd21
3344 root 20 0 0 0 0 R 100.0 0.0 0:11.92 kswapd22
3349 root 20 0 0 0 0 R 100.0 0.0 0:16.28 kswapd27
3352 root 20 0 0 0 0 R 99.7 0.0 0:11.89 kswapd30
3353 root 20 0 0 0 0 R 96.7 0.0 0:16.04 kswapd31
3329 root 20 0 0 0 0 R 96.4 0.0 0:11.41 kswapd7
3345 root 20 0 0 0 0 R 96.4 0.0 0:13.40 kswapd23
3330 root 20 0 0 0 0 S 91.1 0.0 0:08.28 kswapd8
3350 root 20 0 0 0 0 S 86.8 0.0 0:11.13 kswapd28
3325 root 20 0 0 0 0 S 76.3 0.0 0:07.43 kswapd3
3341 root 20 0 0 0 0 S 74.7 0.0 0:08.85 kswapd19
3334 root 20 0 0 0 0 S 71.7 0.0 0:10.04 kswapd12
3351 root 20 0 0 0 0 R 60.5 0.0 0:09.59 kswapd29
3323 root 20 0 0 0 0 R 57.6 0.0 0:11.50 kswapd1
[...]
which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.
If the same test case is run with RWF_UNCACHED set for the buffered read,
the output looks as follows:
Reading bs 65536, uncached 0
1s: 153144MB/sec
2s: 156760MB/sec
3s: 158110MB/sec
4s: 158009MB/sec
5s: 158043MB/sec
6s: 157638MB/sec
7s: 157999MB/sec
8s: 158024MB/sec
9s: 157764MB/sec
10s: 157477MB/sec
11s: 157417MB/sec
12s: 157455MB/sec
13s: 157233MB/sec
14s: 156692MB/sec
which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached
8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top
where just the test app is using CPU, no reclaim is taking place outside
of the main thread. Not only is performance 65% better, it's also using
half the CPU to do it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 28 ++++++++++++++++++++++++++--
mm/swap.c | 2 ++
2 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 02d9cb585195..3d0614ea5f59 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2474,6 +2474,8 @@ static int filemap_create_folio(struct kiocb *iocb,
folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
if (!folio)
return -ENOMEM;
+ if (iocb->ki_flags & IOCB_UNCACHED)
+ __folio_set_uncached(folio);
/*
* Protect against truncate / hole punch. Grabbing invalidate_lock
@@ -2519,6 +2521,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
+ if (iocb->ki_flags & IOCB_UNCACHED)
+ ractl.uncached = 1;
page_cache_async_ra(&ractl, folio, last_index - folio->index);
return 0;
}
@@ -2548,6 +2552,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
return -EAGAIN;
if (iocb->ki_flags & IOCB_NOWAIT)
flags = memalloc_noio_save();
+ if (iocb->ki_flags & IOCB_UNCACHED)
+ ractl.uncached = 1;
page_cache_sync_ra(&ractl, last_index - index);
if (iocb->ki_flags & IOCB_NOWAIT)
memalloc_noio_restore(flags);
@@ -2595,6 +2601,20 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
return (pos1 >> shift == pos2 >> shift);
}
+static void filemap_uncached_read(struct address_space *mapping,
+ struct folio *folio)
+{
+ if (!folio_test_uncached(folio))
+ return;
+ if (folio_test_writeback(folio))
+ return;
+ if (folio_test_clear_uncached(folio)) {
+ folio_lock(folio);
+ folio_unmap_invalidate(mapping, folio, 0);
+ folio_unlock(folio);
+ }
+}
+
/**
* filemap_read - Read data from the page cache.
* @iocb: The iocb to read.
@@ -2706,8 +2726,12 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
}
}
put_folios:
- for (i = 0; i < folio_batch_count(&fbatch); i++)
- folio_put(fbatch.folios[i]);
+ for (i = 0; i < folio_batch_count(&fbatch); i++) {
+ struct folio *folio = fbatch.folios[i];
+
+ filemap_uncached_read(mapping, folio);
+ folio_put(folio);
+ }
folio_batch_init(&fbatch);
} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
diff --git a/mm/swap.c b/mm/swap.c
index b8e3259ea2c4..542f298d3dcd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -472,6 +472,8 @@ static void folio_inc_refs(struct folio *folio)
*/
void folio_mark_accessed(struct folio *folio)
{
+ if (folio_test_uncached(folio))
+ return;
if (lru_gen_enabled()) {
folio_inc_refs(folio);
return;
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 09/17] mm/filemap: drop uncached pages when writeback completes
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (7 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 08/17] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 10/17] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
` (8 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
If the folio is marked as uncached, drop pages when writeback completes.
Intended to be used with RWF_UNCACHED, to avoid needing sync writes for
uncached IO.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d0614ea5f59..13815194ed8a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1600,6 +1600,27 @@ int folio_wait_private_2_killable(struct folio *folio)
}
EXPORT_SYMBOL(folio_wait_private_2_killable);
+/*
+ * If folio was marked as uncached, then pages should be dropped when writeback
+ * completes. Do that now. If we fail, it's likely because of a big folio -
+ * just reset uncached for that case and latter completions should invalidate.
+ */
+static void folio_end_uncached(struct folio *folio)
+{
+ /*
+ * Hitting !in_task() should not happen off RWF_UNCACHED writeback, but
+ * can happen if normal writeback just happens to find dirty folios
+ * that were created as part of uncached writeback, and that writeback
+ * would otherwise not need non-IRQ handling. Just skip the
+ * invalidation in that case.
+ */
+ if (in_task() && folio_trylock(folio)) {
+ if (folio->mapping)
+ folio_unmap_invalidate(folio->mapping, folio, 0);
+ folio_unlock(folio);
+ }
+}
+
/**
* folio_end_writeback - End writeback against a folio.
* @folio: The folio.
@@ -1610,6 +1631,8 @@ EXPORT_SYMBOL(folio_wait_private_2_killable);
*/
void folio_end_writeback(struct folio *folio)
{
+ bool folio_uncached = false;
+
VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
/*
@@ -1631,9 +1654,14 @@ void folio_end_writeback(struct folio *folio)
* reused before the folio_wake_bit().
*/
folio_get(folio);
+ if (!folio_test_dirty(folio))
+ folio_uncached = folio_test_clear_uncached(folio);
if (__folio_end_writeback(folio))
folio_wake_bit(folio, PG_writeback);
acct_reclaim_writeback(folio);
+
+ if (folio_uncached)
+ folio_end_uncached(folio);
folio_put(folio);
}
EXPORT_SYMBOL(folio_end_writeback);
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 10/17] mm/filemap: make buffered writes work with RWF_UNCACHED
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (8 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 09/17] mm/filemap: drop uncached pages when writeback completes Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-18 8:42 ` Baokun Li
2024-11-14 15:25 ` [PATCH 11/17] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
` (7 subsequent siblings)
17 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
If RWF_UNCACHED is set for a write, mark new folios being written with
uncached. This is done by passing in the fact that it's an uncached write
through the folio pointer. We can only get there when IOCB_UNCACHED was
allowed, which can only happen if the file system opts in. Opting in means
they need to check for the LSB in the folio pointer to know if it's an
uncached write or not. If it is, then FGP_UNCACHED should be used if
creating new folios is necessary.
Uncached writes will drop any folios they create upon writeback
completion, but leave folios that may exist in that range alone. Since
->write_begin() doesn't currently take any flags, and to avoid needing
to change the callback kernel wide, use the foliop being passed in to
->write_begin() to signal if this is an uncached write or not. File
systems can then use that to mark newly created folios as uncached.
This provides similar benefits to using RWF_UNCACHED with reads. Testing
buffered writes on 32 files:
writing bs 65536, uncached 0
1s: 196035MB/sec
2s: 132308MB/sec
3s: 132438MB/sec
4s: 116528MB/sec
5s: 103898MB/sec
6s: 108893MB/sec
7s: 99678MB/sec
8s: 106545MB/sec
9s: 106826MB/sec
10s: 101544MB/sec
11s: 111044MB/sec
12s: 124257MB/sec
13s: 116031MB/sec
14s: 114540MB/sec
15s: 115011MB/sec
16s: 115260MB/sec
17s: 116068MB/sec
18s: 116096MB/sec
where it's quite obvious where the page cache filled, and performance
dropped from to about half of where it started, settling in at around
115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
reclaim pages.
Running the same test with uncached buffered writes:
writing bs 65536, uncached 1
1s: 198974MB/sec
2s: 189618MB/sec
3s: 193601MB/sec
4s: 188582MB/sec
5s: 193487MB/sec
6s: 188341MB/sec
7s: 194325MB/sec
8s: 188114MB/sec
9s: 192740MB/sec
10s: 189206MB/sec
11s: 193442MB/sec
12s: 189659MB/sec
13s: 191732MB/sec
14s: 190701MB/sec
15s: 191789MB/sec
16s: 191259MB/sec
17s: 190613MB/sec
18s: 191951MB/sec
and the behavior is fully predictable, performing the same throughout
even after the page cache would otherwise have fully filled with dirty
data. It's also about 65% faster, and using half the CPU of the system
compared to the normal buffered write.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/fs.h | 5 +++++
include/linux/pagemap.h | 9 +++++++++
mm/filemap.c | 12 +++++++++++-
3 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 45510d0b8de0..122ae821989f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2877,6 +2877,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
if (ret)
return ret;
+ } else if (iocb->ki_flags & IOCB_UNCACHED) {
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
+
+ filemap_fdatawrite_range_kick(mapping, iocb->ki_pos,
+ iocb->ki_pos + count);
}
return count;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d55bf995bd9e..cc02518d338d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -14,6 +14,7 @@
#include <linux/gfp.h>
#include <linux/bitops.h>
#include <linux/hardirq.h> /* for in_interrupt() */
+#include <linux/writeback.h>
#include <linux/hugetlb_inline.h>
struct folio_batch;
@@ -70,6 +71,14 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
}
+/*
+ * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
+ * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
+ * must check for this and pass FGP_UNCACHED for folio creation.
+ */
+#define foliop_uncached ((struct folio *) 0xfee1c001)
+#define foliop_is_uncached(foliop) (*(foliop) == foliop_uncached)
+
/**
* filemap_set_wb_err - set a writeback error on an address_space
* @mapping: mapping in which to set writeback error
diff --git a/mm/filemap.c b/mm/filemap.c
index 13815194ed8a..297cb53332ff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4076,7 +4076,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
ssize_t written = 0;
do {
- struct folio *folio;
+ struct folio *folio = NULL;
size_t offset; /* Offset into folio */
size_t bytes; /* Bytes to write to folio */
size_t copied; /* Bytes copied from user */
@@ -4104,6 +4104,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
break;
}
+ /*
+ * If IOCB_UNCACHED is set here, we now the file system
+ * supports it. And hence it'll know to check folip for being
+ * set to this magic value. If so, it's an uncached write.
+ * Whenever ->write_begin() changes prototypes again, this
+ * can go away and just pass iocb or iocb flags.
+ */
+ if (iocb->ki_flags & IOCB_UNCACHED)
+ folio = foliop_uncached;
+
status = a_ops->write_begin(file, mapping, pos, bytes,
&folio, &fsdata);
if (unlikely(status < 0))
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 11/17] mm/filemap: add filemap_fdatawrite_range_kick() helper
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (9 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 10/17] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 12/17] mm: add FGP_UNCACHED folio creation flag Jens Axboe
` (6 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Works like filemap_fdatawrite_range(), except it's a non-integrity data
writeback and hence only starts writeback on the specified range. Will
help facilitate generically starting uncached writeback from
generic_write_sync(), as header dependencies preclude doing this inline
from fs.h.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/fs.h | 2 ++
mm/filemap.c | 18 ++++++++++++++++++
2 files changed, 20 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 122ae821989f..560d3ee1bb8a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2845,6 +2845,8 @@ extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
extern int __must_check file_check_and_advance_wb_err(struct file *file);
extern int __must_check file_write_and_wait_range(struct file *file,
loff_t start, loff_t end);
+int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
+ loff_t end);
static inline int file_write_and_wait(struct file *file)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index 297cb53332ff..a8a9fb986d2d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -450,6 +450,24 @@ int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
}
EXPORT_SYMBOL(filemap_fdatawrite_range);
+/**
+ * filemap_fdatawrite_range_kick - start writeback on a range
+ * @mapping: target address_space
+ * @start: index to start writeback on
+ * @end: last (non-inclusive) index for writeback
+ *
+ * This is a non-integrity writeback helper, to start writing back folios
+ * for the indicated range.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
+ loff_t end)
+{
+ return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
+}
+EXPORT_SYMBOL_GPL(filemap_fdatawrite_range_kick);
+
/**
* filemap_flush - mostly a non-blocking flush
* @mapping: target address_space
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 12/17] mm: add FGP_UNCACHED folio creation flag
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (10 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 11/17] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 13/17] ext4: add RWF_UNCACHED write support Jens Axboe
` (5 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Callers can pass this in for uncached folio creation, in which case if
a folio is newly created it gets marked as uncached. If a folio exists
for this index and lookup succeeds, then it will not get marked as
uncached. If an !uncached lookup finds a cached folio, clear the flag.
For that case, there are competeting uncached and cached users of the
folio, and it should not get pruned.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/pagemap.h | 2 ++
mm/filemap.c | 5 +++++
2 files changed, 7 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cc02518d338d..860807e34b8c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -721,6 +721,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
* * %FGP_NOFS - __GFP_FS will get cleared in gfp.
* * %FGP_NOWAIT - Don't block on the folio lock.
* * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_UNCACHED - Uncached buffered IO
* * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
* implementation.
*/
@@ -734,6 +735,7 @@ typedef unsigned int __bitwise fgf_t;
#define FGP_NOWAIT ((__force fgf_t)0x00000020)
#define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
#define FGP_STABLE ((__force fgf_t)0x00000080)
+#define FGP_UNCACHED ((__force fgf_t)0x00000100)
#define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
#define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
diff --git a/mm/filemap.c b/mm/filemap.c
index a8a9fb986d2d..dbc3fa975ad1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2002,6 +2002,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
/* Init accessed so avoid atomic mark_page_accessed later */
if (fgp_flags & FGP_ACCESSED)
__folio_set_referenced(folio);
+ if (fgp_flags & FGP_UNCACHED)
+ __folio_set_uncached(folio);
err = filemap_add_folio(mapping, folio, index, gfp);
if (!err)
@@ -2024,6 +2026,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
if (!folio)
return ERR_PTR(-ENOENT);
+ /* not an uncached lookup, clear uncached if set */
+ if (folio_test_uncached(folio) && !(fgp_flags & FGP_UNCACHED))
+ folio_clear_uncached(folio);
return folio;
}
EXPORT_SYMBOL(__filemap_get_folio);
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 13/17] ext4: add RWF_UNCACHED write support
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (11 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 12/17] mm: add FGP_UNCACHED folio creation flag Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 14/17] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
` (4 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
IOCB_UNCACHED IO needs to prune writeback regions on IO completion,
and hence need the worker punt that ext4 also does for unwritten
extents. Add an io_end flag to manage that.
If foliop is set to foliop_uncached in ext4_write_begin(), then set
FGP_UNCACHED so that __filemap_get_folio() will mark newly created
folios as uncached. That in turn will make writeback completion drop
these ranges from the page cache.
Now that ext4 supports both uncached reads and writes, add the fop_flag
FOP_UNCACHED to enable it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 2 +-
fs/ext4/inline.c | 7 ++++++-
fs/ext4/inode.c | 18 ++++++++++++++++--
fs/ext4/page-io.c | 28 ++++++++++++++++------------
5 files changed, 40 insertions(+), 16 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 44b0d418143c..60dc9ffae076 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -279,6 +279,7 @@ struct ext4_system_blocks {
* Flags for ext4_io_end->flags
*/
#define EXT4_IO_END_UNWRITTEN 0x0001
+#define EXT4_IO_UNCACHED 0x0002
struct ext4_io_end_vec {
struct list_head list; /* list of io_end_vec */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f14aed14b9cf..0ef39d738598 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -944,7 +944,7 @@ const struct file_operations ext4_file_operations = {
.splice_write = iter_file_splice_write,
.fallocate = ext4_fallocate,
.fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC |
- FOP_DIO_PARALLEL_WRITE,
+ FOP_DIO_PARALLEL_WRITE | FOP_UNCACHED,
};
const struct inode_operations ext4_file_inode_operations = {
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 3536ca7e4fcc..500bfb6d4860 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -667,6 +667,7 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
handle_t *handle;
struct folio *folio;
struct ext4_iloc iloc;
+ fgf_t fgp_flags;
if (pos + len > ext4_get_max_inline_size(inode))
goto convert;
@@ -702,7 +703,11 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
if (ret)
goto out;
- folio = __filemap_get_folio(mapping, 0, FGP_WRITEBEGIN | FGP_NOFS,
+ fgp_flags = FGP_WRITEBEGIN | FGP_NOFS;
+ if (foliop_is_uncached(foliop))
+ fgp_flags |= FGP_UNCACHED;
+
+ folio = __filemap_get_folio(mapping, 0, fgp_flags,
mapping_gfp_mask(mapping));
if (IS_ERR(folio)) {
ret = PTR_ERR(folio);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 54bdd4884fe6..9b815137fb2c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1138,6 +1138,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
int ret, needed_blocks;
handle_t *handle;
int retries = 0;
+ fgf_t fgp_flags;
struct folio *folio;
pgoff_t index;
unsigned from, to;
@@ -1164,6 +1165,15 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
return 0;
}
+ /*
+ * Set FGP_WRITEBEGIN, and FGP_UNCACHED if foliop is marked as
+ * uncached. That's how generic_perform_write() informs us that this
+ * is an uncached write.
+ */
+ fgp_flags = FGP_WRITEBEGIN;
+ if (foliop_is_uncached(foliop))
+ fgp_flags |= FGP_UNCACHED;
+
/*
* __filemap_get_folio() can take a long time if the
* system is thrashing due to memory pressure, or if the folio
@@ -1172,7 +1182,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
* the folio (if needed) without using GFP_NOFS.
*/
retry_grab:
- folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
+ folio = __filemap_get_folio(mapping, index, fgp_flags,
mapping_gfp_mask(mapping));
if (IS_ERR(folio))
return PTR_ERR(folio);
@@ -2903,6 +2913,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
struct folio *folio;
pgoff_t index;
struct inode *inode = mapping->host;
+ fgf_t fgp_flags;
if (unlikely(ext4_forced_shutdown(inode->i_sb)))
return -EIO;
@@ -2926,8 +2937,11 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
return 0;
}
+ fgp_flags = FGP_WRITEBEGIN;
+ if (foliop_is_uncached(foliop))
+ fgp_flags |= FGP_UNCACHED;
retry:
- folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
+ folio = __filemap_get_folio(mapping, index, fgp_flags,
mapping_gfp_mask(mapping));
if (IS_ERR(folio))
return PTR_ERR(folio);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index ad5543866d21..10447c3c4ff1 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -226,8 +226,6 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
unsigned long flags;
/* Only reserved conversions from writeback should enter here */
- WARN_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
- WARN_ON(!io_end->handle && sbi->s_journal);
spin_lock_irqsave(&ei->i_completed_io_lock, flags);
wq = sbi->rsv_conversion_wq;
if (list_empty(&ei->i_rsv_conversion_list))
@@ -252,7 +250,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
while (!list_empty(&unwritten)) {
io_end = list_entry(unwritten.next, ext4_io_end_t, list);
- BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
+ BUG_ON(!(io_end->flag & (EXT4_IO_END_UNWRITTEN|EXT4_IO_UNCACHED)));
list_del_init(&io_end->list);
err = ext4_end_io_end(io_end);
@@ -287,14 +285,15 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
void ext4_put_io_end_defer(ext4_io_end_t *io_end)
{
- if (refcount_dec_and_test(&io_end->count)) {
- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
- list_empty(&io_end->list_vec)) {
- ext4_release_io_end(io_end);
- return;
- }
- ext4_add_complete_io(io_end);
+ if (!refcount_dec_and_test(&io_end->count))
+ return;
+ if ((!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
+ list_empty(&io_end->list_vec)) &&
+ !(io_end->flag & EXT4_IO_UNCACHED)) {
+ ext4_release_io_end(io_end);
+ return;
}
+ ext4_add_complete_io(io_end);
}
int ext4_put_io_end(ext4_io_end_t *io_end)
@@ -348,7 +347,7 @@ static void ext4_end_bio(struct bio *bio)
blk_status_to_errno(bio->bi_status));
}
- if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
+ if (io_end->flag & (EXT4_IO_END_UNWRITTEN|EXT4_IO_UNCACHED)) {
/*
* Link bio into list hanging from io_end. We have to do it
* atomically as bio completions can be racing against each
@@ -417,8 +416,13 @@ static void io_submit_add_bh(struct ext4_io_submit *io,
submit_and_retry:
ext4_io_submit(io);
}
- if (io->io_bio == NULL)
+ if (io->io_bio == NULL) {
io_submit_init_bio(io, bh);
+ if (folio_test_uncached(folio)) {
+ ext4_io_end_t *io_end = io->io_bio->bi_private;
+ io_end->flag |= EXT4_IO_UNCACHED;
+ }
+ }
if (!bio_add_folio(io->io_bio, io_folio, bh->b_size, bh_offset(bh)))
goto submit_and_retry;
wbc_account_cgroup_owner(io->io_wbc, &folio->page, bh->b_size);
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 14/17] iomap: make buffered writes work with RWF_UNCACHED
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (12 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 13/17] ext4: add RWF_UNCACHED write support Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-12-12 5:50 ` Christoph Hellwig
2024-11-14 15:25 ` [PATCH 15/17] xfs: punt uncached write completions to the completion wq Jens Axboe
` (3 subsequent siblings)
17 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Add iomap buffered write support for RWF_UNCACHED. If RWF_UNCACHED is
set for a write, mark the folios being written as uncached. Then
writeback completion will drop the pages. The write_iter handler simply
kicks off writeback for the pages, and writeback completion will take
care of the rest.
This still needs the user of the iomap buffered write helpers to call
iocb_uncached_write() upon successful issue of the writes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/iomap/buffered-io.c | 15 +++++++++++++--
include/linux/iomap.h | 8 +++++++-
2 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ef0b68bccbb6..2f2a5db04a68 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -603,6 +603,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
if (iter->flags & IOMAP_NOWAIT)
fgp |= FGP_NOWAIT;
+ if (iter->flags & IOMAP_UNCACHED)
+ fgp |= FGP_UNCACHED;
fgp |= fgf_set_order(len);
return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
@@ -1023,8 +1025,9 @@ ssize_t
iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
const struct iomap_ops *ops, void *private)
{
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
struct iomap_iter iter = {
- .inode = iocb->ki_filp->f_mapping->host,
+ .inode = mapping->host,
.pos = iocb->ki_pos,
.len = iov_iter_count(i),
.flags = IOMAP_WRITE,
@@ -1034,9 +1037,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
if (iocb->ki_flags & IOCB_NOWAIT)
iter.flags |= IOMAP_NOWAIT;
+ if (iocb->ki_flags & IOCB_UNCACHED)
+ iter.flags |= IOMAP_UNCACHED;
- while ((ret = iomap_iter(&iter, ops)) > 0)
+ while ((ret = iomap_iter(&iter, ops)) > 0) {
+ if (iocb->ki_flags & IOCB_UNCACHED)
+ iter.iomap.flags |= IOMAP_F_UNCACHED;
iter.processed = iomap_write_iter(&iter, i);
+ }
if (unlikely(iter.pos == iocb->ki_pos))
return ret;
@@ -1770,6 +1778,9 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
size_t poff = offset_in_folio(folio, pos);
int error;
+ if (folio_test_uncached(folio))
+ wpc->iomap.flags |= IOMAP_F_UNCACHED;
+
if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
new_ioend:
error = iomap_submit_ioend(wpc, 0);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f61407e3b121..0a88043676f2 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -74,9 +74,14 @@ struct vm_fault;
* IOMAP_F_STALE indicates that the iomap is not valid any longer and the file
* range it covers needs to be remapped by the high level before the operation
* can proceed.
+ *
+ * IOMAP_F_UNCACHED is set to indicate that writes to the page cache (and
+ * hence writeback) will result in folios being evicted as soon as the
+ * updated bytes are written back to the storage.
*/
#define IOMAP_F_SIZE_CHANGED (1U << 8)
#define IOMAP_F_STALE (1U << 9)
+#define IOMAP_F_UNCACHED (1U << 10)
/*
* Flags from 0x1000 up are for file system specific usage:
@@ -173,8 +178,9 @@ struct iomap_folio_ops {
#define IOMAP_NOWAIT (1 << 5) /* do not block */
#define IOMAP_OVERWRITE_ONLY (1 << 6) /* only pure overwrites allowed */
#define IOMAP_UNSHARE (1 << 7) /* unshare_file_range */
+#define IOMAP_UNCACHED (1 << 8) /* uncached IO */
#ifdef CONFIG_FS_DAX
-#define IOMAP_DAX (1 << 8) /* DAX mapping */
+#define IOMAP_DAX (1 << 9) /* DAX mapping */
#else
#define IOMAP_DAX 0
#endif /* CONFIG_FS_DAX */
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 15/17] xfs: punt uncached write completions to the completion wq
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (13 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 14/17] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 16/17] xfs: flag as supporting FOP_UNCACHED Jens Axboe
` (2 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
They need non-irq context guaranteed, to be able to prune ranges from
the page cache. Treat them like unwritten extents and punt them to the
completion workqueue.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/xfs/xfs_aops.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 559a3a577097..c86fc2b8f344 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -416,9 +416,12 @@ xfs_prepare_ioend(
memalloc_nofs_restore(nofs_flag);
- /* send ioends that might require a transaction to the completion wq */
+ /*
+ * Send ioends that might require a transaction or need blocking
+ * context to the completion wq
+ */
if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN ||
- (ioend->io_flags & IOMAP_F_SHARED))
+ (ioend->io_flags & (IOMAP_F_SHARED|IOMAP_F_UNCACHED)))
ioend->io_bio.bi_end_io = xfs_end_bio;
return status;
}
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 16/17] xfs: flag as supporting FOP_UNCACHED
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (14 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 15/17] xfs: punt uncached write completions to the completion wq Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 17/17] btrfs: add support for uncached writes Jens Axboe
2024-11-15 4:01 ` [PATCHSET v5 0/17] Uncached buffered IO Julian Sun
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
Read side was already fully supported, and with the write side
appropriately punted to the worker queue, all that's needed now is
setting FOP_UNCACHED in the file_operations structure to enable full
support for read and write uncached IO.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/xfs/xfs_file.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b19916b11fd5..4fe593896bc5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1595,7 +1595,8 @@ const struct file_operations xfs_file_operations = {
.fadvise = xfs_file_fadvise,
.remap_file_range = xfs_file_remap_range,
.fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC |
- FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE,
+ FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE |
+ FOP_UNCACHED,
};
const struct file_operations xfs_dir_file_operations = {
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 17/17] btrfs: add support for uncached writes
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (15 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 16/17] xfs: flag as supporting FOP_UNCACHED Jens Axboe
@ 2024-11-14 15:25 ` Jens Axboe
2024-11-15 4:01 ` [PATCHSET v5 0/17] Uncached buffered IO Julian Sun
17 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-14 15:25 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster, Jens Axboe
The read side is already covered as btrfs uses the generic filemap
helpers. For writes, just pass in FGP_UNCACHED if uncached IO is being
done, then the folios created should be marked appropriately.
For IO completion, ensure that writing back folios that are uncached
gets punted to one of the btrfs workers, as task context is needed for
that. Add an 'uncached_io' member to struct btrfs_bio to manage that.
With that, add FOP_UNCACHED to the btrfs file_operations fop_flags
structure, enabling use of RWF_UNCACHED.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/btrfs/bio.c | 4 ++--
fs/btrfs/bio.h | 2 ++
fs/btrfs/extent_io.c | 8 +++++++-
fs/btrfs/file.c | 9 ++++++---
4 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 7e0f9600b80c..253e1a656934 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -334,7 +334,7 @@ static void btrfs_end_bio_work(struct work_struct *work)
struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
/* Metadata reads are checked and repaired by the submitter. */
- if (is_data_bbio(bbio))
+ if (bio_op(&bbio->bio) == REQ_OP_READ && is_data_bbio(bbio))
btrfs_check_read_bio(bbio, bbio->bio.bi_private);
else
btrfs_bio_end_io(bbio, bbio->bio.bi_status);
@@ -351,7 +351,7 @@ static void btrfs_simple_end_io(struct bio *bio)
if (bio->bi_status)
btrfs_log_dev_io_error(bio, dev);
- if (bio_op(bio) == REQ_OP_READ) {
+ if (bio_op(bio) == REQ_OP_READ || bbio->uncached_io) {
INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
} else {
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index e2fe16074ad6..39b98326c98f 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -82,6 +82,8 @@ struct btrfs_bio {
/* Save the first error status of split bio. */
blk_status_t status;
+ bool uncached_io;
+
/*
* This member must come last, bio_alloc_bioset will allocate enough
* bytes for entire btrfs_bio but relies on bio being last.
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 872cca54cc6c..b97b21178ed7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -760,8 +760,11 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
ASSERT(bio_ctrl->end_io_func);
if (bio_ctrl->bbio &&
- !btrfs_bio_is_contig(bio_ctrl, folio, disk_bytenr, pg_offset))
+ !btrfs_bio_is_contig(bio_ctrl, folio, disk_bytenr, pg_offset)) {
+ if (folio_test_uncached(folio))
+ bio_ctrl->bbio->uncached_io = true;
submit_one_bio(bio_ctrl);
+ }
do {
u32 len = size;
@@ -779,6 +782,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
len = bio_ctrl->len_to_oe_boundary;
}
+ if (folio_test_uncached(folio))
+ bio_ctrl->bbio->uncached_io = true;
+
if (!bio_add_folio(&bio_ctrl->bbio->bio, folio, len, pg_offset)) {
/* bio full: move on to a new one */
submit_one_bio(bio_ctrl);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4fb521d91b06..cfee783f4c4d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -919,7 +919,7 @@ static gfp_t get_prepare_gfp_flags(struct inode *inode, bool nowait)
static noinline int prepare_pages(struct inode *inode, struct page **pages,
size_t num_pages, loff_t pos,
size_t write_bytes, bool force_uptodate,
- bool nowait)
+ bool nowait, bool uncached)
{
int i;
unsigned long index = pos >> PAGE_SHIFT;
@@ -928,6 +928,8 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
int ret = 0;
int faili;
+ if (uncached)
+ fgp_flags |= FGP_UNCACHED;
for (i = 0; i < num_pages; i++) {
again:
pages[i] = pagecache_get_page(inode->i_mapping, index + i,
@@ -1323,7 +1325,8 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
* contents of pages from loop to loop
*/
ret = prepare_pages(inode, pages, num_pages,
- pos, write_bytes, force_page_uptodate, false);
+ pos, write_bytes, force_page_uptodate,
+ false, iocb->ki_flags & IOCB_UNCACHED);
if (ret) {
btrfs_delalloc_release_extents(BTRFS_I(inode),
reserve_bytes);
@@ -3802,7 +3805,7 @@ const struct file_operations btrfs_file_operations = {
.compat_ioctl = btrfs_compat_ioctl,
#endif
.remap_file_range = btrfs_remap_file_range,
- .fop_flags = FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC,
+ .fop_flags = FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC | FOP_UNCACHED,
};
int btrfs_fdatawrite_range(struct btrfs_inode *inode, loff_t start, loff_t end)
--
2.45.2
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHSET v5 0/17] Uncached buffered IO
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
` (16 preceding siblings ...)
2024-11-14 15:25 ` [PATCH 17/17] btrfs: add support for uncached writes Jens Axboe
@ 2024-11-15 4:01 ` Julian Sun
2024-11-15 15:06 ` Jens Axboe
17 siblings, 1 reply; 27+ messages in thread
From: Julian Sun @ 2024-11-15 4:01 UTC (permalink / raw)
To: Jens Axboe, linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster
On Thu, 2024-11-14 at 08:25 -0700, Jens Axboe wrote:
> Hi,
>
> 5 years ago I posted patches adding support for RWF_UNCACHED, as a way
> to do buffered IO that isn't page cache persistent. The approach back
> then was to have private pages for IO, and then get rid of them once IO
> was done. But that then runs into all the issues that O_DIRECT has, in
> terms of synchronizing with the page cache.
>
> So here's a new approach to the same concent, but using the page cache
> as synchronization. That makes RWF_UNCACHED less special, in that it's
> just page cache IO, except it prunes the ranges once IO is completed.
>
> Why do this, you may ask? The tldr is that device speeds are only
> getting faster, while reclaim is not. Doing normal buffered IO can be
> very unpredictable, and suck up a lot of resources on the reclaim side.
> This leads people to use O_DIRECT as a work-around, which has its own
> set of restrictions in terms of size, offset, and length of IO. It's
> also inherently synchronous, and now you need async IO as well. While
> the latter isn't necessarily a big problem as we have good options
> available there, it also should not be a requirement when all you want
> to do is read or write some data without caching.
>
> Even on desktop type systems, a normal NVMe device can fill the entire
> page cache in seconds. On the big system I used for testing, there's a
> lot more RAM, but also a lot more devices. As can be seen in some of the
> results in the following patches, you can still fill RAM in seconds even
> when there's 1TB of it. Hence this problem isn't solely a "big
> hyperscaler system" issue, it's common across the board.
>
> Common for both reads and writes with RWF_UNCACHED is that they use the
> page cache for IO. Reads work just like a normal buffered read would,
> with the only exception being that the touched ranges will get pruned
> after data has been copied. For writes, the ranges will get writeback
> kicked off before the syscall returns, and then writeback completion
> will prune the range. Hence writes aren't synchronous, and it's easy to
> pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by
> RWF_UNCACHED IO are left untouched. This means you that uncached IO
> will take advantage of the page cache for uptodate data, but not leave
> anything it instantiated/created in cache.
>
> File systems need to support this. The patches add support for the
> generic filemap helpers, and for iomap. Then ext4 and XFS are marked as
> supporting it. The last patch adds support for btrfs as well, lightly
> tested. The read side is already done by filemap, only the write side
> needs a bit of help. The amount of code here is really trivial, and the
> only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO
> return -EOPNOTSUPP just in case the fs doesn't use either the generic
> paths or iomap. Adding "support" to other file systems should be
> trivial, most of the time just a one-liner adding FOP_UNCACHED to the
> fop_flags in the file_operations struct.
>
> Performance results are in patch 8 for reads and patch 10 for writes,
> with the tldr being that I see about a 65% improvement in performance
> for both, with fully predictable IO times. CPU reduction is substantial
> as well, with no kswapd activity at all for reclaim when using uncached
> IO.
>
> Using it from applications is trivial - just set RWF_UNCACHED for the
> read or write, using pwritev2(2) or preadv2(2). For io_uring, same
> thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write
> operation. And that's it.
>
> Patches 1..7 are just prep patches, and should have no functional
> changes at all. Patch 8 adds support for the filemap path for
> RWF_UNCACHED reads, patch 10 adds support for filemap RWF_UNCACHED
> writes, and patches 13..17 adds ext4, xfs/iomap, and btrfs support.
>
> Passes full xfstests and fsx overnight runs, no issues observed. That
> includes the vm running the testing also using RWF_UNCACHED on the host.
> I'll post fsstress and fsx patches for RWF_UNCACHED separately. As far
> as I'm concerned, no further work needs doing here. Once we're into
> the 6.13 merge window, I'll split up this series and aim to get it
> landed that way. There are really 4 parts to this - generic mm bits,
> ext4 bits, xfs bits, and btrfs bits.
>
> And git tree for the patches is here:
>
> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.7
>
> fs/btrfs/bio.c | 4 +-
> fs/btrfs/bio.h | 2 +
> fs/btrfs/extent_io.c | 8 ++-
> fs/btrfs/file.c | 9 ++-
> fs/ext4/ext4.h | 1 +
> fs/ext4/file.c | 2 +-
> fs/ext4/inline.c | 7 +-
> fs/ext4/inode.c | 18 +++++-
> fs/ext4/page-io.c | 28 ++++----
> fs/iomap/buffered-io.c | 15 ++++-
> fs/xfs/xfs_aops.c | 7 +-
> fs/xfs/xfs_file.c | 3 +-
> include/linux/fs.h | 21 +++++-
> include/linux/iomap.h | 8 ++-
> include/linux/page-flags.h | 5 ++
> include/linux/pagemap.h | 14 ++++
> include/trace/events/mmflags.h | 3 +-
> include/uapi/linux/fs.h | 6 +-
> mm/filemap.c | 114 +++++++++++++++++++++++++++++----
> mm/readahead.c | 22 +++++--
> mm/swap.c | 2 +
> mm/truncate.c | 35 ++++++----
> 22 files changed, 271 insertions(+), 63 deletions(-)
>
> Since v3
> - Use foliop_is_uncached() in ext4 rather than do manual compares with
> foliop_uncached.
> - Add filemap_fdatawrite_range_kick() helper and use that in
> generic_write_sync() to kick off uncached writeback, rather than need
> every fs adding a call to generic_uncached_write().
> - Drop generic_uncached_write() helper, not needed anymore.
> - Skip folio_unmap_invalidate() if the folio is dirty.
> - Move IOMAP_F_UNCACHED to the internal iomap flags section, and add
> comment from Darrick to it as well.
> - Only kick uncached writeback in generic_write_sync() if
> iocb_is_dsync() isn't true.
> - Disable RWF_UNCACHED on dax mappings. They require more extensive
> invalidation, and as it isn't a likely use case, just disable it
> for now.
> - Update a few commit messages
>
Hi,
Hello, the simplicity and performance improvement of this patch series are
really impressive, and I have no comments on it.
I'm just curious about its use cases—under which scenarios should it be
used, and under which scenarios should it be avoided? I noticed that the
backing device you used for testing can provide at least 92GB/s read
performance and 115GB/s write performance. Does this mean that the higher
the performance of the backing device, the more noticeable the
optimization? How does this patch series perform on low-speed devices?
My understanding is that the performance issue this patch is trying to
address originates from the page cache being filled up, causing the current
IO to wait for write-back or reclamation, correct? From this perspective,
it seems that this would be suitable for applications that issue a large
amount of IO in a short period of time, and it might not be dependent on
the speed of the backing device?
Thanks,
--
Julian Sun <sunjunchao2870@gmail.com>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 08/17] mm/filemap: add read support for RWF_UNCACHED
2024-11-14 15:25 ` [PATCH 08/17] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
@ 2024-11-15 8:49 ` Kirill A. Shutemov
2024-11-15 15:01 ` Jens Axboe
0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2024-11-15 8:49 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
linux-btrfs, linux-ext4, linux-xfs, bfoster
On Thu, Nov 14, 2024 at 08:25:12AM -0700, Jens Axboe wrote:
> @@ -2595,6 +2601,20 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
> return (pos1 >> shift == pos2 >> shift);
> }
>
> +static void filemap_uncached_read(struct address_space *mapping,
> + struct folio *folio)
> +{
> + if (!folio_test_uncached(folio))
> + return;
> + if (folio_test_writeback(folio))
> + return;
Do we want to drop out here if the folio is dirty, but not yet under
writeback?
It is checked inside folio_unmap_invalidate(), but we will lose
PG_uncached if we get there.
> + if (folio_test_clear_uncached(folio)) {
> + folio_lock(folio);
> + folio_unmap_invalidate(mapping, folio, 0);
> + folio_unlock(folio);
> + }
> +}
> +
> /**
> * filemap_read - Read data from the page cache.
> * @iocb: The iocb to read.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 08/17] mm/filemap: add read support for RWF_UNCACHED
2024-11-15 8:49 ` Kirill A. Shutemov
@ 2024-11-15 15:01 ` Jens Axboe
0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-15 15:01 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
linux-btrfs, linux-ext4, linux-xfs, bfoster
On 11/15/24 1:49 AM, Kirill A. Shutemov wrote:
> On Thu, Nov 14, 2024 at 08:25:12AM -0700, Jens Axboe wrote:
>> @@ -2595,6 +2601,20 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
>> return (pos1 >> shift == pos2 >> shift);
>> }
>>
>> +static void filemap_uncached_read(struct address_space *mapping,
>> + struct folio *folio)
>> +{
>> + if (!folio_test_uncached(folio))
>> + return;
>> + if (folio_test_writeback(folio))
>> + return;
>
> Do we want to drop out here if the folio is dirty, but not yet under
> writeback?
>
> It is checked inside folio_unmap_invalidate(), but we will lose
> PG_uncached if we get there.
True, seems prudent to skip if it's dirty as well, if only to avoid
losing uncached for that particular case. I'll add the tweak, thanks.
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCHSET v5 0/17] Uncached buffered IO
2024-11-15 4:01 ` [PATCHSET v5 0/17] Uncached buffered IO Julian Sun
@ 2024-11-15 15:06 ` Jens Axboe
0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-15 15:06 UTC (permalink / raw)
To: Julian Sun, linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
linux-ext4, linux-xfs, bfoster
On 11/14/24 9:01 PM, Julian Sun wrote:
> On Thu, 2024-11-14 at 08:25 -0700, Jens Axboe wrote:
>> Hi,
>>
>> 5 years ago I posted patches adding support for RWF_UNCACHED, as a way
>> to do buffered IO that isn't page cache persistent. The approach back
>> then was to have private pages for IO, and then get rid of them once IO
>> was done. But that then runs into all the issues that O_DIRECT has, in
>> terms of synchronizing with the page cache.
>>
>> So here's a new approach to the same concent, but using the page cache
>> as synchronization. That makes RWF_UNCACHED less special, in that it's
>> just page cache IO, except it prunes the ranges once IO is completed.
>>
>> Why do this, you may ask? The tldr is that device speeds are only
>> getting faster, while reclaim is not. Doing normal buffered IO can be
>> very unpredictable, and suck up a lot of resources on the reclaim side.
>> This leads people to use O_DIRECT as a work-around, which has its own
>> set of restrictions in terms of size, offset, and length of IO. It's
>> also inherently synchronous, and now you need async IO as well. While
>> the latter isn't necessarily a big problem as we have good options
>> available there, it also should not be a requirement when all you want
>> to do is read or write some data without caching.
>>
>> Even on desktop type systems, a normal NVMe device can fill the entire
>> page cache in seconds. On the big system I used for testing, there's a
>> lot more RAM, but also a lot more devices. As can be seen in some of the
>> results in the following patches, you can still fill RAM in seconds even
>> when there's 1TB of it. Hence this problem isn't solely a "big
>> hyperscaler system" issue, it's common across the board.
>>
>> Common for both reads and writes with RWF_UNCACHED is that they use the
>> page cache for IO. Reads work just like a normal buffered read would,
>> with the only exception being that the touched ranges will get pruned
>> after data has been copied. For writes, the ranges will get writeback
>> kicked off before the syscall returns, and then writeback completion
>> will prune the range. Hence writes aren't synchronous, and it's easy to
>> pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by
>> RWF_UNCACHED IO are left untouched. This means you that uncached IO
>> will take advantage of the page cache for uptodate data, but not leave
>> anything it instantiated/created in cache.
>>
>> File systems need to support this. The patches add support for the
>> generic filemap helpers, and for iomap. Then ext4 and XFS are marked as
>> supporting it. The last patch adds support for btrfs as well, lightly
>> tested. The read side is already done by filemap, only the write side
>> needs a bit of help. The amount of code here is really trivial, and the
>> only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO
>> return -EOPNOTSUPP just in case the fs doesn't use either the generic
>> paths or iomap. Adding "support" to other file systems should be
>> trivial, most of the time just a one-liner adding FOP_UNCACHED to the
>> fop_flags in the file_operations struct.
>>
>> Performance results are in patch 8 for reads and patch 10 for writes,
>> with the tldr being that I see about a 65% improvement in performance
>> for both, with fully predictable IO times. CPU reduction is substantial
>> as well, with no kswapd activity at all for reclaim when using uncached
>> IO.
>>
>> Using it from applications is trivial - just set RWF_UNCACHED for the
>> read or write, using pwritev2(2) or preadv2(2). For io_uring, same
>> thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write
>> operation. And that's it.
>>
>> Patches 1..7 are just prep patches, and should have no functional
>> changes at all. Patch 8 adds support for the filemap path for
>> RWF_UNCACHED reads, patch 10 adds support for filemap RWF_UNCACHED
>> writes, and patches 13..17 adds ext4, xfs/iomap, and btrfs support.
>>
>> Passes full xfstests and fsx overnight runs, no issues observed. That
>> includes the vm running the testing also using RWF_UNCACHED on the host.
>> I'll post fsstress and fsx patches for RWF_UNCACHED separately. As far
>> as I'm concerned, no further work needs doing here. Once we're into
>> the 6.13 merge window, I'll split up this series and aim to get it
>> landed that way. There are really 4 parts to this - generic mm bits,
>> ext4 bits, xfs bits, and btrfs bits.
>>
>> And git tree for the patches is here:
>>
>> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.7
>>
>> fs/btrfs/bio.c | 4 +-
>> fs/btrfs/bio.h | 2 +
>> fs/btrfs/extent_io.c | 8 ++-
>> fs/btrfs/file.c | 9 ++-
>> fs/ext4/ext4.h | 1 +
>> fs/ext4/file.c | 2 +-
>> fs/ext4/inline.c | 7 +-
>> fs/ext4/inode.c | 18 +++++-
>> fs/ext4/page-io.c | 28 ++++----
>> fs/iomap/buffered-io.c | 15 ++++-
>> fs/xfs/xfs_aops.c | 7 +-
>> fs/xfs/xfs_file.c | 3 +-
>> include/linux/fs.h | 21 +++++-
>> include/linux/iomap.h | 8 ++-
>> include/linux/page-flags.h | 5 ++
>> include/linux/pagemap.h | 14 ++++
>> include/trace/events/mmflags.h | 3 +-
>> include/uapi/linux/fs.h | 6 +-
>> mm/filemap.c | 114 +++++++++++++++++++++++++++++----
>> mm/readahead.c | 22 +++++--
>> mm/swap.c | 2 +
>> mm/truncate.c | 35 ++++++----
>> 22 files changed, 271 insertions(+), 63 deletions(-)
>>
>> Since v3
>> - Use foliop_is_uncached() in ext4 rather than do manual compares with
>> foliop_uncached.
>> - Add filemap_fdatawrite_range_kick() helper and use that in
>> generic_write_sync() to kick off uncached writeback, rather than need
>> every fs adding a call to generic_uncached_write().
>> - Drop generic_uncached_write() helper, not needed anymore.
>> - Skip folio_unmap_invalidate() if the folio is dirty.
>> - Move IOMAP_F_UNCACHED to the internal iomap flags section, and add
>> comment from Darrick to it as well.
>> - Only kick uncached writeback in generic_write_sync() if
>> iocb_is_dsync() isn't true.
>> - Disable RWF_UNCACHED on dax mappings. They require more extensive
>> invalidation, and as it isn't a likely use case, just disable it
>> for now.
>> - Update a few commit messages
>>
>
> Hi,
>
> Hello, the simplicity and performance improvement of this patch series are
> really impressive, and I have no comments on it.
>
> I'm just curious about its use cases?under which scenarios should it be
> used, and under which scenarios should it be avoided? I noticed that the
> backing device you used for testing can provide at least 92GB/s read
> performance and 115GB/s write performance. Does this mean that the higher
> the performance of the backing device, the more noticeable the
> optimization? How does this patch series perform on low-speed devices?
It's really more about ratio of device speed to size of RAM. Yes the box
I tested on has a lot of drives, but it also has a lot of memory. Hence
the ratio to device speeds and memory size isn't that different from a
normal desktop box with eg 32G of memory, and a flash drive that does
6GB/sec. Obviously reclaim for that smaller box will not be as bad as
the big one, but still.
It's really two fold:
- You want to kick off writeback sooner rather than later. On devices
these days, it's pretty pointless to let a lot of dirty data build up
before starting to clean it. Uncached writeback starts when the copy
is done, rather than many seconds later when some writeback thread
decides the pressure is either too high, or it's been dirty too long.
- Don't leave things in cache that aren't going to get reused, only to
get pruned later at the point where you need more memory for the cache
anyway.
> My understanding is that the performance issue this patch is trying to
> address originates from the page cache being filled up, causing the current
> IO to wait for write-back or reclamation, correct? From this perspective,
> it seems that this would be suitable for applications that issue a large
> amount of IO in a short period of time, and it might not be dependent on
> the speed of the backing device?
On the read side, if you're not going to be reusing the data you read,
uncached is appropriate. Ditto on the write side, if you're just
flushing out a bunch of data with limted reuse, may as well prune the
cache regions as soon as the write is done, rather than let some kind of
background activity do that when memory becomes scarce.
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 10/17] mm/filemap: make buffered writes work with RWF_UNCACHED
2024-11-14 15:25 ` [PATCH 10/17] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-11-18 8:42 ` Baokun Li
2024-11-18 14:49 ` Jens Axboe
0 siblings, 1 reply; 27+ messages in thread
From: Baokun Li @ 2024-11-18 8:42 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-fsdevel, linux-mm, hannes, clm, linux-kernel, willy,
kirill, linux-btrfs, linux-ext4, linux-xfs, bfoster, Yang Erkun
On 2024/11/14 23:25, Jens Axboe wrote:
> If RWF_UNCACHED is set for a write, mark new folios being written with
> uncached. This is done by passing in the fact that it's an uncached write
> through the folio pointer. We can only get there when IOCB_UNCACHED was
> allowed, which can only happen if the file system opts in. Opting in means
> they need to check for the LSB in the folio pointer to know if it's an
> uncached write or not. If it is, then FGP_UNCACHED should be used if
> creating new folios is necessary.
>
> Uncached writes will drop any folios they create upon writeback
> completion, but leave folios that may exist in that range alone. Since
> ->write_begin() doesn't currently take any flags, and to avoid needing
> to change the callback kernel wide, use the foliop being passed in to
> ->write_begin() to signal if this is an uncached write or not. File
> systems can then use that to mark newly created folios as uncached.
>
> This provides similar benefits to using RWF_UNCACHED with reads. Testing
> buffered writes on 32 files:
>
> writing bs 65536, uncached 0
> 1s: 196035MB/sec
> 2s: 132308MB/sec
> 3s: 132438MB/sec
> 4s: 116528MB/sec
> 5s: 103898MB/sec
> 6s: 108893MB/sec
> 7s: 99678MB/sec
> 8s: 106545MB/sec
> 9s: 106826MB/sec
> 10s: 101544MB/sec
> 11s: 111044MB/sec
> 12s: 124257MB/sec
> 13s: 116031MB/sec
> 14s: 114540MB/sec
> 15s: 115011MB/sec
> 16s: 115260MB/sec
> 17s: 116068MB/sec
> 18s: 116096MB/sec
>
> where it's quite obvious where the page cache filled, and performance
> dropped from to about half of where it started, settling in at around
> 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
> reclaim pages.
>
> Running the same test with uncached buffered writes:
>
> writing bs 65536, uncached 1
> 1s: 198974MB/sec
> 2s: 189618MB/sec
> 3s: 193601MB/sec
> 4s: 188582MB/sec
> 5s: 193487MB/sec
> 6s: 188341MB/sec
> 7s: 194325MB/sec
> 8s: 188114MB/sec
> 9s: 192740MB/sec
> 10s: 189206MB/sec
> 11s: 193442MB/sec
> 12s: 189659MB/sec
> 13s: 191732MB/sec
> 14s: 190701MB/sec
> 15s: 191789MB/sec
> 16s: 191259MB/sec
> 17s: 190613MB/sec
> 18s: 191951MB/sec
>
> and the behavior is fully predictable, performing the same throughout
> even after the page cache would otherwise have fully filled with dirty
> data. It's also about 65% faster, and using half the CPU of the system
> compared to the normal buffered write.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> include/linux/fs.h | 5 +++++
> include/linux/pagemap.h | 9 +++++++++
> mm/filemap.c | 12 +++++++++++-
> 3 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 45510d0b8de0..122ae821989f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2877,6 +2877,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
> (iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
> if (ret)
> return ret;
> + } else if (iocb->ki_flags & IOCB_UNCACHED) {
> + struct address_space *mapping = iocb->ki_filp->f_mapping;
> +
> + filemap_fdatawrite_range_kick(mapping, iocb->ki_pos,
> + iocb->ki_pos + count);
> }
>
Hi Jens,
The filemap_fdatawrite_range_kick() helper function is not added until
the next patch, so you should swap the order of patch 10 and patch 11.
Regards,
Baokun
> return count;
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index d55bf995bd9e..cc02518d338d 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -14,6 +14,7 @@
> #include <linux/gfp.h>
> #include <linux/bitops.h>
> #include <linux/hardirq.h> /* for in_interrupt() */
> +#include <linux/writeback.h>
> #include <linux/hugetlb_inline.h>
>
> struct folio_batch;
> @@ -70,6 +71,14 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
> return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
> }
>
> +/*
> + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
> + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
> + * must check for this and pass FGP_UNCACHED for folio creation.
> + */
> +#define foliop_uncached ((struct folio *) 0xfee1c001)
> +#define foliop_is_uncached(foliop) (*(foliop) == foliop_uncached)
> +
> /**
> * filemap_set_wb_err - set a writeback error on an address_space
> * @mapping: mapping in which to set writeback error
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 13815194ed8a..297cb53332ff 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -4076,7 +4076,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
> ssize_t written = 0;
>
> do {
> - struct folio *folio;
> + struct folio *folio = NULL;
> size_t offset; /* Offset into folio */
> size_t bytes; /* Bytes to write to folio */
> size_t copied; /* Bytes copied from user */
> @@ -4104,6 +4104,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
> break;
> }
>
> + /*
> + * If IOCB_UNCACHED is set here, we now the file system
> + * supports it. And hence it'll know to check folip for being
> + * set to this magic value. If so, it's an uncached write.
> + * Whenever ->write_begin() changes prototypes again, this
> + * can go away and just pass iocb or iocb flags.
> + */
> + if (iocb->ki_flags & IOCB_UNCACHED)
> + folio = foliop_uncached;
> +
> status = a_ops->write_begin(file, mapping, pos, bytes,
> &folio, &fsdata);
> if (unlikely(status < 0))
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 10/17] mm/filemap: make buffered writes work with RWF_UNCACHED
2024-11-18 8:42 ` Baokun Li
@ 2024-11-18 14:49 ` Jens Axboe
0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-11-18 14:49 UTC (permalink / raw)
To: Baokun Li
Cc: linux-fsdevel, linux-mm, hannes, clm, linux-kernel, willy,
kirill, linux-btrfs, linux-ext4, linux-xfs, bfoster, Yang Erkun
On 11/18/24 1:42 AM, Baokun Li wrote:
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 45510d0b8de0..122ae821989f 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2877,6 +2877,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
>> (iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
>> if (ret)
>> return ret;
>> + } else if (iocb->ki_flags & IOCB_UNCACHED) {
>> + struct address_space *mapping = iocb->ki_filp->f_mapping;
>> +
>> + filemap_fdatawrite_range_kick(mapping, iocb->ki_pos,
>> + iocb->ki_pos + count);
>> }
>>
>
> Hi Jens,
>
> The filemap_fdatawrite_range_kick() helper function is not added until
> the next patch, so you should swap the order of patch 10 and patch 11.
Ah thanks, not sure how I missed that. I'll swap them for the next
posting, and also do a basic bisection test just to ensure I did't do
more of those...
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 14/17] iomap: make buffered writes work with RWF_UNCACHED
2024-11-14 15:25 ` [PATCH 14/17] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-12-12 5:50 ` Christoph Hellwig
2024-12-12 6:26 ` Darrick J. Wong
0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2024-12-12 5:50 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
kirill, linux-btrfs, linux-ext4, linux-xfs, bfoster
On Thu, Nov 14, 2024 at 08:25:18AM -0700, Jens Axboe wrote:
> + if (iocb->ki_flags & IOCB_UNCACHED)
> + iter.flags |= IOMAP_UNCACHED;
>
> - while ((ret = iomap_iter(&iter, ops)) > 0)
> + while ((ret = iomap_iter(&iter, ops)) > 0) {
> + if (iocb->ki_flags & IOCB_UNCACHED)
> + iter.iomap.flags |= IOMAP_F_UNCACHED;
iomap.flags and the IOMAP_F_* namespace is used to communicate flags
from the file system to the iomap core, so this looks wrong.
> size_t poff = offset_in_folio(folio, pos);
> int error;
>
> + if (folio_test_uncached(folio))
> + wpc->iomap.flags |= IOMAP_F_UNCACHED;
I guess this is what actually makes it work. Note that with the iomap
zoned series I posted yesteday things change a bit here in that the flags
in the wpc are decouple from the iomap flags, and this would now become
a wpc only flag as it isn't really a fs to iomap cummunication, but
based on iomap / page cache state.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 14/17] iomap: make buffered writes work with RWF_UNCACHED
2024-12-12 5:50 ` Christoph Hellwig
@ 2024-12-12 6:26 ` Darrick J. Wong
2024-12-12 6:31 ` Christoph Hellwig
0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2024-12-12 6:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-mm, linux-fsdevel, hannes, clm, linux-kernel,
willy, kirill, linux-btrfs, linux-ext4, linux-xfs, bfoster
On Wed, Dec 11, 2024 at 09:50:19PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 14, 2024 at 08:25:18AM -0700, Jens Axboe wrote:
> > + if (iocb->ki_flags & IOCB_UNCACHED)
> > + iter.flags |= IOMAP_UNCACHED;
> >
> > - while ((ret = iomap_iter(&iter, ops)) > 0)
> > + while ((ret = iomap_iter(&iter, ops)) > 0) {
> > + if (iocb->ki_flags & IOCB_UNCACHED)
> > + iter.iomap.flags |= IOMAP_F_UNCACHED;
>
> iomap.flags and the IOMAP_F_* namespace is used to communicate flags
> from the file system to the iomap core, so this looks wrong.
Not entirely true -- IOMAP_F_SIZE_CHANGED is used to communicate state
from iomap to gfs2, and IOMAP_F_STALE is set/checked only by the iomap
core. iomap.h even says as much.
Though given that there's a 4-byte gap in struct iomap between flags and
the bdev pointer (at least on 64-bit) maybe we should make a separate
field for these iomap state bits?
> > size_t poff = offset_in_folio(folio, pos);
> > int error;
> >
> > + if (folio_test_uncached(folio))
> > + wpc->iomap.flags |= IOMAP_F_UNCACHED;
>
> I guess this is what actually makes it work. Note that with the iomap
> zoned series I posted yesteday things change a bit here in that the flags
> in the wpc are decouple from the iomap flags, and this would now become
> a wpc only flag as it isn't really a fs to iomap cummunication, but
> based on iomap / page cache state.
Hrmmm I'll go take a second look at that patch in the morning in case a
better idea comes along.
--D
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 14/17] iomap: make buffered writes work with RWF_UNCACHED
2024-12-12 6:26 ` Darrick J. Wong
@ 2024-12-12 6:31 ` Christoph Hellwig
0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2024-12-12 6:31 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Jens Axboe, linux-mm, linux-fsdevel, hannes,
clm, linux-kernel, willy, kirill, linux-btrfs, linux-ext4,
linux-xfs, bfoster
On Wed, Dec 11, 2024 at 10:26:41PM -0800, Darrick J. Wong wrote:
> > iomap.flags and the IOMAP_F_* namespace is used to communicate flags
> > from the file system to the iomap core, so this looks wrong.
>
> Not entirely true -- IOMAP_F_SIZE_CHANGED is used to communicate state
> from iomap to gfs2, and IOMAP_F_STALE is set/checked only by the iomap
> core. iomap.h even says as much.
Indeed, some of the non-initial additions already broke this. And now
that you mentioned it I ran into that before because it was in the way
of some further constifycation I attempted in fs/iomap/.
> Though given that there's a 4-byte gap in struct iomap between flags and
> the bdev pointer (at least on 64-bit) maybe we should make a separate
> field for these iomap state bits?
Probably. Preferably in a way that isn't too painful for Jens, though.
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2024-12-12 6:31 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-14 15:25 [PATCHSET v5 0/17] Uncached buffered IO Jens Axboe
2024-11-14 15:25 ` [PATCH 01/17] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
2024-11-14 15:25 ` [PATCH 02/17] mm/readahead: add folio allocation helper Jens Axboe
2024-11-14 15:25 ` [PATCH 03/17] mm: add PG_uncached page flag Jens Axboe
2024-11-14 15:25 ` [PATCH 04/17] mm/readahead: add readahead_control->uncached member Jens Axboe
2024-11-14 15:25 ` [PATCH 05/17] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
2024-11-14 15:25 ` [PATCH 06/17] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
2024-11-14 15:25 ` [PATCH 07/17] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
2024-11-14 15:25 ` [PATCH 08/17] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
2024-11-15 8:49 ` Kirill A. Shutemov
2024-11-15 15:01 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 09/17] mm/filemap: drop uncached pages when writeback completes Jens Axboe
2024-11-14 15:25 ` [PATCH 10/17] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
2024-11-18 8:42 ` Baokun Li
2024-11-18 14:49 ` Jens Axboe
2024-11-14 15:25 ` [PATCH 11/17] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
2024-11-14 15:25 ` [PATCH 12/17] mm: add FGP_UNCACHED folio creation flag Jens Axboe
2024-11-14 15:25 ` [PATCH 13/17] ext4: add RWF_UNCACHED write support Jens Axboe
2024-11-14 15:25 ` [PATCH 14/17] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
2024-12-12 5:50 ` Christoph Hellwig
2024-12-12 6:26 ` Darrick J. Wong
2024-12-12 6:31 ` Christoph Hellwig
2024-11-14 15:25 ` [PATCH 15/17] xfs: punt uncached write completions to the completion wq Jens Axboe
2024-11-14 15:25 ` [PATCH 16/17] xfs: flag as supporting FOP_UNCACHED Jens Axboe
2024-11-14 15:25 ` [PATCH 17/17] btrfs: add support for uncached writes Jens Axboe
2024-11-15 4:01 ` [PATCHSET v5 0/17] Uncached buffered IO Julian Sun
2024-11-15 15:06 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox