linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v6 0/12] Uncached buffered IO
@ 2024-12-03 15:31 Jens Axboe
  2024-12-03 15:31 ` [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
                   ` (14 more replies)
  0 siblings, 15 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, willy, kirill, bfoster

Hi,

5 years ago I posted patches adding support for RWF_UNCACHED, as a way
to do buffered IO that isn't page cache persistent. The approach back
then was to have private pages for IO, and then get rid of them once IO
was done. But that then runs into all the issues that O_DIRECT has, in
terms of synchronizing with the page cache.

So here's a new approach to the same concent, but using the page cache
as synchronization. That makes RWF_UNCACHED less special, in that it's
just page cache IO, except it prunes the ranges once IO is completed.

Why do this, you may ask? The tldr is that device speeds are only
getting faster, while reclaim is not. Doing normal buffered IO can be
very unpredictable, and suck up a lot of resources on the reclaim side.
This leads people to use O_DIRECT as a work-around, which has its own
set of restrictions in terms of size, offset, and length of IO. It's
also inherently synchronous, and now you need async IO as well. While
the latter isn't necessarily a big problem as we have good options
available there, it also should not be a requirement when all you want
to do is read or write some data without caching.

Even on desktop type systems, a normal NVMe device can fill the entire
page cache in seconds. On the big system I used for testing, there's a
lot more RAM, but also a lot more devices. As can be seen in some of the
results in the following patches, you can still fill RAM in seconds even
when there's 1TB of it. Hence this problem isn't solely a "big
hyperscaler system" issue, it's common across the board.

Common for both reads and writes with RWF_UNCACHED is that they use the
page cache for IO. Reads work just like a normal buffered read would,
with the only exception being that the touched ranges will get pruned
after data has been copied. For writes, the ranges will get writeback
kicked off before the syscall returns, and then writeback completion
will prune the range. Hence writes aren't synchronous, and it's easy to
pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by
RWF_UNCACHED IO are left untouched. This means you that uncached IO
will take advantage of the page cache for uptodate data, but not leave
anything it instantiated/created in cache.

File systems need to support this. The patches add support for the
generic filemap helpers, and for iomap. Then ext4 and XFS are marked as
supporting it. The last patch adds support for btrfs as well, lightly
tested. The read side is already done by filemap, only the write side
needs a bit of help. The amount of code here is really trivial, and the
only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO
return -EOPNOTSUPP just in case the fs doesn't use either the generic
paths or iomap. Adding "support" to other file systems should be
trivial, most of the time just a one-liner adding FOP_UNCACHED to the
fop_flags in the file_operations struct.

Performance results are in patch 8 for reads and patch 10 for writes,
with the tldr being that I see about a 65% improvement in performance
for both, with fully predictable IO times. CPU reduction is substantial
as well, with no kswapd activity at all for reclaim when using uncached
IO.

Using it from applications is trivial - just set RWF_UNCACHED for the
read or write, using pwritev2(2) or preadv2(2). For io_uring, same
thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write
operation. And that's it.

Patches 1..7 are just prep patches, and should have no functional
changes at all. Patch 8 adds support for the filemap path for
RWF_UNCACHED reads, patch 11 adds support for filemap RWF_UNCACHED
writes. In the below mentioned branch, there are then patches to
adopt uncached reads and writes for ext4, xfs, and btrfs.

Passes full xfstests and fsx overnight runs, no issues observed. That
includes the vm running the testing also using RWF_UNCACHED on the host.
I'll post fsstress and fsx patches for RWF_UNCACHED separately. As far
as I'm concerned, no further work needs doing here.

And git tree for the patches is here:

https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.8

 include/linux/fs.h             |  21 +++++-
 include/linux/page-flags.h     |   5 ++
 include/linux/pagemap.h        |  14 ++++
 include/trace/events/mmflags.h |   3 +-
 include/uapi/linux/fs.h        |   6 +-
 mm/filemap.c                   | 114 +++++++++++++++++++++++++++++----
 mm/readahead.c                 |  22 +++++--
 mm/swap.c                      |   2 +
 mm/truncate.c                  |  35 ++++++----
 9 files changed, 187 insertions(+), 35 deletions(-)

Since v5
- Skip invalidation in filemap_uncached_read() if the folio is dirty
  as well, retaining the uncached setting for later cleaning to do
  the actual invalidation.
- Use the same trylock approach in read invalidation as the writeback
  invalidation does.
- Swap order of patches 10 and 11 to fix a bisection issue.
- Split core mm changes and fs series patches. Once the generic side
  has been approved, I'll send out the fs series separately.
- Rebase on 6.13-rc1

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-10 11:13   ` Christoph Hellwig
  2024-12-03 15:31 ` [PATCH 02/12] mm/readahead: add folio allocation helper Jens Axboe
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

Rather than pass in both the file and position directly from the kiocb,
just take a struct kiocb instead. While doing so, move the ki_flags
checking into filemap_create_folio() as well. In preparation for actually
needing the kiocb in the function.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 7c76a123ba18..898e992039e8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2459,15 +2459,17 @@ static int filemap_update_page(struct kiocb *iocb,
 	return error;
 }
 
-static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, loff_t pos,
-		struct folio_batch *fbatch)
+static int filemap_create_folio(struct kiocb *iocb,
+		struct address_space *mapping, struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
 	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t index;
 
+	if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
+		return -EAGAIN;
+
 	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
 	if (!folio)
 		return -ENOMEM;
@@ -2486,7 +2488,7 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
-	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
+	index = (iocb->ki_pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2494,7 +2496,8 @@ static int filemap_create_folio(struct file *file,
 	if (error)
 		goto error;
 
-	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
+	error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio,
+					folio);
 	if (error)
 		goto error;
 
@@ -2550,9 +2553,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 		filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
 	}
 	if (!folio_batch_count(fbatch)) {
-		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
-			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
+		err = filemap_create_folio(iocb, mapping, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 02/12] mm/readahead: add folio allocation helper
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
  2024-12-03 15:31 ` [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-03 15:31 ` [PATCH 03/12] mm: add PG_uncached page flag Jens Axboe
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

Just a wrapper around filemap_alloc_folio() for now, but add it in
preparation for modifying the folio based on the 'ractl' being passed
in.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/readahead.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 8f1cf599b572..8424bf1b67e2 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -188,6 +188,12 @@ static void read_pages(struct readahead_control *rac)
 	BUG_ON(readahead_count(rac));
 }
 
+static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
+				       gfp_t gfp_mask, unsigned int order)
+{
+	return filemap_alloc_folio(gfp_mask, order);
+}
+
 /**
  * page_cache_ra_unbounded - Start unchecked readahead.
  * @ractl: Readahead control.
@@ -265,8 +271,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask,
-					    mapping_min_folio_order(mapping));
+		folio = ractl_alloc_folio(ractl, gfp_mask,
+					mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 
@@ -436,7 +442,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {
 	int err;
-	struct folio *folio = filemap_alloc_folio(gfp, order);
+	struct folio *folio = ractl_alloc_folio(ractl, gfp, order);
 
 	if (!folio)
 		return -ENOMEM;
@@ -751,7 +757,7 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, min_order);
+		folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
 		if (!folio)
 			return;
 
@@ -780,7 +786,7 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, min_order);
+		folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
 		if (!folio)
 			return;
 
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 03/12] mm: add PG_uncached page flag
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
  2024-12-03 15:31 ` [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
  2024-12-03 15:31 ` [PATCH 02/12] mm/readahead: add folio allocation helper Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-03 15:31 ` [PATCH 04/12] mm/readahead: add readahead_control->uncached member Jens Axboe
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

Add a page flag that file IO can use to indicate that the IO being done
is uncached, as in it should not persist in the page cache after the IO
has been completed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/page-flags.h     | 5 +++++
 include/trace/events/mmflags.h | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 2220bfec278e..14346fa2470f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -110,6 +110,7 @@ enum pageflags {
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
 	PG_unevictable,		/* Page is "unevictable"  */
+	PG_uncached,		/* uncached read/write IO */
 #ifdef CONFIG_MMU
 	PG_mlocked,		/* Page is vma mlocked */
 #endif
@@ -562,6 +563,10 @@ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
 FOLIO_FLAG(readahead, FOLIO_HEAD_PAGE)
 	FOLIO_TEST_CLEAR_FLAG(readahead, FOLIO_HEAD_PAGE)
 
+FOLIO_FLAG(uncached, FOLIO_HEAD_PAGE)
+	FOLIO_TEST_CLEAR_FLAG(uncached, FOLIO_HEAD_PAGE)
+	__FOLIO_SET_FLAG(uncached, FOLIO_HEAD_PAGE)
+
 #ifdef CONFIG_HIGHMEM
 /*
  * Must use a macro here due to header dependency issues. page_zone() is not
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index bb8a59c6caa2..b60057284102 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -116,7 +116,8 @@
 	DEF_PAGEFLAG_NAME(head),					\
 	DEF_PAGEFLAG_NAME(reclaim),					\
 	DEF_PAGEFLAG_NAME(swapbacked),					\
-	DEF_PAGEFLAG_NAME(unevictable)					\
+	DEF_PAGEFLAG_NAME(unevictable),					\
+	DEF_PAGEFLAG_NAME(uncached)					\
 IF_HAVE_PG_MLOCK(mlocked)						\
 IF_HAVE_PG_HWPOISON(hwpoison)						\
 IF_HAVE_PG_IDLE(idle)							\
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 04/12] mm/readahead: add readahead_control->uncached member
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (2 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 03/12] mm: add PG_uncached page flag Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-03 15:31 ` [PATCH 05/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

If ractl->uncached is set to true, then folios created are marked as
uncached as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 1 +
 mm/readahead.c          | 8 +++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bcf0865a38ae..72b03b37c265 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1353,6 +1353,7 @@ struct readahead_control {
 	pgoff_t _index;
 	unsigned int _nr_pages;
 	unsigned int _batch_count;
+	bool uncached;
 	bool _workingset;
 	unsigned long _pflags;
 };
diff --git a/mm/readahead.c b/mm/readahead.c
index 8424bf1b67e2..33a2d0feae14 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -191,7 +191,13 @@ static void read_pages(struct readahead_control *rac)
 static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
 				       gfp_t gfp_mask, unsigned int order)
 {
-	return filemap_alloc_folio(gfp_mask, order);
+	struct folio *folio;
+
+	folio = filemap_alloc_folio(gfp_mask, order);
+	if (folio && ractl->uncached)
+		__folio_set_uncached(folio);
+
+	return folio;
 }
 
 /**
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 05/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (3 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 04/12] mm/readahead: add readahead_control->uncached member Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-10 11:15   ` Christoph Hellwig
  2024-12-03 15:31 ` [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly. In preparation for needing
to modify ractl inside filemap_get_pages().

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 898e992039e8..dd3042de8038 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2527,7 +2527,6 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
-	struct file_ra_state *ra = &filp->f_ra;
 	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
 	pgoff_t last_index;
 	struct folio *folio;
@@ -2542,12 +2541,13 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 
 	filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
 	if (!folio_batch_count(fbatch)) {
+		DEFINE_READAHEAD(ractl, filp, &filp->f_ra, mapping, index);
+
 		if (iocb->ki_flags & IOCB_NOIO)
 			return -EAGAIN;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			flags = memalloc_noio_save();
-		page_cache_sync_readahead(mapping, ra, filp, index,
-				last_index - index);
+		page_cache_sync_ra(&ractl, last_index - index);
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			memalloc_noio_restore(flags);
 		filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (4 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 05/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-10 11:21   ` Christoph Hellwig
  2024-12-03 15:31 ` [PATCH 07/12] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

Add a folio_unmap_invalidate() helper, which unmaps and invalidates a
given folio. The caller must already have locked the folio. Use this
new helper in invalidate_inode_pages2_range(), rather than duplicate
the code there.

In preparation for using this elsewhere as well, have it take a gfp_t
mask rather than assume GFP_KERNEL is the right choice. This bubbles
back to invalidate_complete_folio2() as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h |  2 ++
 mm/truncate.c           | 35 ++++++++++++++++++++++-------------
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 72b03b37c265..f2d49dccb7c1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -34,6 +34,8 @@ int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
 void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count);
 int filemap_invalidate_pages(struct address_space *mapping,
 			     loff_t pos, loff_t end, bool nowait);
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+			   gfp_t gfp);
 
 int write_inode_now(struct inode *, int sync);
 int filemap_fdatawrite(struct address_space *);
diff --git a/mm/truncate.c b/mm/truncate.c
index 7c304d2f0052..c1dfddb1122a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -533,12 +533,12 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
  * sitting in the folio_add_lru() caches.
  */
 static int invalidate_complete_folio2(struct address_space *mapping,
-					struct folio *folio)
+				      struct folio *folio, gfp_t gfp_mask)
 {
 	if (folio->mapping != mapping)
 		return 0;
 
-	if (!filemap_release_folio(folio, GFP_KERNEL))
+	if (!filemap_release_folio(folio, gfp_mask))
 		return 0;
 
 	spin_lock(&mapping->host->i_lock);
@@ -570,6 +570,25 @@ static int folio_launder(struct address_space *mapping, struct folio *folio)
 	return mapping->a_ops->launder_folio(folio);
 }
 
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+			   gfp_t gfp)
+{
+	int ret;
+
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+	if (folio_test_dirty(folio))
+		return 0;
+	if (folio_mapped(folio))
+		unmap_mapping_folio(folio);
+	BUG_ON(folio_mapped(folio));
+
+	ret = folio_launder(mapping, folio);
+	if (!ret && !invalidate_complete_folio2(mapping, folio, gfp))
+		return -EBUSY;
+	return ret;
+}
+
 /**
  * invalidate_inode_pages2_range - remove range of pages from an address_space
  * @mapping: the address_space
@@ -629,18 +648,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				folio_unlock(folio);
 				continue;
 			}
-			VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
 			folio_wait_writeback(folio);
-
-			if (folio_mapped(folio))
-				unmap_mapping_folio(folio);
-			BUG_ON(folio_mapped(folio));
-
-			ret2 = folio_launder(mapping, folio);
-			if (ret2 == 0) {
-				if (!invalidate_complete_folio2(mapping, folio))
-					ret2 = -EBUSY;
-			}
+			ret2 = folio_unmap_invalidate(mapping, folio, GFP_KERNEL);
 			if (ret2 < 0)
 				ret = ret2;
 			folio_unlock(folio);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 07/12] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (5 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-06 17:35   ` Darrick J. Wong
  2024-12-10 11:22   ` Christoph Hellwig
  2024-12-03 15:31 ` [PATCH 08/12] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

If a file system supports uncached buffered IO, it may set FOP_UNCACHED
and enable RWF_UNCACHED. If RWF_UNCACHED is attempted without the file
system supporting it, it'll get errored with -EOPNOTSUPP.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h      | 14 +++++++++++++-
 include/uapi/linux/fs.h |  6 +++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e29433c5ecc..b64a78582f06 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -322,6 +322,7 @@ struct readahead_control;
 #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
 #define IOCB_APPEND		(__force int) RWF_APPEND
 #define IOCB_ATOMIC		(__force int) RWF_ATOMIC
+#define IOCB_UNCACHED		(__force int) RWF_UNCACHED
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
@@ -356,7 +357,8 @@ struct readahead_control;
 	{ IOCB_SYNC,		"SYNC" }, \
 	{ IOCB_NOWAIT,		"NOWAIT" }, \
 	{ IOCB_APPEND,		"APPEND" }, \
-	{ IOCB_ATOMIC,		"ATOMIC"}, \
+	{ IOCB_ATOMIC,		"ATOMIC" }, \
+	{ IOCB_UNCACHED,	"UNCACHED" }, \
 	{ IOCB_EVENTFD,		"EVENTFD"}, \
 	{ IOCB_DIRECT,		"DIRECT" }, \
 	{ IOCB_WRITE,		"WRITE" }, \
@@ -2127,6 +2129,8 @@ struct file_operations {
 #define FOP_UNSIGNED_OFFSET	((__force fop_flags_t)(1 << 5))
 /* Supports asynchronous lock callbacks */
 #define FOP_ASYNC_LOCK		((__force fop_flags_t)(1 << 6))
+/* File system supports uncached read/write buffered IO */
+#define FOP_UNCACHED		((__force fop_flags_t)(1 << 7))
 
 /* Wrap a directory iterator that needs exclusive inode access */
 int wrap_directory_iterator(struct file *, struct dir_context *,
@@ -3614,6 +3618,14 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
 		if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
 			return -EOPNOTSUPP;
 	}
+	if (flags & RWF_UNCACHED) {
+		/* file system must support it */
+		if (!(ki->ki_filp->f_op->fop_flags & FOP_UNCACHED))
+			return -EOPNOTSUPP;
+		/* DAX mappings not supported */
+		if (IS_DAX(ki->ki_filp->f_mapping->host))
+			return -EOPNOTSUPP;
+	}
 	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
 	if (flags & RWF_SYNC)
 		kiocb_flags |= IOCB_DSYNC;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..dc77cd8ae1a3 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -332,9 +332,13 @@ typedef int __bitwise __kernel_rwf_t;
 /* Atomic Write */
 #define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000040)
 
+/* buffered IO that drops the cache after reading or writing data */
+#define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000080)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
+			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
+			 RWF_UNCACHED)
 
 #define PROCFS_IOCTL_MAGIC 'f'
 
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 08/12] mm/filemap: add read support for RWF_UNCACHED
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (6 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 07/12] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-03 15:31 ` [PATCH 09/12] mm/filemap: drop uncached pages when writeback completes Jens Axboe
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

Add RWF_UNCACHED as a read operation flag, which means that any data
read wil be removed from the page cache upon completion. Uses the page
cache to synchronize, and simply prunes folios that were instantiated
when the operation completes. While it would be possible to use private
pages for this, using the page cache as synchronization is handy for a
variety of reasons:

1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
   cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable

and the code to support this is much simpler as a result.

You can think of uncached buffered IO as being the much more attractive
cousin of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes,
it will copy the data, but unlike regular buffered IO, it doesn't run
into the unpredictability of the page cache in terms of reclaim. As an
example, on a test box with 32 drives, reading them with buffered IO
looks as follows:

Reading bs 65536, uncached 0
  1s: 145945MB/sec
  2s: 158067MB/sec
  3s: 157007MB/sec
  4s: 148622MB/sec
  5s: 118824MB/sec
  6s: 70494MB/sec
  7s: 41754MB/sec
  8s: 90811MB/sec
  9s: 92204MB/sec
 10s: 95178MB/sec
 11s: 95488MB/sec
 12s: 95552MB/sec
 13s: 96275MB/sec

where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7535 root      20   0  267004      0      0 S  3199   0.0   8:40.65 uncached
3326 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd4
3327 root      20   0       0      0      0 R 100.0   0.0   0:17.22 kswapd5
3328 root      20   0       0      0      0 R 100.0   0.0   0:13.29 kswapd6
3332 root      20   0       0      0      0 R 100.0   0.0   0:11.11 kswapd10
3339 root      20   0       0      0      0 R 100.0   0.0   0:16.25 kswapd17
3348 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd26
3343 root      20   0       0      0      0 R 100.0   0.0   0:16.30 kswapd21
3344 root      20   0       0      0      0 R 100.0   0.0   0:11.92 kswapd22
3349 root      20   0       0      0      0 R 100.0   0.0   0:16.28 kswapd27
3352 root      20   0       0      0      0 R  99.7   0.0   0:11.89 kswapd30
3353 root      20   0       0      0      0 R  96.7   0.0   0:16.04 kswapd31
3329 root      20   0       0      0      0 R  96.4   0.0   0:11.41 kswapd7
3345 root      20   0       0      0      0 R  96.4   0.0   0:13.40 kswapd23
3330 root      20   0       0      0      0 S  91.1   0.0   0:08.28 kswapd8
3350 root      20   0       0      0      0 S  86.8   0.0   0:11.13 kswapd28
3325 root      20   0       0      0      0 S  76.3   0.0   0:07.43 kswapd3
3341 root      20   0       0      0      0 S  74.7   0.0   0:08.85 kswapd19
3334 root      20   0       0      0      0 S  71.7   0.0   0:10.04 kswapd12
3351 root      20   0       0      0      0 R  60.5   0.0   0:09.59 kswapd29
3323 root      20   0       0      0      0 R  57.6   0.0   0:11.50 kswapd1
[...]

which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.

If the same test case is run with RWF_UNCACHED set for the buffered read,
the output looks as follows:

Reading bs 65536, uncached 0
  1s: 153144MB/sec
  2s: 156760MB/sec
  3s: 158110MB/sec
  4s: 158009MB/sec
  5s: 158043MB/sec
  6s: 157638MB/sec
  7s: 157999MB/sec
  8s: 158024MB/sec
  9s: 157764MB/sec
 10s: 157477MB/sec
 11s: 157417MB/sec
 12s: 157455MB/sec
 13s: 157233MB/sec
 14s: 156692MB/sec

which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top

where just the test app is using CPU, no reclaim is taking place outside
of the main thread. Not only is performance 65% better, it's also using
half the CPU to do it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 28 ++++++++++++++++++++++++++--
 mm/swap.c    |  2 ++
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index dd3042de8038..139d1db79ff8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2473,6 +2473,8 @@ static int filemap_create_folio(struct kiocb *iocb,
 	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
 	if (!folio)
 		return -ENOMEM;
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		__folio_set_uncached(folio);
 
 	/*
 	 * Protect against truncate / hole punch. Grabbing invalidate_lock
@@ -2518,6 +2520,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
 
 	if (iocb->ki_flags & IOCB_NOIO)
 		return -EAGAIN;
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		ractl.uncached = 1;
 	page_cache_async_ra(&ractl, folio, last_index - folio->index);
 	return 0;
 }
@@ -2547,6 +2551,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 			return -EAGAIN;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			flags = memalloc_noio_save();
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			ractl.uncached = 1;
 		page_cache_sync_ra(&ractl, last_index - index);
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			memalloc_noio_restore(flags);
@@ -2594,6 +2600,20 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
 	return (pos1 >> shift == pos2 >> shift);
 }
 
+static void filemap_uncached_read(struct address_space *mapping,
+				  struct folio *folio)
+{
+	if (!folio_test_uncached(folio))
+		return;
+	if (folio_test_writeback(folio) || folio_test_dirty(folio))
+		return;
+	if (folio_trylock(folio)) {
+		if (folio_test_clear_uncached(folio))
+			folio_unmap_invalidate(mapping, folio, 0);
+		folio_unlock(folio);
+	}
+}
+
 /**
  * filemap_read - Read data from the page cache.
  * @iocb: The iocb to read.
@@ -2707,8 +2727,12 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 			}
 		}
 put_folios:
-		for (i = 0; i < folio_batch_count(&fbatch); i++)
-			folio_put(fbatch.folios[i]);
+		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+			struct folio *folio = fbatch.folios[i];
+
+			filemap_uncached_read(mapping, folio);
+			folio_put(folio);
+		}
 		folio_batch_init(&fbatch);
 	} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
 
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa1..4019ab371759 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -427,6 +427,8 @@ static void folio_inc_refs(struct folio *folio)
  */
 void folio_mark_accessed(struct folio *folio)
 {
+	if (folio_test_uncached(folio))
+		return;
 	if (lru_gen_enabled()) {
 		folio_inc_refs(folio);
 		return;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 09/12] mm/filemap: drop uncached pages when writeback completes
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (7 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 08/12] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-03 15:31 ` [PATCH 10/12] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

If the folio is marked as uncached, drop pages when writeback completes.
Intended to be used with RWF_UNCACHED, to avoid needing sync writes for
uncached IO.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 139d1db79ff8..eb6a8d39f9d0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1599,6 +1599,27 @@ int folio_wait_private_2_killable(struct folio *folio)
 }
 EXPORT_SYMBOL(folio_wait_private_2_killable);
 
+/*
+ * If folio was marked as uncached, then pages should be dropped when writeback
+ * completes. Do that now. If we fail, it's likely because of a big folio -
+ * just reset uncached for that case and latter completions should invalidate.
+ */
+static void folio_end_uncached(struct folio *folio)
+{
+	/*
+	 * Hitting !in_task() should not happen off RWF_UNCACHED writeback, but
+	 * can happen if normal writeback just happens to find dirty folios
+	 * that were created as part of uncached writeback, and that writeback
+	 * would otherwise not need non-IRQ handling. Just skip the
+	 * invalidation in that case.
+	 */
+	if (in_task() && folio_trylock(folio)) {
+		if (folio->mapping)
+			folio_unmap_invalidate(folio->mapping, folio, 0);
+		folio_unlock(folio);
+	}
+}
+
 /**
  * folio_end_writeback - End writeback against a folio.
  * @folio: The folio.
@@ -1609,6 +1630,8 @@ EXPORT_SYMBOL(folio_wait_private_2_killable);
  */
 void folio_end_writeback(struct folio *folio)
 {
+	bool folio_uncached = false;
+
 	VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
 
 	/*
@@ -1630,9 +1653,14 @@ void folio_end_writeback(struct folio *folio)
 	 * reused before the folio_wake_bit().
 	 */
 	folio_get(folio);
+	if (!folio_test_dirty(folio))
+		folio_uncached = folio_test_clear_uncached(folio);
 	if (__folio_end_writeback(folio))
 		folio_wake_bit(folio, PG_writeback);
 	acct_reclaim_writeback(folio);
+
+	if (folio_uncached)
+		folio_end_uncached(folio);
 	folio_put(folio);
 }
 EXPORT_SYMBOL(folio_end_writeback);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 10/12] mm/filemap: add filemap_fdatawrite_range_kick() helper
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (8 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 09/12] mm/filemap: drop uncached pages when writeback completes Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-03 15:31 ` [PATCH 11/12] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

Works like filemap_fdatawrite_range(), except it's a non-integrity data
writeback and hence only starts writeback on the specified range. Will
help facilitate generically starting uncached writeback from
generic_write_sync(), as header dependencies preclude doing this inline
from fs.h.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b64a78582f06..40383f5cc6a2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2878,6 +2878,8 @@ extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
 extern int __must_check file_check_and_advance_wb_err(struct file *file);
 extern int __must_check file_write_and_wait_range(struct file *file,
 						loff_t start, loff_t end);
+int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
+		loff_t end);
 
 static inline int file_write_and_wait(struct file *file)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index eb6a8d39f9d0..826df99e294f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -449,6 +449,24 @@ int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 }
 EXPORT_SYMBOL(filemap_fdatawrite_range);
 
+/**
+ * filemap_fdatawrite_range_kick - start writeback on a range
+ * @mapping:	target address_space
+ * @start:	index to start writeback on
+ * @end:	last (non-inclusive) index for writeback
+ *
+ * This is a non-integrity writeback helper, to start writing back folios
+ * for the indicated range.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
+				  loff_t end)
+{
+	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
+}
+EXPORT_SYMBOL_GPL(filemap_fdatawrite_range_kick);
+
 /**
  * filemap_flush - mostly a non-blocking flush
  * @mapping:	target address_space
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 11/12] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (9 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 10/12] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-06 17:17   ` Darrick J. Wong
  2024-12-03 15:31 ` [PATCH 12/12] mm: add FGP_UNCACHED folio creation flag Jens Axboe
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

If RWF_UNCACHED is set for a write, mark new folios being written with
uncached. This is done by passing in the fact that it's an uncached write
through the folio pointer. We can only get there when IOCB_UNCACHED was
allowed, which can only happen if the file system opts in. Opting in means
they need to check for the LSB in the folio pointer to know if it's an
uncached write or not. If it is, then FGP_UNCACHED should be used if
creating new folios is necessary.

Uncached writes will drop any folios they create upon writeback
completion, but leave folios that may exist in that range alone. Since
->write_begin() doesn't currently take any flags, and to avoid needing
to change the callback kernel wide, use the foliop being passed in to
->write_begin() to signal if this is an uncached write or not. File
systems can then use that to mark newly created folios as uncached.

This provides similar benefits to using RWF_UNCACHED with reads. Testing
buffered writes on 32 files:

writing bs 65536, uncached 0
  1s: 196035MB/sec
  2s: 132308MB/sec
  3s: 132438MB/sec
  4s: 116528MB/sec
  5s: 103898MB/sec
  6s: 108893MB/sec
  7s: 99678MB/sec
  8s: 106545MB/sec
  9s: 106826MB/sec
 10s: 101544MB/sec
 11s: 111044MB/sec
 12s: 124257MB/sec
 13s: 116031MB/sec
 14s: 114540MB/sec
 15s: 115011MB/sec
 16s: 115260MB/sec
 17s: 116068MB/sec
 18s: 116096MB/sec

where it's quite obvious where the page cache filled, and performance
dropped from to about half of where it started, settling in at around
115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
reclaim pages.

Running the same test with uncached buffered writes:

writing bs 65536, uncached 1
  1s: 198974MB/sec
  2s: 189618MB/sec
  3s: 193601MB/sec
  4s: 188582MB/sec
  5s: 193487MB/sec
  6s: 188341MB/sec
  7s: 194325MB/sec
  8s: 188114MB/sec
  9s: 192740MB/sec
 10s: 189206MB/sec
 11s: 193442MB/sec
 12s: 189659MB/sec
 13s: 191732MB/sec
 14s: 190701MB/sec
 15s: 191789MB/sec
 16s: 191259MB/sec
 17s: 190613MB/sec
 18s: 191951MB/sec

and the behavior is fully predictable, performing the same throughout
even after the page cache would otherwise have fully filled with dirty
data. It's also about 65% faster, and using half the CPU of the system
compared to the normal buffered write.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h      |  5 +++++
 include/linux/pagemap.h |  9 +++++++++
 mm/filemap.c            | 12 +++++++++++-
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 40383f5cc6a2..32255473f79d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2912,6 +2912,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
 				(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
 		if (ret)
 			return ret;
+	} else if (iocb->ki_flags & IOCB_UNCACHED) {
+		struct address_space *mapping = iocb->ki_filp->f_mapping;
+
+		filemap_fdatawrite_range_kick(mapping, iocb->ki_pos,
+					      iocb->ki_pos + count);
 	}
 
 	return count;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f2d49dccb7c1..e49587c40157 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -14,6 +14,7 @@
 #include <linux/gfp.h>
 #include <linux/bitops.h>
 #include <linux/hardirq.h> /* for in_interrupt() */
+#include <linux/writeback.h>
 #include <linux/hugetlb_inline.h>
 
 struct folio_batch;
@@ -70,6 +71,14 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
 	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
 }
 
+/*
+ * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
+ * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
+ * must check for this and pass FGP_UNCACHED for folio creation.
+ */
+#define foliop_uncached			((struct folio *) 0xfee1c001)
+#define foliop_is_uncached(foliop)	(*(foliop) == foliop_uncached)
+
 /**
  * filemap_set_wb_err - set a writeback error on an address_space
  * @mapping: mapping in which to set writeback error
diff --git a/mm/filemap.c b/mm/filemap.c
index 826df99e294f..00f3c6c58629 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4095,7 +4095,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 	ssize_t written = 0;
 
 	do {
-		struct folio *folio;
+		struct folio *folio = NULL;
 		size_t offset;		/* Offset into folio */
 		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
@@ -4123,6 +4123,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 			break;
 		}
 
+		/*
+		 * If IOCB_UNCACHED is set here, we now the file system
+		 * supports it. And hence it'll know to check folip for being
+		 * set to this magic value. If so, it's an uncached write.
+		 * Whenever ->write_begin() changes prototypes again, this
+		 * can go away and just pass iocb or iocb flags.
+		 */
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			folio = foliop_uncached;
+
 		status = a_ops->write_begin(file, mapping, pos, bytes,
 						&folio, &fsdata);
 		if (unlikely(status < 0))
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 12/12] mm: add FGP_UNCACHED folio creation flag
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (10 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 11/12] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-12-03 15:31 ` Jens Axboe
  2024-12-03 18:23 ` [PATCHSET v6 0/12] Uncached buffered IO Christoph Lameter (Ampere)
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 15:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe

Callers can pass this in for uncached folio creation, in which case if
a folio is newly created it gets marked as uncached. If a folio exists
for this index and lookup succeeds, then it will not get marked as
uncached. If an !uncached lookup finds a cached folio, clear the flag.
For that case, there are competeting uncached and cached users of the
folio, and it should not get pruned.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 2 ++
 mm/filemap.c            | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e49587c40157..374872acbe1d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -721,6 +721,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
  * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
  * * %FGP_NOWAIT - Don't block on the folio lock.
  * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_UNCACHED - Uncached buffered IO
  * * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
  *   implementation.
  */
@@ -734,6 +735,7 @@ typedef unsigned int __bitwise fgf_t;
 #define FGP_NOWAIT		((__force fgf_t)0x00000020)
 #define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
 #define FGP_STABLE		((__force fgf_t)0x00000080)
+#define FGP_UNCACHED		((__force fgf_t)0x00000100)
 #define FGF_GET_ORDER(fgf)	(((__force unsigned)fgf) >> 26)	/* top 6 bits */
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
diff --git a/mm/filemap.c b/mm/filemap.c
index 00f3c6c58629..a03a9b9127b8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2001,6 +2001,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			/* Init accessed so avoid atomic mark_page_accessed later */
 			if (fgp_flags & FGP_ACCESSED)
 				__folio_set_referenced(folio);
+			if (fgp_flags & FGP_UNCACHED)
+				__folio_set_uncached(folio);
 
 			err = filemap_add_folio(mapping, folio, index, gfp);
 			if (!err)
@@ -2023,6 +2025,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 
 	if (!folio)
 		return ERR_PTR(-ENOENT);
+	/* not an uncached lookup, clear uncached if set */
+	if (folio_test_uncached(folio) && !(fgp_flags & FGP_UNCACHED))
+		folio_clear_uncached(folio);
 	return folio;
 }
 EXPORT_SYMBOL(__filemap_get_folio);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (11 preceding siblings ...)
  2024-12-03 15:31 ` [PATCH 12/12] mm: add FGP_UNCACHED folio creation flag Jens Axboe
@ 2024-12-03 18:23 ` Christoph Lameter (Ampere)
  2024-12-03 21:06   ` Jens Axboe
  2024-12-06 17:37 ` Darrick J. Wong
  2024-12-10  9:48 ` Bharata B Rao
  14 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-12-03 18:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, 3 Dec 2024, Jens Axboe wrote:

>
> So here's a new approach to the same concent, but using the page cache
> as synchronization. That makes RWF_UNCACHED less special, in that it's
> just page cache IO, except it prunes the ranges once IO is completed.


Great idea and someting that is really important these days.

However, one nit that I have is the use of the term "uncached" for a
folio/page. An uncached "page frame" refers to a page frame that requires
accesses not  going through the cpu cache. I.e. device mappings. This is
an established mm/cpu term as far as I can tell.

So maybe be a bit more specific about which cache this is?

PAGE_CACHE_UNCACHED?

or use a different term. It is cached after all but only for a brief
period. So this may be a "TEMPORAL_PAGE" or so? (Similar to the x86
"non-temporal" stores).




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-03 18:23 ` [PATCHSET v6 0/12] Uncached buffered IO Christoph Lameter (Ampere)
@ 2024-12-03 21:06   ` Jens Axboe
  2024-12-03 22:16     ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 21:06 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On 12/3/24 11:23 AM, Christoph Lameter (Ampere) wrote:
> On Tue, 3 Dec 2024, Jens Axboe wrote:
> 
>>
>> So here's a new approach to the same concent, but using the page cache
>> as synchronization. That makes RWF_UNCACHED less special, in that it's
>> just page cache IO, except it prunes the ranges once IO is completed.
> 
> 
> Great idea and someting that is really important these days.
> 
> However, one nit that I have is the use of the term "uncached" for a
> folio/page. An uncached "page frame" refers to a page frame that requires
> accesses not  going through the cpu cache. I.e. device mappings. This is
> an established mm/cpu term as far as I can tell.
> 
> So maybe be a bit more specific about which cache this is?
> 
> PAGE_CACHE_UNCACHED?
> 
> or use a different term. It is cached after all but only for a brief
> period. So this may be a "TEMPORAL_PAGE" or so? (Similar to the x86
> "non-temporal" stores).

I actually did consider using some form of temporal, as it's the only
other name I liked. But I do think cached_uncached becomes pretty
unwieldy. Which is why I just stuck with uncached. Yes I know it means
different things in different circles, but probably mostly an overlap
with deeper technical things like that. An honestly almost impossible to
avoid overlap these days, everything has been used already :-)

IOW, I think uncached is probably still the most descriptive thing out
there, even if I'm certainly open to entertaining other names. Just not
anything yet that has really resonated with me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-03 21:06   ` Jens Axboe
@ 2024-12-03 22:16     ` Christoph Lameter (Ampere)
  2024-12-03 22:41       ` Jens Axboe
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-12-03 22:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, 3 Dec 2024, Jens Axboe wrote:

> I actually did consider using some form of temporal, as it's the only
> other name I liked. But I do think cached_uncached becomes pretty
> unwieldy. Which is why I just stuck with uncached. Yes I know it means
> different things in different circles, but probably mostly an overlap
> with deeper technical things like that. An honestly almost impossible to
> avoid overlap these days, everything has been used already :-)
>
> IOW, I think uncached is probably still the most descriptive thing out
> there, even if I'm certainly open to entertaining other names. Just not
> anything yet that has really resonated with me.

How about calling this a "transitory" page? It means fleeting, not
persistent and I think we have not used that term with a page/folio yet.





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-03 22:16     ` Christoph Lameter (Ampere)
@ 2024-12-03 22:41       ` Jens Axboe
  2024-12-04  5:52         ` Darrick J. Wong
  0 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2024-12-03 22:41 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On 12/3/24 3:16 PM, Christoph Lameter (Ampere) wrote:
> On Tue, 3 Dec 2024, Jens Axboe wrote:
> 
>> I actually did consider using some form of temporal, as it's the only
>> other name I liked. But I do think cached_uncached becomes pretty
>> unwieldy. Which is why I just stuck with uncached. Yes I know it means
>> different things in different circles, but probably mostly an overlap
>> with deeper technical things like that. An honestly almost impossible to
>> avoid overlap these days, everything has been used already :-)
>>
>> IOW, I think uncached is probably still the most descriptive thing out
>> there, even if I'm certainly open to entertaining other names. Just not
>> anything yet that has really resonated with me.
> 
> How about calling this a "transitory" page? It means fleeting, not
> persistent and I think we have not used that term with a page/folio yet.

I also hit the thesaurus ;-)

I'm honestly not too worried about the internal name, as developers can
figure that out. It's more about presenting an external name that sys
developers will not need a lot of explaining to know what it's about.
And something that isn't too long. BRIEFLY_CACHED? TRANSIENT_CACHE?

Dunno, I keep going back to uncached as it's pretty easy to grok!

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-03 22:41       ` Jens Axboe
@ 2024-12-04  5:52         ` Darrick J. Wong
  2024-12-04 16:36           ` Jens Axboe
  2024-12-10 11:11           ` Christoph Hellwig
  0 siblings, 2 replies; 43+ messages in thread
From: Darrick J. Wong @ 2024-12-04  5:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Lameter (Ampere),
	linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, Dec 03, 2024 at 03:41:53PM -0700, Jens Axboe wrote:
> On 12/3/24 3:16 PM, Christoph Lameter (Ampere) wrote:
> > On Tue, 3 Dec 2024, Jens Axboe wrote:
> > 
> >> I actually did consider using some form of temporal, as it's the only
> >> other name I liked. But I do think cached_uncached becomes pretty
> >> unwieldy. Which is why I just stuck with uncached. Yes I know it means
> >> different things in different circles, but probably mostly an overlap
> >> with deeper technical things like that. An honestly almost impossible to
> >> avoid overlap these days, everything has been used already :-)
> >>
> >> IOW, I think uncached is probably still the most descriptive thing out
> >> there, even if I'm certainly open to entertaining other names. Just not
> >> anything yet that has really resonated with me.
> > 
> > How about calling this a "transitory" page? It means fleeting, not
> > persistent and I think we have not used that term with a page/folio yet.
> 
> I also hit the thesaurus ;-)
> 
> I'm honestly not too worried about the internal name, as developers can
> figure that out. It's more about presenting an external name that sys
> developers will not need a lot of explaining to know what it's about.
> And something that isn't too long. BRIEFLY_CACHED? TRANSIENT_CACHE?
> 
> Dunno, I keep going back to uncached as it's pretty easy to grok!

<shrug> RWF_DONTCACHE, to match {I,DCACHE}_DONTCACHE ? ;)

They sound pretty similar ("load this so I can do something with it,
evict it immediately if possible") though I wouldn't rely on people
outside the kernel being familiar with the existing dontcaches.

--D

> -- 
> Jens Axboe
> 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-04  5:52         ` Darrick J. Wong
@ 2024-12-04 16:36           ` Jens Axboe
  2024-12-10 11:11           ` Christoph Hellwig
  1 sibling, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-04 16:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Lameter (Ampere),
	linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On 12/3/24 10:52 PM, Darrick J. Wong wrote:
> On Tue, Dec 03, 2024 at 03:41:53PM -0700, Jens Axboe wrote:
>> On 12/3/24 3:16 PM, Christoph Lameter (Ampere) wrote:
>>> On Tue, 3 Dec 2024, Jens Axboe wrote:
>>>
>>>> I actually did consider using some form of temporal, as it's the only
>>>> other name I liked. But I do think cached_uncached becomes pretty
>>>> unwieldy. Which is why I just stuck with uncached. Yes I know it means
>>>> different things in different circles, but probably mostly an overlap
>>>> with deeper technical things like that. An honestly almost impossible to
>>>> avoid overlap these days, everything has been used already :-)
>>>>
>>>> IOW, I think uncached is probably still the most descriptive thing out
>>>> there, even if I'm certainly open to entertaining other names. Just not
>>>> anything yet that has really resonated with me.
>>>
>>> How about calling this a "transitory" page? It means fleeting, not
>>> persistent and I think we have not used that term with a page/folio yet.
>>
>> I also hit the thesaurus ;-)
>>
>> I'm honestly not too worried about the internal name, as developers can
>> figure that out. It's more about presenting an external name that sys
>> developers will not need a lot of explaining to know what it's about.
>> And something that isn't too long. BRIEFLY_CACHED? TRANSIENT_CACHE?
>>
>> Dunno, I keep going back to uncached as it's pretty easy to grok!
> 
> <shrug> RWF_DONTCACHE, to match {I,DCACHE}_DONTCACHE ? ;)
> 
> They sound pretty similar ("load this so I can do something with it,
> evict it immediately if possible") though I wouldn't rely on people
> outside the kernel being familiar with the existing dontcaches.

Naming is hard! Most people do seem to grok what uncached means, when
I've shopped it around. The fact that it does use the page cache is
pretty irrelevant, that's more of an implementation detail to solve
various issues around competing users of it. That it doesn't persist is
the important bit, and uncached does seem to relay that pretty nicely.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 11/12] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-12-03 15:31 ` [PATCH 11/12] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-12-06 17:17   ` Darrick J. Wong
  2024-12-06 18:22     ` Jens Axboe
  0 siblings, 1 reply; 43+ messages in thread
From: Darrick J. Wong @ 2024-12-06 17:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, Dec 03, 2024 at 08:31:47AM -0700, Jens Axboe wrote:
> If RWF_UNCACHED is set for a write, mark new folios being written with
> uncached. This is done by passing in the fact that it's an uncached write
> through the folio pointer. We can only get there when IOCB_UNCACHED was
> allowed, which can only happen if the file system opts in. Opting in means
> they need to check for the LSB in the folio pointer to know if it's an
> uncached write or not. If it is, then FGP_UNCACHED should be used if
> creating new folios is necessary.
> 
> Uncached writes will drop any folios they create upon writeback
> completion, but leave folios that may exist in that range alone. Since
> ->write_begin() doesn't currently take any flags, and to avoid needing
> to change the callback kernel wide, use the foliop being passed in to
> ->write_begin() to signal if this is an uncached write or not. File
> systems can then use that to mark newly created folios as uncached.
> 
> This provides similar benefits to using RWF_UNCACHED with reads. Testing
> buffered writes on 32 files:
> 
> writing bs 65536, uncached 0
>   1s: 196035MB/sec
>   2s: 132308MB/sec
>   3s: 132438MB/sec
>   4s: 116528MB/sec
>   5s: 103898MB/sec
>   6s: 108893MB/sec
>   7s: 99678MB/sec
>   8s: 106545MB/sec
>   9s: 106826MB/sec
>  10s: 101544MB/sec
>  11s: 111044MB/sec
>  12s: 124257MB/sec
>  13s: 116031MB/sec
>  14s: 114540MB/sec
>  15s: 115011MB/sec
>  16s: 115260MB/sec
>  17s: 116068MB/sec
>  18s: 116096MB/sec
> 
> where it's quite obvious where the page cache filled, and performance
> dropped from to about half of where it started, settling in at around
> 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
> reclaim pages.
> 
> Running the same test with uncached buffered writes:
> 
> writing bs 65536, uncached 1
>   1s: 198974MB/sec
>   2s: 189618MB/sec
>   3s: 193601MB/sec
>   4s: 188582MB/sec
>   5s: 193487MB/sec
>   6s: 188341MB/sec
>   7s: 194325MB/sec
>   8s: 188114MB/sec
>   9s: 192740MB/sec
>  10s: 189206MB/sec
>  11s: 193442MB/sec
>  12s: 189659MB/sec
>  13s: 191732MB/sec
>  14s: 190701MB/sec
>  15s: 191789MB/sec
>  16s: 191259MB/sec
>  17s: 190613MB/sec
>  18s: 191951MB/sec
> 
> and the behavior is fully predictable, performing the same throughout
> even after the page cache would otherwise have fully filled with dirty
> data. It's also about 65% faster, and using half the CPU of the system
> compared to the normal buffered write.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/fs.h      |  5 +++++
>  include/linux/pagemap.h |  9 +++++++++
>  mm/filemap.c            | 12 +++++++++++-
>  3 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 40383f5cc6a2..32255473f79d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2912,6 +2912,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
>  				(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
>  		if (ret)
>  			return ret;
> +	} else if (iocb->ki_flags & IOCB_UNCACHED) {
> +		struct address_space *mapping = iocb->ki_filp->f_mapping;
> +
> +		filemap_fdatawrite_range_kick(mapping, iocb->ki_pos,
> +					      iocb->ki_pos + count);
>  	}
>  
>  	return count;
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index f2d49dccb7c1..e49587c40157 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -14,6 +14,7 @@
>  #include <linux/gfp.h>
>  #include <linux/bitops.h>
>  #include <linux/hardirq.h> /* for in_interrupt() */
> +#include <linux/writeback.h>
>  #include <linux/hugetlb_inline.h>
>  
>  struct folio_batch;
> @@ -70,6 +71,14 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
>  	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
>  }
>  
> +/*
> + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
> + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
> + * must check for this and pass FGP_UNCACHED for folio creation.
> + */
> +#define foliop_uncached			((struct folio *) 0xfee1c001)
> +#define foliop_is_uncached(foliop)	(*(foliop) == foliop_uncached)

Honestly, I'm not a fan of foliop_uncached or foliop_is_uncached.

The first one because it's a magic value and can you guarantee that
0xfee1c001 will never be a pointer to an actual struct folio, even on
32-bit?

Second, they're both named "foliop" even though the first one doesn't
return a (struct folio **) but the second one takes that as an arg.

I think these two macros are only used for ext4 (or really, !iomap)
support, right?  And that's only to avoid messing with ->write_begin?
What if you dropped ext4 support instead? :D

--D

>  /**
>   * filemap_set_wb_err - set a writeback error on an address_space
>   * @mapping: mapping in which to set writeback error
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 826df99e294f..00f3c6c58629 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -4095,7 +4095,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>  	ssize_t written = 0;
>  
>  	do {
> -		struct folio *folio;
> +		struct folio *folio = NULL;
>  		size_t offset;		/* Offset into folio */
>  		size_t bytes;		/* Bytes to write to folio */
>  		size_t copied;		/* Bytes copied from user */
> @@ -4123,6 +4123,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>  			break;
>  		}
>  
> +		/*
> +		 * If IOCB_UNCACHED is set here, we now the file system
> +		 * supports it. And hence it'll know to check folip for being
> +		 * set to this magic value. If so, it's an uncached write.
> +		 * Whenever ->write_begin() changes prototypes again, this
> +		 * can go away and just pass iocb or iocb flags.
> +		 */
> +		if (iocb->ki_flags & IOCB_UNCACHED)
> +			folio = foliop_uncached;
> +
>  		status = a_ops->write_begin(file, mapping, pos, bytes,
>  						&folio, &fsdata);
>  		if (unlikely(status < 0))
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/12] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag
  2024-12-03 15:31 ` [PATCH 07/12] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
@ 2024-12-06 17:35   ` Darrick J. Wong
  2024-12-10 11:22   ` Christoph Hellwig
  1 sibling, 0 replies; 43+ messages in thread
From: Darrick J. Wong @ 2024-12-06 17:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, Dec 03, 2024 at 08:31:43AM -0700, Jens Axboe wrote:
> If a file system supports uncached buffered IO, it may set FOP_UNCACHED
> and enable RWF_UNCACHED. If RWF_UNCACHED is attempted without the file
> system supporting it, it'll get errored with -EOPNOTSUPP.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/fs.h      | 14 +++++++++++++-
>  include/uapi/linux/fs.h |  6 +++++-
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7e29433c5ecc..b64a78582f06 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -322,6 +322,7 @@ struct readahead_control;
>  #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
>  #define IOCB_APPEND		(__force int) RWF_APPEND
>  #define IOCB_ATOMIC		(__force int) RWF_ATOMIC
> +#define IOCB_UNCACHED		(__force int) RWF_UNCACHED
>  
>  /* non-RWF related bits - start at 16 */
>  #define IOCB_EVENTFD		(1 << 16)
> @@ -356,7 +357,8 @@ struct readahead_control;
>  	{ IOCB_SYNC,		"SYNC" }, \
>  	{ IOCB_NOWAIT,		"NOWAIT" }, \
>  	{ IOCB_APPEND,		"APPEND" }, \
> -	{ IOCB_ATOMIC,		"ATOMIC"}, \
> +	{ IOCB_ATOMIC,		"ATOMIC" }, \
> +	{ IOCB_UNCACHED,	"UNCACHED" }, \
>  	{ IOCB_EVENTFD,		"EVENTFD"}, \
>  	{ IOCB_DIRECT,		"DIRECT" }, \
>  	{ IOCB_WRITE,		"WRITE" }, \
> @@ -2127,6 +2129,8 @@ struct file_operations {
>  #define FOP_UNSIGNED_OFFSET	((__force fop_flags_t)(1 << 5))
>  /* Supports asynchronous lock callbacks */
>  #define FOP_ASYNC_LOCK		((__force fop_flags_t)(1 << 6))
> +/* File system supports uncached read/write buffered IO */
> +#define FOP_UNCACHED		((__force fop_flags_t)(1 << 7))
>  
>  /* Wrap a directory iterator that needs exclusive inode access */
>  int wrap_directory_iterator(struct file *, struct dir_context *,
> @@ -3614,6 +3618,14 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
>  		if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
>  			return -EOPNOTSUPP;
>  	}
> +	if (flags & RWF_UNCACHED) {

Should FMODE_NOREUSE imply RWF_UNCACHED?  I know, I'm dredging this up
again from v3:

https://lore.kernel.org/linux-fsdevel/ZzKn4OyHXq5r6eiI@dread.disaster.area/

but the manpage for fadvise says NOREUSE means "The specified data will
be accessed only once." and I think that fits what you're doing here.
And yeah, it's annoying that people keep asking for moar knobs to tweak
io operations: Let's have a mount option, and a fadvise mode, and a
fcntl mode, and finally per-io flags!  (mostly kidding)

Also, one of your replies referenced a poc to set UNCACHED on NOREUSE
involving willy and yu.  Where was that?  I've found this:

https://lore.kernel.org/linux-fsdevel/ZzI97bky3Rwzw18C@casper.infradead.org/

but that turned into a documentation discussion.

There were also a few unanswered questions (imo) from the last few
iterations of this patchset.

If someone issues a lot of small appending uncached writes to a file,
does that mean the writes and writeback will now be lockstepping each
other to write out the folio?  Or should programs simply not do that?

What if I wanted to do a bunch of small writes to adjacent bytes,
amortize writeback over a single disk io, and not wait for reclaim to
drop the folio?  Admittedly that doesn't really fit with "will be
accessed only once" so I think "don't do that" is an acceptable answer.

And, I guess if the application really wants fine-grained control then
it /can/ still pwrite, sync_file_range, and fadvise(WONTNEED).  Though
that's three syscalls/uring ops/whatever.  But that might be cheaper
than repeated rewrites.

--D

> +		/* file system must support it */
> +		if (!(ki->ki_filp->f_op->fop_flags & FOP_UNCACHED))
> +			return -EOPNOTSUPP;
> +		/* DAX mappings not supported */
> +		if (IS_DAX(ki->ki_filp->f_mapping->host))
> +			return -EOPNOTSUPP;
> +	}
>  	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
>  	if (flags & RWF_SYNC)
>  		kiocb_flags |= IOCB_DSYNC;
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 753971770733..dc77cd8ae1a3 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -332,9 +332,13 @@ typedef int __bitwise __kernel_rwf_t;
>  /* Atomic Write */
>  #define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000040)
>  
> +/* buffered IO that drops the cache after reading or writing data */
> +#define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000080)
> +
>  /* mask of flags supported by the kernel */
>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> -			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
> +			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
> +			 RWF_UNCACHED)
>  
>  #define PROCFS_IOCTL_MAGIC 'f'
>  
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (12 preceding siblings ...)
  2024-12-03 18:23 ` [PATCHSET v6 0/12] Uncached buffered IO Christoph Lameter (Ampere)
@ 2024-12-06 17:37 ` Darrick J. Wong
  2024-12-10  9:48 ` Bharata B Rao
  14 siblings, 0 replies; 43+ messages in thread
From: Darrick J. Wong @ 2024-12-06 17:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, Dec 03, 2024 at 08:31:36AM -0700, Jens Axboe wrote:
> Hi,
> 
> 5 years ago I posted patches adding support for RWF_UNCACHED, as a way
> to do buffered IO that isn't page cache persistent. The approach back
> then was to have private pages for IO, and then get rid of them once IO
> was done. But that then runs into all the issues that O_DIRECT has, in
> terms of synchronizing with the page cache.
> 
> So here's a new approach to the same concent, but using the page cache
> as synchronization. That makes RWF_UNCACHED less special, in that it's
> just page cache IO, except it prunes the ranges once IO is completed.
> 
> Why do this, you may ask? The tldr is that device speeds are only
> getting faster, while reclaim is not. Doing normal buffered IO can be
> very unpredictable, and suck up a lot of resources on the reclaim side.
> This leads people to use O_DIRECT as a work-around, which has its own
> set of restrictions in terms of size, offset, and length of IO. It's
> also inherently synchronous, and now you need async IO as well. While
> the latter isn't necessarily a big problem as we have good options
> available there, it also should not be a requirement when all you want
> to do is read or write some data without caching.
> 
> Even on desktop type systems, a normal NVMe device can fill the entire
> page cache in seconds. On the big system I used for testing, there's a
> lot more RAM, but also a lot more devices. As can be seen in some of the
> results in the following patches, you can still fill RAM in seconds even
> when there's 1TB of it. Hence this problem isn't solely a "big
> hyperscaler system" issue, it's common across the board.
> 
> Common for both reads and writes with RWF_UNCACHED is that they use the
> page cache for IO. Reads work just like a normal buffered read would,
> with the only exception being that the touched ranges will get pruned
> after data has been copied. For writes, the ranges will get writeback
> kicked off before the syscall returns, and then writeback completion
> will prune the range. Hence writes aren't synchronous, and it's easy to
> pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by
> RWF_UNCACHED IO are left untouched. This means you that uncached IO
> will take advantage of the page cache for uptodate data, but not leave
> anything it instantiated/created in cache.
> 
> File systems need to support this. The patches add support for the
> generic filemap helpers, and for iomap. Then ext4 and XFS are marked as
> supporting it. The last patch adds support for btrfs as well, lightly
> tested. The read side is already done by filemap, only the write side
> needs a bit of help. The amount of code here is really trivial, and the
> only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO
> return -EOPNOTSUPP just in case the fs doesn't use either the generic
> paths or iomap. Adding "support" to other file systems should be
> trivial, most of the time just a one-liner adding FOP_UNCACHED to the
> fop_flags in the file_operations struct.
> 
> Performance results are in patch 8 for reads and patch 10 for writes,
> with the tldr being that I see about a 65% improvement in performance
> for both, with fully predictable IO times. CPU reduction is substantial
> as well, with no kswapd activity at all for reclaim when using uncached
> IO.
> 
> Using it from applications is trivial - just set RWF_UNCACHED for the
> read or write, using pwritev2(2) or preadv2(2). For io_uring, same
> thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write
> operation. And that's it.
> 
> Patches 1..7 are just prep patches, and should have no functional
> changes at all. Patch 8 adds support for the filemap path for
> RWF_UNCACHED reads, patch 11 adds support for filemap RWF_UNCACHED
> writes. In the below mentioned branch, there are then patches to
> adopt uncached reads and writes for ext4, xfs, and btrfs.
> 
> Passes full xfstests and fsx overnight runs, no issues observed. That
> includes the vm running the testing also using RWF_UNCACHED on the host.
> I'll post fsstress and fsx patches for RWF_UNCACHED separately. As far
> as I'm concerned, no further work needs doing here.
> 
> And git tree for the patches is here:
> 
> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.8

Oh good, I much prefer browsing git branches these days. :)

 * mm/filemap: change filemap_create_folio() to take a struct kiocb
 * mm/readahead: add folio allocation helper
 * mm: add PG_uncached page flag
 * mm/readahead: add readahead_control->uncached member
 * mm/filemap: use page_cache_sync_ra() to kick off read-ahead
 * mm/truncate: add folio_unmap_invalidate() helper

The mm patches look ok to me, but I think you ought to get at least an
ack from willy since they're largely pagecache changes.

 * fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag

See more detailed reply in the thread.

 * mm/filemap: add read support for RWF_UNCACHED

Looks cleaner now that we don't even unmap the page if it's dirty.

 * mm/filemap: drop uncached pages when writeback completes
 * mm/filemap: add filemap_fdatawrite_range_kick() helper
 * mm/filemap: make buffered writes work with RWF_UNCACHED

See more detailed reply in the thread.

 * mm: add FGP_UNCACHED folio creation flag

I appreciate that !UNCACHED callers of __filemap_get_folio now clear the
uncached bit if it's set.

Now I proceed into the rest of your branch, because I felt like it:

 * ext4: add RWF_UNCACHED write support

(Dunno about the WARN_ON removals in this patch, but this is really
Ted's call anyway).

 * iomap: make buffered writes work with RWF_UNCACHED

The commit message references a "iocb_uncached_write" but I don't find
any such function in the extended patchset?  Otherwise this looks ready
to me.  Thanks for changing it only to set uncached if we're actually
creating a folio, and not just returning one that was already in the
pagecache.

 * xfs: punt uncached write completions to the completion wq

Dumb nit: spaces between "IOMAP_F_SHARED|IOMAP_F_UNCACHED" in this
patch.

 * xfs: flag as supporting FOP_UNCACHED

Otherwise the xfs changes look ready too.

 * btrfs: add support for uncached writes
 * block: support uncached IO

Not sure why the definition of bio_dirty_lock gets moved around, but in
principle this looks ok to me too.

For the whole pile of mm changes (aka patches 1-6,8-10,12),
Acked-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> 
>  include/linux/fs.h             |  21 +++++-
>  include/linux/page-flags.h     |   5 ++
>  include/linux/pagemap.h        |  14 ++++
>  include/trace/events/mmflags.h |   3 +-
>  include/uapi/linux/fs.h        |   6 +-
>  mm/filemap.c                   | 114 +++++++++++++++++++++++++++++----
>  mm/readahead.c                 |  22 +++++--
>  mm/swap.c                      |   2 +
>  mm/truncate.c                  |  35 ++++++----
>  9 files changed, 187 insertions(+), 35 deletions(-)
> 
> Since v5
> - Skip invalidation in filemap_uncached_read() if the folio is dirty
>   as well, retaining the uncached setting for later cleaning to do
>   the actual invalidation.
> - Use the same trylock approach in read invalidation as the writeback
>   invalidation does.
> - Swap order of patches 10 and 11 to fix a bisection issue.
> - Split core mm changes and fs series patches. Once the generic side
>   has been approved, I'll send out the fs series separately.
> - Rebase on 6.13-rc1
> 
> -- 
> Jens Axboe
> 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 11/12] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-12-06 17:17   ` Darrick J. Wong
@ 2024-12-06 18:22     ` Jens Axboe
  2024-12-10 11:31       ` Christoph Hellwig
  0 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2024-12-06 18:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On 12/6/24 10:17 AM, Darrick J. Wong wrote:
> On Tue, Dec 03, 2024 at 08:31:47AM -0700, Jens Axboe wrote:
>> If RWF_UNCACHED is set for a write, mark new folios being written with
>> uncached. This is done by passing in the fact that it's an uncached write
>> through the folio pointer. We can only get there when IOCB_UNCACHED was
>> allowed, which can only happen if the file system opts in. Opting in means
>> they need to check for the LSB in the folio pointer to know if it's an
>> uncached write or not. If it is, then FGP_UNCACHED should be used if
>> creating new folios is necessary.
>>
>> Uncached writes will drop any folios they create upon writeback
>> completion, but leave folios that may exist in that range alone. Since
>> ->write_begin() doesn't currently take any flags, and to avoid needing
>> to change the callback kernel wide, use the foliop being passed in to
>> ->write_begin() to signal if this is an uncached write or not. File
>> systems can then use that to mark newly created folios as uncached.
>>
>> This provides similar benefits to using RWF_UNCACHED with reads. Testing
>> buffered writes on 32 files:
>>
>> writing bs 65536, uncached 0
>>   1s: 196035MB/sec
>>   2s: 132308MB/sec
>>   3s: 132438MB/sec
>>   4s: 116528MB/sec
>>   5s: 103898MB/sec
>>   6s: 108893MB/sec
>>   7s: 99678MB/sec
>>   8s: 106545MB/sec
>>   9s: 106826MB/sec
>>  10s: 101544MB/sec
>>  11s: 111044MB/sec
>>  12s: 124257MB/sec
>>  13s: 116031MB/sec
>>  14s: 114540MB/sec
>>  15s: 115011MB/sec
>>  16s: 115260MB/sec
>>  17s: 116068MB/sec
>>  18s: 116096MB/sec
>>
>> where it's quite obvious where the page cache filled, and performance
>> dropped from to about half of where it started, settling in at around
>> 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
>> reclaim pages.
>>
>> Running the same test with uncached buffered writes:
>>
>> writing bs 65536, uncached 1
>>   1s: 198974MB/sec
>>   2s: 189618MB/sec
>>   3s: 193601MB/sec
>>   4s: 188582MB/sec
>>   5s: 193487MB/sec
>>   6s: 188341MB/sec
>>   7s: 194325MB/sec
>>   8s: 188114MB/sec
>>   9s: 192740MB/sec
>>  10s: 189206MB/sec
>>  11s: 193442MB/sec
>>  12s: 189659MB/sec
>>  13s: 191732MB/sec
>>  14s: 190701MB/sec
>>  15s: 191789MB/sec
>>  16s: 191259MB/sec
>>  17s: 190613MB/sec
>>  18s: 191951MB/sec
>>
>> and the behavior is fully predictable, performing the same throughout
>> even after the page cache would otherwise have fully filled with dirty
>> data. It's also about 65% faster, and using half the CPU of the system
>> compared to the normal buffered write.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  include/linux/fs.h      |  5 +++++
>>  include/linux/pagemap.h |  9 +++++++++
>>  mm/filemap.c            | 12 +++++++++++-
>>  3 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 40383f5cc6a2..32255473f79d 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2912,6 +2912,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
>>  				(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
>>  		if (ret)
>>  			return ret;
>> +	} else if (iocb->ki_flags & IOCB_UNCACHED) {
>> +		struct address_space *mapping = iocb->ki_filp->f_mapping;
>> +
>> +		filemap_fdatawrite_range_kick(mapping, iocb->ki_pos,
>> +					      iocb->ki_pos + count);
>>  	}
>>  
>>  	return count;
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index f2d49dccb7c1..e49587c40157 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -14,6 +14,7 @@
>>  #include <linux/gfp.h>
>>  #include <linux/bitops.h>
>>  #include <linux/hardirq.h> /* for in_interrupt() */
>> +#include <linux/writeback.h>
>>  #include <linux/hugetlb_inline.h>
>>  
>>  struct folio_batch;
>> @@ -70,6 +71,14 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
>>  	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
>>  }
>>  
>> +/*
>> + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
>> + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
>> + * must check for this and pass FGP_UNCACHED for folio creation.
>> + */
>> +#define foliop_uncached			((struct folio *) 0xfee1c001)
>> +#define foliop_is_uncached(foliop)	(*(foliop) == foliop_uncached)
> 
> Honestly, I'm not a fan of foliop_uncached or foliop_is_uncached.

It definitely is what I would elegantly refer to as somewhat of a
hack... But it's not _that_ bad imho.

> The first one because it's a magic value and can you guarantee that
> 0xfee1c001 will never be a pointer to an actual struct folio, even on
> 32-bit?

I don't think that should be possible, since it's deliberately 1 at the
end. A struct like folio (or anything else) should at least be sizeof
aligned, and this one is not.

> Second, they're both named "foliop" even though the first one doesn't
> return a (struct folio **) but the second one takes that as an arg.

I just named them as such since they only deal with the folio ** that is
being passed in. I can certainly rename the second one to
folio_uncached, that would be an improvement I think. Thanks!

> I think these two macros are only used for ext4 (or really, !iomap)
> support, right?  And that's only to avoid messing with ->write_begin?

Indeed, ideally we'd change ->write_begin() instead. And that probably
should still be done, I just did not want to deal with that nightmare in
terms of managing the patchset. And honestly I think it'd be OK to defer
that part until ->write_begin() needs to be changed for other reasons,
it's a lot of churn just for this particular thing and dealing with the
magic pointer value (at least to me) is liveable.

> What if you dropped ext4 support instead? :D

Hah, yes obviously that'd be a solution, then I'd need to drop btrfs as
well. And I would kind of prefer not doing that ;-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
                   ` (13 preceding siblings ...)
  2024-12-06 17:37 ` Darrick J. Wong
@ 2024-12-10  9:48 ` Bharata B Rao
  2024-12-12 15:46   ` Jens Axboe
  14 siblings, 1 reply; 43+ messages in thread
From: Bharata B Rao @ 2024-12-10  9:48 UTC (permalink / raw)
  To: axboe
  Cc: bfoster, clm, hannes, kirill, linux-fsdevel, linux-kernel,
	linux-mm, willy

Hi Jens,

I ran a couple of variants of FIO to check how this patchset affects the
FIO numbers. My other motivation with this patchset is to check if it
improves the scalability issues that were discussed in [1] and [2].
But for now, here are some initial numbers.

To enabled uncached buffered IO, I have modified FIO pvsync2 engine to
issue preadv2()/pwritev2() calls with RWF_UNCACHED set. The FIO change
looks like below and I assume this is good enough to correctly use this
patchset.

diff --git a/engines/sync.c b/engines/sync.c
index b8be4eb3..44e9da3d 100644
--- a/engines/sync.c
+++ b/engines/sync.c
@@ -170,6 +170,8 @@ static enum fio_q_status fio_pvsyncio2_queue(struct thread_data *td,
        if (o->nowait)
                flags |= RWF_NOWAIT;
 
+       flags |= RWF_UNCACHED;
+
        iov->iov_base = io_u->xfer_buf;
        iov->iov_len = io_u->xfer_buflen;
 
Also note that I am using your buffered-uncached.8 branch from
https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.8 that has
changes to enable uncached buffered IO for EXT4 and block devices.

In the below reported numbers,
'base' means kernel from buffered-uncached.8 branch and
'patched' means kernel from buffered-uncached.8 branch + above shown FIO change

FIO on EXT4 partitions
======================
nvme1n1     259:12   0   3.5T  0 disk 
├─nvme1n1p1 259:13   0 894.3G  0 part /mnt1
├─nvme1n1p2 259:14   0 894.3G  0 part /mnt2
├─nvme1n1p3 259:15   0 894.3G  0 part /mnt3
└─nvme1n1p4 259:16   0 894.1G  0 part /mnt4

fio -directory=/mnt4/ -direct=0 -thread -size=3G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
fio -directory=/mnt3/ -direct=0 -thread -size=3G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
fio -directory=/mnt1/ -direct=0 -thread -size=3G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
fio -directory=/mnt2/ -direct=0 -thread -size=3G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest

Four NVME devices are formatted with EXT4 and four parallel FIO instances
are run on them with the options as shown above.

FIO output looks like this:

base:
   READ: bw=1233MiB/s (1293MB/s), 1233MiB/s-1233MiB/s (1293MB/s-1293MB/s), io=4335GiB (4654GB), run=3600097-3600097msec
  WRITE: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=1858GiB (1995GB), run=3600097-3600097msec
   READ: bw=1248MiB/s (1308MB/s), 1248MiB/s-1248MiB/s (1308MB/s-1308MB/s), io=4387GiB (4710GB), run=3600091-3600091msec
  WRITE: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=1880GiB (2019GB), run=3600091-3600091msec
   READ: bw=1235MiB/s (1294MB/s), 1235MiB/s-1235MiB/s (1294MB/s-1294MB/s), io=4340GiB (4660GB), run=3600094-3600094msec
  WRITE: bw=529MiB/s (555MB/s), 529MiB/s-529MiB/s (555MB/s-555MB/s), io=1860GiB (1997GB), run=3600094-3600094msec
   READ: bw=1234MiB/s (1294MB/s), 1234MiB/s-1234MiB/s (1294MB/s-1294MB/s), io=4337GiB (4657GB), run=3600093-3600093msec
  WRITE: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=1859GiB (1996GB), run=3600093-3600093msec

patched:
   READ: bw=1400MiB/s (1469MB/s), 1400MiB/s-1400MiB/s (1469MB/s-1469MB/s), io=4924GiB (5287GB), run=3600100-3600100msec
  WRITE: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=2110GiB (2266GB), run=3600100-3600100msec
   READ: bw=1395MiB/s (1463MB/s), 1395MiB/s-1395MiB/s (1463MB/s-1463MB/s), io=4904GiB (5266GB), run=3600148-3600148msec
  WRITE: bw=598MiB/s (627MB/s), 598MiB/s-598MiB/s (627MB/s-627MB/s), io=2102GiB (2257GB), run=3600148-3600148msec
   READ: bw=1385MiB/s (1452MB/s), 1385MiB/s-1385MiB/s (1452MB/s-1452MB/s), io=4868GiB (5227GB), run=3600136-3600136msec
  WRITE: bw=594MiB/s (622MB/s), 594MiB/s-594MiB/s (622MB/s-622MB/s), io=2087GiB (2241GB), run=3600136-3600136msec
   READ: bw=1376MiB/s (1443MB/s), 1376MiB/s-1376MiB/s (1443MB/s-1443MB/s), io=4837GiB (5194GB), run=3600145-3600145msec
  WRITE: bw=590MiB/s (618MB/s), 590MiB/s-590MiB/s (618MB/s-618MB/s), io=2073GiB (2226GB), run=3600145-3600145msec

FIO on block devices
====================
nvme1n1     259:12   0   3.5T  0 disk 
├─nvme1n1p1 259:13   0 894.3G  0 part 
├─nvme1n1p2 259:14   0 894.3G  0 part 
├─nvme1n1p3 259:15   0 894.3G  0 part 
└─nvme1n1p4 259:16   0 894.1G  0 part 

fio -filename=/dev/nvme1n1p4 -direct=0 -thread -size=800G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=800G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
fio -filename=/dev/nvme1n1p1 -direct=0 -thread -size=800G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
fio -filename=/dev/nvme1n1p3 -direct=0 -thread -size=800G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest

Four instances of FIO are run on four different NVME block devices
with the options as shown above.

base:
   READ: bw=8712MiB/s (9135MB/s), 8712MiB/s-8712MiB/s (9135MB/s-9135MB/s), io=29.9TiB (32.9TB), run=3600011-3600011msec
  WRITE: bw=3734MiB/s (3915MB/s), 3734MiB/s-3734MiB/s (3915MB/s-3915MB/s), io=12.8TiB (14.1TB), run=3600011-3600011msec
   READ: bw=8727MiB/s (9151MB/s), 8727MiB/s-8727MiB/s (9151MB/s-9151MB/s), io=30.0TiB (32.9TB), run=3600005-3600005msec
  WRITE: bw=3740MiB/s (3922MB/s), 3740MiB/s-3740MiB/s (3922MB/s-3922MB/s), io=12.8TiB (14.1TB), run=3600005-3600005msec
   READ: bw=8701MiB/s (9123MB/s), 8701MiB/s-8701MiB/s (9123MB/s-9123MB/s), io=29.9TiB (32.8TB), run=3600004-3600004msec
  WRITE: bw=3729MiB/s (3910MB/s), 3729MiB/s-3729MiB/s (3910MB/s-3910MB/s), io=12.8TiB (14.1TB), run=3600004-3600004msec
   READ: bw=8706MiB/s (9128MB/s), 8706MiB/s-8706MiB/s (9128MB/s-9128MB/s), io=29.9TiB (32.9TB), run=3600005-3600005msec
  WRITE: bw=3731MiB/s (3913MB/s), 3731MiB/s-3731MiB/s (3913MB/s-3913MB/s), io=12.8TiB (14.1TB), run=3600005-3600005msec

patched:
   READ: bw=1844MiB/s (1933MB/s), 1844MiB/s-1844MiB/s (1933MB/s-1933MB/s), io=6500GiB (6980GB), run=3610641-3610641msec
  WRITE: bw=790MiB/s (828MB/s), 790MiB/s-790MiB/s (828MB/s-828MB/s), io=2786GiB (2991GB), run=3610642-3610642msec
   READ: bw=1753MiB/s (1838MB/s), 1753MiB/s-1753MiB/s (1838MB/s-1838MB/s), io=6235GiB (6695GB), run=3641973-3641973msec
  WRITE: bw=751MiB/s (788MB/s), 751MiB/s-751MiB/s (788MB/s-788MB/s), io=2672GiB (2869GB), run=3641969-3641969msec
   READ: bw=1078MiB/s (1130MB/s), 1078MiB/s-1078MiB/s (1130MB/s-1130MB/s), io=3788GiB (4068GB), run=3600007-3600007msec
  WRITE: bw=462MiB/s (484MB/s), 462MiB/s-462MiB/s (484MB/s-484MB/s), io=1624GiB (1743GB), run=3600007-3600007msec
   READ: bw=1752MiB/s (1838MB/s), 1752MiB/s-1752MiB/s (1838MB/s-1838MB/s), io=6234GiB (6694GB), run=3642657-3642657msec
  WRITE: bw=751MiB/s (788MB/s), 751MiB/s-751MiB/s (788MB/s-788MB/s), io=2672GiB (2869GB), run=3642622-3642622msec

While FIO on FS shows improvement, FIO on block shows numbers going down.
Is this expected or am I missing enabling anything else for the block option?

Regards,
Bharata.

[1] https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
[2] https://lore.kernel.org/linux-fsdevel/20241127054737.33351-1-bharata@amd.com/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-04  5:52         ` Darrick J. Wong
  2024-12-04 16:36           ` Jens Axboe
@ 2024-12-10 11:11           ` Christoph Hellwig
  2024-12-12 15:48             ` Jens Axboe
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2024-12-10 11:11 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jens Axboe, Christoph Lameter (Ampere),
	linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, Dec 03, 2024 at 09:52:41PM -0800, Darrick J. Wong wrote:
> <shrug> RWF_DONTCACHE, to match {I,DCACHE}_DONTCACHE ? ;)
> 
> They sound pretty similar ("load this so I can do something with it,
> evict it immediately if possible") though I wouldn't rely on people
> outside the kernel being familiar with the existing dontcaches.

FYI, another word for dontcache.  uncached just has too many conotations
in the kernel context.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb
  2024-12-03 15:31 ` [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-12-10 11:13   ` Christoph Hellwig
  2024-12-12 15:49     ` Jens Axboe
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2024-12-10 11:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, Dec 03, 2024 at 08:31:37AM -0700, Jens Axboe wrote:
> +static int filemap_create_folio(struct kiocb *iocb,
> +		struct address_space *mapping, struct folio_batch *fbatch)

We might as well drop passing the mapping and deriving it from the iocb
as well.

Otherwise this looks fine to me.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 05/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead
  2024-12-03 15:31 ` [PATCH 05/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
@ 2024-12-10 11:15   ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2024-12-10 11:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, Dec 03, 2024 at 08:31:41AM -0700, Jens Axboe wrote:
> Rather than use the page_cache_sync_readahead() helper, define our own
> ractl and use page_cache_sync_ra() directly. In preparation for needing
> to modify ractl inside filemap_get_pages().
> 
> No functional changes in this patch.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

FYI, I usually try to keep these pure cleanup patches that don't even
directly touch the series subject matter at the beginning to reduce
the cognitive load.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper
  2024-12-03 15:31 ` [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
@ 2024-12-10 11:21   ` Christoph Hellwig
  2024-12-12 20:19     ` Jens Axboe
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2024-12-10 11:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, Dec 03, 2024 at 08:31:42AM -0700, Jens Axboe wrote:
> Add a folio_unmap_invalidate() helper, which unmaps and invalidates a
> given folio. The caller must already have locked the folio. Use this
> new helper in invalidate_inode_pages2_range(), rather than duplicate
> the code there.

This new helper ends up the only caller of invalidate_complete_folio2,
so you might as well merge the two instead of having yet another
invalidate/unmap helper, which are getting impossible to track of.

Also it is only used in mm/, so add the prototype to mm/internal.h
insead of the public pagemap.h.  And a little comment what the function
does would be pretty useful as well.

> In preparation for using this elsewhere as well, have it take a gfp_t
> mask rather than assume GFP_KERNEL is the right choice. This bubbles
> back to invalidate_complete_folio2() as well.

Looking at the callers the gfp_t looks a bit odd to me, as it is
either GFP_KERNEL or 0 which is a valid but rather unusuable gfp_t
value, but I guess this comes form filemap_release_folio which
works similarly.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/12] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag
  2024-12-03 15:31 ` [PATCH 07/12] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
  2024-12-06 17:35   ` Darrick J. Wong
@ 2024-12-10 11:22   ` Christoph Hellwig
  2024-12-12 19:42     ` Jens Axboe
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2024-12-10 11:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On Tue, Dec 03, 2024 at 08:31:43AM -0700, Jens Axboe wrote:
> +	if (flags & RWF_UNCACHED) {
> +		/* file system must support it */
> +		if (!(ki->ki_filp->f_op->fop_flags & FOP_UNCACHED))
> +			return -EOPNOTSUPP;
> +		/* DAX mappings not supported */
> +		if (IS_DAX(ki->ki_filp->f_mapping->host))
> +			return -EOPNOTSUPP;

I'd argue that DAX is always uncached and could just ignore the flag.
Same for direct I/O.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 11/12] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-12-06 18:22     ` Jens Axboe
@ 2024-12-10 11:31       ` Christoph Hellwig
  2024-12-12 15:51         ` Jens Axboe
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2024-12-10 11:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Darrick J. Wong, linux-mm, linux-fsdevel, hannes, clm,
	linux-kernel, willy, kirill, bfoster

On Fri, Dec 06, 2024 at 11:22:55AM -0700, Jens Axboe wrote:
> > Honestly, I'm not a fan of foliop_uncached or foliop_is_uncached.
> 
> It definitely is what I would elegantly refer to as somewhat of a
> hack... But it's not _that_ bad imho.

It's pretty horrible actually.

> > I think these two macros are only used for ext4 (or really, !iomap)
> > support, right?  And that's only to avoid messing with ->write_begin?
> 
> Indeed, ideally we'd change ->write_begin() instead. And that probably
> should still be done, I just did not want to deal with that nightmare in
> terms of managing the patchset. And honestly I think it'd be OK to defer
> that part until ->write_begin() needs to be changed for other reasons,
> it's a lot of churn just for this particular thing and dealing with the
> magic pointer value (at least to me) is liveable.

->write_begin() really should just go away, it is a horrible interface.
Note that in that past it actually had a flags argument, but that got
killed a while ago.

> > What if you dropped ext4 support instead? :D
> 
> Hah, yes obviously that'd be a solution, then I'd need to drop btrfs as
> well. And I would kind of prefer not doing that ;-)

Btrfs doesn't need it.  In fact the code would be cleaner and do less
work with out, see the patch below.  And for ext4 there already is an
iomap conversion patch series on the list that just needs more review,
so skipping it here and growing the uncached support through that sounds
sensible.

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 444bee219b4e..db3a3c7ecf77 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -894,20 +894,17 @@ static gfp_t get_prepare_gfp_flags(struct inode *inode, bool nowait)
  */
 static noinline int prepare_one_folio(struct inode *inode, struct folio **folio_ret,
 				      loff_t pos, size_t write_bytes,
-				      bool force_uptodate, bool nowait)
+				      bool force_uptodate, fgf_t fgp_flags)
 {
 	unsigned long index = pos >> PAGE_SHIFT;
-	gfp_t mask = get_prepare_gfp_flags(inode, nowait);
-	fgf_t fgp_flags = (nowait ? FGP_WRITEBEGIN | FGP_NOWAIT : FGP_WRITEBEGIN);
+	gfp_t mask = get_prepare_gfp_flags(inode, fgp_flags & FGP_NOWAIT);
 	struct folio *folio;
 	int ret = 0;
 
-	if (foliop_is_uncached(folio_ret))
-		fgp_flags |= FGP_UNCACHED;
 again:
 	folio = __filemap_get_folio(inode->i_mapping, index, fgp_flags, mask);
 	if (IS_ERR(folio)) {
-		if (nowait)
+		if (fgp_flags & FGP_NOWAIT)
 			ret = -EAGAIN;
 		else
 			ret = PTR_ERR(folio);
@@ -925,7 +922,7 @@ static noinline int prepare_one_folio(struct inode *inode, struct folio **folio_
 	if (ret) {
 		/* The folio is already unlocked. */
 		folio_put(folio);
-		if (!nowait && ret == -EAGAIN) {
+		if (!(fgp_flags & FGP_NOWAIT) && ret == -EAGAIN) {
 			ret = 0;
 			goto again;
 		}
@@ -1135,9 +1132,15 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
 	const bool nowait = (iocb->ki_flags & IOCB_NOWAIT);
 	unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0);
 	bool only_release_metadata = false;
+	fgf_t fgp_flags = FGP_WRITEBEGIN;
 
-	if (nowait)
+	if (nowait) {
 		ilock_flags |= BTRFS_ILOCK_TRY;
+		fgp_flags |= FGP_NOWAIT;
+	}
+
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		fgp_flags |= FGP_UNCACHED;
 
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
@@ -1232,11 +1235,8 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
 			break;
 		}
 
-		if (iocb->ki_flags & IOCB_UNCACHED)
-			folio = foliop_uncached;
-
 		ret = prepare_one_folio(inode, &folio, pos, write_bytes,
-					force_page_uptodate, false);
+					force_page_uptodate, fgp_flags);
 		if (ret) {
 			btrfs_delalloc_release_extents(BTRFS_I(inode),
 						       reserve_bytes);


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-10  9:48 ` Bharata B Rao
@ 2024-12-12 15:46   ` Jens Axboe
  0 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-12 15:46 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: bfoster, clm, hannes, kirill, linux-fsdevel, linux-kernel,
	linux-mm, willy

On 12/10/24 2:48 AM, Bharata B Rao wrote:
> Hi Jens,
> 
> I ran a couple of variants of FIO to check how this patchset affects the
> FIO numbers. My other motivation with this patchset is to check if it
> improves the scalability issues that were discussed in [1] and [2].
> But for now, here are some initial numbers.

Thanks for testing!

> Also note that I am using your buffered-uncached.8 branch from
> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.8 that has
> changes to enable uncached buffered IO for EXT4 and block devices.
> 
> In the below reported numbers,
> 'base' means kernel from buffered-uncached.8 branch and
> 'patched' means kernel from buffered-uncached.8 branch + above shown FIO change
> 
> FIO on EXT4 partitions
> ======================
> nvme1n1     259:12   0   3.5T  0 disk 
> ??nvme1n1p1 259:13   0 894.3G  0 part /mnt1
> ??nvme1n1p2 259:14   0 894.3G  0 part /mnt2
> ??nvme1n1p3 259:15   0 894.3G  0 part /mnt3
> ??nvme1n1p4 259:16   0 894.1G  0 part /mnt4
> 
> fio -directory=/mnt4/ -direct=0 -thread -size=3G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
> fio -directory=/mnt3/ -direct=0 -thread -size=3G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
> fio -directory=/mnt1/ -direct=0 -thread -size=3G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
> fio -directory=/mnt2/ -direct=0 -thread -size=3G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
> 
> Four NVME devices are formatted with EXT4 and four parallel FIO instances
> are run on them with the options as shown above.
> 
> FIO output looks like this:
> 
> base:
>    READ: bw=1233MiB/s (1293MB/s), 1233MiB/s-1233MiB/s (1293MB/s-1293MB/s), io=4335GiB (4654GB), run=3600097-3600097msec
>   WRITE: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=1858GiB (1995GB), run=3600097-3600097msec
>    READ: bw=1248MiB/s (1308MB/s), 1248MiB/s-1248MiB/s (1308MB/s-1308MB/s), io=4387GiB (4710GB), run=3600091-3600091msec
>   WRITE: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=1880GiB (2019GB), run=3600091-3600091msec
>    READ: bw=1235MiB/s (1294MB/s), 1235MiB/s-1235MiB/s (1294MB/s-1294MB/s), io=4340GiB (4660GB), run=3600094-3600094msec
>   WRITE: bw=529MiB/s (555MB/s), 529MiB/s-529MiB/s (555MB/s-555MB/s), io=1860GiB (1997GB), run=3600094-3600094msec
>    READ: bw=1234MiB/s (1294MB/s), 1234MiB/s-1234MiB/s (1294MB/s-1294MB/s), io=4337GiB (4657GB), run=3600093-3600093msec
>   WRITE: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=1859GiB (1996GB), run=3600093-3600093msec
> 
> patched:
>    READ: bw=1400MiB/s (1469MB/s), 1400MiB/s-1400MiB/s (1469MB/s-1469MB/s), io=4924GiB (5287GB), run=3600100-3600100msec
>   WRITE: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=2110GiB (2266GB), run=3600100-3600100msec
>    READ: bw=1395MiB/s (1463MB/s), 1395MiB/s-1395MiB/s (1463MB/s-1463MB/s), io=4904GiB (5266GB), run=3600148-3600148msec
>   WRITE: bw=598MiB/s (627MB/s), 598MiB/s-598MiB/s (627MB/s-627MB/s), io=2102GiB (2257GB), run=3600148-3600148msec
>    READ: bw=1385MiB/s (1452MB/s), 1385MiB/s-1385MiB/s (1452MB/s-1452MB/s), io=4868GiB (5227GB), run=3600136-3600136msec
>   WRITE: bw=594MiB/s (622MB/s), 594MiB/s-594MiB/s (622MB/s-622MB/s), io=2087GiB (2241GB), run=3600136-3600136msec
>    READ: bw=1376MiB/s (1443MB/s), 1376MiB/s-1376MiB/s (1443MB/s-1443MB/s), io=4837GiB (5194GB), run=3600145-3600145msec
>   WRITE: bw=590MiB/s (618MB/s), 590MiB/s-590MiB/s (618MB/s-618MB/s), io=2073GiB (2226GB), run=3600145-3600145msec
> 
> FIO on block devices
> ====================
> nvme1n1     259:12   0   3.5T  0 disk 
> ??nvme1n1p1 259:13   0 894.3G  0 part 
> ??nvme1n1p2 259:14   0 894.3G  0 part 
> ??nvme1n1p3 259:15   0 894.3G  0 part 
> ??nvme1n1p4 259:16   0 894.1G  0 part 
> 
> fio -filename=/dev/nvme1n1p4 -direct=0 -thread -size=800G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
> fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=800G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
> fio -filename=/dev/nvme1n1p1 -direct=0 -thread -size=800G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
> fio -filename=/dev/nvme1n1p3 -direct=0 -thread -size=800G -rw=rw -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=pvsync2 -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting -name=mytest
> 
> Four instances of FIO are run on four different NVME block devices
> with the options as shown above.
> 
> base:
>    READ: bw=8712MiB/s (9135MB/s), 8712MiB/s-8712MiB/s (9135MB/s-9135MB/s), io=29.9TiB (32.9TB), run=3600011-3600011msec
>   WRITE: bw=3734MiB/s (3915MB/s), 3734MiB/s-3734MiB/s (3915MB/s-3915MB/s), io=12.8TiB (14.1TB), run=3600011-3600011msec
>    READ: bw=8727MiB/s (9151MB/s), 8727MiB/s-8727MiB/s (9151MB/s-9151MB/s), io=30.0TiB (32.9TB), run=3600005-3600005msec
>   WRITE: bw=3740MiB/s (3922MB/s), 3740MiB/s-3740MiB/s (3922MB/s-3922MB/s), io=12.8TiB (14.1TB), run=3600005-3600005msec
>    READ: bw=8701MiB/s (9123MB/s), 8701MiB/s-8701MiB/s (9123MB/s-9123MB/s), io=29.9TiB (32.8TB), run=3600004-3600004msec
>   WRITE: bw=3729MiB/s (3910MB/s), 3729MiB/s-3729MiB/s (3910MB/s-3910MB/s), io=12.8TiB (14.1TB), run=3600004-3600004msec
>    READ: bw=8706MiB/s (9128MB/s), 8706MiB/s-8706MiB/s (9128MB/s-9128MB/s), io=29.9TiB (32.9TB), run=3600005-3600005msec
>   WRITE: bw=3731MiB/s (3913MB/s), 3731MiB/s-3731MiB/s (3913MB/s-3913MB/s), io=12.8TiB (14.1TB), run=3600005-3600005msec
> 
> patched:
>    READ: bw=1844MiB/s (1933MB/s), 1844MiB/s-1844MiB/s (1933MB/s-1933MB/s), io=6500GiB (6980GB), run=3610641-3610641msec
>   WRITE: bw=790MiB/s (828MB/s), 790MiB/s-790MiB/s (828MB/s-828MB/s), io=2786GiB (2991GB), run=3610642-3610642msec
>    READ: bw=1753MiB/s (1838MB/s), 1753MiB/s-1753MiB/s (1838MB/s-1838MB/s), io=6235GiB (6695GB), run=3641973-3641973msec
>   WRITE: bw=751MiB/s (788MB/s), 751MiB/s-751MiB/s (788MB/s-788MB/s), io=2672GiB (2869GB), run=3641969-3641969msec
>    READ: bw=1078MiB/s (1130MB/s), 1078MiB/s-1078MiB/s (1130MB/s-1130MB/s), io=3788GiB (4068GB), run=3600007-3600007msec
>   WRITE: bw=462MiB/s (484MB/s), 462MiB/s-462MiB/s (484MB/s-484MB/s), io=1624GiB (1743GB), run=3600007-3600007msec
>    READ: bw=1752MiB/s (1838MB/s), 1752MiB/s-1752MiB/s (1838MB/s-1838MB/s), io=6234GiB (6694GB), run=3642657-3642657msec
>   WRITE: bw=751MiB/s (788MB/s), 751MiB/s-751MiB/s (788MB/s-788MB/s), io=2672GiB (2869GB), run=3642622-3642622msec
> 
> While FIO on FS shows improvement, FIO on block shows numbers going down.
> Is this expected or am I missing enabling anything else for the block option?

The fs side looks as expected, that's a nice win. For the bdev side, I
deliberately did not post the bdev patch for enabling uncached buffered
IO on a raw block device, as it's just a hack for testing. It currently
needs punting similar to dirtying of pages, and it's not optimized at
all. We really need the raw bdev ops moving fully to iomap and not being
dependent on buffer_heads for this to pan out, so the most likely
outcome here is that raw bdev uncached buffered IO will not really be
supported until the time that someone (probably Christoph) does that
work.

I don't think this is a big issue, can't imagine buffered IO on raw
block devices being THAT interesting of a use case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-10 11:11           ` Christoph Hellwig
@ 2024-12-12 15:48             ` Jens Axboe
  2024-12-12 16:59               ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2024-12-12 15:48 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong
  Cc: Christoph Lameter (Ampere),
	linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On 12/10/24 4:11 AM, Christoph Hellwig wrote:
> On Tue, Dec 03, 2024 at 09:52:41PM -0800, Darrick J. Wong wrote:
>> <shrug> RWF_DONTCACHE, to match {I,DCACHE}_DONTCACHE ? ;)
>>
>> They sound pretty similar ("load this so I can do something with it,
>> evict it immediately if possible") though I wouldn't rely on people
>> outside the kernel being familiar with the existing dontcaches.
> 
> FYI, another word for dontcache.  uncached just has too many conotations
> in the kernel context.

Sure, we can go with DONTCACHE instead. Only thing I don't like about
that is that you can use uncached as a verb and adjective, eg talking
about uncached IO. Talking about dontcached IO sounds pretty weird.

As I've said previously in this and other threads, I don't feel too
strongly about the in-kernel naming, I care more about the exposed
name. And uncached does seem to be the most descriptive and most
easily understandable by users.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb
  2024-12-10 11:13   ` Christoph Hellwig
@ 2024-12-12 15:49     ` Jens Axboe
  0 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-12 15:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On 12/10/24 4:13 AM, Christoph Hellwig wrote:
> On Tue, Dec 03, 2024 at 08:31:37AM -0700, Jens Axboe wrote:
>> +static int filemap_create_folio(struct kiocb *iocb,
>> +		struct address_space *mapping, struct folio_batch *fbatch)
> 
> We might as well drop passing the mapping and deriving it from the iocb
> as well.
> 
> Otherwise this looks fine to me.

Sure, I can do that cleanup at the same time.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 11/12] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-12-10 11:31       ` Christoph Hellwig
@ 2024-12-12 15:51         ` Jens Axboe
  0 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-12 15:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, linux-mm, linux-fsdevel, hannes, clm,
	linux-kernel, willy, kirill, bfoster

On 12/10/24 4:31 AM, Christoph Hellwig wrote:
> On Fri, Dec 06, 2024 at 11:22:55AM -0700, Jens Axboe wrote:
>>> Honestly, I'm not a fan of foliop_uncached or foliop_is_uncached.
>>
>> It definitely is what I would elegantly refer to as somewhat of a
>> hack... But it's not _that_ bad imho.
> 
> It's pretty horrible actually.

Tell us how you really feel :-)

>>> I think these two macros are only used for ext4 (or really, !iomap)
>>> support, right?  And that's only to avoid messing with ->write_begin?
>>
>> Indeed, ideally we'd change ->write_begin() instead. And that probably
>> should still be done, I just did not want to deal with that nightmare in
>> terms of managing the patchset. And honestly I think it'd be OK to defer
>> that part until ->write_begin() needs to be changed for other reasons,
>> it's a lot of churn just for this particular thing and dealing with the
>> magic pointer value (at least to me) is liveable.
> 
> ->write_begin() really should just go away, it is a horrible interface.
> Note that in that past it actually had a flags argument, but that got
> killed a while ago.
> 
>>> What if you dropped ext4 support instead? :D
>>
>> Hah, yes obviously that'd be a solution, then I'd need to drop btrfs as
>> well. And I would kind of prefer not doing that ;-)
> 
> Btrfs doesn't need it.  In fact the code would be cleaner and do less
> work with out, see the patch below.  And for ext4 there already is an
> iomap conversion patch series on the list that just needs more review,
> so skipping it here and growing the uncached support through that sounds
> sensible.

I can certainly defer the ext4 series if the below sorts out btrfs, if
that iomap conversion series is making progress. Don't have an issue
slotting behind that.

I'll check and test your btrfs tweak, thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-12 15:48             ` Jens Axboe
@ 2024-12-12 16:59               ` Christoph Lameter (Ampere)
  2024-12-12 19:14                 ` Jens Axboe
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-12-12 16:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Darrick J. Wong, linux-mm, linux-fsdevel,
	hannes, clm, linux-kernel, willy, kirill, bfoster

On Thu, 12 Dec 2024, Jens Axboe wrote:

> On 12/10/24 4:11 AM, Christoph Hellwig wrote:
> > On Tue, Dec 03, 2024 at 09:52:41PM -0800, Darrick J. Wong wrote:
> >> <shrug> RWF_DONTCACHE, to match {I,DCACHE}_DONTCACHE ? ;)
> >>
> >> They sound pretty similar ("load this so I can do something with it,
> >> evict it immediately if possible") though I wouldn't rely on people
> >> outside the kernel being familiar with the existing dontcaches.
> >
> > FYI, another word for dontcache.  uncached just has too many conotations
> > in the kernel context.
>
> Sure, we can go with DONTCACHE instead. Only thing I don't like about
> that is that you can use uncached as a verb and adjective, eg talking
> about uncached IO. Talking about dontcached IO sounds pretty weird.
>
> As I've said previously in this and other threads, I don't feel too
> strongly about the in-kernel naming, I care more about the exposed
> name. And uncached does seem to be the most descriptive and most
> easily understandable by users.

The page is cached while the operation is ongoing. "Transitory" would be
more accurate and it is a new term that was not used with pages before.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-12 16:59               ` Christoph Lameter (Ampere)
@ 2024-12-12 19:14                 ` Jens Axboe
  2024-12-12 19:35                   ` Matthew Wilcox
  0 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2024-12-12 19:14 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Christoph Hellwig, Darrick J. Wong, linux-mm, linux-fsdevel,
	hannes, clm, linux-kernel, willy, kirill, bfoster

On 12/12/24 9:59 AM, Christoph Lameter (Ampere) wrote:
> On Thu, 12 Dec 2024, Jens Axboe wrote:
> 
>> On 12/10/24 4:11 AM, Christoph Hellwig wrote:
>>> On Tue, Dec 03, 2024 at 09:52:41PM -0800, Darrick J. Wong wrote:
>>>> <shrug> RWF_DONTCACHE, to match {I,DCACHE}_DONTCACHE ? ;)
>>>>
>>>> They sound pretty similar ("load this so I can do something with it,
>>>> evict it immediately if possible") though I wouldn't rely on people
>>>> outside the kernel being familiar with the existing dontcaches.
>>>
>>> FYI, another word for dontcache.  uncached just has too many conotations
>>> in the kernel context.
>>
>> Sure, we can go with DONTCACHE instead. Only thing I don't like about
>> that is that you can use uncached as a verb and adjective, eg talking
>> about uncached IO. Talking about dontcached IO sounds pretty weird.
>>
>> As I've said previously in this and other threads, I don't feel too
>> strongly about the in-kernel naming, I care more about the exposed
>> name. And uncached does seem to be the most descriptive and most
>> easily understandable by users.
> 
> The page is cached while the operation is ongoing. "Transitory" would be
> more accurate and it is a new term that was not used with pages before.

Like I mentioned earlier, the fact that it's cached for the duration of
the operation is more of an implementation detail that developers need
not worry about. What's important is that it's not cached AFTER. I still
feel UNCACHED is the best description, but I'll change it to DONTCACHE
for the next version just to avoid the overlap with other in-kernel
uses.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-12 19:14                 ` Jens Axboe
@ 2024-12-12 19:35                   ` Matthew Wilcox
  2024-12-12 19:36                     ` Jens Axboe
                                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Matthew Wilcox @ 2024-12-12 19:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Lameter (Ampere),
	Christoph Hellwig, Darrick J. Wong, linux-mm, linux-fsdevel,
	hannes, clm, linux-kernel, kirill, bfoster

On Thu, Dec 12, 2024 at 12:14:23PM -0700, Jens Axboe wrote:
> Like I mentioned earlier, the fact that it's cached for the duration of
> the operation is more of an implementation detail that developers need
> not worry about. What's important is that it's not cached AFTER. I still
> feel UNCACHED is the best description, but I'll change it to DONTCACHE
> for the next version just to avoid the overlap with other in-kernel
> uses.

Regardless of the user API name, I like PG_streaming for the folio
flag name.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-12 19:35                   ` Matthew Wilcox
@ 2024-12-12 19:36                     ` Jens Axboe
  2024-12-12 20:06                     ` Christoph Lameter (Ampere)
  2024-12-13  5:04                     ` Johannes Weiner
  2 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-12 19:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter (Ampere),
	Christoph Hellwig, Darrick J. Wong, linux-mm, linux-fsdevel,
	hannes, clm, linux-kernel, kirill, bfoster

On 12/12/24 12:35 PM, Matthew Wilcox wrote:
> On Thu, Dec 12, 2024 at 12:14:23PM -0700, Jens Axboe wrote:
>> Like I mentioned earlier, the fact that it's cached for the duration of
>> the operation is more of an implementation detail that developers need
>> not worry about. What's important is that it's not cached AFTER. I still
>> feel UNCACHED is the best description, but I'll change it to DONTCACHE
>> for the next version just to avoid the overlap with other in-kernel
>> uses.
> 
> Regardless of the user API name, I like PG_streaming for the folio
> flag name.

Sure, I can make that change.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/12] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag
  2024-12-10 11:22   ` Christoph Hellwig
@ 2024-12-12 19:42     ` Jens Axboe
  0 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-12 19:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On 12/10/24 4:22 AM, Christoph Hellwig wrote:
> On Tue, Dec 03, 2024 at 08:31:43AM -0700, Jens Axboe wrote:
>> +	if (flags & RWF_UNCACHED) {
>> +		/* file system must support it */
>> +		if (!(ki->ki_filp->f_op->fop_flags & FOP_UNCACHED))
>> +			return -EOPNOTSUPP;
>> +		/* DAX mappings not supported */
>> +		if (IS_DAX(ki->ki_filp->f_mapping->host))
>> +			return -EOPNOTSUPP;
> 
> I'd argue that DAX is always uncached and could just ignore the flag.
> Same for direct I/O.

It's more of a safe guard in terms of the invalidation requiring extra
work for DAX.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-12 19:35                   ` Matthew Wilcox
  2024-12-12 19:36                     ` Jens Axboe
@ 2024-12-12 20:06                     ` Christoph Lameter (Ampere)
  2024-12-13  5:04                     ` Johannes Weiner
  2 siblings, 0 replies; 43+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-12-12 20:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Christoph Hellwig, Darrick J. Wong, linux-mm,
	linux-fsdevel, hannes, clm, linux-kernel, kirill, bfoster

On Thu, 12 Dec 2024, Matthew Wilcox wrote:

> Regardless of the user API name, I like PG_streaming for the folio
> flag name.

Streaming I/O seems to be a good name for this ....



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper
  2024-12-10 11:21   ` Christoph Hellwig
@ 2024-12-12 20:19     ` Jens Axboe
  0 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-12 20:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, bfoster

On 12/10/24 4:21 AM, Christoph Hellwig wrote:
> On Tue, Dec 03, 2024 at 08:31:42AM -0700, Jens Axboe wrote:
>> Add a folio_unmap_invalidate() helper, which unmaps and invalidates a
>> given folio. The caller must already have locked the folio. Use this
>> new helper in invalidate_inode_pages2_range(), rather than duplicate
>> the code there.
> 
> This new helper ends up the only caller of invalidate_complete_folio2,
> so you might as well merge the two instead of having yet another
> invalidate/unmap helper, which are getting impossible to track of.

Sure, missed that it's the only caller now.

> Also it is only used in mm/, so add the prototype to mm/internal.h
> insead of the public pagemap.h.  And a little comment what the function
> does would be pretty useful as well.

Good point, moved to internal.h instead.

>> In preparation for using this elsewhere as well, have it take a gfp_t
>> mask rather than assume GFP_KERNEL is the right choice. This bubbles
>> back to invalidate_complete_folio2() as well.
> 
> Looking at the callers the gfp_t looks a bit odd to me, as it is
> either GFP_KERNEL or 0 which is a valid but rather unusuable gfp_t
> value, but I guess this comes form filemap_release_folio which
> works similarly.

Indeed

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-12 19:35                   ` Matthew Wilcox
  2024-12-12 19:36                     ` Jens Axboe
  2024-12-12 20:06                     ` Christoph Lameter (Ampere)
@ 2024-12-13  5:04                     ` Johannes Weiner
  2024-12-13 14:49                       ` Jens Axboe
  2 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2024-12-13  5:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Christoph Lameter (Ampere),
	Christoph Hellwig, Darrick J. Wong, linux-mm, linux-fsdevel, clm,
	linux-kernel, kirill, bfoster

On Thu, Dec 12, 2024 at 07:35:28PM +0000, Matthew Wilcox wrote:
> On Thu, Dec 12, 2024 at 12:14:23PM -0700, Jens Axboe wrote:
> > Like I mentioned earlier, the fact that it's cached for the duration of
> > the operation is more of an implementation detail that developers need
> > not worry about. What's important is that it's not cached AFTER. I still
> > feel UNCACHED is the best description, but I'll change it to DONTCACHE
> > for the next version just to avoid the overlap with other in-kernel
> > uses.
> 
> Regardless of the user API name, I like PG_streaming for the folio
> flag name.

If we're throwing names in the ring, I'm partial to PG_dropbehind.

It's a term I think has been used to describe this type of behavior
before; it juxtaposes nicely with readahead; it plainly names the
action of what will happen to the page after the current IO operation
against it has completed (i.e. pairs up with PG_reclaim).


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET v6 0/12] Uncached buffered IO
  2024-12-13  5:04                     ` Johannes Weiner
@ 2024-12-13 14:49                       ` Jens Axboe
  0 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2024-12-13 14:49 UTC (permalink / raw)
  To: Johannes Weiner, Matthew Wilcox
  Cc: Christoph Lameter (Ampere),
	Christoph Hellwig, Darrick J. Wong, linux-mm, linux-fsdevel, clm,
	linux-kernel, kirill, bfoster

On 12/12/24 10:04 PM, Johannes Weiner wrote:
> On Thu, Dec 12, 2024 at 07:35:28PM +0000, Matthew Wilcox wrote:
>> On Thu, Dec 12, 2024 at 12:14:23PM -0700, Jens Axboe wrote:
>>> Like I mentioned earlier, the fact that it's cached for the duration of
>>> the operation is more of an implementation detail that developers need
>>> not worry about. What's important is that it's not cached AFTER. I still
>>> feel UNCACHED is the best description, but I'll change it to DONTCACHE
>>> for the next version just to avoid the overlap with other in-kernel
>>> uses.
>>
>> Regardless of the user API name, I like PG_streaming for the folio
>> flag name.
> 
> If we're throwing names in the ring, I'm partial to PG_dropbehind.
> 
> It's a term I think has been used to describe this type of behavior
> before; it juxtaposes nicely with readahead; it plainly names the
> action of what will happen to the page after the current IO operation
> against it has completed (i.e. pairs up with PG_reclaim).

True, I do think that's a good name for the folio flag. streaming isn't
bad, but it's not fully descriptive as the IO may not be streaming at
all, depending on the use case. I do remember when we used dropbehind
naming in the vm, probably 20 some years ago?

If there are no objections to this, I'll change the folio flag to
dropbehind. Also looks nicer with the bit operations on the folio, when
you have:

if (flags & RWF_DONTCACHE)
	folio_set_dropbehind(folio);

rather than:

if (flags & RWF_DONTCACHE)
	folio_set_streaming(folio);

and so forth, as the former just intuitively makes sense.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2024-12-13 14:50 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-03 15:31 [PATCHSET v6 0/12] Uncached buffered IO Jens Axboe
2024-12-03 15:31 ` [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
2024-12-10 11:13   ` Christoph Hellwig
2024-12-12 15:49     ` Jens Axboe
2024-12-03 15:31 ` [PATCH 02/12] mm/readahead: add folio allocation helper Jens Axboe
2024-12-03 15:31 ` [PATCH 03/12] mm: add PG_uncached page flag Jens Axboe
2024-12-03 15:31 ` [PATCH 04/12] mm/readahead: add readahead_control->uncached member Jens Axboe
2024-12-03 15:31 ` [PATCH 05/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
2024-12-10 11:15   ` Christoph Hellwig
2024-12-03 15:31 ` [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
2024-12-10 11:21   ` Christoph Hellwig
2024-12-12 20:19     ` Jens Axboe
2024-12-03 15:31 ` [PATCH 07/12] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
2024-12-06 17:35   ` Darrick J. Wong
2024-12-10 11:22   ` Christoph Hellwig
2024-12-12 19:42     ` Jens Axboe
2024-12-03 15:31 ` [PATCH 08/12] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
2024-12-03 15:31 ` [PATCH 09/12] mm/filemap: drop uncached pages when writeback completes Jens Axboe
2024-12-03 15:31 ` [PATCH 10/12] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
2024-12-03 15:31 ` [PATCH 11/12] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
2024-12-06 17:17   ` Darrick J. Wong
2024-12-06 18:22     ` Jens Axboe
2024-12-10 11:31       ` Christoph Hellwig
2024-12-12 15:51         ` Jens Axboe
2024-12-03 15:31 ` [PATCH 12/12] mm: add FGP_UNCACHED folio creation flag Jens Axboe
2024-12-03 18:23 ` [PATCHSET v6 0/12] Uncached buffered IO Christoph Lameter (Ampere)
2024-12-03 21:06   ` Jens Axboe
2024-12-03 22:16     ` Christoph Lameter (Ampere)
2024-12-03 22:41       ` Jens Axboe
2024-12-04  5:52         ` Darrick J. Wong
2024-12-04 16:36           ` Jens Axboe
2024-12-10 11:11           ` Christoph Hellwig
2024-12-12 15:48             ` Jens Axboe
2024-12-12 16:59               ` Christoph Lameter (Ampere)
2024-12-12 19:14                 ` Jens Axboe
2024-12-12 19:35                   ` Matthew Wilcox
2024-12-12 19:36                     ` Jens Axboe
2024-12-12 20:06                     ` Christoph Lameter (Ampere)
2024-12-13  5:04                     ` Johannes Weiner
2024-12-13 14:49                       ` Jens Axboe
2024-12-06 17:37 ` Darrick J. Wong
2024-12-10  9:48 ` Bharata B Rao
2024-12-12 15:46   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox