linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v4] Uncached buffered IO
@ 2024-11-08 17:43 Jens Axboe
  2024-11-08 17:43 ` [PATCH 01/13] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
                   ` (13 more replies)
  0 siblings, 14 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel

Hi,

5 years ago I posted patches adding support for RWF_UNCACHED, as a way
to do buffered IO that isn't page cache persistent. The approach back
then was to have private pages for IO, and then get rid of them once IO
was done. But that then runs into all the issues that O_DIRECT has, in
terms of synchronizing with the page cache.

So here's a new approach to the same concent, but using the page cache
as synchronization. That makes RWF_UNCACHED less special, in that it's
just page cache IO, except it prunes the ranges once IO is completed.

Why do this, you may ask? The tldr is that device speeds are only
getting faster, while reclaim is not. Doing normal buffered IO can be
very unpredictable, and suck up a lot of resources on the reclaim side.
This leads people to use O_DIRECT as a work-around, which has its own
set of restrictions in terms of size, offset, and length of IO. It's
also inherently synchronous, and now you need async IO as well. While
the latter isn't necessarily a big problem as we have good options
available there, it also should not be a requirement when all you want
to do is read or write some data without caching.

Even on desktop type systems, a normal NVMe device can fill the entire
page cache in seconds. On the big system I used for testing, there's a
lot more RAM, but also a lot more devices. As can be seen in some of the
results in the following patches, you can still fill RAM in seconds even
when there's 1TB of it. Hence this problem isn't solely a "big
hyperscaler system" issue, it's common across the board. Normal users
do big backups too, edit videos, etc.

Common for both reads and writes with RWF_UNCACHED is that they use the
page cache for IO. Reads work just like a normal buffered read would,
with the only exception being that the touched ranges will get pruned
after data has been copied. For writes, the ranges will get writeback
kicked off before the syscall returns, and then writeback completion
will prune the range. Hence writes aren't synchronous, and it's easy to
pipeline writes using RWF_UNCACHED.

File systems need to support this. The patches add support for the
generic filemap helpers, and for iomap. Then ext4 and XFS are marked as
supporting it. The amount of code here is really trivial, and the only
reason the fs opt-in is necessary is to have an RWF_UNCACHED IO return
-EOPNOTSUPP just in case the fs doesn't use either the generic paths or
iomap. Adding "support" to other file systems should be trivial, most of
the time just a one-liner adding FOP_UNCACHED to the fop_flags in the
file_operations struct.

Performance results are in patch 8 for reads and patch 10 for writes,
with the tldr being that I see about a 65% improvement in performance
for both, with fully predictable IO times. CPU reduction is substantial
as well, with no kswapd activity at all for reclaim when using uncached
IO.

Using it from applications is trivial - just set RWF_UNCACHED for the
read or write, using pwritev2(2) or preadv2(2). For io_uring, same
thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write
operation. And that's it.

The goal with this patchset was to make it less special than before. I
think if you look at the diffstat you'll agree that this is the case.

Patches 1..7 are just prep patches, and should have no functional
changes at all. Patch 8 adds support for the filemap path for
RWF_UNCACHED reads, patch 10 adds support for filemap RWF_UNCACHED
writes, and patch 11 adds iomap support uncached writes. Finally patches
12 and 13 do the simple 1-liner writing up for ext4 and XFS.

Git tree can be found here:

https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.4

 fs/ext4/file.c                 |  2 +-
 fs/iomap/buffered-io.c         | 12 ++++++++-
 fs/xfs/xfs_file.c              |  3 ++-
 include/linux/fs.h             | 10 +++++++-
 include/linux/iomap.h          |  3 ++-
 include/linux/page-flags.h     |  5 ++++
 include/linux/pagemap.h        |  3 +++
 include/trace/events/mmflags.h |  3 ++-
 include/uapi/linux/fs.h        |  6 ++++-
 mm/filemap.c                   | 58 ++++++++++++++++++++++++++++++++++--------
 mm/readahead.c                 | 22 ++++++++++++----
 mm/swap.c                      |  2 ++
 mm/truncate.c                  |  9 ++++---
 13 files changed, 111 insertions(+), 27 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/13] mm/filemap: change filemap_create_folio() to take a struct kiocb
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 18:18   ` Matthew Wilcox
  2024-11-08 17:43 ` [PATCH 02/13] mm/readahead: add folio allocation helper Jens Axboe
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

Rather than pass in both the file and position directly from the kiocb,
just take a struct kiocb instead. In preparation for actually needing
the kiocb in the function.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 36d22968be9a..2ae26a0f961b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2460,9 +2460,8 @@ static int filemap_update_page(struct kiocb *iocb,
 	return error;
 }
 
-static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, loff_t pos,
-		struct folio_batch *fbatch)
+static int filemap_create_folio(struct kiocb *iocb,
+		struct address_space *mapping, struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
@@ -2487,7 +2486,7 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
-	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
+	index = (iocb->ki_pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2495,7 +2494,8 @@ static int filemap_create_folio(struct file *file,
 	if (error)
 		goto error;
 
-	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
+	error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio,
+					folio);
 	if (error)
 		goto error;
 
@@ -2553,7 +2553,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
+		err = filemap_create_folio(iocb, mapping, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 02/13] mm/readahead: add folio allocation helper
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
  2024-11-08 17:43 ` [PATCH 01/13] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 17:43 ` [PATCH 03/13] mm: add PG_uncached page flag Jens Axboe
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

Just a wrapper around filemap_alloc_folio() for now, but add it in
preparation for modifying the folio based on the 'ractl' being passed
in.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/readahead.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 3dc6c7a128dd..003cfe79880d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -188,6 +188,12 @@ static void read_pages(struct readahead_control *rac)
 	BUG_ON(readahead_count(rac));
 }
 
+static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
+				       gfp_t gfp_mask, unsigned int order)
+{
+	return filemap_alloc_folio(gfp_mask, order);
+}
+
 /**
  * page_cache_ra_unbounded - Start unchecked readahead.
  * @ractl: Readahead control.
@@ -260,8 +266,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask,
-					    mapping_min_folio_order(mapping));
+		folio = ractl_alloc_folio(ractl, gfp_mask,
+					mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 
@@ -431,7 +437,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {
 	int err;
-	struct folio *folio = filemap_alloc_folio(gfp, order);
+	struct folio *folio = ractl_alloc_folio(ractl, gfp, order);
 
 	if (!folio)
 		return -ENOMEM;
@@ -753,7 +759,7 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, min_order);
+		folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
 		if (!folio)
 			return;
 
@@ -782,7 +788,7 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, min_order);
+		folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
 		if (!folio)
 			return;
 
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 03/13] mm: add PG_uncached page flag
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
  2024-11-08 17:43 ` [PATCH 01/13] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
  2024-11-08 17:43 ` [PATCH 02/13] mm/readahead: add folio allocation helper Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 19:25   ` Kirill A. Shutemov
  2024-11-08 17:43 ` [PATCH 04/13] mm/readahead: add readahead_control->uncached member Jens Axboe
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

Add a page flag that file IO can use to indicate that the IO being done
is uncached, as in it should not persist in the page cache after the IO
has been completed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/page-flags.h     | 5 +++++
 include/trace/events/mmflags.h | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index cc839e4365c1..3c4003495929 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -110,6 +110,7 @@ enum pageflags {
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
 	PG_unevictable,		/* Page is "unevictable"  */
+	PG_uncached,		/* uncached read/write IO */
 #ifdef CONFIG_MMU
 	PG_mlocked,		/* Page is vma mlocked */
 #endif
@@ -562,6 +563,10 @@ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
 FOLIO_FLAG(readahead, FOLIO_HEAD_PAGE)
 	FOLIO_TEST_CLEAR_FLAG(readahead, FOLIO_HEAD_PAGE)
 
+FOLIO_FLAG(uncached, FOLIO_HEAD_PAGE)
+	FOLIO_TEST_CLEAR_FLAG(uncached, FOLIO_HEAD_PAGE)
+	__FOLIO_SET_FLAG(uncached, FOLIO_HEAD_PAGE)
+
 #ifdef CONFIG_HIGHMEM
 /*
  * Must use a macro here due to header dependency issues. page_zone() is not
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index bb8a59c6caa2..b60057284102 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -116,7 +116,8 @@
 	DEF_PAGEFLAG_NAME(head),					\
 	DEF_PAGEFLAG_NAME(reclaim),					\
 	DEF_PAGEFLAG_NAME(swapbacked),					\
-	DEF_PAGEFLAG_NAME(unevictable)					\
+	DEF_PAGEFLAG_NAME(unevictable),					\
+	DEF_PAGEFLAG_NAME(uncached)					\
 IF_HAVE_PG_MLOCK(mlocked)						\
 IF_HAVE_PG_HWPOISON(hwpoison)						\
 IF_HAVE_PG_IDLE(idle)							\
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 04/13] mm/readahead: add readahead_control->uncached member
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (2 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 03/13] mm: add PG_uncached page flag Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 18:21   ` Matthew Wilcox
  2024-11-08 17:43 ` [PATCH 05/13] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

If ractl->uncached is set to true, then folios created are marked as
uncached as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 1 +
 mm/readahead.c          | 8 +++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301..8afacb7520d4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1350,6 +1350,7 @@ struct readahead_control {
 	pgoff_t _index;
 	unsigned int _nr_pages;
 	unsigned int _batch_count;
+	bool uncached;
 	bool _workingset;
 	unsigned long _pflags;
 };
diff --git a/mm/readahead.c b/mm/readahead.c
index 003cfe79880d..09cddbbfe28f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -191,7 +191,13 @@ static void read_pages(struct readahead_control *rac)
 static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
 				       gfp_t gfp_mask, unsigned int order)
 {
-	return filemap_alloc_folio(gfp_mask, order);
+	struct folio *folio;
+
+	folio = filemap_alloc_folio(gfp_mask, order);
+	if (folio && ractl->uncached)
+		folio_set_uncached(folio);
+
+	return folio;
 }
 
 /**
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 05/13] mm/filemap: use page_cache_sync_ra() to kick off read-ahead
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (3 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 04/13] mm/readahead: add readahead_control->uncached member Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 17:43 ` [PATCH 06/13] mm/truncate: make invalidate_complete_folio2() public Jens Axboe
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly. In preparation for needing
to modify ractl inside filemap_get_pages().

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2ae26a0f961b..7f8d13f06c04 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2525,7 +2525,6 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
-	struct file_ra_state *ra = &filp->f_ra;
 	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
 	pgoff_t last_index;
 	struct folio *folio;
@@ -2540,12 +2539,13 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 
 	filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
 	if (!folio_batch_count(fbatch)) {
+		DEFINE_READAHEAD(ractl, filp, &filp->f_ra, mapping, index);
+
 		if (iocb->ki_flags & IOCB_NOIO)
 			return -EAGAIN;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			flags = memalloc_noio_save();
-		page_cache_sync_readahead(mapping, ra, filp, index,
-				last_index - index);
+		page_cache_sync_ra(&ractl, last_index - index);
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			memalloc_noio_restore(flags);
 		filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 06/13] mm/truncate: make invalidate_complete_folio2() public
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (4 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 05/13] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 17:43 ` [PATCH 07/13] fs: add FOP_UNCACHED flag Jens Axboe
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

Make invalidate_complete_folio2() be publicly available, and have it
take a gfp_t mask as well rather than hardcode GFP_KERNEL. The only
caller just passes in GFP_KERNEL, no functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 2 ++
 mm/truncate.c           | 9 +++++----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8afacb7520d4..0122b3fbe2ac 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -34,6 +34,8 @@ int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
 void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count);
 int filemap_invalidate_pages(struct address_space *mapping,
 			     loff_t pos, loff_t end, bool nowait);
+int invalidate_complete_folio2(struct address_space *mapping,
+				struct folio *folio, gfp_t gfp_mask);
 
 int write_inode_now(struct inode *, int sync);
 int filemap_fdatawrite(struct address_space *);
diff --git a/mm/truncate.c b/mm/truncate.c
index 0668cd340a46..e084f7aa9370 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -546,13 +546,13 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
  * shrink_folio_list() has a temp ref on them, or because they're transiently
  * sitting in the folio_add_lru() caches.
  */
-static int invalidate_complete_folio2(struct address_space *mapping,
-					struct folio *folio)
+int invalidate_complete_folio2(struct address_space *mapping,
+				struct folio *folio, gfp_t gfp_mask)
 {
 	if (folio->mapping != mapping)
 		return 0;
 
-	if (!filemap_release_folio(folio, GFP_KERNEL))
+	if (!filemap_release_folio(folio, gfp_mask))
 		return 0;
 
 	spin_lock(&mapping->host->i_lock);
@@ -650,7 +650,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 
 			ret2 = folio_launder(mapping, folio);
 			if (ret2 == 0) {
-				if (!invalidate_complete_folio2(mapping, folio))
+				if (!invalidate_complete_folio2(mapping, folio,
+								GFP_KERNEL))
 					ret2 = -EBUSY;
 			}
 			if (ret2 < 0)
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 07/13] fs: add FOP_UNCACHED flag
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (5 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 06/13] mm/truncate: make invalidate_complete_folio2() public Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 18:27   ` Matthew Wilcox
  2024-11-08 17:43 ` [PATCH 08/13] fs: add read support for RWF_UNCACHED Jens Axboe
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

If a file system supports uncached buffered IO, it may set FOP_UNCACHED
and enable RWF_UNCACHED. If RWF_UNCACHED is attempted without the file
system supporting it, it'll get errored with -EOPNOTSUPP.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3559446279c1..491eeb73e725 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2116,6 +2116,8 @@ struct file_operations {
 #define FOP_HUGE_PAGES		((__force fop_flags_t)(1 << 4))
 /* Treat loff_t as unsigned (e.g., /dev/mem) */
 #define FOP_UNSIGNED_OFFSET	((__force fop_flags_t)(1 << 5))
+/* File system supports uncached read/write buffered IO */
+#define FOP_UNCACHED		((__force fop_flags_t)(1 << 6))
 
 /* Wrap a directory iterator that needs exclusive inode access */
 int wrap_directory_iterator(struct file *, struct dir_context *,
@@ -3532,6 +3534,10 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
 		if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
 			return -EOPNOTSUPP;
 	}
+	if (flags & RWF_UNCACHED) {
+		if (!(ki->ki_filp->f_op->fop_flags & FOP_UNCACHED))
+			return -EOPNOTSUPP;
+	}
 	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
 	if (flags & RWF_SYNC)
 		kiocb_flags |= IOCB_DSYNC;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 08/13] fs: add read support for RWF_UNCACHED
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (6 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 07/13] fs: add FOP_UNCACHED flag Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 18:33   ` Matthew Wilcox
  2024-11-11 13:04   ` Stefan Metzmacher
  2024-11-08 17:43 ` [PATCH 09/13] mm: drop uncached pages when writeback completes Jens Axboe
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

Add RWF_UNCACHED as a read operation flag, which means that any data
read wil be removed from the page cache upon completion. Uses the page
cache to synchronize, and simply prunes folios that were instantiated
when the operation completes. While it would be possible to use private
pages for this, using the page cache as synchronization is handy for a
variety of reasons:

1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
   cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable

and the code to support this is much simpler as a result.

You can think of uncached buffered IO as being the much more attractive
cousing of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes,
it will copy the data, but unlike regular buffered IO, it doesn't run
into the unpredictability of the page cache in terms of reclaim. As an
example, on a test box with 32 drives, reading them with buffered IO
looks as follows:

Reading bs 65536, uncached 0
  1s: 145945MB/sec
  2s: 158067MB/sec
  3s: 157007MB/sec
  4s: 148622MB/sec
  5s: 118824MB/sec
  6s: 70494MB/sec
  7s: 41754MB/sec
  8s: 90811MB/sec
  9s: 92204MB/sec
 10s: 95178MB/sec
 11s: 95488MB/sec
 12s: 95552MB/sec
 13s: 96275MB/sec

where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7535 root      20   0  267004      0      0 S  3199   0.0   8:40.65 uncached
3326 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd4
3327 root      20   0       0      0      0 R 100.0   0.0   0:17.22 kswapd5
3328 root      20   0       0      0      0 R 100.0   0.0   0:13.29 kswapd6
3332 root      20   0       0      0      0 R 100.0   0.0   0:11.11 kswapd10
3339 root      20   0       0      0      0 R 100.0   0.0   0:16.25 kswapd17
3348 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd26
3343 root      20   0       0      0      0 R 100.0   0.0   0:16.30 kswapd21
3344 root      20   0       0      0      0 R 100.0   0.0   0:11.92 kswapd22
3349 root      20   0       0      0      0 R 100.0   0.0   0:16.28 kswapd27
3352 root      20   0       0      0      0 R  99.7   0.0   0:11.89 kswapd30
3353 root      20   0       0      0      0 R  96.7   0.0   0:16.04 kswapd31
3329 root      20   0       0      0      0 R  96.4   0.0   0:11.41 kswapd7
3345 root      20   0       0      0      0 R  96.4   0.0   0:13.40 kswapd23
3330 root      20   0       0      0      0 S  91.1   0.0   0:08.28 kswapd8
3350 root      20   0       0      0      0 S  86.8   0.0   0:11.13 kswapd28
3325 root      20   0       0      0      0 S  76.3   0.0   0:07.43 kswapd3
3341 root      20   0       0      0      0 S  74.7   0.0   0:08.85 kswapd19
3334 root      20   0       0      0      0 S  71.7   0.0   0:10.04 kswapd12
3351 root      20   0       0      0      0 R  60.5   0.0   0:09.59 kswapd29
3323 root      20   0       0      0      0 R  57.6   0.0   0:11.50 kswapd1
[...]

which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.

If the same test case is run with RWF_UNCACHED set for the buffered read,
the output looks as follows:

Reading bs 65536, uncached 0
  1s: 153144MB/sec
  2s: 156760MB/sec
  3s: 158110MB/sec
  4s: 158009MB/sec
  5s: 158043MB/sec
  6s: 157638MB/sec
  7s: 157999MB/sec
  8s: 158024MB/sec
  9s: 157764MB/sec
 10s: 157477MB/sec
 11s: 157417MB/sec
 12s: 157455MB/sec
 13s: 157233MB/sec
 14s: 156692MB/sec

which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top

where just the test app is using CPU, no reclaim is taking place outside
of the main thread. Not only is performance 65% better, it's also using
half the CPU to do it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h      |  4 +++-
 include/uapi/linux/fs.h |  6 +++++-
 mm/filemap.c            | 18 ++++++++++++++++--
 mm/swap.c               |  2 ++
 4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 491eeb73e725..5abc53991cd0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -320,6 +320,7 @@ struct readahead_control;
 #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
 #define IOCB_APPEND		(__force int) RWF_APPEND
 #define IOCB_ATOMIC		(__force int) RWF_ATOMIC
+#define IOCB_UNCACHED		(__force int) RWF_UNCACHED
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
@@ -354,7 +355,8 @@ struct readahead_control;
 	{ IOCB_SYNC,		"SYNC" }, \
 	{ IOCB_NOWAIT,		"NOWAIT" }, \
 	{ IOCB_APPEND,		"APPEND" }, \
-	{ IOCB_ATOMIC,		"ATOMIC"}, \
+	{ IOCB_ATOMIC,		"ATOMIC" }, \
+	{ IOCB_UNCACHED,	"UNCACHED" }, \
 	{ IOCB_EVENTFD,		"EVENTFD"}, \
 	{ IOCB_DIRECT,		"DIRECT" }, \
 	{ IOCB_WRITE,		"WRITE" }, \
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..dc77cd8ae1a3 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -332,9 +332,13 @@ typedef int __bitwise __kernel_rwf_t;
 /* Atomic Write */
 #define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000040)
 
+/* buffered IO that drops the cache after reading or writing data */
+#define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000080)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
+			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
+			 RWF_UNCACHED)
 
 #define PROCFS_IOCTL_MAGIC 'f'
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 7f8d13f06c04..6f65025782bb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2471,6 +2471,8 @@ static int filemap_create_folio(struct kiocb *iocb,
 	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
 	if (!folio)
 		return -ENOMEM;
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		folio_set_uncached(folio);
 
 	/*
 	 * Protect against truncate / hole punch. Grabbing invalidate_lock
@@ -2516,6 +2518,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
 
 	if (iocb->ki_flags & IOCB_NOIO)
 		return -EAGAIN;
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		ractl.uncached = 1;
 	page_cache_async_ra(&ractl, folio, last_index - folio->index);
 	return 0;
 }
@@ -2545,6 +2549,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 			return -EAGAIN;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			flags = memalloc_noio_save();
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			ractl.uncached = 1;
 		page_cache_sync_ra(&ractl, last_index - index);
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			memalloc_noio_restore(flags);
@@ -2705,8 +2711,16 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 			}
 		}
 put_folios:
-		for (i = 0; i < folio_batch_count(&fbatch); i++)
-			folio_put(fbatch.folios[i]);
+		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+			struct folio *folio = fbatch.folios[i];
+
+			if (folio_test_uncached(folio)) {
+				folio_lock(folio);
+				invalidate_complete_folio2(mapping, folio, 0);
+				folio_unlock(folio);
+			}
+			folio_put(folio);
+		}
 		folio_batch_init(&fbatch);
 	} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
 
diff --git a/mm/swap.c b/mm/swap.c
index 835bdf324b76..f2457acae383 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -472,6 +472,8 @@ static void folio_inc_refs(struct folio *folio)
  */
 void folio_mark_accessed(struct folio *folio)
 {
+	if (folio_test_uncached(folio))
+		return;
 	if (lru_gen_enabled()) {
 		folio_inc_refs(folio);
 		return;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 09/13] mm: drop uncached pages when writeback completes
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (7 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 08/13] fs: add read support for RWF_UNCACHED Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 17:43 ` [PATCH 10/13] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

If the folio is marked as uncached, drop pages when writeback completes.
Intended to be used with RWF_UNCACHED, to avoid needing sync writes for
uncached IO.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6f65025782bb..1e455ca872b5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1634,6 +1634,18 @@ void folio_end_writeback(struct folio *folio)
 	if (__folio_end_writeback(folio))
 		folio_wake_bit(folio, PG_writeback);
 	acct_reclaim_writeback(folio);
+
+	/*
+	 * If folio is marked as uncached, then pages should be dropped when
+	 * writeback completes. Do that now.
+	 */
+	if (folio_test_uncached(folio)) {
+		folio_lock(folio);
+		if (invalidate_complete_folio2(folio->mapping, folio, 0))
+			folio_clear_uncached(folio);
+		folio_unlock(folio);
+
+	}
 	folio_put(folio);
 }
 EXPORT_SYMBOL(folio_end_writeback);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 10/13] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (8 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 09/13] mm: drop uncached pages when writeback completes Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 17:43 ` [PATCH 11/13] iomap: " Jens Axboe
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

If RWF_UNCACHED is set for a write, mark the folios being written with
drop_writeback. Then writeback completion will drop the pages. The
write_iter handler simply kicks off writeback for the pages, and
writeback completion will take care of the rest.

This provides similar benefits to using RWF_UNCACHED with reads. Testing
buffered writes on 32 files:

writing bs 65536, uncached 0
  1s: 196035MB/sec, MB=196035
  2s: 132308MB/sec, MB=328147
  3s: 132438MB/sec, MB=460586
  4s: 116528MB/sec, MB=577115
  5s: 103898MB/sec, MB=681014
  6s: 108893MB/sec, MB=789907
  7s: 99678MB/sec, MB=889586
  8s: 106545MB/sec, MB=996132
  9s: 106826MB/sec, MB=1102958
 10s: 101544MB/sec, MB=1204503
 11s: 111044MB/sec, MB=1315548
 12s: 124257MB/sec, MB=1441121
 13s: 116031MB/sec, MB=1557153
 14s: 114540MB/sec, MB=1671694
 15s: 115011MB/sec, MB=1786705
 16s: 115260MB/sec, MB=1901966
 17s: 116068MB/sec, MB=2018034
 18s: 116096MB/sec, MB=2134131

where it's quite obvious where the page cache filled, and performance
dropped from to about half of where it started, settling in at around
115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
reclaim pages.

Running the same test with uncached buffered writes:

writing bs 65536, uncached 1
  1s: 198974MB/sec
  2s: 189618MB/sec
  3s: 193601MB/sec
  4s: 188582MB/sec
  5s: 193487MB/sec
  6s: 188341MB/sec
  7s: 194325MB/sec
  8s: 188114MB/sec
  9s: 192740MB/sec
 10s: 189206MB/sec
 11s: 193442MB/sec
 12s: 189659MB/sec
 13s: 191732MB/sec
 14s: 190701MB/sec
 15s: 191789MB/sec
 16s: 191259MB/sec
 17s: 190613MB/sec
 18s: 191951MB/sec

and the behavior is fully predictable, performing the same throughout
even after the page cache would otherwise have fully filled with dirty
data. It's also about 65% faster, and using half the CPU of the system
compared to the normal buffered write.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1e455ca872b5..d4c5928c5e2a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1610,6 +1610,8 @@ EXPORT_SYMBOL(folio_wait_private_2_killable);
  */
 void folio_end_writeback(struct folio *folio)
 {
+	bool folio_uncached;
+
 	VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
 
 	/*
@@ -1631,6 +1633,7 @@ void folio_end_writeback(struct folio *folio)
 	 * reused before the folio_wake_bit().
 	 */
 	folio_get(folio);
+	folio_uncached = folio_test_clear_uncached(folio);
 	if (__folio_end_writeback(folio))
 		folio_wake_bit(folio, PG_writeback);
 	acct_reclaim_writeback(folio);
@@ -1639,12 +1642,10 @@ void folio_end_writeback(struct folio *folio)
 	 * If folio is marked as uncached, then pages should be dropped when
 	 * writeback completes. Do that now.
 	 */
-	if (folio_test_uncached(folio)) {
-		folio_lock(folio);
-		if (invalidate_complete_folio2(folio->mapping, folio, 0))
-			folio_clear_uncached(folio);
+	if (folio_uncached && folio_trylock(folio)) {
+		if (folio->mapping)
+			invalidate_complete_folio2(folio->mapping, folio, 0);
 		folio_unlock(folio);
-
 	}
 	folio_put(folio);
 }
@@ -4082,6 +4083,9 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 		if (unlikely(status < 0))
 			break;
 
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			folio_set_uncached(folio);
+
 		offset = offset_in_folio(folio, pos);
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
@@ -4122,6 +4126,12 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 
 	if (!written)
 		return status;
+	if (iocb->ki_flags & IOCB_UNCACHED) {
+		/* kick off uncached writeback, completion will drop it */
+		__filemap_fdatawrite_range(mapping, iocb->ki_pos,
+						iocb->ki_pos + written,
+						WB_SYNC_NONE);
+	}
 	iocb->ki_pos += written;
 	return written;
 }
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 11/13] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (9 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 10/13] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 18:46   ` Matthew Wilcox
  2024-11-08 17:43 ` [PATCH 12/13] ext4: flag as supporting FOP_UNCACHED Jens Axboe
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

Add iomap buffered write support for RWF_UNCACHED. If RWF_UNCACHED is
set for a write, mark the folios being written with drop_writeback. Then
writeback completion will drop the pages. The write_iter handler simply
kicks off writeback for the pages, and writeback completion will take
care of the rest.

See the similar patch for the generic filemap handling for performance
results, those were in fact done on XFS using this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/buffered-io.c | 12 +++++++++++-
 include/linux/iomap.h  |  3 ++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ef0b68bccbb6..609256885094 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -959,6 +959,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		}
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
+		if (iter->flags & IOMAP_UNCACHED)
+			folio_set_uncached(folio);
 
 		offset = offset_in_folio(folio, pos);
 		if (bytes > folio_size(folio) - offset)
@@ -1023,8 +1025,9 @@ ssize_t
 iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 		const struct iomap_ops *ops, void *private)
 {
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct iomap_iter iter = {
-		.inode		= iocb->ki_filp->f_mapping->host,
+		.inode		= mapping->host,
 		.pos		= iocb->ki_pos,
 		.len		= iov_iter_count(i),
 		.flags		= IOMAP_WRITE,
@@ -1034,12 +1037,19 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iter.flags |= IOMAP_NOWAIT;
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		iter.flags |= IOMAP_UNCACHED;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.processed = iomap_write_iter(&iter, i);
 
 	if (unlikely(iter.pos == iocb->ki_pos))
 		return ret;
+	if (iocb->ki_flags & IOCB_UNCACHED) {
+		/* kick off uncached writeback, completion will drop it */
+		__filemap_fdatawrite_range(mapping, iocb->ki_pos, iter.pos,
+						WB_SYNC_NONE);
+	}
 	ret = iter.pos - iocb->ki_pos;
 	iocb->ki_pos = iter.pos;
 	return ret;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f61407e3b121..89b24fbb1399 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -173,8 +173,9 @@ struct iomap_folio_ops {
 #define IOMAP_NOWAIT		(1 << 5) /* do not block */
 #define IOMAP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
 #define IOMAP_UNSHARE		(1 << 7) /* unshare_file_range */
+#define IOMAP_UNCACHED		(1 << 8) /* uncached IO */
 #ifdef CONFIG_FS_DAX
-#define IOMAP_DAX		(1 << 8) /* DAX mapping */
+#define IOMAP_DAX		(1 << 9) /* DAX mapping */
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 12/13] ext4: flag as supporting FOP_UNCACHED
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (10 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 11/13] iomap: " Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-08 17:43 ` [PATCH 13/13] xfs: " Jens Axboe
  2024-11-11 12:55 ` [PATCHSET v4] Uncached buffered IO Stefan Metzmacher
  13 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

ext4 uses the generic read/write paths, and can fully support
FOP_UNCACHED. Set the flag to indicate support, enabling use of
RWF_UNCACHED.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ext4/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f14aed14b9cf..0ef39d738598 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -944,7 +944,7 @@ const struct file_operations ext4_file_operations = {
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ext4_fallocate,
 	.fop_flags	= FOP_MMAP_SYNC | FOP_BUFFER_RASYNC |
-			  FOP_DIO_PARALLEL_WRITE,
+			  FOP_DIO_PARALLEL_WRITE | FOP_UNCACHED,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 13/13] xfs: flag as supporting FOP_UNCACHED
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (11 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 12/13] ext4: flag as supporting FOP_UNCACHED Jens Axboe
@ 2024-11-08 17:43 ` Jens Axboe
  2024-11-11 12:55 ` [PATCHSET v4] Uncached buffered IO Stefan Metzmacher
  13 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 17:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, Jens Axboe

iomap supports uncached IO, enable the use of RWF_UNCACHED with XFS by
flagging support with FOP_UNCACHED.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/xfs/xfs_file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b19916b11fd5..4fe593896bc5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1595,7 +1595,8 @@ const struct file_operations xfs_file_operations = {
 	.fadvise	= xfs_file_fadvise,
 	.remap_file_range = xfs_file_remap_range,
 	.fop_flags	= FOP_MMAP_SYNC | FOP_BUFFER_RASYNC |
-			  FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE,
+			  FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE |
+			  FOP_UNCACHED,
 };
 
 const struct file_operations xfs_dir_file_operations = {
-- 
2.45.2



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/13] mm/filemap: change filemap_create_folio() to take a struct kiocb
  2024-11-08 17:43 ` [PATCH 01/13] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-11-08 18:18   ` Matthew Wilcox
  2024-11-08 19:22     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2024-11-08 18:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On Fri, Nov 08, 2024 at 10:43:24AM -0700, Jens Axboe wrote:
> Rather than pass in both the file and position directly from the kiocb,
> just take a struct kiocb instead. In preparation for actually needing
> the kiocb in the function.

If you're undoing this part of f253e1854ce8, it's probably worth moving
the IOCB flag checks back to where they were too.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/13] mm/readahead: add readahead_control->uncached member
  2024-11-08 17:43 ` [PATCH 04/13] mm/readahead: add readahead_control->uncached member Jens Axboe
@ 2024-11-08 18:21   ` Matthew Wilcox
  2024-11-08 19:22     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2024-11-08 18:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On Fri, Nov 08, 2024 at 10:43:27AM -0700, Jens Axboe wrote:
> +++ b/mm/readahead.c
> @@ -191,7 +191,13 @@ static void read_pages(struct readahead_control *rac)
>  static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
>  				       gfp_t gfp_mask, unsigned int order)
>  {
> -	return filemap_alloc_folio(gfp_mask, order);
> +	struct folio *folio;
> +
> +	folio = filemap_alloc_folio(gfp_mask, order);
> +	if (folio && ractl->uncached)
> +		folio_set_uncached(folio);

If we've just allocated it, it should be safe to use
__folio_set_uncached() here, no?

Not that I'm keen on using a folio flag here, but I'm reserving judgement
on that unti I've got further through this series and see how it's used.
I can see that it might be necessary.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/13] fs: add FOP_UNCACHED flag
  2024-11-08 17:43 ` [PATCH 07/13] fs: add FOP_UNCACHED flag Jens Axboe
@ 2024-11-08 18:27   ` Matthew Wilcox
  2024-11-08 19:23     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2024-11-08 18:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On Fri, Nov 08, 2024 at 10:43:30AM -0700, Jens Axboe wrote:
> +	if (flags & RWF_UNCACHED) {

You introduce RWF_UNCACHED in the next patch, so this one's a bisection
hazard.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/13] fs: add read support for RWF_UNCACHED
  2024-11-08 17:43 ` [PATCH 08/13] fs: add read support for RWF_UNCACHED Jens Axboe
@ 2024-11-08 18:33   ` Matthew Wilcox
  2024-11-08 19:25     ` Jens Axboe
  2024-11-11 13:04   ` Stefan Metzmacher
  1 sibling, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2024-11-08 18:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On Fri, Nov 08, 2024 at 10:43:31AM -0700, Jens Axboe wrote:
> +++ b/mm/swap.c
> @@ -472,6 +472,8 @@ static void folio_inc_refs(struct folio *folio)
>   */
>  void folio_mark_accessed(struct folio *folio)
>  {
> +	if (folio_test_uncached(folio))
> +		return;
>  	if (lru_gen_enabled()) {

This feels like it might be a problem.  If, eg, process A is doing
uncached IO and process B comes along and, say, mmap()s it, I think
we'll need to clear the uncached flag in order to have things work
correctly.  It's a performance problem, not a correctness problem.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/13] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-08 17:43 ` [PATCH 11/13] iomap: " Jens Axboe
@ 2024-11-08 18:46   ` Matthew Wilcox
  2024-11-08 19:26     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2024-11-08 18:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On Fri, Nov 08, 2024 at 10:43:34AM -0700, Jens Axboe wrote:
> +++ b/fs/iomap/buffered-io.c
> @@ -959,6 +959,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  		}
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
> +		if (iter->flags & IOMAP_UNCACHED)
> +			folio_set_uncached(folio);

This seems like it'd convert an existing page cache folio into being
uncached?  Is this just leftover from a previous version or is that a
design decision you made?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/13] mm/filemap: change filemap_create_folio() to take a struct kiocb
  2024-11-08 18:18   ` Matthew Wilcox
@ 2024-11-08 19:22     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 19:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On 11/8/24 11:18 AM, Matthew Wilcox wrote:
> On Fri, Nov 08, 2024 at 10:43:24AM -0700, Jens Axboe wrote:
>> Rather than pass in both the file and position directly from the kiocb,
>> just take a struct kiocb instead. In preparation for actually needing
>> the kiocb in the function.
> 
> If you're undoing this part of f253e1854ce8, it's probably worth moving
> the IOCB flag checks back to where they were too.

Ah wasn't aware of that one, didn't do any git history digging. Sure,
I can move the flags checking too.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/13] mm/readahead: add readahead_control->uncached member
  2024-11-08 18:21   ` Matthew Wilcox
@ 2024-11-08 19:22     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 19:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On 11/8/24 11:21 AM, Matthew Wilcox wrote:
> On Fri, Nov 08, 2024 at 10:43:27AM -0700, Jens Axboe wrote:
>> +++ b/mm/readahead.c
>> @@ -191,7 +191,13 @@ static void read_pages(struct readahead_control *rac)
>>  static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
>>  				       gfp_t gfp_mask, unsigned int order)
>>  {
>> -	return filemap_alloc_folio(gfp_mask, order);
>> +	struct folio *folio;
>> +
>> +	folio = filemap_alloc_folio(gfp_mask, order);
>> +	if (folio && ractl->uncached)
>> +		folio_set_uncached(folio);
> 
> If we've just allocated it, it should be safe to use
> __folio_set_uncached() here, no?

Indeed, we can use __folio_set_uncached() here. I'll make that change.

> Not that I'm keen on using a folio flag here, but I'm reserving judgement
> on that unti I've got further through this series and see how it's used.
> I can see that it might be necessary.

I knew that'd be one of the more contentious items here... On the read
side, we can get by without the flag. But for writeback we do need it.
I just kept it consistent and used folio_*_uncached() throughout
because of that.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/13] fs: add FOP_UNCACHED flag
  2024-11-08 18:27   ` Matthew Wilcox
@ 2024-11-08 19:23     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 19:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On 11/8/24 11:27 AM, Matthew Wilcox wrote:
> On Fri, Nov 08, 2024 at 10:43:30AM -0700, Jens Axboe wrote:
>> +	if (flags & RWF_UNCACHED) {
> 
> You introduce RWF_UNCACHED in the next patch, so this one's a bisection
> hazard.

Oops, I did reshuffle before sending. I'll sort that out.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/13] mm: add PG_uncached page flag
  2024-11-08 17:43 ` [PATCH 03/13] mm: add PG_uncached page flag Jens Axboe
@ 2024-11-08 19:25   ` Kirill A. Shutemov
  2024-11-08 19:39     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2024-11-08 19:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On Fri, Nov 08, 2024 at 10:43:26AM -0700, Jens Axboe wrote:
> Add a page flag that file IO can use to indicate that the IO being done
> is uncached, as in it should not persist in the page cache after the IO
> has been completed.

Flag bits are precious resource. It would be nice to re-use an existing
bit if possible.

PG_reclaim description looks suspiciously close to what you want.
I wounder if it would be valid to re-define PG_reclaim behaviour to drop
the page after writeback instead of moving to the tail of inactive list.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/13] fs: add read support for RWF_UNCACHED
  2024-11-08 18:33   ` Matthew Wilcox
@ 2024-11-08 19:25     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 19:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On 11/8/24 11:33 AM, Matthew Wilcox wrote:
> On Fri, Nov 08, 2024 at 10:43:31AM -0700, Jens Axboe wrote:
>> +++ b/mm/swap.c
>> @@ -472,6 +472,8 @@ static void folio_inc_refs(struct folio *folio)
>>   */
>>  void folio_mark_accessed(struct folio *folio)
>>  {
>> +	if (folio_test_uncached(folio))
>> +		return;
>>  	if (lru_gen_enabled()) {
> 
> This feels like it might be a problem.  If, eg, process A is doing
> uncached IO and process B comes along and, say, mmap()s it, I think
> we'll need to clear the uncached flag in order to have things work
> correctly.  It's a performance problem, not a correctness problem.

I'll take a look, should be fine to just unconditionally clear it
here. uncached is a hint after all. We'll try our best to honor it,
but there will be cases where inline reclaim will fail and you'll
get cached contents, particularly if you mix uncached and buffered,
or uncached and mmap.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/13] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-08 18:46   ` Matthew Wilcox
@ 2024-11-08 19:26     ` Jens Axboe
  2024-11-08 19:49       ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 19:26 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On 11/8/24 11:46 AM, Matthew Wilcox wrote:
> On Fri, Nov 08, 2024 at 10:43:34AM -0700, Jens Axboe wrote:
>> +++ b/fs/iomap/buffered-io.c
>> @@ -959,6 +959,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>>  		}
>>  		if (iter->iomap.flags & IOMAP_F_STALE)
>>  			break;
>> +		if (iter->flags & IOMAP_UNCACHED)
>> +			folio_set_uncached(folio);
> 
> This seems like it'd convert an existing page cache folio into being
> uncached?  Is this just leftover from a previous version or is that a
> design decision you made?

I'll see if we can improve that. Currently both the read and write side
do drop whatever it touches. We could feasibly just have it drop
newly instantiated pages - iow, uncached just won't create new persistent
folios, but it'll happily use the ones that are there already.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/13] mm: add PG_uncached page flag
  2024-11-08 19:25   ` Kirill A. Shutemov
@ 2024-11-08 19:39     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 19:39 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On 11/8/24 12:25 PM, Kirill A. Shutemov wrote:
> On Fri, Nov 08, 2024 at 10:43:26AM -0700, Jens Axboe wrote:
>> Add a page flag that file IO can use to indicate that the IO being done
>> is uncached, as in it should not persist in the page cache after the IO
>> has been completed.
> 
> Flag bits are precious resource. It would be nice to re-use an existing
> bit if possible.

I knoew, like I mentioned in the reply to willy, I knew this one would
be an interesting discussion in and of itself.

> PG_reclaim description looks suspiciously close to what you want.
> I wounder if it would be valid to re-define PG_reclaim behaviour to drop
> the page after writeback instead of moving to the tail of inactive list.

You're the mm expert - I added the flag since then it has a clearly
defined meaning, and I would not need to worry about any kind of odd
overlap in paths I didn't know about. Would definitely entertain reusing
something else, but I'll leave that in the hands of the people that know
this code and the various intricacies and assumptions a lot better than
I do.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/13] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-08 19:26     ` Jens Axboe
@ 2024-11-08 19:49       ` Jens Axboe
  2024-11-08 20:07         ` Matthew Wilcox
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 19:49 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On 11/8/24 12:26 PM, Jens Axboe wrote:
> On 11/8/24 11:46 AM, Matthew Wilcox wrote:
>> On Fri, Nov 08, 2024 at 10:43:34AM -0700, Jens Axboe wrote:
>>> +++ b/fs/iomap/buffered-io.c
>>> @@ -959,6 +959,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>>>  		}
>>>  		if (iter->iomap.flags & IOMAP_F_STALE)
>>>  			break;
>>> +		if (iter->flags & IOMAP_UNCACHED)
>>> +			folio_set_uncached(folio);
>>
>> This seems like it'd convert an existing page cache folio into being
>> uncached?  Is this just leftover from a previous version or is that a
>> design decision you made?
> 
> I'll see if we can improve that. Currently both the read and write side
> do drop whatever it touches. We could feasibly just have it drop
> newly instantiated pages - iow, uncached just won't create new persistent
> folios, but it'll happily use the ones that are there already.

Well that was nonsense on the read side, it deliberately only prunes
entries that has uncached set. For the write side, this is a bit
trickier. We'd essentially need to know if the folio populated by
write_begin was found in the page cache, or create from new. Any way we
can do that? One way is to change ->write_begin() so it takes a kiocb
rather than a file, but that's an amount of churn I'd rather avoid!
Maybe there's a way I'm just not seeing?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/13] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-08 19:49       ` Jens Axboe
@ 2024-11-08 20:07         ` Matthew Wilcox
  2024-11-08 20:18           ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2024-11-08 20:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On Fri, Nov 08, 2024 at 12:49:58PM -0700, Jens Axboe wrote:
> On 11/8/24 12:26 PM, Jens Axboe wrote:
> > On 11/8/24 11:46 AM, Matthew Wilcox wrote:
> >> On Fri, Nov 08, 2024 at 10:43:34AM -0700, Jens Axboe wrote:
> >>> +++ b/fs/iomap/buffered-io.c
> >>> @@ -959,6 +959,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> >>>  		}
> >>>  		if (iter->iomap.flags & IOMAP_F_STALE)
> >>>  			break;
> >>> +		if (iter->flags & IOMAP_UNCACHED)
> >>> +			folio_set_uncached(folio);
> >>
> >> This seems like it'd convert an existing page cache folio into being
> >> uncached?  Is this just leftover from a previous version or is that a
> >> design decision you made?
> > 
> > I'll see if we can improve that. Currently both the read and write side
> > do drop whatever it touches. We could feasibly just have it drop
> > newly instantiated pages - iow, uncached just won't create new persistent
> > folios, but it'll happily use the ones that are there already.
> 
> Well that was nonsense on the read side, it deliberately only prunes
> entries that has uncached set. For the write side, this is a bit
> trickier. We'd essentially need to know if the folio populated by
> write_begin was found in the page cache, or create from new. Any way we
> can do that? One way is to change ->write_begin() so it takes a kiocb
> rather than a file, but that's an amount of churn I'd rather avoid!
> Maybe there's a way I'm just not seeing?

Umm.  We can solve it for iomap with a new FGP_UNCACHED flag and
checking IOMAP_UNCACHED in iomap_get_folio().  Not sure how we solve it
for other filesystems though.  Any filesystem which uses FGP_NOWAIT has
_a_ solution, but eg btrfs will need to plumb through a third boolean
flag (or, more efficiently, just start passing FGP flags to
prepare_one_folio()).


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/13] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-08 20:07         ` Matthew Wilcox
@ 2024-11-08 20:18           ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-08 20:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel

On 11/8/24 1:07 PM, Matthew Wilcox wrote:
> On Fri, Nov 08, 2024 at 12:49:58PM -0700, Jens Axboe wrote:
>> On 11/8/24 12:26 PM, Jens Axboe wrote:
>>> On 11/8/24 11:46 AM, Matthew Wilcox wrote:
>>>> On Fri, Nov 08, 2024 at 10:43:34AM -0700, Jens Axboe wrote:
>>>>> +++ b/fs/iomap/buffered-io.c
>>>>> @@ -959,6 +959,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>>>>>  		}
>>>>>  		if (iter->iomap.flags & IOMAP_F_STALE)
>>>>>  			break;
>>>>> +		if (iter->flags & IOMAP_UNCACHED)
>>>>> +			folio_set_uncached(folio);
>>>>
>>>> This seems like it'd convert an existing page cache folio into being
>>>> uncached?  Is this just leftover from a previous version or is that a
>>>> design decision you made?
>>>
>>> I'll see if we can improve that. Currently both the read and write side
>>> do drop whatever it touches. We could feasibly just have it drop
>>> newly instantiated pages - iow, uncached just won't create new persistent
>>> folios, but it'll happily use the ones that are there already.
>>
>> Well that was nonsense on the read side, it deliberately only prunes
>> entries that has uncached set. For the write side, this is a bit
>> trickier. We'd essentially need to know if the folio populated by
>> write_begin was found in the page cache, or create from new. Any way we
>> can do that? One way is to change ->write_begin() so it takes a kiocb
>> rather than a file, but that's an amount of churn I'd rather avoid!
>> Maybe there's a way I'm just not seeing?
> 
> Umm.  We can solve it for iomap with a new FGP_UNCACHED flag and
> checking IOMAP_UNCACHED in iomap_get_folio().  Not sure how we solve it
> for other filesystems though.  Any filesystem which uses FGP_NOWAIT has
> _a_ solution, but eg btrfs will need to plumb through a third boolean
> flag (or, more efficiently, just start passing FGP flags to
> prepare_one_folio()).

Yeah that's true, forgot we already have the IOMAP_UNCACHED flag there
and it's available in creation. Thanks, I'll start with that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET v4] Uncached buffered IO
  2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
                   ` (12 preceding siblings ...)
  2024-11-08 17:43 ` [PATCH 13/13] xfs: " Jens Axboe
@ 2024-11-11 12:55 ` Stefan Metzmacher
  2024-11-11 14:08   ` Jens Axboe
  13 siblings, 1 reply; 36+ messages in thread
From: Stefan Metzmacher @ 2024-11-11 12:55 UTC (permalink / raw)
  To: Jens Axboe, linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel

Hi Jens,

I'm wondering about the impact on memory mapped files.

Let's say one (or more) process(es) called mmap on a file in order to
use the content of the file as persistent shared memory.
As far as I understand pages from the page cache are used for this.

Now another process uses RWF_UNCACHED for a read of the same file.
What happens if the pages are removed from the page cache?
Or is the removal deferred based on some refcount?

Thanks!
metze



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/13] fs: add read support for RWF_UNCACHED
  2024-11-08 17:43 ` [PATCH 08/13] fs: add read support for RWF_UNCACHED Jens Axboe
  2024-11-08 18:33   ` Matthew Wilcox
@ 2024-11-11 13:04   ` Stefan Metzmacher
  2024-11-11 14:10     ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: Stefan Metzmacher @ 2024-11-11 13:04 UTC (permalink / raw)
  To: Jens Axboe, linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel

Hi Jens,

> If the same test case is run with RWF_UNCACHED set for the buffered read,
> the output looks as follows:
> 
> Reading bs 65536, uncached 0
>    1s: 153144MB/sec
>    2s: 156760MB/sec
>    3s: 158110MB/sec
>    4s: 158009MB/sec
>    5s: 158043MB/sec
>    6s: 157638MB/sec
>    7s: 157999MB/sec
>    8s: 158024MB/sec
>    9s: 157764MB/sec
>   10s: 157477MB/sec
>   11s: 157417MB/sec
>   12s: 157455MB/sec
>   13s: 157233MB/sec
>   14s: 156692MB/sec
> 
> which is just chugging along at ~155GB/sec of read performance. Looking
> at top, we see:
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
> 8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top
> 
> where just the test app is using CPU, no reclaim is taking place outside
> of the main thread. Not only is performance 65% better, it's also using
> half the CPU to do it.

Do you have numbers of similar code using O_DIRECT just to
see the impact of the memcpy from the page cache to the userspace
buffer...

Thanks!
metze



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET v4] Uncached buffered IO
  2024-11-11 12:55 ` [PATCHSET v4] Uncached buffered IO Stefan Metzmacher
@ 2024-11-11 14:08   ` Jens Axboe
  2024-11-11 15:05     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-11 14:08 UTC (permalink / raw)
  To: Stefan Metzmacher, linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel

On 11/11/24 5:55 AM, Stefan Metzmacher wrote:
> Hi Jens,
> 
> I'm wondering about the impact on memory mapped files.
> 
> Let's say one (or more) process(es) called mmap on a file in order to
> use the content of the file as persistent shared memory.
> As far as I understand pages from the page cache are used for this.
> 
> Now another process uses RWF_UNCACHED for a read of the same file.
> What happens if the pages are removed from the page cache?
> Or is the removal deferred based on some refcount?

For mmap, if a given page isn't in page cache, it'll get faulted in.
Should be fine to have mmap and uncached IO co-exist. If an uncached
read IO instantiates a page, it'll get reaped when the data has been
copied. If an uncached IO hits an already existing page (eg mmap faulted
it in), then it won't get touched. Same thing happens with mixing
buffered and uncached IO. The latter will only reap parts it
instantiated to satisfy the operation. That doesn't matter in terms of
data integrity, only in terms of the policy of uncached leaving things
alone it didn't create to satisfy the operation.

This is really no different than say using mmap and evicting pages, they
will just get faulted in if needed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/13] fs: add read support for RWF_UNCACHED
  2024-11-11 13:04   ` Stefan Metzmacher
@ 2024-11-11 14:10     ` Jens Axboe
  2024-11-11 15:44       ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-11 14:10 UTC (permalink / raw)
  To: Stefan Metzmacher, linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel

On 11/11/24 6:04 AM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>> If the same test case is run with RWF_UNCACHED set for the buffered read,
>> the output looks as follows:
>>
>> Reading bs 65536, uncached 0
>>    1s: 153144MB/sec
>>    2s: 156760MB/sec
>>    3s: 158110MB/sec
>>    4s: 158009MB/sec
>>    5s: 158043MB/sec
>>    6s: 157638MB/sec
>>    7s: 157999MB/sec
>>    8s: 158024MB/sec
>>    9s: 157764MB/sec
>>   10s: 157477MB/sec
>>   11s: 157417MB/sec
>>   12s: 157455MB/sec
>>   13s: 157233MB/sec
>>   14s: 156692MB/sec
>>
>> which is just chugging along at ~155GB/sec of read performance. Looking
>> at top, we see:
>>
>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>> 7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
>> 8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top
>>
>> where just the test app is using CPU, no reclaim is taking place outside
>> of the main thread. Not only is performance 65% better, it's also using
>> half the CPU to do it.
> 
> Do you have numbers of similar code using O_DIRECT just to
> see the impact of the memcpy from the page cache to the userspace
> buffer...

I don't, but I can surely generate those. I didn't consider them that
interesting for this comparison which is why I didn't do them, O_DIRECT
reads for bigger blocks sizes (or even smaller block sizes, if using
io_uring + registered buffers) will definitely have lower overhead than
uncached and buffered IO. Copying 160GB/sec isn't free :-)

For writes it's a bit more complicated to do an apples to apples
comparison, as uncached IO isn't synchronous like O_DIRECT is. It only
kicks off the IO, doesn't wait for it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET v4] Uncached buffered IO
  2024-11-11 14:08   ` Jens Axboe
@ 2024-11-11 15:05     ` Jens Axboe
  2024-11-11 23:54       ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2024-11-11 15:05 UTC (permalink / raw)
  To: Stefan Metzmacher, linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel

On 11/11/24 7:08 AM, Jens Axboe wrote:
> On 11/11/24 5:55 AM, Stefan Metzmacher wrote:
>> Hi Jens,
>>
>> I'm wondering about the impact on memory mapped files.
>>
>> Let's say one (or more) process(es) called mmap on a file in order to
>> use the content of the file as persistent shared memory.
>> As far as I understand pages from the page cache are used for this.
>>
>> Now another process uses RWF_UNCACHED for a read of the same file.
>> What happens if the pages are removed from the page cache?
>> Or is the removal deferred based on some refcount?
> 
> For mmap, if a given page isn't in page cache, it'll get faulted in.
> Should be fine to have mmap and uncached IO co-exist. If an uncached
> read IO instantiates a page, it'll get reaped when the data has been
> copied. If an uncached IO hits an already existing page (eg mmap faulted
> it in), then it won't get touched. Same thing happens with mixing
> buffered and uncached IO. The latter will only reap parts it
> instantiated to satisfy the operation. That doesn't matter in terms of
> data integrity, only in terms of the policy of uncached leaving things
> alone it didn't create to satisfy the operation.
> 
> This is really no different than say using mmap and evicting pages, they
> will just get faulted in if needed.

Turns out that was nonsense, as per Kiril's comments on the other thread.
For pages that are actually mapped, we'll have to skip the invalidation
as it's not safe to do so.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/13] fs: add read support for RWF_UNCACHED
  2024-11-11 14:10     ` Jens Axboe
@ 2024-11-11 15:44       ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-11 15:44 UTC (permalink / raw)
  To: Stefan Metzmacher, linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel

On 11/11/24 7:10 AM, Jens Axboe wrote:
> On 11/11/24 6:04 AM, Stefan Metzmacher wrote:
>> Hi Jens,
>>
>>> If the same test case is run with RWF_UNCACHED set for the buffered read,
>>> the output looks as follows:
>>>
>>> Reading bs 65536, uncached 0
>>>    1s: 153144MB/sec
>>>    2s: 156760MB/sec
>>>    3s: 158110MB/sec
>>>    4s: 158009MB/sec
>>>    5s: 158043MB/sec
>>>    6s: 157638MB/sec
>>>    7s: 157999MB/sec
>>>    8s: 158024MB/sec
>>>    9s: 157764MB/sec
>>>   10s: 157477MB/sec
>>>   11s: 157417MB/sec
>>>   12s: 157455MB/sec
>>>   13s: 157233MB/sec
>>>   14s: 156692MB/sec
>>>
>>> which is just chugging along at ~155GB/sec of read performance. Looking
>>> at top, we see:
>>>
>>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>> 7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
>>> 8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top
>>>
>>> where just the test app is using CPU, no reclaim is taking place outside
>>> of the main thread. Not only is performance 65% better, it's also using
>>> half the CPU to do it.
>>
>> Do you have numbers of similar code using O_DIRECT just to
>> see the impact of the memcpy from the page cache to the userspace
>> buffer...
> 
> I don't, but I can surely generate those. I didn't consider them that
> interesting for this comparison which is why I didn't do them, O_DIRECT
> reads for bigger blocks sizes (or even smaller block sizes, if using
> io_uring + registered buffers) will definitely have lower overhead than
> uncached and buffered IO. Copying 160GB/sec isn't free :-)
> 
> For writes it's a bit more complicated to do an apples to apples
> comparison, as uncached IO isn't synchronous like O_DIRECT is. It only
> kicks off the IO, doesn't wait for it.

Here's the read side - same test as above, using 64K reads:

  1s: 24947MB/sec
  2s: 24840MB/sec
  3s: 24666MB/sec
  4s: 24549MB/sec
  5s: 24575MB/sec
  6s: 24669MB/sec
  7s: 24611MB/sec
  8s: 24369MB/sec
  9s: 24261MB/sec
 10s: 24125MB/sec

which is in fact pretty depressing. As before, this is 32 threads, each
reading a file from separate XFS mount points, so 32 file systems in
total. If I bump the read size to 128K, then it's about 42GB/sec. 256K
gets you to 71-72GB/sec.

Just goes to show you, you need parallellism to get the best performance
out of the devices with O_DIRECT. If I run io_uring + dio + registered
buffers, I can get ~172GB/sec out of reading the same 32 files from 32
threads.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET v4] Uncached buffered IO
  2024-11-11 15:05     ` Jens Axboe
@ 2024-11-11 23:54       ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2024-11-11 23:54 UTC (permalink / raw)
  To: Stefan Metzmacher, linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel

On 11/11/24 8:05 AM, Jens Axboe wrote:
> On 11/11/24 7:08 AM, Jens Axboe wrote:
>> On 11/11/24 5:55 AM, Stefan Metzmacher wrote:
>>> Hi Jens,
>>>
>>> I'm wondering about the impact on memory mapped files.
>>>
>>> Let's say one (or more) process(es) called mmap on a file in order to
>>> use the content of the file as persistent shared memory.
>>> As far as I understand pages from the page cache are used for this.
>>>
>>> Now another process uses RWF_UNCACHED for a read of the same file.
>>> What happens if the pages are removed from the page cache?
>>> Or is the removal deferred based on some refcount?
>>
>> For mmap, if a given page isn't in page cache, it'll get faulted in.
>> Should be fine to have mmap and uncached IO co-exist. If an uncached
>> read IO instantiates a page, it'll get reaped when the data has been
>> copied. If an uncached IO hits an already existing page (eg mmap faulted
>> it in), then it won't get touched. Same thing happens with mixing
>> buffered and uncached IO. The latter will only reap parts it
>> instantiated to satisfy the operation. That doesn't matter in terms of
>> data integrity, only in terms of the policy of uncached leaving things
>> alone it didn't create to satisfy the operation.
>>
>> This is really no different than say using mmap and evicting pages, they
>> will just get faulted in if needed.
> 
> Turns out that was nonsense, as per Kiril's comments on the other thread.
> For pages that are actually mapped, we'll have to skip the invalidation
> as it's not safe to do so.

...and now v3 (just posted) actually does work like I described, it'll
co-exist with mmap.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-11-11 23:54 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-08 17:43 [PATCHSET v4] Uncached buffered IO Jens Axboe
2024-11-08 17:43 ` [PATCH 01/13] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
2024-11-08 18:18   ` Matthew Wilcox
2024-11-08 19:22     ` Jens Axboe
2024-11-08 17:43 ` [PATCH 02/13] mm/readahead: add folio allocation helper Jens Axboe
2024-11-08 17:43 ` [PATCH 03/13] mm: add PG_uncached page flag Jens Axboe
2024-11-08 19:25   ` Kirill A. Shutemov
2024-11-08 19:39     ` Jens Axboe
2024-11-08 17:43 ` [PATCH 04/13] mm/readahead: add readahead_control->uncached member Jens Axboe
2024-11-08 18:21   ` Matthew Wilcox
2024-11-08 19:22     ` Jens Axboe
2024-11-08 17:43 ` [PATCH 05/13] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
2024-11-08 17:43 ` [PATCH 06/13] mm/truncate: make invalidate_complete_folio2() public Jens Axboe
2024-11-08 17:43 ` [PATCH 07/13] fs: add FOP_UNCACHED flag Jens Axboe
2024-11-08 18:27   ` Matthew Wilcox
2024-11-08 19:23     ` Jens Axboe
2024-11-08 17:43 ` [PATCH 08/13] fs: add read support for RWF_UNCACHED Jens Axboe
2024-11-08 18:33   ` Matthew Wilcox
2024-11-08 19:25     ` Jens Axboe
2024-11-11 13:04   ` Stefan Metzmacher
2024-11-11 14:10     ` Jens Axboe
2024-11-11 15:44       ` Jens Axboe
2024-11-08 17:43 ` [PATCH 09/13] mm: drop uncached pages when writeback completes Jens Axboe
2024-11-08 17:43 ` [PATCH 10/13] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
2024-11-08 17:43 ` [PATCH 11/13] iomap: " Jens Axboe
2024-11-08 18:46   ` Matthew Wilcox
2024-11-08 19:26     ` Jens Axboe
2024-11-08 19:49       ` Jens Axboe
2024-11-08 20:07         ` Matthew Wilcox
2024-11-08 20:18           ` Jens Axboe
2024-11-08 17:43 ` [PATCH 12/13] ext4: flag as supporting FOP_UNCACHED Jens Axboe
2024-11-08 17:43 ` [PATCH 13/13] xfs: " Jens Axboe
2024-11-11 12:55 ` [PATCHSET v4] Uncached buffered IO Stefan Metzmacher
2024-11-11 14:08   ` Jens Axboe
2024-11-11 15:05     ` Jens Axboe
2024-11-11 23:54       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox