linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v3 0/16] Uncached buffered IO
@ 2024-11-11 23:37 Jens Axboe
  2024-11-11 23:37 ` [PATCH 01/16] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
                   ` (16 more replies)
  0 siblings, 17 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs

Hi,

(A bit of version confusion, but this follows v4 -> v2 -> v3, as v4 was
 a relic of the 5 year old version. Next will be v5 and we should be
 consistent again)

5 years ago I posted patches adding support for RWF_UNCACHED, as a way
to do buffered IO that isn't page cache persistent. The approach back
then was to have private pages for IO, and then get rid of them once IO
was done. But that then runs into all the issues that O_DIRECT has, in
terms of synchronizing with the page cache.

So here's a new approach to the same concent, but using the page cache
as synchronization. That makes RWF_UNCACHED less special, in that it's
just page cache IO, except it prunes the ranges once IO is completed.

Why do this, you may ask? The tldr is that device speeds are only
getting faster, while reclaim is not. Doing normal buffered IO can be
very unpredictable, and suck up a lot of resources on the reclaim side.
This leads people to use O_DIRECT as a work-around, which has its own
set of restrictions in terms of size, offset, and length of IO. It's
also inherently synchronous, and now you need async IO as well. While
the latter isn't necessarily a big problem as we have good options
available there, it also should not be a requirement when all you want
to do is read or write some data without caching.

Even on desktop type systems, a normal NVMe device can fill the entire
page cache in seconds. On the big system I used for testing, there's a
lot more RAM, but also a lot more devices. As can be seen in some of the
results in the following patches, you can still fill RAM in seconds even
when there's 1TB of it. Hence this problem isn't solely a "big
hyperscaler system" issue, it's common across the board.

Common for both reads and writes with RWF_UNCACHED is that they use the
page cache for IO. Reads work just like a normal buffered read would,
with the only exception being that the touched ranges will get pruned
after data has been copied. For writes, the ranges will get writeback
kicked off before the syscall returns, and then writeback completion
will prune the range. Hence writes aren't synchronous, and it's easy to
pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by
RWF_UNCACHED IO are left untouched. This means you that uncached IO
will take advantage of the page cache for uptodate data, but not leave
anything it instantiated/created in cache.

File systems need to support this. The patches add support for the
generic filemap helpers, and for iomap. Then ext4 and XFS are marked as
supporting it. The last patch adds support for btrfs as well, lightly
tested. The read side is already done by filemap, only the write side
needs a bit of help. The amount of code here is really trivial, and the
only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO
return -EOPNOTSUPP just in case the fs doesn't use either the generic
paths or iomap. Adding "support" to other file systems should be
trivial, most of the time just a one-liner adding FOP_UNCACHED to the
fop_flags in the file_operations struct.

Performance results are in patch 8 for reads and patch 10 for writes,
with the tldr being that I see about a 65% improvement in performance
for both, with fully predictable IO times. CPU reduction is substantial
as well, with no kswapd activity at all for reclaim when using uncached
IO.

Using it from applications is trivial - just set RWF_UNCACHED for the
read or write, using pwritev2(2) or preadv2(2). For io_uring, same
thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write
operation. And that's it.

Patches 1..7 are just prep patches, and should have no functional
changes at all. Patch 8 adds support for the filemap path for
RWF_UNCACHED reads, patch 10 adds support for filemap RWF_UNCACHED
writes, and patches 12..16 adds ext4, xfs/iomap, and btrfs support.

I ran this through xfstests, and it found some of the issue listed as
fixed below. This posted version passes the whole generic suite of
xfstests. The xfstests patch is here:

https://lore.kernel.org/linux-mm/3da73668-a954-47b9-b66d-bb2e719f5590@kernel.dk/

And git tree for the patches is here:

https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.6


 fs/btrfs/bio.c                 |   4 +-
 fs/btrfs/bio.h                 |   2 +
 fs/btrfs/extent_io.c           |   8 ++-
 fs/btrfs/file.c                |  10 +++-
 fs/ext4/ext4.h                 |   1 +
 fs/ext4/file.c                 |   2 +-
 fs/ext4/inline.c               |   7 ++-
 fs/ext4/inode.c                |  18 +++++-
 fs/ext4/page-io.c              |  28 +++++----
 fs/iomap/buffered-io.c         |  15 ++++-
 fs/xfs/xfs_aops.c              |   7 ++-
 fs/xfs/xfs_file.c              |   4 +-
 include/linux/fs.h             |  10 +++-
 include/linux/iomap.h          |   4 +-
 include/linux/page-flags.h     |   5 ++
 include/linux/pagemap.h        |  34 +++++++++++
 include/trace/events/mmflags.h |   3 +-
 include/uapi/linux/fs.h        |   6 +-
 mm/filemap.c                   | 101 ++++++++++++++++++++++++++++-----
 mm/readahead.c                 |  22 +++++--
 mm/swap.c                      |   2 +
 mm/truncate.c                  |  33 ++++++-----
 22 files changed, 262 insertions(+), 64 deletions(-)

Since v2
- Add patch for btrfs to work on the write side, read side was already
  covered by the generic filemap changes. Now btrfs is FOP_UNCACHED
  enabled as well.
- Add folio_unmap_invalidate() helper, and use that from both the core
  code and the uncached handling.
- Add filemap_uncached_read() helper to encapsulate the uncached
  handling on the read side.
- Enable handling of invalidation of mapped folios
- Clear uncached in looked up folio, if FGP_UNCACHED isn't set. For this
  case, there are competing non-uncached page cache users and the folio
  should not get invalidated.
- Various little tweaks or comments.
- Ran fsstress with read/write uncached support, no issues seen
- Fixup a commit message
- Rebase on 6.12-rc7

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/16] mm/filemap: change filemap_create_folio() to take a struct kiocb
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 02/16] mm/readahead: add folio allocation helper Jens Axboe
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

Rather than pass in both the file and position directly from the kiocb,
just take a struct kiocb instead. While doing so, move the ki_flags
checking into filemap_create_folio() as well. In preparation for actually
needing the kiocb in the function.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 56fa431c52af..91974308e9bf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2460,15 +2460,17 @@ static int filemap_update_page(struct kiocb *iocb,
 	return error;
 }
 
-static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, loff_t pos,
-		struct folio_batch *fbatch)
+static int filemap_create_folio(struct kiocb *iocb,
+		struct address_space *mapping, struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
 	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t index;
 
+	if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
+		return -EAGAIN;
+
 	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
 	if (!folio)
 		return -ENOMEM;
@@ -2487,7 +2489,7 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
-	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
+	index = (iocb->ki_pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2495,7 +2497,8 @@ static int filemap_create_folio(struct file *file,
 	if (error)
 		goto error;
 
-	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
+	error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio,
+					folio);
 	if (error)
 		goto error;
 
@@ -2551,9 +2554,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 		filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
 	}
 	if (!folio_batch_count(fbatch)) {
-		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
-			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
+		err = filemap_create_folio(iocb, mapping, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 02/16] mm/readahead: add folio allocation helper
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
  2024-11-11 23:37 ` [PATCH 01/16] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 03/16] mm: add PG_uncached page flag Jens Axboe
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

Just a wrapper around filemap_alloc_folio() for now, but add it in
preparation for modifying the folio based on the 'ractl' being passed
in.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/readahead.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 3dc6c7a128dd..003cfe79880d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -188,6 +188,12 @@ static void read_pages(struct readahead_control *rac)
 	BUG_ON(readahead_count(rac));
 }
 
+static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
+				       gfp_t gfp_mask, unsigned int order)
+{
+	return filemap_alloc_folio(gfp_mask, order);
+}
+
 /**
  * page_cache_ra_unbounded - Start unchecked readahead.
  * @ractl: Readahead control.
@@ -260,8 +266,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask,
-					    mapping_min_folio_order(mapping));
+		folio = ractl_alloc_folio(ractl, gfp_mask,
+					mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 
@@ -431,7 +437,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {
 	int err;
-	struct folio *folio = filemap_alloc_folio(gfp, order);
+	struct folio *folio = ractl_alloc_folio(ractl, gfp, order);
 
 	if (!folio)
 		return -ENOMEM;
@@ -753,7 +759,7 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, min_order);
+		folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
 		if (!folio)
 			return;
 
@@ -782,7 +788,7 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, min_order);
+		folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
 		if (!folio)
 			return;
 
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 03/16] mm: add PG_uncached page flag
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
  2024-11-11 23:37 ` [PATCH 01/16] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
  2024-11-11 23:37 ` [PATCH 02/16] mm/readahead: add folio allocation helper Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-12  9:12   ` Kirill A. Shutemov
  2024-11-11 23:37 ` [PATCH 04/16] mm/readahead: add readahead_control->uncached member Jens Axboe
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

Add a page flag that file IO can use to indicate that the IO being done
is uncached, as in it should not persist in the page cache after the IO
has been completed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/page-flags.h     | 5 +++++
 include/trace/events/mmflags.h | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index cc839e4365c1..3c4003495929 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -110,6 +110,7 @@ enum pageflags {
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
 	PG_unevictable,		/* Page is "unevictable"  */
+	PG_uncached,		/* uncached read/write IO */
 #ifdef CONFIG_MMU
 	PG_mlocked,		/* Page is vma mlocked */
 #endif
@@ -562,6 +563,10 @@ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
 FOLIO_FLAG(readahead, FOLIO_HEAD_PAGE)
 	FOLIO_TEST_CLEAR_FLAG(readahead, FOLIO_HEAD_PAGE)
 
+FOLIO_FLAG(uncached, FOLIO_HEAD_PAGE)
+	FOLIO_TEST_CLEAR_FLAG(uncached, FOLIO_HEAD_PAGE)
+	__FOLIO_SET_FLAG(uncached, FOLIO_HEAD_PAGE)
+
 #ifdef CONFIG_HIGHMEM
 /*
  * Must use a macro here due to header dependency issues. page_zone() is not
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index bb8a59c6caa2..b60057284102 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -116,7 +116,8 @@
 	DEF_PAGEFLAG_NAME(head),					\
 	DEF_PAGEFLAG_NAME(reclaim),					\
 	DEF_PAGEFLAG_NAME(swapbacked),					\
-	DEF_PAGEFLAG_NAME(unevictable)					\
+	DEF_PAGEFLAG_NAME(unevictable),					\
+	DEF_PAGEFLAG_NAME(uncached)					\
 IF_HAVE_PG_MLOCK(mlocked)						\
 IF_HAVE_PG_HWPOISON(hwpoison)						\
 IF_HAVE_PG_IDLE(idle)							\
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 04/16] mm/readahead: add readahead_control->uncached member
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (2 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 03/16] mm: add PG_uncached page flag Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 05/16] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

If ractl->uncached is set to true, then folios created are marked as
uncached as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 1 +
 mm/readahead.c          | 8 +++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301..8afacb7520d4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1350,6 +1350,7 @@ struct readahead_control {
 	pgoff_t _index;
 	unsigned int _nr_pages;
 	unsigned int _batch_count;
+	bool uncached;
 	bool _workingset;
 	unsigned long _pflags;
 };
diff --git a/mm/readahead.c b/mm/readahead.c
index 003cfe79880d..8dbeab9bc1f0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -191,7 +191,13 @@ static void read_pages(struct readahead_control *rac)
 static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
 				       gfp_t gfp_mask, unsigned int order)
 {
-	return filemap_alloc_folio(gfp_mask, order);
+	struct folio *folio;
+
+	folio = filemap_alloc_folio(gfp_mask, order);
+	if (folio && ractl->uncached)
+		__folio_set_uncached(folio);
+
+	return folio;
 }
 
 /**
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 05/16] mm/filemap: use page_cache_sync_ra() to kick off read-ahead
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (3 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 04/16] mm/readahead: add readahead_control->uncached member Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 06/16] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly. In preparation for needing
to modify ractl inside filemap_get_pages().

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 91974308e9bf..02d9cb585195 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2528,7 +2528,6 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
-	struct file_ra_state *ra = &filp->f_ra;
 	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
 	pgoff_t last_index;
 	struct folio *folio;
@@ -2543,12 +2542,13 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 
 	filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
 	if (!folio_batch_count(fbatch)) {
+		DEFINE_READAHEAD(ractl, filp, &filp->f_ra, mapping, index);
+
 		if (iocb->ki_flags & IOCB_NOIO)
 			return -EAGAIN;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			flags = memalloc_noio_save();
-		page_cache_sync_readahead(mapping, ra, filp, index,
-				last_index - index);
+		page_cache_sync_ra(&ractl, last_index - index);
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			memalloc_noio_restore(flags);
 		filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 06/16] mm/truncate: add folio_unmap_invalidate() helper
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (4 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 05/16] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 07/16] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

Add a folio_unmap_invalidate() helper, which unmaps and invalidates a
given folio. The caller must already have locked the folio. Use this
new helper in invalidate_inode_pages2_range(), rather than duplicate
the code there.

In preparation for using this elsewhere as well, have it take a gfp_t
mask rather than assume GFP_KERNEL is the right choice. This bubbles
back to invalidate_complete_folio2() as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h |  2 ++
 mm/truncate.c           | 33 ++++++++++++++++++++-------------
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8afacb7520d4..d55bf995bd9e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -34,6 +34,8 @@ int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
 void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count);
 int filemap_invalidate_pages(struct address_space *mapping,
 			     loff_t pos, loff_t end, bool nowait);
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+			   gfp_t gfp);
 
 int write_inode_now(struct inode *, int sync);
 int filemap_fdatawrite(struct address_space *);
diff --git a/mm/truncate.c b/mm/truncate.c
index 0668cd340a46..5663c3f1d548 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -547,12 +547,12 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
  * sitting in the folio_add_lru() caches.
  */
 static int invalidate_complete_folio2(struct address_space *mapping,
-					struct folio *folio)
+				      struct folio *folio, gfp_t gfp_mask)
 {
 	if (folio->mapping != mapping)
 		return 0;
 
-	if (!filemap_release_folio(folio, GFP_KERNEL))
+	if (!filemap_release_folio(folio, gfp_mask))
 		return 0;
 
 	spin_lock(&mapping->host->i_lock);
@@ -584,6 +584,23 @@ static int folio_launder(struct address_space *mapping, struct folio *folio)
 	return mapping->a_ops->launder_folio(folio);
 }
 
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+			   gfp_t gfp)
+{
+	int ret;
+
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+	if (folio_mapped(folio))
+		unmap_mapping_folio(folio);
+	BUG_ON(folio_mapped(folio));
+
+	ret = folio_launder(mapping, folio);
+	if (!ret && !invalidate_complete_folio2(mapping, folio, gfp))
+		return -EBUSY;
+	return ret;
+}
+
 /**
  * invalidate_inode_pages2_range - remove range of pages from an address_space
  * @mapping: the address_space
@@ -641,18 +658,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				folio_unlock(folio);
 				continue;
 			}
-			VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
 			folio_wait_writeback(folio);
-
-			if (folio_mapped(folio))
-				unmap_mapping_folio(folio);
-			BUG_ON(folio_mapped(folio));
-
-			ret2 = folio_launder(mapping, folio);
-			if (ret2 == 0) {
-				if (!invalidate_complete_folio2(mapping, folio))
-					ret2 = -EBUSY;
-			}
+			ret2 = folio_unmap_invalidate(mapping, folio, GFP_KERNEL);
 			if (ret2 < 0)
 				ret = ret2;
 			folio_unlock(folio);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 07/16] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (5 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 06/16] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 08/16] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

If a file system supports uncached buffered IO, it may set FOP_UNCACHED
and enable RWF_UNCACHED. If RWF_UNCACHED is attempted without the file
system supporting it, it'll get errored with -EOPNOTSUPP.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h      | 10 +++++++++-
 include/uapi/linux/fs.h |  6 +++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3559446279c1..5abc53991cd0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -320,6 +320,7 @@ struct readahead_control;
 #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
 #define IOCB_APPEND		(__force int) RWF_APPEND
 #define IOCB_ATOMIC		(__force int) RWF_ATOMIC
+#define IOCB_UNCACHED		(__force int) RWF_UNCACHED
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
@@ -354,7 +355,8 @@ struct readahead_control;
 	{ IOCB_SYNC,		"SYNC" }, \
 	{ IOCB_NOWAIT,		"NOWAIT" }, \
 	{ IOCB_APPEND,		"APPEND" }, \
-	{ IOCB_ATOMIC,		"ATOMIC"}, \
+	{ IOCB_ATOMIC,		"ATOMIC" }, \
+	{ IOCB_UNCACHED,	"UNCACHED" }, \
 	{ IOCB_EVENTFD,		"EVENTFD"}, \
 	{ IOCB_DIRECT,		"DIRECT" }, \
 	{ IOCB_WRITE,		"WRITE" }, \
@@ -2116,6 +2118,8 @@ struct file_operations {
 #define FOP_HUGE_PAGES		((__force fop_flags_t)(1 << 4))
 /* Treat loff_t as unsigned (e.g., /dev/mem) */
 #define FOP_UNSIGNED_OFFSET	((__force fop_flags_t)(1 << 5))
+/* File system supports uncached read/write buffered IO */
+#define FOP_UNCACHED		((__force fop_flags_t)(1 << 6))
 
 /* Wrap a directory iterator that needs exclusive inode access */
 int wrap_directory_iterator(struct file *, struct dir_context *,
@@ -3532,6 +3536,10 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
 		if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
 			return -EOPNOTSUPP;
 	}
+	if (flags & RWF_UNCACHED) {
+		if (!(ki->ki_filp->f_op->fop_flags & FOP_UNCACHED))
+			return -EOPNOTSUPP;
+	}
 	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
 	if (flags & RWF_SYNC)
 		kiocb_flags |= IOCB_DSYNC;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..dc77cd8ae1a3 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -332,9 +332,13 @@ typedef int __bitwise __kernel_rwf_t;
 /* Atomic Write */
 #define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000040)
 
+/* buffered IO that drops the cache after reading or writing data */
+#define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000080)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
+			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
+			 RWF_UNCACHED)
 
 #define PROCFS_IOCTL_MAGIC 'f'
 
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 08/16] mm/filemap: add read support for RWF_UNCACHED
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (6 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 07/16] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 09/16] mm/filemap: drop uncached pages when writeback completes Jens Axboe
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

Add RWF_UNCACHED as a read operation flag, which means that any data
read wil be removed from the page cache upon completion. Uses the page
cache to synchronize, and simply prunes folios that were instantiated
when the operation completes. While it would be possible to use private
pages for this, using the page cache as synchronization is handy for a
variety of reasons:

1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
   cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable

and the code to support this is much simpler as a result.

You can think of uncached buffered IO as being the much more attractive
cousing of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes,
it will copy the data, but unlike regular buffered IO, it doesn't run
into the unpredictability of the page cache in terms of reclaim. As an
example, on a test box with 32 drives, reading them with buffered IO
looks as follows:

Reading bs 65536, uncached 0
  1s: 145945MB/sec
  2s: 158067MB/sec
  3s: 157007MB/sec
  4s: 148622MB/sec
  5s: 118824MB/sec
  6s: 70494MB/sec
  7s: 41754MB/sec
  8s: 90811MB/sec
  9s: 92204MB/sec
 10s: 95178MB/sec
 11s: 95488MB/sec
 12s: 95552MB/sec
 13s: 96275MB/sec

where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7535 root      20   0  267004      0      0 S  3199   0.0   8:40.65 uncached
3326 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd4
3327 root      20   0       0      0      0 R 100.0   0.0   0:17.22 kswapd5
3328 root      20   0       0      0      0 R 100.0   0.0   0:13.29 kswapd6
3332 root      20   0       0      0      0 R 100.0   0.0   0:11.11 kswapd10
3339 root      20   0       0      0      0 R 100.0   0.0   0:16.25 kswapd17
3348 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd26
3343 root      20   0       0      0      0 R 100.0   0.0   0:16.30 kswapd21
3344 root      20   0       0      0      0 R 100.0   0.0   0:11.92 kswapd22
3349 root      20   0       0      0      0 R 100.0   0.0   0:16.28 kswapd27
3352 root      20   0       0      0      0 R  99.7   0.0   0:11.89 kswapd30
3353 root      20   0       0      0      0 R  96.7   0.0   0:16.04 kswapd31
3329 root      20   0       0      0      0 R  96.4   0.0   0:11.41 kswapd7
3345 root      20   0       0      0      0 R  96.4   0.0   0:13.40 kswapd23
3330 root      20   0       0      0      0 S  91.1   0.0   0:08.28 kswapd8
3350 root      20   0       0      0      0 S  86.8   0.0   0:11.13 kswapd28
3325 root      20   0       0      0      0 S  76.3   0.0   0:07.43 kswapd3
3341 root      20   0       0      0      0 S  74.7   0.0   0:08.85 kswapd19
3334 root      20   0       0      0      0 S  71.7   0.0   0:10.04 kswapd12
3351 root      20   0       0      0      0 R  60.5   0.0   0:09.59 kswapd29
3323 root      20   0       0      0      0 R  57.6   0.0   0:11.50 kswapd1
[...]

which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.

If the same test case is run with RWF_UNCACHED set for the buffered read,
the output looks as follows:

Reading bs 65536, uncached 0
  1s: 153144MB/sec
  2s: 156760MB/sec
  3s: 158110MB/sec
  4s: 158009MB/sec
  5s: 158043MB/sec
  6s: 157638MB/sec
  7s: 157999MB/sec
  8s: 158024MB/sec
  9s: 157764MB/sec
 10s: 157477MB/sec
 11s: 157417MB/sec
 12s: 157455MB/sec
 13s: 157233MB/sec
 14s: 156692MB/sec

which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top

where just the test app is using CPU, no reclaim is taking place outside
of the main thread. Not only is performance 65% better, it's also using
half the CPU to do it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 28 ++++++++++++++++++++++++++--
 mm/swap.c    |  2 ++
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 02d9cb585195..3d0614ea5f59 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2474,6 +2474,8 @@ static int filemap_create_folio(struct kiocb *iocb,
 	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
 	if (!folio)
 		return -ENOMEM;
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		__folio_set_uncached(folio);
 
 	/*
 	 * Protect against truncate / hole punch. Grabbing invalidate_lock
@@ -2519,6 +2521,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
 
 	if (iocb->ki_flags & IOCB_NOIO)
 		return -EAGAIN;
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		ractl.uncached = 1;
 	page_cache_async_ra(&ractl, folio, last_index - folio->index);
 	return 0;
 }
@@ -2548,6 +2552,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 			return -EAGAIN;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			flags = memalloc_noio_save();
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			ractl.uncached = 1;
 		page_cache_sync_ra(&ractl, last_index - index);
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			memalloc_noio_restore(flags);
@@ -2595,6 +2601,20 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
 	return (pos1 >> shift == pos2 >> shift);
 }
 
+static void filemap_uncached_read(struct address_space *mapping,
+				  struct folio *folio)
+{
+	if (!folio_test_uncached(folio))
+		return;
+	if (folio_test_writeback(folio))
+		return;
+	if (folio_test_clear_uncached(folio)) {
+		folio_lock(folio);
+		folio_unmap_invalidate(mapping, folio, 0);
+		folio_unlock(folio);
+	}
+}
+
 /**
  * filemap_read - Read data from the page cache.
  * @iocb: The iocb to read.
@@ -2706,8 +2726,12 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 			}
 		}
 put_folios:
-		for (i = 0; i < folio_batch_count(&fbatch); i++)
-			folio_put(fbatch.folios[i]);
+		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+			struct folio *folio = fbatch.folios[i];
+
+			filemap_uncached_read(mapping, folio);
+			folio_put(folio);
+		}
 		folio_batch_init(&fbatch);
 	} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
 
diff --git a/mm/swap.c b/mm/swap.c
index b8e3259ea2c4..542f298d3dcd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -472,6 +472,8 @@ static void folio_inc_refs(struct folio *folio)
  */
 void folio_mark_accessed(struct folio *folio)
 {
+	if (folio_test_uncached(folio))
+		return;
 	if (lru_gen_enabled()) {
 		folio_inc_refs(folio);
 		return;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 09/16] mm/filemap: drop uncached pages when writeback completes
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (7 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 08/16] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-12  9:31   ` Kirill A. Shutemov
  2024-11-11 23:37 ` [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

If the folio is marked as uncached, drop pages when writeback completes.
Intended to be used with RWF_UNCACHED, to avoid needing sync writes for
uncached IO.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3d0614ea5f59..40debe742abe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1600,6 +1600,27 @@ int folio_wait_private_2_killable(struct folio *folio)
 }
 EXPORT_SYMBOL(folio_wait_private_2_killable);
 
+/*
+ * If folio was marked as uncached, then pages should be dropped when writeback
+ * completes. Do that now. If we fail, it's likely because of a big folio -
+ * just reset uncached for that case and latter completions should invalidate.
+ */
+static void folio_end_uncached(struct folio *folio)
+{
+	/*
+	 * Hitting !in_task() should not happen off RWF_UNCACHED writeback, but
+	 * can happen if normal writeback just happens to find dirty folios
+	 * that were created as part of uncached writeback, and that writeback
+	 * would otherwise not need non-IRQ handling. Just skip the
+	 * invalidation in that case.
+	 */
+	if (in_task() && folio_trylock(folio)) {
+		if (folio->mapping)
+			folio_unmap_invalidate(folio->mapping, folio, 0);
+		folio_unlock(folio);
+	}
+}
+
 /**
  * folio_end_writeback - End writeback against a folio.
  * @folio: The folio.
@@ -1610,6 +1631,8 @@ EXPORT_SYMBOL(folio_wait_private_2_killable);
  */
 void folio_end_writeback(struct folio *folio)
 {
+	bool folio_uncached = false;
+
 	VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
 
 	/*
@@ -1631,9 +1654,14 @@ void folio_end_writeback(struct folio *folio)
 	 * reused before the folio_wake_bit().
 	 */
 	folio_get(folio);
+	if (folio_test_uncached(folio) && folio_test_clear_uncached(folio))
+		folio_uncached = true;
 	if (__folio_end_writeback(folio))
 		folio_wake_bit(folio, PG_writeback);
 	acct_reclaim_writeback(folio);
+
+	if (folio_uncached)
+		folio_end_uncached(folio);
 	folio_put(folio);
 }
 EXPORT_SYMBOL(folio_end_writeback);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (8 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 09/16] mm/filemap: drop uncached pages when writeback completes Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-12  0:57   ` Dave Chinner
  2024-11-11 23:37 ` [PATCH 11/16] mm: add FGP_UNCACHED folio creation flag Jens Axboe
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

If RWF_UNCACHED is set for a write, mark new folios being written with
uncached. This is done by passing in the fact that it's an uncached write
through the folio pointer. We can only get there when IOCB_UNCACHED was
allowed, which can only happen if the file system opts in. Opting in means
they need to check for the LSB in the folio pointer to know if it's an
uncached write or not. If it is, then FGP_UNCACHED should be used if
creating new folios is necessary.

Uncached writes will drop any folios they create upon writeback
completion, but leave folios that may exist in that range alone. Since
->write_begin() doesn't currently take any flags, and to avoid needing
to change the callback kernel wide, use the foliop being passed in to
->write_begin() to signal if this is an uncached write or not. File
systems can then use that to mark newly created folios as uncached.

Add a helper, generic_uncached_write(), that generic_file_write_iter()
calls upon successful completion of an uncached write.

This provides similar benefits to using RWF_UNCACHED with reads. Testing
buffered writes on 32 files:

writing bs 65536, uncached 0
  1s: 196035MB/sec
  2s: 132308MB/sec
  3s: 132438MB/sec
  4s: 116528MB/sec
  5s: 103898MB/sec
  6s: 108893MB/sec
  7s: 99678MB/sec
  8s: 106545MB/sec
  9s: 106826MB/sec
 10s: 101544MB/sec
 11s: 111044MB/sec
 12s: 124257MB/sec
 13s: 116031MB/sec
 14s: 114540MB/sec
 15s: 115011MB/sec
 16s: 115260MB/sec
 17s: 116068MB/sec
 18s: 116096MB/sec

where it's quite obvious where the page cache filled, and performance
dropped from to about half of where it started, settling in at around
115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
reclaim pages.

Running the same test with uncached buffered writes:

writing bs 65536, uncached 1
  1s: 198974MB/sec
  2s: 189618MB/sec
  3s: 193601MB/sec
  4s: 188582MB/sec
  5s: 193487MB/sec
  6s: 188341MB/sec
  7s: 194325MB/sec
  8s: 188114MB/sec
  9s: 192740MB/sec
 10s: 189206MB/sec
 11s: 193442MB/sec
 12s: 189659MB/sec
 13s: 191732MB/sec
 14s: 190701MB/sec
 15s: 191789MB/sec
 16s: 191259MB/sec
 17s: 190613MB/sec
 18s: 191951MB/sec

and the behavior is fully predictable, performing the same throughout
even after the page cache would otherwise have fully filled with dirty
data. It's also about 65% faster, and using half the CPU of the system
compared to the normal buffered write.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 29 +++++++++++++++++++++++++++++
 mm/filemap.c            | 17 +++++++++++++++--
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d55bf995bd9e..d35280744aa1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -14,6 +14,7 @@
 #include <linux/gfp.h>
 #include <linux/bitops.h>
 #include <linux/hardirq.h> /* for in_interrupt() */
+#include <linux/writeback.h>
 #include <linux/hugetlb_inline.h>
 
 struct folio_batch;
@@ -70,6 +71,34 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
 	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
 }
 
+/*
+ * generic_uncached_write - start uncached writeback
+ * @iocb: the iocb that was written
+ * @written: the amount of bytes written
+ *
+ * When writeback has been handled by write_iter, this helper should be called
+ * if the file system supports uncached writes. If %IOCB_UNCACHED is set, it
+ * will kick off writeback for the specified range.
+ */
+static inline void generic_uncached_write(struct kiocb *iocb, ssize_t written)
+{
+	if (iocb->ki_flags & IOCB_UNCACHED) {
+		struct address_space *mapping = iocb->ki_filp->f_mapping;
+
+		/* kick off uncached writeback */
+		__filemap_fdatawrite_range(mapping, iocb->ki_pos,
+					   iocb->ki_pos + written, WB_SYNC_NONE);
+	}
+}
+
+/*
+ * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
+ * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
+ * must check for this and pass FGP_UNCACHED for folio creation.
+ */
+#define foliop_uncached			((struct folio *) 0xfee1c001)
+#define foliop_is_uncached(foliop)	(*(foliop) == foliop_uncached)
+
 /**
  * filemap_set_wb_err - set a writeback error on an address_space
  * @mapping: mapping in which to set writeback error
diff --git a/mm/filemap.c b/mm/filemap.c
index 40debe742abe..0d312de4e20c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -430,6 +430,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 
 	return filemap_fdatawrite_wbc(mapping, &wbc);
 }
+EXPORT_SYMBOL_GPL(__filemap_fdatawrite_range);
 
 static inline int __filemap_fdatawrite(struct address_space *mapping,
 	int sync_mode)
@@ -4076,7 +4077,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 	ssize_t written = 0;
 
 	do {
-		struct folio *folio;
+		struct folio *folio = NULL;
 		size_t offset;		/* Offset into folio */
 		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
@@ -4104,6 +4105,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 			break;
 		}
 
+		/*
+		 * If IOCB_UNCACHED is set here, we now the file system
+		 * supports it. And hence it'll know to check folip for being
+		 * set to this magic value. If so, it's an uncached write.
+		 * Whenever ->write_begin() changes prototypes again, this
+		 * can go away and just pass iocb or iocb flags.
+		 */
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			folio = foliop_uncached;
+
 		status = a_ops->write_begin(file, mapping, pos, bytes,
 						&folio, &fsdata);
 		if (unlikely(status < 0))
@@ -4234,8 +4245,10 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		ret = __generic_file_write_iter(iocb, from);
 	inode_unlock(inode);
 
-	if (ret > 0)
+	if (ret > 0) {
+		generic_uncached_write(iocb, ret);
 		ret = generic_write_sync(iocb, ret);
+	}
 	return ret;
 }
 EXPORT_SYMBOL(generic_file_write_iter);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 11/16] mm: add FGP_UNCACHED folio creation flag
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (9 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 12/16] ext4: add RWF_UNCACHED write support Jens Axboe
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

Callers can pass this in for uncached folio creation, in which case if
a folio is newly created it gets marked as uncached. If a folio exists
for this index and lookup succeeds, then it will not get marked as
uncached. If an !uncached lookup finds a cached folio, clear the flag.
For that case, there are competeting uncached and cached users of the
folio, and it should not get pruned.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 2 ++
 mm/filemap.c            | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d35280744aa1..0b298e81fcae 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -741,6 +741,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
  * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
  * * %FGP_NOWAIT - Don't block on the folio lock.
  * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_UNCACHED - Uncached buffered IO
  * * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
  *   implementation.
  */
@@ -754,6 +755,7 @@ typedef unsigned int __bitwise fgf_t;
 #define FGP_NOWAIT		((__force fgf_t)0x00000020)
 #define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
 #define FGP_STABLE		((__force fgf_t)0x00000080)
+#define FGP_UNCACHED		((__force fgf_t)0x00000100)
 #define FGF_GET_ORDER(fgf)	(((__force unsigned)fgf) >> 26)	/* top 6 bits */
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
diff --git a/mm/filemap.c b/mm/filemap.c
index 0d312de4e20c..0949f0f340f5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1985,6 +1985,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			/* Init accessed so avoid atomic mark_page_accessed later */
 			if (fgp_flags & FGP_ACCESSED)
 				__folio_set_referenced(folio);
+			if (fgp_flags & FGP_UNCACHED)
+				__folio_set_uncached(folio);
 
 			err = filemap_add_folio(mapping, folio, index, gfp);
 			if (!err)
@@ -2007,6 +2009,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 
 	if (!folio)
 		return ERR_PTR(-ENOENT);
+	/* not an uncached lookup, clear uncached if set */
+	if (folio_test_uncached(folio) && !(fgp_flags & FGP_UNCACHED))
+		folio_clear_uncached(folio);
 	return folio;
 }
 EXPORT_SYMBOL(__filemap_get_folio);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 12/16] ext4: add RWF_UNCACHED write support
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (10 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 11/16] mm: add FGP_UNCACHED folio creation flag Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-12 16:36   ` Brian Foster
  2024-11-11 23:37 ` [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

IOCB_UNCACHED IO needs to prune writeback regions on IO completion,
and hence need the worker punt that ext4 also does for unwritten
extents. Add an io_end flag to manage that.

If foliop is set to foliop_uncached in ext4_write_begin(), then set
FGP_UNCACHED so that __filemap_get_folio() will mark newly created
folios as uncached. That in turn will make writeback completion drop
these ranges from the page cache.

Now that ext4 supports both uncached reads and writes, add the fop_flag
FOP_UNCACHED to enable it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ext4/ext4.h    |  1 +
 fs/ext4/file.c    |  2 +-
 fs/ext4/inline.c  |  7 ++++++-
 fs/ext4/inode.c   | 18 ++++++++++++++++--
 fs/ext4/page-io.c | 28 ++++++++++++++++------------
 5 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 44b0d418143c..60dc9ffae076 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -279,6 +279,7 @@ struct ext4_system_blocks {
  * Flags for ext4_io_end->flags
  */
 #define	EXT4_IO_END_UNWRITTEN	0x0001
+#define EXT4_IO_UNCACHED	0x0002
 
 struct ext4_io_end_vec {
 	struct list_head list;		/* list of io_end_vec */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f14aed14b9cf..0ef39d738598 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -944,7 +944,7 @@ const struct file_operations ext4_file_operations = {
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ext4_fallocate,
 	.fop_flags	= FOP_MMAP_SYNC | FOP_BUFFER_RASYNC |
-			  FOP_DIO_PARALLEL_WRITE,
+			  FOP_DIO_PARALLEL_WRITE | FOP_UNCACHED,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 3536ca7e4fcc..4089d0744164 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -667,6 +667,7 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
 	handle_t *handle;
 	struct folio *folio;
 	struct ext4_iloc iloc;
+	fgf_t fgp_flags;
 
 	if (pos + len > ext4_get_max_inline_size(inode))
 		goto convert;
@@ -702,7 +703,11 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
 	if (ret)
 		goto out;
 
-	folio = __filemap_get_folio(mapping, 0, FGP_WRITEBEGIN | FGP_NOFS,
+	fgp_flags = FGP_WRITEBEGIN | FGP_NOFS;
+	if (*foliop == foliop_uncached)
+		fgp_flags |= FGP_UNCACHED;
+
+	folio = __filemap_get_folio(mapping, 0, fgp_flags,
 					mapping_gfp_mask(mapping));
 	if (IS_ERR(folio)) {
 		ret = PTR_ERR(folio);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 54bdd4884fe6..afae3ab64c9e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1138,6 +1138,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 	int ret, needed_blocks;
 	handle_t *handle;
 	int retries = 0;
+	fgf_t fgp_flags;
 	struct folio *folio;
 	pgoff_t index;
 	unsigned from, to;
@@ -1164,6 +1165,15 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 			return 0;
 	}
 
+	/*
+	 * Set FGP_WRITEBEGIN, and FGP_UNCACHED if foliop contains
+	 * foliop_uncached. That's how generic_perform_write() informs us
+	 * that this is an uncached write.
+	 */
+	fgp_flags = FGP_WRITEBEGIN;
+	if (*foliop == foliop_uncached)
+		fgp_flags |= FGP_UNCACHED;
+
 	/*
 	 * __filemap_get_folio() can take a long time if the
 	 * system is thrashing due to memory pressure, or if the folio
@@ -1172,7 +1182,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 	 * the folio (if needed) without using GFP_NOFS.
 	 */
 retry_grab:
-	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
+	folio = __filemap_get_folio(mapping, index, fgp_flags,
 					mapping_gfp_mask(mapping));
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
@@ -2903,6 +2913,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 	struct folio *folio;
 	pgoff_t index;
 	struct inode *inode = mapping->host;
+	fgf_t fgp_flags;
 
 	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
 		return -EIO;
@@ -2926,8 +2937,11 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 			return 0;
 	}
 
+	fgp_flags = FGP_WRITEBEGIN;
+	if (*foliop == foliop_uncached)
+		fgp_flags |= FGP_UNCACHED;
 retry:
-	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
+	folio = __filemap_get_folio(mapping, index, fgp_flags,
 			mapping_gfp_mask(mapping));
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index ad5543866d21..10447c3c4ff1 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -226,8 +226,6 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
 	unsigned long flags;
 
 	/* Only reserved conversions from writeback should enter here */
-	WARN_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
-	WARN_ON(!io_end->handle && sbi->s_journal);
 	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
 	wq = sbi->rsv_conversion_wq;
 	if (list_empty(&ei->i_rsv_conversion_list))
@@ -252,7 +250,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
 
 	while (!list_empty(&unwritten)) {
 		io_end = list_entry(unwritten.next, ext4_io_end_t, list);
-		BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
+		BUG_ON(!(io_end->flag & (EXT4_IO_END_UNWRITTEN|EXT4_IO_UNCACHED)));
 		list_del_init(&io_end->list);
 
 		err = ext4_end_io_end(io_end);
@@ -287,14 +285,15 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
 
 void ext4_put_io_end_defer(ext4_io_end_t *io_end)
 {
-	if (refcount_dec_and_test(&io_end->count)) {
-		if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
-				list_empty(&io_end->list_vec)) {
-			ext4_release_io_end(io_end);
-			return;
-		}
-		ext4_add_complete_io(io_end);
+	if (!refcount_dec_and_test(&io_end->count))
+		return;
+	if ((!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
+	    list_empty(&io_end->list_vec)) &&
+	    !(io_end->flag & EXT4_IO_UNCACHED)) {
+		ext4_release_io_end(io_end);
+		return;
 	}
+	ext4_add_complete_io(io_end);
 }
 
 int ext4_put_io_end(ext4_io_end_t *io_end)
@@ -348,7 +347,7 @@ static void ext4_end_bio(struct bio *bio)
 				blk_status_to_errno(bio->bi_status));
 	}
 
-	if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
+	if (io_end->flag & (EXT4_IO_END_UNWRITTEN|EXT4_IO_UNCACHED)) {
 		/*
 		 * Link bio into list hanging from io_end. We have to do it
 		 * atomically as bio completions can be racing against each
@@ -417,8 +416,13 @@ static void io_submit_add_bh(struct ext4_io_submit *io,
 submit_and_retry:
 		ext4_io_submit(io);
 	}
-	if (io->io_bio == NULL)
+	if (io->io_bio == NULL) {
 		io_submit_init_bio(io, bh);
+		if (folio_test_uncached(folio)) {
+			ext4_io_end_t *io_end = io->io_bio->bi_private;
+			io_end->flag |= EXT4_IO_UNCACHED;
+		}
+	}
 	if (!bio_add_folio(io->io_bio, io_folio, bh->b_size, bh_offset(bh)))
 		goto submit_and_retry;
 	wbc_account_cgroup_owner(io->io_wbc, &folio->page, bh->b_size);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (11 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 12/16] ext4: add RWF_UNCACHED write support Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-12  1:01   ` Darrick J. Wong
  2024-11-12 16:37   ` Brian Foster
  2024-11-11 23:37 ` [PATCH 14/16] xfs: punt uncached write completions to the completion wq Jens Axboe
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

Add iomap buffered write support for RWF_UNCACHED. If RWF_UNCACHED is
set for a write, mark the folios being written with drop_writeback. Then
writeback completion will drop the pages. The write_iter handler simply
kicks off writeback for the pages, and writeback completion will take
care of the rest.

This still needs the user of the iomap buffered write helpers to call
iocb_uncached_write() upon successful issue of the writes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/buffered-io.c | 15 +++++++++++++--
 include/linux/iomap.h  |  4 +++-
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ef0b68bccbb6..2f2a5db04a68 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -603,6 +603,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	if (iter->flags & IOMAP_UNCACHED)
+		fgp |= FGP_UNCACHED;
 	fgp |= fgf_set_order(len);
 
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
@@ -1023,8 +1025,9 @@ ssize_t
 iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 		const struct iomap_ops *ops, void *private)
 {
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct iomap_iter iter = {
-		.inode		= iocb->ki_filp->f_mapping->host,
+		.inode		= mapping->host,
 		.pos		= iocb->ki_pos,
 		.len		= iov_iter_count(i),
 		.flags		= IOMAP_WRITE,
@@ -1034,9 +1037,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iter.flags |= IOMAP_NOWAIT;
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		iter.flags |= IOMAP_UNCACHED;
 
-	while ((ret = iomap_iter(&iter, ops)) > 0)
+	while ((ret = iomap_iter(&iter, ops)) > 0) {
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			iter.iomap.flags |= IOMAP_F_UNCACHED;
 		iter.processed = iomap_write_iter(&iter, i);
+	}
 
 	if (unlikely(iter.pos == iocb->ki_pos))
 		return ret;
@@ -1770,6 +1778,9 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 	size_t poff = offset_in_folio(folio, pos);
 	int error;
 
+	if (folio_test_uncached(folio))
+		wpc->iomap.flags |= IOMAP_F_UNCACHED;
+
 	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
 new_ioend:
 		error = iomap_submit_ioend(wpc, 0);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f61407e3b121..2efc72df19a2 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -64,6 +64,7 @@ struct vm_fault;
 #define IOMAP_F_BUFFER_HEAD	0
 #endif /* CONFIG_BUFFER_HEAD */
 #define IOMAP_F_XATTR		(1U << 5)
+#define IOMAP_F_UNCACHED	(1U << 6)
 
 /*
  * Flags set by the core iomap code during operations:
@@ -173,8 +174,9 @@ struct iomap_folio_ops {
 #define IOMAP_NOWAIT		(1 << 5) /* do not block */
 #define IOMAP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
 #define IOMAP_UNSHARE		(1 << 7) /* unshare_file_range */
+#define IOMAP_UNCACHED		(1 << 8) /* uncached IO */
 #ifdef CONFIG_FS_DAX
-#define IOMAP_DAX		(1 << 8) /* DAX mapping */
+#define IOMAP_DAX		(1 << 9) /* DAX mapping */
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 14/16] xfs: punt uncached write completions to the completion wq
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (12 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 15/16] xfs: flag as supporting FOP_UNCACHED Jens Axboe
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

They need non-irq context guaranteed, to be able to prune ranges from
the page cache. Treat them like unwritten extents and punt them to the
completion workqueue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/xfs/xfs_aops.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 559a3a577097..c86fc2b8f344 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -416,9 +416,12 @@ xfs_prepare_ioend(
 
 	memalloc_nofs_restore(nofs_flag);
 
-	/* send ioends that might require a transaction to the completion wq */
+	/*
+	 * Send ioends that might require a transaction or need blocking
+	 * context to the completion wq
+	 */
 	if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN ||
-	    (ioend->io_flags & IOMAP_F_SHARED))
+	    (ioend->io_flags & (IOMAP_F_SHARED|IOMAP_F_UNCACHED)))
 		ioend->io_bio.bi_end_io = xfs_end_bio;
 	return status;
 }
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 15/16] xfs: flag as supporting FOP_UNCACHED
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (13 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 14/16] xfs: punt uncached write completions to the completion wq Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-11 23:37 ` [PATCH 16/16] btrfs: add support for uncached writes Jens Axboe
  2024-11-12  1:31 ` [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

Read side was already fully supported, for the write side all that's
needed now is calling generic_uncached_write() when uncached writes
have been submitted. With that, enable the use of RWF_UNCACHED with XFS
by flagging support with FOP_UNCACHED.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/xfs/xfs_file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b19916b11fd5..1a7f46e13464 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -825,6 +825,7 @@ xfs_file_buffered_write(
 
 	if (ret > 0) {
 		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
+		generic_uncached_write(iocb, ret);
 		/* Handle various SYNC-type writes */
 		ret = generic_write_sync(iocb, ret);
 	}
@@ -1595,7 +1596,8 @@ const struct file_operations xfs_file_operations = {
 	.fadvise	= xfs_file_fadvise,
 	.remap_file_range = xfs_file_remap_range,
 	.fop_flags	= FOP_MMAP_SYNC | FOP_BUFFER_RASYNC |
-			  FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE,
+			  FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE |
+			  FOP_UNCACHED,
 };
 
 const struct file_operations xfs_dir_file_operations = {
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 16/16] btrfs: add support for uncached writes
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (14 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 15/16] xfs: flag as supporting FOP_UNCACHED Jens Axboe
@ 2024-11-11 23:37 ` Jens Axboe
  2024-11-12  1:31 ` [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-11 23:37 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs, Jens Axboe

The read side is already covered as btrfs uses the generic filemap
helpers. For writes, just pass in FGP_UNCACHED if uncached IO is being
done, then the folios created should be marked appropriately.

For IO completion, ensure that writing back folios that are uncached
gets punted to one of the btrfs workers, as task context is needed for
that. Add an 'uncached_io' member to struct btrfs_bio to manage that.

Outside of that, call generic_uncached_write() upon successful
completion of a buffered write.

With that, add FOP_UNCACHED to the btrfs file_operations fop_flags
structure, enabling use of RWF_UNCACHED.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/btrfs/bio.c       |  4 ++--
 fs/btrfs/bio.h       |  2 ++
 fs/btrfs/extent_io.c |  8 +++++++-
 fs/btrfs/file.c      | 10 +++++++---
 4 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 7e0f9600b80c..253e1a656934 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -334,7 +334,7 @@ static void btrfs_end_bio_work(struct work_struct *work)
 	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
 
 	/* Metadata reads are checked and repaired by the submitter. */
-	if (is_data_bbio(bbio))
+	if (bio_op(&bbio->bio) == REQ_OP_READ && is_data_bbio(bbio))
 		btrfs_check_read_bio(bbio, bbio->bio.bi_private);
 	else
 		btrfs_bio_end_io(bbio, bbio->bio.bi_status);
@@ -351,7 +351,7 @@ static void btrfs_simple_end_io(struct bio *bio)
 	if (bio->bi_status)
 		btrfs_log_dev_io_error(bio, dev);
 
-	if (bio_op(bio) == REQ_OP_READ) {
+	if (bio_op(bio) == REQ_OP_READ || bbio->uncached_io) {
 		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
 		queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
 	} else {
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index e2fe16074ad6..39b98326c98f 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -82,6 +82,8 @@ struct btrfs_bio {
 	/* Save the first error status of split bio. */
 	blk_status_t status;
 
+	bool uncached_io;
+
 	/*
 	 * This member must come last, bio_alloc_bioset will allocate enough
 	 * bytes for entire btrfs_bio but relies on bio being last.
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 872cca54cc6c..b97b21178ed7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -760,8 +760,11 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
 	ASSERT(bio_ctrl->end_io_func);
 
 	if (bio_ctrl->bbio &&
-	    !btrfs_bio_is_contig(bio_ctrl, folio, disk_bytenr, pg_offset))
+	    !btrfs_bio_is_contig(bio_ctrl, folio, disk_bytenr, pg_offset)) {
+		if (folio_test_uncached(folio))
+			bio_ctrl->bbio->uncached_io = true;
 		submit_one_bio(bio_ctrl);
+	}
 
 	do {
 		u32 len = size;
@@ -779,6 +782,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
 			len = bio_ctrl->len_to_oe_boundary;
 		}
 
+		if (folio_test_uncached(folio))
+			bio_ctrl->bbio->uncached_io = true;
+
 		if (!bio_add_folio(&bio_ctrl->bbio->bio, folio, len, pg_offset)) {
 			/* bio full: move on to a new one */
 			submit_one_bio(bio_ctrl);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4fb521d91b06..a27d194a28e0 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -919,7 +919,7 @@ static gfp_t get_prepare_gfp_flags(struct inode *inode, bool nowait)
 static noinline int prepare_pages(struct inode *inode, struct page **pages,
 				  size_t num_pages, loff_t pos,
 				  size_t write_bytes, bool force_uptodate,
-				  bool nowait)
+				  bool nowait, bool uncached)
 {
 	int i;
 	unsigned long index = pos >> PAGE_SHIFT;
@@ -928,6 +928,8 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
 	int ret = 0;
 	int faili;
 
+	if (uncached)
+		fgp_flags |= FGP_UNCACHED;
 	for (i = 0; i < num_pages; i++) {
 again:
 		pages[i] = pagecache_get_page(inode->i_mapping, index + i,
@@ -1323,7 +1325,8 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
 		 * contents of pages from loop to loop
 		 */
 		ret = prepare_pages(inode, pages, num_pages,
-				    pos, write_bytes, force_page_uptodate, false);
+				    pos, write_bytes, force_page_uptodate,
+				    false, iocb->ki_flags & IOCB_UNCACHED);
 		if (ret) {
 			btrfs_delalloc_release_extents(BTRFS_I(inode),
 						       reserve_bytes);
@@ -1512,6 +1515,7 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
 	btrfs_set_inode_last_sub_trans(inode);
 
 	if (num_sync > 0) {
+		generic_uncached_write(iocb, num_sync);
 		num_sync = generic_write_sync(iocb, num_sync);
 		if (num_sync < 0)
 			num_written = num_sync;
@@ -3802,7 +3806,7 @@ const struct file_operations btrfs_file_operations = {
 	.compat_ioctl	= btrfs_compat_ioctl,
 #endif
 	.remap_file_range = btrfs_remap_file_range,
-	.fop_flags	= FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC,
+	.fop_flags	= FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC | FOP_UNCACHED,
 };
 
 int btrfs_fdatawrite_range(struct btrfs_inode *inode, loff_t start, loff_t end)
-- 
2.45.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-11-11 23:37 ` [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-11-12  0:57   ` Dave Chinner
  2024-11-12  1:27     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2024-11-12  0:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On Mon, Nov 11, 2024 at 04:37:37PM -0700, Jens Axboe wrote:
> If RWF_UNCACHED is set for a write, mark new folios being written with
> uncached. This is done by passing in the fact that it's an uncached write
> through the folio pointer. We can only get there when IOCB_UNCACHED was
> allowed, which can only happen if the file system opts in. Opting in means
> they need to check for the LSB in the folio pointer to know if it's an
> uncached write or not. If it is, then FGP_UNCACHED should be used if
> creating new folios is necessary.
> 
> Uncached writes will drop any folios they create upon writeback
> completion, but leave folios that may exist in that range alone. Since
> ->write_begin() doesn't currently take any flags, and to avoid needing
> to change the callback kernel wide, use the foliop being passed in to
> ->write_begin() to signal if this is an uncached write or not. File
> systems can then use that to mark newly created folios as uncached.
> 
> Add a helper, generic_uncached_write(), that generic_file_write_iter()
> calls upon successful completion of an uncached write.

This doesn't implement an "uncached" write operation. This
implements a cache write-through operation.

We've actually been talking about this for some time as a desirable
general buffered write trait on fast SSDs. Excessive write-behind
caching is a real problem in general, especially when doing
streaming sequential writes to pcie 4 and 5 nvme SSDs that can do
more than 7GB/s to disk. When the page cache fills up, we see all
the same problems you are trying to work around in this series
with "uncached" writes.

IOWS, what we really want is page cache write-through as an
automatic feature for buffered writes.


> @@ -70,6 +71,34 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
>  	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
>  }
>  
> +/*
> + * generic_uncached_write - start uncached writeback
> + * @iocb: the iocb that was written
> + * @written: the amount of bytes written
> + *
> + * When writeback has been handled by write_iter, this helper should be called
> + * if the file system supports uncached writes. If %IOCB_UNCACHED is set, it
> + * will kick off writeback for the specified range.
> + */
> +static inline void generic_uncached_write(struct kiocb *iocb, ssize_t written)
> +{
> +	if (iocb->ki_flags & IOCB_UNCACHED) {
> +		struct address_space *mapping = iocb->ki_filp->f_mapping;
> +
> +		/* kick off uncached writeback */
> +		__filemap_fdatawrite_range(mapping, iocb->ki_pos,
> +					   iocb->ki_pos + written, WB_SYNC_NONE);
> +	}
> +}

Yup, this is basically write-through.

> +
> +/*
> + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
> + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
> + * must check for this and pass FGP_UNCACHED for folio creation.
> + */
> +#define foliop_uncached			((struct folio *) 0xfee1c001)
> +#define foliop_is_uncached(foliop)	(*(foliop) == foliop_uncached)
> +
>  /**
>   * filemap_set_wb_err - set a writeback error on an address_space
>   * @mapping: mapping in which to set writeback error
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 40debe742abe..0d312de4e20c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -430,6 +430,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  
>  	return filemap_fdatawrite_wbc(mapping, &wbc);
>  }
> +EXPORT_SYMBOL_GPL(__filemap_fdatawrite_range);
>  
>  static inline int __filemap_fdatawrite(struct address_space *mapping,
>  	int sync_mode)
> @@ -4076,7 +4077,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>  	ssize_t written = 0;
>  
>  	do {
> -		struct folio *folio;
> +		struct folio *folio = NULL;
>  		size_t offset;		/* Offset into folio */
>  		size_t bytes;		/* Bytes to write to folio */
>  		size_t copied;		/* Bytes copied from user */
> @@ -4104,6 +4105,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>  			break;
>  		}
>  
> +		/*
> +		 * If IOCB_UNCACHED is set here, we now the file system
> +		 * supports it. And hence it'll know to check folip for being
> +		 * set to this magic value. If so, it's an uncached write.
> +		 * Whenever ->write_begin() changes prototypes again, this
> +		 * can go away and just pass iocb or iocb flags.
> +		 */
> +		if (iocb->ki_flags & IOCB_UNCACHED)
> +			folio = foliop_uncached;
> +
>  		status = a_ops->write_begin(file, mapping, pos, bytes,
>  						&folio, &fsdata);
>  		if (unlikely(status < 0))
> @@ -4234,8 +4245,10 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		ret = __generic_file_write_iter(iocb, from);
>  	inode_unlock(inode);
>  
> -	if (ret > 0)
> +	if (ret > 0) {
> +		generic_uncached_write(iocb, ret);
>  		ret = generic_write_sync(iocb, ret);

Why isn't the writethrough check inside generic_write_sync()?
Having to add it to every filesystem that supports write-through is
unwieldy. If the IO is DSYNC or SYNC, we're going to run WB_SYNC_ALL
writeback through the generic_write_sync() path already, so the only time we
actually want to run WB_SYNC_NONE write-through here is if the iocb
is not marked as dsync.

Hence I think this write-through check should be done conditionally
inside generic_write_sync(), not in addition to the writeback
generic_write_sync() might need to do...

That also gives us a common place for adding cache write-through
trigger logic (think writebehind trigger logic similar to readahead)
and this is also a place where we could automatically tag mapping
ranges for reclaim on writeback completion....

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-11 23:37 ` [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2024-11-12  1:01   ` Darrick J. Wong
  2024-11-12  1:30     ` Jens Axboe
  2024-11-12 16:37   ` Brian Foster
  1 sibling, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2024-11-12  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On Mon, Nov 11, 2024 at 04:37:40PM -0700, Jens Axboe wrote:
> Add iomap buffered write support for RWF_UNCACHED. If RWF_UNCACHED is
> set for a write, mark the folios being written with drop_writeback. Then
> writeback completion will drop the pages. The write_iter handler simply
> kicks off writeback for the pages, and writeback completion will take
> care of the rest.
> 
> This still needs the user of the iomap buffered write helpers to call
> iocb_uncached_write() upon successful issue of the writes.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/iomap/buffered-io.c | 15 +++++++++++++--
>  include/linux/iomap.h  |  4 +++-
>  2 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ef0b68bccbb6..2f2a5db04a68 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -603,6 +603,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> +	if (iter->flags & IOMAP_UNCACHED)
> +		fgp |= FGP_UNCACHED;
>  	fgp |= fgf_set_order(len);
>  
>  	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> @@ -1023,8 +1025,9 @@ ssize_t
>  iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>  		const struct iomap_ops *ops, void *private)
>  {
> +	struct address_space *mapping = iocb->ki_filp->f_mapping;
>  	struct iomap_iter iter = {
> -		.inode		= iocb->ki_filp->f_mapping->host,
> +		.inode		= mapping->host,
>  		.pos		= iocb->ki_pos,
>  		.len		= iov_iter_count(i),
>  		.flags		= IOMAP_WRITE,
> @@ -1034,9 +1037,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iter.flags |= IOMAP_NOWAIT;
> +	if (iocb->ki_flags & IOCB_UNCACHED)
> +		iter.flags |= IOMAP_UNCACHED;
>  
> -	while ((ret = iomap_iter(&iter, ops)) > 0)
> +	while ((ret = iomap_iter(&iter, ops)) > 0) {
> +		if (iocb->ki_flags & IOCB_UNCACHED)
> +			iter.iomap.flags |= IOMAP_F_UNCACHED;
>  		iter.processed = iomap_write_iter(&iter, i);
> +	}
>  
>  	if (unlikely(iter.pos == iocb->ki_pos))
>  		return ret;
> @@ -1770,6 +1778,9 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  	size_t poff = offset_in_folio(folio, pos);
>  	int error;
>  
> +	if (folio_test_uncached(folio))
> +		wpc->iomap.flags |= IOMAP_F_UNCACHED;
> +
>  	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
>  new_ioend:
>  		error = iomap_submit_ioend(wpc, 0);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index f61407e3b121..2efc72df19a2 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -64,6 +64,7 @@ struct vm_fault;
>  #define IOMAP_F_BUFFER_HEAD	0
>  #endif /* CONFIG_BUFFER_HEAD */
>  #define IOMAP_F_XATTR		(1U << 5)
> +#define IOMAP_F_UNCACHED	(1U << 6)

This value ^^^ is set only by the core iomap code, right?

>  /*
>   * Flags set by the core iomap code during operations:

...in which case it should be set down here.  It probably ought to have
a description of what it does, too:

"IOMAP_F_UNCACHED is set to indicate that writes to the page cache (and
hence writeback) will result in folios being evicted as soon as the
updated bytes are written back to the storage."

If the writeback fails, does that mean that the dirty data will /not/ be
retained in the page cache?  IIRC we finally got to the point where the
major filesystems leave pagecache alone after writeback EIO.

The rest of the mechanics looks nifty to me; there's plenty of places
where this could be useful to me personally. :)

--D

> @@ -173,8 +174,9 @@ struct iomap_folio_ops {
>  #define IOMAP_NOWAIT		(1 << 5) /* do not block */
>  #define IOMAP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
>  #define IOMAP_UNSHARE		(1 << 7) /* unshare_file_range */
> +#define IOMAP_UNCACHED		(1 << 8) /* uncached IO */
>  #ifdef CONFIG_FS_DAX
> -#define IOMAP_DAX		(1 << 8) /* DAX mapping */
> +#define IOMAP_DAX		(1 << 9) /* DAX mapping */
>  #else
>  #define IOMAP_DAX		0
>  #endif /* CONFIG_FS_DAX */
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-11-12  0:57   ` Dave Chinner
@ 2024-11-12  1:27     ` Jens Axboe
  2024-11-12  8:02       ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-11-12  1:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On 11/11/24 5:57 PM, Dave Chinner wrote:
> On Mon, Nov 11, 2024 at 04:37:37PM -0700, Jens Axboe wrote:
>> If RWF_UNCACHED is set for a write, mark new folios being written with
>> uncached. This is done by passing in the fact that it's an uncached write
>> through the folio pointer. We can only get there when IOCB_UNCACHED was
>> allowed, which can only happen if the file system opts in. Opting in means
>> they need to check for the LSB in the folio pointer to know if it's an
>> uncached write or not. If it is, then FGP_UNCACHED should be used if
>> creating new folios is necessary.
>>
>> Uncached writes will drop any folios they create upon writeback
>> completion, but leave folios that may exist in that range alone. Since
>> ->write_begin() doesn't currently take any flags, and to avoid needing
>> to change the callback kernel wide, use the foliop being passed in to
>> ->write_begin() to signal if this is an uncached write or not. File
>> systems can then use that to mark newly created folios as uncached.
>>
>> Add a helper, generic_uncached_write(), that generic_file_write_iter()
>> calls upon successful completion of an uncached write.
> 
> This doesn't implement an "uncached" write operation. This
> implements a cache write-through operation.

It's uncached in the sense that the range gets pruned on writeback
completion. For write-through, I'd consider that just the fact that it
gets kicked off once dirtied rather than wait for writeback to get
kicked at some point.

So I'd say write-through is a subset of that.

> We've actually been talking about this for some time as a desirable
> general buffered write trait on fast SSDs. Excessive write-behind
> caching is a real problem in general, especially when doing
> streaming sequential writes to pcie 4 and 5 nvme SSDs that can do
> more than 7GB/s to disk. When the page cache fills up, we see all

Try twice that, 14GB/sec.

> the same problems you are trying to work around in this series
> with "uncached" writes.
> 
> IOWS, what we really want is page cache write-through as an
> automatic feature for buffered writes.

I don't know who "we" is here - what I really want is for the write to
get kicked off, but also reclaimed as part of completion. I don't want
kswapd to do that, as it's inefficient.

>> @@ -70,6 +71,34 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
>>  	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
>>  }
>>  
>> +/*
>> + * generic_uncached_write - start uncached writeback
>> + * @iocb: the iocb that was written
>> + * @written: the amount of bytes written
>> + *
>> + * When writeback has been handled by write_iter, this helper should be called
>> + * if the file system supports uncached writes. If %IOCB_UNCACHED is set, it
>> + * will kick off writeback for the specified range.
>> + */
>> +static inline void generic_uncached_write(struct kiocb *iocb, ssize_t written)
>> +{
>> +	if (iocb->ki_flags & IOCB_UNCACHED) {
>> +		struct address_space *mapping = iocb->ki_filp->f_mapping;
>> +
>> +		/* kick off uncached writeback */
>> +		__filemap_fdatawrite_range(mapping, iocb->ki_pos,
>> +					   iocb->ki_pos + written, WB_SYNC_NONE);
>> +	}
>> +}
> 
> Yup, this is basically write-through.

... with pruning once it completes.

>> + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
>> + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
>> + * must check for this and pass FGP_UNCACHED for folio creation.
>> + */
>> +#define foliop_uncached			((struct folio *) 0xfee1c001)
>> +#define foliop_is_uncached(foliop)	(*(foliop) == foliop_uncached)
>> +
>>  /**
>>   * filemap_set_wb_err - set a writeback error on an address_space
>>   * @mapping: mapping in which to set writeback error
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 40debe742abe..0d312de4e20c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -430,6 +430,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>>  
>>  	return filemap_fdatawrite_wbc(mapping, &wbc);
>>  }
>> +EXPORT_SYMBOL_GPL(__filemap_fdatawrite_range);
>>  
>>  static inline int __filemap_fdatawrite(struct address_space *mapping,
>>  	int sync_mode)
>> @@ -4076,7 +4077,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>>  	ssize_t written = 0;
>>  
>>  	do {
>> -		struct folio *folio;
>> +		struct folio *folio = NULL;
>>  		size_t offset;		/* Offset into folio */
>>  		size_t bytes;		/* Bytes to write to folio */
>>  		size_t copied;		/* Bytes copied from user */
>> @@ -4104,6 +4105,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>>  			break;
>>  		}
>>  
>> +		/*
>> +		 * If IOCB_UNCACHED is set here, we now the file system
>> +		 * supports it. And hence it'll know to check folip for being
>> +		 * set to this magic value. If so, it's an uncached write.
>> +		 * Whenever ->write_begin() changes prototypes again, this
>> +		 * can go away and just pass iocb or iocb flags.
>> +		 */
>> +		if (iocb->ki_flags & IOCB_UNCACHED)
>> +			folio = foliop_uncached;
>> +
>>  		status = a_ops->write_begin(file, mapping, pos, bytes,
>>  						&folio, &fsdata);
>>  		if (unlikely(status < 0))
>> @@ -4234,8 +4245,10 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>  		ret = __generic_file_write_iter(iocb, from);
>>  	inode_unlock(inode);
>>  
>> -	if (ret > 0)
>> +	if (ret > 0) {
>> +		generic_uncached_write(iocb, ret);
>>  		ret = generic_write_sync(iocb, ret);
> 
> Why isn't the writethrough check inside generic_write_sync()?
> Having to add it to every filesystem that supports write-through is
> unwieldy. If the IO is DSYNC or SYNC, we're going to run WB_SYNC_ALL
> writeback through the generic_write_sync() path already, so the only time we
> actually want to run WB_SYNC_NONE write-through here is if the iocb
> is not marked as dsync.
> 
> Hence I think this write-through check should be done conditionally
> inside generic_write_sync(), not in addition to the writeback
> generic_write_sync() might need to do...

True good point, I'll move it to generic_write_sync() instead, it needs
to go there for all three spots where it's added anyway.

> That also gives us a common place for adding cache write-through
> trigger logic (think writebehind trigger logic similar to readahead)
> and this is also a place where we could automatically tag mapping
> ranges for reclaim on writeback completion....

I appreciate that you seemingly like the concept, but not that you are
also seemingly trying to commandeer this to be something else. Unless
you like the automatic reclaiming as well, it's not clear to me.

I'm certainly open to doing the reclaim differently, and I originally
wanted to do tagging for that - but there are no more tags available.
Yes we could add that, but then the question was whether that was
worthwhile. Hence I just went with a folio flag instead.

But the mechanism for doing that doesn't interest me as much as the
concept of it. Certainly open to ideas on that end.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-12  1:01   ` Darrick J. Wong
@ 2024-11-12  1:30     ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-12  1:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On 11/11/24 6:01 PM, Darrick J. Wong wrote:
> On Mon, Nov 11, 2024 at 04:37:40PM -0700, Jens Axboe wrote:
>> Add iomap buffered write support for RWF_UNCACHED. If RWF_UNCACHED is
>> set for a write, mark the folios being written with drop_writeback. Then
>> writeback completion will drop the pages. The write_iter handler simply
>> kicks off writeback for the pages, and writeback completion will take
>> care of the rest.
>>
>> This still needs the user of the iomap buffered write helpers to call
>> iocb_uncached_write() upon successful issue of the writes.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/iomap/buffered-io.c | 15 +++++++++++++--
>>  include/linux/iomap.h  |  4 +++-
>>  2 files changed, 16 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index ef0b68bccbb6..2f2a5db04a68 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -603,6 +603,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>>  
>>  	if (iter->flags & IOMAP_NOWAIT)
>>  		fgp |= FGP_NOWAIT;
>> +	if (iter->flags & IOMAP_UNCACHED)
>> +		fgp |= FGP_UNCACHED;
>>  	fgp |= fgf_set_order(len);
>>  
>>  	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
>> @@ -1023,8 +1025,9 @@ ssize_t
>>  iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>>  		const struct iomap_ops *ops, void *private)
>>  {
>> +	struct address_space *mapping = iocb->ki_filp->f_mapping;
>>  	struct iomap_iter iter = {
>> -		.inode		= iocb->ki_filp->f_mapping->host,
>> +		.inode		= mapping->host,
>>  		.pos		= iocb->ki_pos,
>>  		.len		= iov_iter_count(i),
>>  		.flags		= IOMAP_WRITE,
>> @@ -1034,9 +1037,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>>  
>>  	if (iocb->ki_flags & IOCB_NOWAIT)
>>  		iter.flags |= IOMAP_NOWAIT;
>> +	if (iocb->ki_flags & IOCB_UNCACHED)
>> +		iter.flags |= IOMAP_UNCACHED;
>>  
>> -	while ((ret = iomap_iter(&iter, ops)) > 0)
>> +	while ((ret = iomap_iter(&iter, ops)) > 0) {
>> +		if (iocb->ki_flags & IOCB_UNCACHED)
>> +			iter.iomap.flags |= IOMAP_F_UNCACHED;
>>  		iter.processed = iomap_write_iter(&iter, i);
>> +	}
>>  
>>  	if (unlikely(iter.pos == iocb->ki_pos))
>>  		return ret;
>> @@ -1770,6 +1778,9 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>>  	size_t poff = offset_in_folio(folio, pos);
>>  	int error;
>>  
>> +	if (folio_test_uncached(folio))
>> +		wpc->iomap.flags |= IOMAP_F_UNCACHED;
>> +
>>  	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
>>  new_ioend:
>>  		error = iomap_submit_ioend(wpc, 0);
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index f61407e3b121..2efc72df19a2 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -64,6 +64,7 @@ struct vm_fault;
>>  #define IOMAP_F_BUFFER_HEAD	0
>>  #endif /* CONFIG_BUFFER_HEAD */
>>  #define IOMAP_F_XATTR		(1U << 5)
>> +#define IOMAP_F_UNCACHED	(1U << 6)
> 
> This value ^^^ is set only by the core iomap code, right?

Correct

>>  /*
>>   * Flags set by the core iomap code during operations:
> 
> ...in which case it should be set down here.  It probably ought to have
> a description of what it does, too:

Ah yes indeed, good point. I'll move it and add a description.

> "IOMAP_F_UNCACHED is set to indicate that writes to the page cache (and
> hence writeback) will result in folios being evicted as soon as the
> updated bytes are written back to the storage."

Excellent, I'll go with that.

> If the writeback fails, does that mean that the dirty data will /not/ be
> retained in the page cache?  IIRC we finally got to the point where the
> major filesystems leave pagecache alone after writeback EIO.

Good question - didn't change any of those bits. It currently relies on
writeback completion to prune the ranges. So if an EIO completion
triggers writeback completion, then it'll get pruned. But for that case,
I suspect the range is still dirty, and hence the pruning would not
succeed, for obvious reasons. So you'd need further things on top of
that, I'm afraid.

> The rest of the mechanics looks nifty to me; there's plenty of places
> where this could be useful to me personally. :)

For sure, I think there are tons of use cases for this as well. Thanks
for taking a look!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHSET v3 0/16] Uncached buffered IO
  2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
                   ` (15 preceding siblings ...)
  2024-11-11 23:37 ` [PATCH 16/16] btrfs: add support for uncached writes Jens Axboe
@ 2024-11-12  1:31 ` Jens Axboe
  16 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-12  1:31 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: hannes, clm, linux-kernel, willy, kirill, linux-btrfs,
	linux-ext4, linux-xfs

On 11/11/24 4:37 PM, Jens Axboe wrote:
> I ran this through xfstests, and it found some of the issue listed as
> fixed below. This posted version passes the whole generic suite of
> xfstests. The xfstests patch is here:

FWIW, the xfs grouping also ran to completion after that, some hours
later... At least from the "is it semi sane, at least?" perspective, the
answer should be yes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-11-12  1:27     ` Jens Axboe
@ 2024-11-12  8:02       ` Dave Chinner
  2024-11-12  9:50         ` Kirill A. Shutemov
  2024-11-12 14:51         ` Jens Axboe
  0 siblings, 2 replies; 37+ messages in thread
From: Dave Chinner @ 2024-11-12  8:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On Mon, Nov 11, 2024 at 06:27:46PM -0700, Jens Axboe wrote:
> On 11/11/24 5:57 PM, Dave Chinner wrote:
> > On Mon, Nov 11, 2024 at 04:37:37PM -0700, Jens Axboe wrote:
> >> If RWF_UNCACHED is set for a write, mark new folios being written with
> >> uncached. This is done by passing in the fact that it's an uncached write
> >> through the folio pointer. We can only get there when IOCB_UNCACHED was
> >> allowed, which can only happen if the file system opts in. Opting in means
> >> they need to check for the LSB in the folio pointer to know if it's an
> >> uncached write or not. If it is, then FGP_UNCACHED should be used if
> >> creating new folios is necessary.
> >>
> >> Uncached writes will drop any folios they create upon writeback
> >> completion, but leave folios that may exist in that range alone. Since
> >> ->write_begin() doesn't currently take any flags, and to avoid needing
> >> to change the callback kernel wide, use the foliop being passed in to
> >> ->write_begin() to signal if this is an uncached write or not. File
> >> systems can then use that to mark newly created folios as uncached.
> >>
> >> Add a helper, generic_uncached_write(), that generic_file_write_iter()
> >> calls upon successful completion of an uncached write.
> > 
> > This doesn't implement an "uncached" write operation. This
> > implements a cache write-through operation.
> 
> It's uncached in the sense that the range gets pruned on writeback
> completion.

That's not the definition of "uncached". Direct IO is, by
definition, "uncached" because it bypasses the cache and is not
coherent with the contents of the cache.

This IO, however, is moving the data coherently through the cache
(both on read and write).  The cached folios are transient - i.e.
-temporarily resident- in the cache whilst the IO is in progress -
but this behaviour does not make it "uncached IO".

Calling it "uncached IO " is simply wrong from any direction I look
at it....

> For write-through, I'd consider that just the fact that it
> gets kicked off once dirtied rather than wait for writeback to get
> kicked at some point.
> 
> So I'd say write-through is a subset of that.

I think the post-IO invalidation that these IOs do is largely
irrelevant to how the page cache processes the write. Indeed,
from userspace, the functionality in this patchset would be
implemented like this:

oneshot_data_write(fd, buf, len, off)
{
	/* write into page cache */
	pwrite(fd, buf, len, off);

	/* force the write through the page cache */
	sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);

	/* Invalidate the single use data in the cache now it is on disk */
	posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
}

Allowing the application to control writeback and invalidation
granularity is a much more flexible solution to the problem here;
when IO is sequential, delayed allocation will be allowed to ensure
large contiguous extents are created and that will greatly reduce
file fragmentation on XFS, btrfs, bcachefs and ext4. For random
writes, it'll submit async IOs in batches...

Given that io_uring already supports sync_file_range() and
posix_fadvise(), I'm wondering why we need an new IO API to perform
this specific write-through behaviour in a way that is less flexible
than what applications can already implement through existing
APIs....

> > the same problems you are trying to work around in this series
> > with "uncached" writes.
> > 
> > IOWS, what we really want is page cache write-through as an
> > automatic feature for buffered writes.
> 
> I don't know who "we" is here - what I really want is for the write to
> get kicked off, but also reclaimed as part of completion. I don't want
> kswapd to do that, as it's inefficient.

"we" as in the general cohort of filesystem and mm
developers who interact closely with the page cache all the time.
There was a fair bit of talk about writethrough and other
transparent page cache IO path improvements at LSFMM this year.

> > That also gives us a common place for adding cache write-through
> > trigger logic (think writebehind trigger logic similar to readahead)
> > and this is also a place where we could automatically tag mapping
> > ranges for reclaim on writeback completion....
> 
> I appreciate that you seemingly like the concept, but not that you are
> also seemingly trying to commandeer this to be something else. Unless
> you like the automatic reclaiming as well, it's not clear to me.

I'm not trying to commandeer anything.

Having thought about it more, I think this new API is unneccesary
for custom written applications to perform fine grained control of
page cache residency of one-shot data. We already have APIs that
allow applications to do exactly what this patchset is doing. rather
than choosing to modify whatever benchmark being used to use
existing APIs, a choice was made to modify both the applicaiton and
the kernel to implement a whole new API....

I think that was the -wrong choice-.

I think this partially because the kernel modifications are don't
really help further us towards the goal of transparent mode
switching in the page cache.

Read-through should be a mode that the readahead control activates,
not be something triggered by a special read() syscall flag. We
already have access patterns and fadvise modes guiding this.
Write-through should be controlled in a similar way.

And making the data being read and written behave as transient page
caceh objects should be done via an existing fadvise mode, too,
because the model you have implemented here exactly matches the 
definition of FADV_NOREUSE:

	POSIX_FADV_NOREUSE
              The specified data will be accessed only once.

Having a new per-IO flag that effectively collides existing
control functionality into a single inflexible API bit doesn't
really make a whole lot of sense to me.

IOWs, I'm not questioning whether we need rw-through modes and/or
IO-transient residency for page cache based IO - it's been on our
radar for a while. I'm more concerned that the chosen API in this
patchset is a poor one as it cannot replace any of the existing
controls we already have for these sorts of application directed
page cache manipulations...

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/16] mm: add PG_uncached page flag
  2024-11-11 23:37 ` [PATCH 03/16] mm: add PG_uncached page flag Jens Axboe
@ 2024-11-12  9:12   ` Kirill A. Shutemov
  2024-11-12 14:07     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Kirill A. Shutemov @ 2024-11-12  9:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	linux-btrfs, linux-ext4, linux-xfs

On Mon, Nov 11, 2024 at 04:37:30PM -0700, Jens Axboe wrote:
> Add a page flag that file IO can use to indicate that the IO being done
> is uncached, as in it should not persist in the page cache after the IO
> has been completed.

I have not found a way to avoid using a new bit. I am unsure if we have
enough bits on 32-bit systems with all possible features enabled.

In the worst-case scenario, we may need to make the feature 64-bit only.
I believe it should be acceptable as long as userspace is prepared for the
possibility that RWF_UNCACHED may fail. It is not going to be supported by
all filesystems anyway.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/16] mm/filemap: drop uncached pages when writeback completes
  2024-11-11 23:37 ` [PATCH 09/16] mm/filemap: drop uncached pages when writeback completes Jens Axboe
@ 2024-11-12  9:31   ` Kirill A. Shutemov
  2024-11-12 14:09     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Kirill A. Shutemov @ 2024-11-12  9:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	linux-btrfs, linux-ext4, linux-xfs

On Mon, Nov 11, 2024 at 04:37:36PM -0700, Jens Axboe wrote:
> If the folio is marked as uncached, drop pages when writeback completes.
> Intended to be used with RWF_UNCACHED, to avoid needing sync writes for
> uncached IO.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  mm/filemap.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3d0614ea5f59..40debe742abe 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1600,6 +1600,27 @@ int folio_wait_private_2_killable(struct folio *folio)
>  }
>  EXPORT_SYMBOL(folio_wait_private_2_killable);
>  
> +/*
> + * If folio was marked as uncached, then pages should be dropped when writeback
> + * completes. Do that now. If we fail, it's likely because of a big folio -
> + * just reset uncached for that case and latter completions should invalidate.
> + */
> +static void folio_end_uncached(struct folio *folio)
> +{
> +	/*
> +	 * Hitting !in_task() should not happen off RWF_UNCACHED writeback, but
> +	 * can happen if normal writeback just happens to find dirty folios
> +	 * that were created as part of uncached writeback, and that writeback
> +	 * would otherwise not need non-IRQ handling. Just skip the
> +	 * invalidation in that case.
> +	 */
> +	if (in_task() && folio_trylock(folio)) {
> +		if (folio->mapping)
> +			folio_unmap_invalidate(folio->mapping, folio, 0);
> +		folio_unlock(folio);
> +	}
> +}
> +
>  /**
>   * folio_end_writeback - End writeback against a folio.
>   * @folio: The folio.
> @@ -1610,6 +1631,8 @@ EXPORT_SYMBOL(folio_wait_private_2_killable);
>   */
>  void folio_end_writeback(struct folio *folio)
>  {
> +	bool folio_uncached = false;
> +
>  	VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
>  
>  	/*
> @@ -1631,9 +1654,14 @@ void folio_end_writeback(struct folio *folio)
>  	 * reused before the folio_wake_bit().
>  	 */
>  	folio_get(folio);
> +	if (folio_test_uncached(folio) && folio_test_clear_uncached(folio))
> +		folio_uncached = true;

Hm? Maybe

	folio_uncached = folio_test_clear_uncached(folio);

?

>  	if (__folio_end_writeback(folio))
>  		folio_wake_bit(folio, PG_writeback);
>  	acct_reclaim_writeback(folio);
> +
> +	if (folio_uncached)
> +		folio_end_uncached(folio);
>  	folio_put(folio);
>  }
>  EXPORT_SYMBOL(folio_end_writeback);
> -- 
> 2.45.2
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-11-12  8:02       ` Dave Chinner
@ 2024-11-12  9:50         ` Kirill A. Shutemov
  2024-11-12 13:36           ` Dave Chinner
  2024-11-12 14:51         ` Jens Axboe
  1 sibling, 1 reply; 37+ messages in thread
From: Kirill A. Shutemov @ 2024-11-12  9:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, linux-mm, linux-fsdevel, hannes, clm, linux-kernel,
	willy, linux-btrfs, linux-ext4, linux-xfs

On Tue, Nov 12, 2024 at 07:02:33PM +1100, Dave Chinner wrote:
> I think the post-IO invalidation that these IOs do is largely
> irrelevant to how the page cache processes the write. Indeed,
> from userspace, the functionality in this patchset would be
> implemented like this:
> 
> oneshot_data_write(fd, buf, len, off)
> {
> 	/* write into page cache */
> 	pwrite(fd, buf, len, off);
> 
> 	/* force the write through the page cache */
> 	sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> 
> 	/* Invalidate the single use data in the cache now it is on disk */
> 	posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> }
> 
> Allowing the application to control writeback and invalidation
> granularity is a much more flexible solution to the problem here;
> when IO is sequential, delayed allocation will be allowed to ensure
> large contiguous extents are created and that will greatly reduce
> file fragmentation on XFS, btrfs, bcachefs and ext4. For random
> writes, it'll submit async IOs in batches...
> 
> Given that io_uring already supports sync_file_range() and
> posix_fadvise(), I'm wondering why we need an new IO API to perform
> this specific write-through behaviour in a way that is less flexible
> than what applications can already implement through existing
> APIs....

Attaching the hint to the IO operation allows kernel to keep the data in
page cache if it is there for other reason. You cannot do it with a
separate syscall.

Consider a scenario of a nightly backup of the data. The same data is in
cache because the actual workload needs it. You don't want backup task to
invalidate the data from cache. Your snippet would do that.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-11-12  9:50         ` Kirill A. Shutemov
@ 2024-11-12 13:36           ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2024-11-12 13:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jens Axboe, linux-mm, linux-fsdevel, hannes, clm, linux-kernel,
	willy, linux-btrfs, linux-ext4, linux-xfs

On Tue, Nov 12, 2024 at 11:50:46AM +0200, Kirill A. Shutemov wrote:
> On Tue, Nov 12, 2024 at 07:02:33PM +1100, Dave Chinner wrote:
> > I think the post-IO invalidation that these IOs do is largely
> > irrelevant to how the page cache processes the write. Indeed,
> > from userspace, the functionality in this patchset would be
> > implemented like this:
> > 
> > oneshot_data_write(fd, buf, len, off)
> > {
> > 	/* write into page cache */
> > 	pwrite(fd, buf, len, off);
> > 
> > 	/* force the write through the page cache */
> > 	sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > 	/* Invalidate the single use data in the cache now it is on disk */
> > 	posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> > }
> > 
> > Allowing the application to control writeback and invalidation
> > granularity is a much more flexible solution to the problem here;
> > when IO is sequential, delayed allocation will be allowed to ensure
> > large contiguous extents are created and that will greatly reduce
> > file fragmentation on XFS, btrfs, bcachefs and ext4. For random
> > writes, it'll submit async IOs in batches...
> > 
> > Given that io_uring already supports sync_file_range() and
> > posix_fadvise(), I'm wondering why we need an new IO API to perform
> > this specific write-through behaviour in a way that is less flexible
> > than what applications can already implement through existing
> > APIs....
> 
> Attaching the hint to the IO operation allows kernel to keep the data in
> page cache if it is there for other reason. You cannot do it with a
> separate syscall.

Sure we can. FADV_NOREUSE is attached to the struct file - that's
available to every IO that is done on that file. Hence we know
before we start every IO on that file if we only need to preserve
existing page cache or all data we access.

Having a file marked like this doesn't affect any other application
that is accessing the same inode. It just means that the specific
fd opened by a specific process will not perturb the long term
residency of the page cache on that inode.

> Consider a scenario of a nightly backup of the data. The same data is in
> cache because the actual workload needs it. You don't want backup task to
> invalidate the data from cache. Your snippet would do that.

The code I presented was essentially just a demonstration of what
"uncached IO" was doing. That it is actually cached IO, and that it
can be done from userspace right now. Yes, it's not exactly the same
cache invalidation semantics, but that's not the point.

The point was that the existing APIs are *much more flexible* than
this proposal, and we don't actually need new kernel functionality
for applications to see the same benchmark results as Jens has
presented. All they need to do is be modified to use existing APIs.

The additional point to that end is that FADV_NOREUSE should be
hooke dup to the conditional cache invalidation mechanism Jens added
to the page cache IO paths. Then we have all the functionality of
this patch set individually selectable by userspace applications
without needing a new IO API to be rolled out. i.e. the snippet
then bcomes:

	/* don't cache after IO */
	fadvise(fd, FADV_NORESUSE)
	....
	write(fd, buf, len, off);
	/* write through */
	sync_file_range(fd, off, len, SYNC_FILE_RANGE);

Note how this doesn't need to block in sync_file_range() before
doing the invalidation anymore? We've separated the cache control
behaviour from the writeback behaviour. We can now do both write
back and write through buffered writes that clean up the page cache
after IO completion has occurred - write-through is not restricted
to uncached writes, nor is the cache purge after writeback
completion.

IOWs, we can do:

	/* don't cache after IO */
	fadvise(fd, FADV_NORESUSE)
	....
	off = pos;
	count = 4096;
	while (off < pos + len) {
		ret = write(fd, buf, count, off);
		/* get more data and put it in buf */
		off += ret;
	}
	/* write through */
	sync_file_range(fd, pos, len, SYNC_FILE_RANGE);

And now we only do one set of writeback on the file range, instead
of one per IO. And we still get the page cache being released on
writeback Io completion.

This is a *much* better API for IO and page cache control. It is not
constrained to individual IOs, so applications can allow the page
cache to write-combine data from multiple syscalls into a single
physical extent allocation and writeback IO.

This is much more efficient for modern filesytsems - the "writeback
per IO" model forces filesystms like XFS and ext4 to work like ext3
did, and defeats buffered write IO optimisations like dealyed
allocation. If we are going to do small "allocation and write IO"
patterns, we may as well be using direct IO as it is optimised for
that sort of behaviour.

So let's conside the backup application example. IMO, backup
applications  really don't want to use this new uncached IO
mechanism for either reading or writing data.

Backup programs do sequential data read IO as they walk the backup set -
if they are doing buffered IO then we -really- want readahead to be
active.

However, uncached IO turns off readahead, which is the equivalent of
the backup application doing:

	fadvise(fd, FADV_RANDOM);
	while (len > 0) {
		ret = read(fd, buf, len, off);
		fadvise(fd, FADV_DONTNEED, off, len);

		/* do stuff with buf */

		off += ret;
		len -= ret;
	}

Sequential buffered read IO after setting FADV_RANDOM absolutely
*sucks* from a performance perspective.

This is when FADV_NOREUSE is useful. We can leave readahead turned
on, and when we do the first read from the page cache after
readahead completes, we can then apply the NOREUSE policy. i.e. if
the data we are reading has not been accessed, then turf it after
reading if NOREUSE is set. If the data was already resident in
cache, then leave it there as per a normal read.

IOWs, if we separate the cache control from the read IO itself,
there is no need to turn off readahead to implement "drop cache
on-read" semantics. We just need to know if the folio has been
accessed or not to determine what to do with it.

Let's also consider the backup data file - that is written
sequentially.  It's going to be large and we don't know it's size
ahead of time. If we are using buffered writes we want delayed
allocation to optimise the file layout and hence writeback IO
throughput.  We also want to drop the page cache when writeback
eventually happens, but we really don't want writeback to happen on
every write.

IOWs, backup programs can take advantage of "drop cache when clean"
semantics, but can't really take any significant advantage from
per-IO write-through semantics. IOWs, backup applications really
want per-file NOREUSE write semantics that are seperately controlled
w.r.t. cache write-through behaviour.

One of the points I tried to make was that the uncached IO proposal
smashes multiple disparate semantics into a single per-IO control
bit. The backup application example above shows exactly how that API
isn't actually very good for the applications that could benefit
from the functionality this patchset adds to the page cache to
support that single control bit...

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/16] mm: add PG_uncached page flag
  2024-11-12  9:12   ` Kirill A. Shutemov
@ 2024-11-12 14:07     ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-12 14:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	linux-btrfs, linux-ext4, linux-xfs

On 11/12/24 2:12 AM, Kirill A. Shutemov wrote:
> On Mon, Nov 11, 2024 at 04:37:30PM -0700, Jens Axboe wrote:
>> Add a page flag that file IO can use to indicate that the IO being done
>> is uncached, as in it should not persist in the page cache after the IO
>> has been completed.
> 
> I have not found a way to avoid using a new bit. I am unsure if we have
> enough bits on 32-bit systems with all possible features enabled.

I think it should be OK, at least the kernel test bot reports build
success on all the archs it tests, which has a lot of 32-bit archs. I
have to say I didn't check on numbering and if the mm subsystem has a
BUILD_BUG_ON() for bits exceeding the allowable value for unsigned long
on the host, but I'm assuming it does?

> In the worst-case scenario, we may need to make the feature 64-bit only.
> I believe it should be acceptable as long as userspace is prepared for the
> possibility that RWF_UNCACHED may fail. It is not going to be supported by
> all filesystems anyway.

Right, I would not even see that as a big issue. 32-bit would just see
-EOPNOTSUPP for any fs, even ones that support it on 64-bit archs.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/16] mm/filemap: drop uncached pages when writeback completes
  2024-11-12  9:31   ` Kirill A. Shutemov
@ 2024-11-12 14:09     ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-12 14:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	linux-btrfs, linux-ext4, linux-xfs

On 11/12/24 2:31 AM, Kirill A. Shutemov wrote:
> On Mon, Nov 11, 2024 at 04:37:36PM -0700, Jens Axboe wrote:
>> If the folio is marked as uncached, drop pages when writeback completes.
>> Intended to be used with RWF_UNCACHED, to avoid needing sync writes for
>> uncached IO.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  mm/filemap.c | 28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 3d0614ea5f59..40debe742abe 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1600,6 +1600,27 @@ int folio_wait_private_2_killable(struct folio *folio)
>>  }
>>  EXPORT_SYMBOL(folio_wait_private_2_killable);
>>  
>> +/*
>> + * If folio was marked as uncached, then pages should be dropped when writeback
>> + * completes. Do that now. If we fail, it's likely because of a big folio -
>> + * just reset uncached for that case and latter completions should invalidate.
>> + */
>> +static void folio_end_uncached(struct folio *folio)
>> +{
>> +	/*
>> +	 * Hitting !in_task() should not happen off RWF_UNCACHED writeback, but
>> +	 * can happen if normal writeback just happens to find dirty folios
>> +	 * that were created as part of uncached writeback, and that writeback
>> +	 * would otherwise not need non-IRQ handling. Just skip the
>> +	 * invalidation in that case.
>> +	 */
>> +	if (in_task() && folio_trylock(folio)) {
>> +		if (folio->mapping)
>> +			folio_unmap_invalidate(folio->mapping, folio, 0);
>> +		folio_unlock(folio);
>> +	}
>> +}
>> +
>>  /**
>>   * folio_end_writeback - End writeback against a folio.
>>   * @folio: The folio.
>> @@ -1610,6 +1631,8 @@ EXPORT_SYMBOL(folio_wait_private_2_killable);
>>   */
>>  void folio_end_writeback(struct folio *folio)
>>  {
>> +	bool folio_uncached = false;
>> +
>>  	VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
>>  
>>  	/*
>> @@ -1631,9 +1654,14 @@ void folio_end_writeback(struct folio *folio)
>>  	 * reused before the folio_wake_bit().
>>  	 */
>>  	folio_get(folio);
>> +	if (folio_test_uncached(folio) && folio_test_clear_uncached(folio))
>> +		folio_uncached = true;
> 
> Hm? Maybe
> 
> 	folio_uncached = folio_test_clear_uncached(folio);
> 
> ?

It's done that way to avoid a RMW for the (for now, at least) common
case of not seeing cached folios. For that case, you can get by with a
cheap test_bit, for the cached case you pay the full price of the
test_clear.

Previous versions just had the test_clear, happy to just go back or add
a comment, whatever is preferred.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
  2024-11-12  8:02       ` Dave Chinner
  2024-11-12  9:50         ` Kirill A. Shutemov
@ 2024-11-12 14:51         ` Jens Axboe
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-12 14:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On 11/12/24 1:02 AM, Dave Chinner wrote:
> On Mon, Nov 11, 2024 at 06:27:46PM -0700, Jens Axboe wrote:
>> On 11/11/24 5:57 PM, Dave Chinner wrote:
>>> On Mon, Nov 11, 2024 at 04:37:37PM -0700, Jens Axboe wrote:
>>>> If RWF_UNCACHED is set for a write, mark new folios being written with
>>>> uncached. This is done by passing in the fact that it's an uncached write
>>>> through the folio pointer. We can only get there when IOCB_UNCACHED was
>>>> allowed, which can only happen if the file system opts in. Opting in means
>>>> they need to check for the LSB in the folio pointer to know if it's an
>>>> uncached write or not. If it is, then FGP_UNCACHED should be used if
>>>> creating new folios is necessary.
>>>>
>>>> Uncached writes will drop any folios they create upon writeback
>>>> completion, but leave folios that may exist in that range alone. Since
>>>> ->write_begin() doesn't currently take any flags, and to avoid needing
>>>> to change the callback kernel wide, use the foliop being passed in to
>>>> ->write_begin() to signal if this is an uncached write or not. File
>>>> systems can then use that to mark newly created folios as uncached.
>>>>
>>>> Add a helper, generic_uncached_write(), that generic_file_write_iter()
>>>> calls upon successful completion of an uncached write.
>>>
>>> This doesn't implement an "uncached" write operation. This
>>> implements a cache write-through operation.
>>
>> It's uncached in the sense that the range gets pruned on writeback
>> completion.
> 
> That's not the definition of "uncached". Direct IO is, by
> definition, "uncached" because it bypasses the cache and is not
> coherent with the contents of the cache.

I grant you it's not the best word in the world to describe it, but it
is uncached in the sense that it's not persistent in cache. It does very
much use the page cache as the synchronization point, exactly to avoid
the pitfalls of the giant mess that is O_DIRECT. But it's not persistent
in cache, whereas write-through very much traditionally is. Hence I
think uncached is a much better word than write-through, though as
mentioned I'll be happy to take other suggestions. Write-through isn't
it though, as the uncached concept is as much about reads as it is about
writes.

> This IO, however, is moving the data coherently through the cache
> (both on read and write).  The cached folios are transient - i.e.
> -temporarily resident- in the cache whilst the IO is in progress -
> but this behaviour does not make it "uncached IO".
> 
> Calling it "uncached IO " is simply wrong from any direction I look
> at it....

As mentioned, better words welcome :-)

>> For write-through, I'd consider that just the fact that it
>> gets kicked off once dirtied rather than wait for writeback to get
>> kicked at some point.
>>
>> So I'd say write-through is a subset of that.
> 
> I think the post-IO invalidation that these IOs do is largely
> irrelevant to how the page cache processes the write. Indeed,
> from userspace, the functionality in this patchset would be
> implemented like this:
> 
> oneshot_data_write(fd, buf, len, off)
> {
> 	/* write into page cache */
> 	pwrite(fd, buf, len, off);
> 
> 	/* force the write through the page cache */
> 	sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> 
> 	/* Invalidate the single use data in the cache now it is on disk */
> 	posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> }

Right, you could do that, it'd obviously just be much slower as you lose
the pipelining of the writes. This is the reason for the patch, after
all.

> Allowing the application to control writeback and invalidation
> granularity is a much more flexible solution to the problem here;
> when IO is sequential, delayed allocation will be allowed to ensure
> large contiguous extents are created and that will greatly reduce
> file fragmentation on XFS, btrfs, bcachefs and ext4. For random
> writes, it'll submit async IOs in batches...
> 
> Given that io_uring already supports sync_file_range() and
> posix_fadvise(), I'm wondering why we need an new IO API to perform
> this specific write-through behaviour in a way that is less flexible
> than what applications can already implement through existing
> APIs....

Just to make it available generically, it's just a read/write flag after
all. And yes, you can very much do this already with io_uring, just by
linking the ops. But the way I see it, it's a generic solution to a
generic problem.

>>> That also gives us a common place for adding cache write-through
>>> trigger logic (think writebehind trigger logic similar to readahead)
>>> and this is also a place where we could automatically tag mapping
>>> ranges for reclaim on writeback completion....
>>
>> I appreciate that you seemingly like the concept, but not that you are
>> also seemingly trying to commandeer this to be something else. Unless
>> you like the automatic reclaiming as well, it's not clear to me.
> 
> I'm not trying to commandeer anything.

No? You're very much trying to steer it in a direction that you find
better. There's a difference between making suggestions, or speaking
like you are sitting on the ultimate truth.

> Having thought about it more, I think this new API is unneccesary
> for custom written applications to perform fine grained control of
> page cache residency of one-shot data. We already have APIs that
> allow applications to do exactly what this patchset is doing. rather
> than choosing to modify whatever benchmark being used to use
> existing APIs, a choice was made to modify both the applicaiton and
> the kernel to implement a whole new API....
> 
> I think that was the -wrong choice-.
> 
> I think this partially because the kernel modifications are don't
> really help further us towards the goal of transparent mode
> switching in the page cache.
> 
> Read-through should be a mode that the readahead control activates,
> not be something triggered by a special read() syscall flag. We
> already have access patterns and fadvise modes guiding this.
> Write-through should be controlled in a similar way.
> 
> And making the data being read and written behave as transient page
> caceh objects should be done via an existing fadvise mode, too,
> because the model you have implemented here exactly matches the 
> definition of FADV_NOREUSE:
> 
> 	POSIX_FADV_NOREUSE
>               The specified data will be accessed only once.
> 
> Having a new per-IO flag that effectively collides existing
> control functionality into a single inflexible API bit doesn't
> really make a whole lot of sense to me.
> 
> IOWs, I'm not questioning whether we need rw-through modes and/or
> IO-transient residency for page cache based IO - it's been on our
> radar for a while. I'm more concerned that the chosen API in this
> patchset is a poor one as it cannot replace any of the existing
> controls we already have for these sorts of application directed
> page cache manipulations...

We'll just have to disagree, then. Per-file settings is fine for sync
IO, for anything async per-io is the way to go. It's why we have things
like RWF_NOWAIT as well, where O_NONBLOCK exists too. I'd argue that
RWF_NOWAIT should always have been a thing, and O_NONBLOCK is a mistake.
That's why RWF_UNCACHED exists. And yes, the FADV_NOREUSE was already
discussed with Willy and Yu, and I already did a poc patch to just
unconditionally set RWF_UNCACHED for FADV_NOREUSE enabled files. While
it's not exactly the same concept, I think the overlap is large enough
that it makes sense to do that. Especially since, historically,
FADV_NOREUSE has been largely a no-op and even know it doesn't have well
defined semantics.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 12/16] ext4: add RWF_UNCACHED write support
  2024-11-11 23:37 ` [PATCH 12/16] ext4: add RWF_UNCACHED write support Jens Axboe
@ 2024-11-12 16:36   ` Brian Foster
  2024-11-12 17:13     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2024-11-12 16:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On Mon, Nov 11, 2024 at 04:37:39PM -0700, Jens Axboe wrote:
> IOCB_UNCACHED IO needs to prune writeback regions on IO completion,
> and hence need the worker punt that ext4 also does for unwritten
> extents. Add an io_end flag to manage that.
> 
> If foliop is set to foliop_uncached in ext4_write_begin(), then set
> FGP_UNCACHED so that __filemap_get_folio() will mark newly created
> folios as uncached. That in turn will make writeback completion drop
> these ranges from the page cache.
> 
> Now that ext4 supports both uncached reads and writes, add the fop_flag
> FOP_UNCACHED to enable it.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/ext4/ext4.h    |  1 +
>  fs/ext4/file.c    |  2 +-
>  fs/ext4/inline.c  |  7 ++++++-
>  fs/ext4/inode.c   | 18 ++++++++++++++++--
>  fs/ext4/page-io.c | 28 ++++++++++++++++------------
>  5 files changed, 40 insertions(+), 16 deletions(-)
> 
...
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 54bdd4884fe6..afae3ab64c9e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1138,6 +1138,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>  	int ret, needed_blocks;
>  	handle_t *handle;
>  	int retries = 0;
> +	fgf_t fgp_flags;
>  	struct folio *folio;
>  	pgoff_t index;
>  	unsigned from, to;
> @@ -1164,6 +1165,15 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>  			return 0;
>  	}
>  
> +	/*
> +	 * Set FGP_WRITEBEGIN, and FGP_UNCACHED if foliop contains
> +	 * foliop_uncached. That's how generic_perform_write() informs us
> +	 * that this is an uncached write.
> +	 */
> +	fgp_flags = FGP_WRITEBEGIN;
> +	if (*foliop == foliop_uncached)
> +		fgp_flags |= FGP_UNCACHED;
> +
>  	/*
>  	 * __filemap_get_folio() can take a long time if the
>  	 * system is thrashing due to memory pressure, or if the folio
> @@ -1172,7 +1182,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>  	 * the folio (if needed) without using GFP_NOFS.
>  	 */
>  retry_grab:
> -	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
> +	folio = __filemap_get_folio(mapping, index, fgp_flags,
>  					mapping_gfp_mask(mapping));
>  	if (IS_ERR(folio))
>  		return PTR_ERR(folio);

JFYI, I notice that ext4 cycles the folio lock here in this path and
thus follows up with a couple checks presumably to accommodate that. One
is whether i_mapping has changed, which I assume means uncached state
would have been handled/cleared externally somewhere..? I.e., if an
uncached folio is somehow truncated/freed without ever having been
written back?

The next is a folio_wait_stable() call "in case writeback began ..."
It's not immediately clear to me if that is possible here, but taking
that at face value, is it an issue if we were to create an uncached
folio, drop the folio lock, then have some other task dirty and
writeback the folio (due to a sync write or something), then have
writeback completion invalidate the folio before we relock it here?

Brian

> @@ -2903,6 +2913,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
>  	struct folio *folio;
>  	pgoff_t index;
>  	struct inode *inode = mapping->host;
> +	fgf_t fgp_flags;
>  
>  	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
>  		return -EIO;
> @@ -2926,8 +2937,11 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
>  			return 0;
>  	}
>  
> +	fgp_flags = FGP_WRITEBEGIN;
> +	if (*foliop == foliop_uncached)
> +		fgp_flags |= FGP_UNCACHED;
>  retry:
> -	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
> +	folio = __filemap_get_folio(mapping, index, fgp_flags,
>  			mapping_gfp_mask(mapping));
>  	if (IS_ERR(folio))
>  		return PTR_ERR(folio);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index ad5543866d21..10447c3c4ff1 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -226,8 +226,6 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
>  	unsigned long flags;
>  
>  	/* Only reserved conversions from writeback should enter here */
> -	WARN_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> -	WARN_ON(!io_end->handle && sbi->s_journal);
>  	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
>  	wq = sbi->rsv_conversion_wq;
>  	if (list_empty(&ei->i_rsv_conversion_list))
> @@ -252,7 +250,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
>  
>  	while (!list_empty(&unwritten)) {
>  		io_end = list_entry(unwritten.next, ext4_io_end_t, list);
> -		BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> +		BUG_ON(!(io_end->flag & (EXT4_IO_END_UNWRITTEN|EXT4_IO_UNCACHED)));
>  		list_del_init(&io_end->list);
>  
>  		err = ext4_end_io_end(io_end);
> @@ -287,14 +285,15 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
>  
>  void ext4_put_io_end_defer(ext4_io_end_t *io_end)
>  {
> -	if (refcount_dec_and_test(&io_end->count)) {
> -		if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
> -				list_empty(&io_end->list_vec)) {
> -			ext4_release_io_end(io_end);
> -			return;
> -		}
> -		ext4_add_complete_io(io_end);
> +	if (!refcount_dec_and_test(&io_end->count))
> +		return;
> +	if ((!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
> +	    list_empty(&io_end->list_vec)) &&
> +	    !(io_end->flag & EXT4_IO_UNCACHED)) {
> +		ext4_release_io_end(io_end);
> +		return;
>  	}
> +	ext4_add_complete_io(io_end);
>  }
>  
>  int ext4_put_io_end(ext4_io_end_t *io_end)
> @@ -348,7 +347,7 @@ static void ext4_end_bio(struct bio *bio)
>  				blk_status_to_errno(bio->bi_status));
>  	}
>  
> -	if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
> +	if (io_end->flag & (EXT4_IO_END_UNWRITTEN|EXT4_IO_UNCACHED)) {
>  		/*
>  		 * Link bio into list hanging from io_end. We have to do it
>  		 * atomically as bio completions can be racing against each
> @@ -417,8 +416,13 @@ static void io_submit_add_bh(struct ext4_io_submit *io,
>  submit_and_retry:
>  		ext4_io_submit(io);
>  	}
> -	if (io->io_bio == NULL)
> +	if (io->io_bio == NULL) {
>  		io_submit_init_bio(io, bh);
> +		if (folio_test_uncached(folio)) {
> +			ext4_io_end_t *io_end = io->io_bio->bi_private;
> +			io_end->flag |= EXT4_IO_UNCACHED;
> +		}
> +	}
>  	if (!bio_add_folio(io->io_bio, io_folio, bh->b_size, bh_offset(bh)))
>  		goto submit_and_retry;
>  	wbc_account_cgroup_owner(io->io_wbc, &folio->page, bh->b_size);
> -- 
> 2.45.2
> 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-11 23:37 ` [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
  2024-11-12  1:01   ` Darrick J. Wong
@ 2024-11-12 16:37   ` Brian Foster
  2024-11-12 17:16     ` Jens Axboe
  1 sibling, 1 reply; 37+ messages in thread
From: Brian Foster @ 2024-11-12 16:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On Mon, Nov 11, 2024 at 04:37:40PM -0700, Jens Axboe wrote:
> Add iomap buffered write support for RWF_UNCACHED. If RWF_UNCACHED is
> set for a write, mark the folios being written with drop_writeback. Then

s/drop_writeback/uncached/ ?

BTW, this might be getting into wonky "don't care that much" territory,
but something else to be aware of is that certain writes can potentially
change pagecache state as a side effect outside of the actual buffered
write itself.

For example, xfs calls iomap_zero_range() on write extension (i.e. pos >
isize), which uses buffered writes and thus could populate a pagecache
folio without setting it uncached, even if done on behalf of an uncached
write.

I've only made a first pass and could be missing some details, but IIUC
I _think_ this means something like writing out a stream of small,
sparse and file extending uncached writes could actually end up behaving
more like sync I/O. Again, not saying that's something we really care
about, just raising it in case it's worth considering or documenting..

Brian

> writeback completion will drop the pages. The write_iter handler simply
> kicks off writeback for the pages, and writeback completion will take
> care of the rest.
> 
> This still needs the user of the iomap buffered write helpers to call
> iocb_uncached_write() upon successful issue of the writes.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/iomap/buffered-io.c | 15 +++++++++++++--
>  include/linux/iomap.h  |  4 +++-
>  2 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ef0b68bccbb6..2f2a5db04a68 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -603,6 +603,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> +	if (iter->flags & IOMAP_UNCACHED)
> +		fgp |= FGP_UNCACHED;
>  	fgp |= fgf_set_order(len);
>  
>  	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> @@ -1023,8 +1025,9 @@ ssize_t
>  iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>  		const struct iomap_ops *ops, void *private)
>  {
> +	struct address_space *mapping = iocb->ki_filp->f_mapping;
>  	struct iomap_iter iter = {
> -		.inode		= iocb->ki_filp->f_mapping->host,
> +		.inode		= mapping->host,
>  		.pos		= iocb->ki_pos,
>  		.len		= iov_iter_count(i),
>  		.flags		= IOMAP_WRITE,
> @@ -1034,9 +1037,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iter.flags |= IOMAP_NOWAIT;
> +	if (iocb->ki_flags & IOCB_UNCACHED)
> +		iter.flags |= IOMAP_UNCACHED;
>  
> -	while ((ret = iomap_iter(&iter, ops)) > 0)
> +	while ((ret = iomap_iter(&iter, ops)) > 0) {
> +		if (iocb->ki_flags & IOCB_UNCACHED)
> +			iter.iomap.flags |= IOMAP_F_UNCACHED;
>  		iter.processed = iomap_write_iter(&iter, i);
> +	}
>  
>  	if (unlikely(iter.pos == iocb->ki_pos))
>  		return ret;
> @@ -1770,6 +1778,9 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  	size_t poff = offset_in_folio(folio, pos);
>  	int error;
>  
> +	if (folio_test_uncached(folio))
> +		wpc->iomap.flags |= IOMAP_F_UNCACHED;
> +
>  	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
>  new_ioend:
>  		error = iomap_submit_ioend(wpc, 0);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index f61407e3b121..2efc72df19a2 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -64,6 +64,7 @@ struct vm_fault;
>  #define IOMAP_F_BUFFER_HEAD	0
>  #endif /* CONFIG_BUFFER_HEAD */
>  #define IOMAP_F_XATTR		(1U << 5)
> +#define IOMAP_F_UNCACHED	(1U << 6)
>  
>  /*
>   * Flags set by the core iomap code during operations:
> @@ -173,8 +174,9 @@ struct iomap_folio_ops {
>  #define IOMAP_NOWAIT		(1 << 5) /* do not block */
>  #define IOMAP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
>  #define IOMAP_UNSHARE		(1 << 7) /* unshare_file_range */
> +#define IOMAP_UNCACHED		(1 << 8) /* uncached IO */
>  #ifdef CONFIG_FS_DAX
> -#define IOMAP_DAX		(1 << 8) /* DAX mapping */
> +#define IOMAP_DAX		(1 << 9) /* DAX mapping */
>  #else
>  #define IOMAP_DAX		0
>  #endif /* CONFIG_FS_DAX */
> -- 
> 2.45.2
> 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 12/16] ext4: add RWF_UNCACHED write support
  2024-11-12 16:36   ` Brian Foster
@ 2024-11-12 17:13     ` Jens Axboe
  2024-11-12 18:11       ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-11-12 17:13 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On 11/12/24 9:36 AM, Brian Foster wrote:
> On Mon, Nov 11, 2024 at 04:37:39PM -0700, Jens Axboe wrote:
>> IOCB_UNCACHED IO needs to prune writeback regions on IO completion,
>> and hence need the worker punt that ext4 also does for unwritten
>> extents. Add an io_end flag to manage that.
>>
>> If foliop is set to foliop_uncached in ext4_write_begin(), then set
>> FGP_UNCACHED so that __filemap_get_folio() will mark newly created
>> folios as uncached. That in turn will make writeback completion drop
>> these ranges from the page cache.
>>
>> Now that ext4 supports both uncached reads and writes, add the fop_flag
>> FOP_UNCACHED to enable it.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/ext4/ext4.h    |  1 +
>>  fs/ext4/file.c    |  2 +-
>>  fs/ext4/inline.c  |  7 ++++++-
>>  fs/ext4/inode.c   | 18 ++++++++++++++++--
>>  fs/ext4/page-io.c | 28 ++++++++++++++++------------
>>  5 files changed, 40 insertions(+), 16 deletions(-)
>>
> ...
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 54bdd4884fe6..afae3ab64c9e 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1138,6 +1138,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>>  	int ret, needed_blocks;
>>  	handle_t *handle;
>>  	int retries = 0;
>> +	fgf_t fgp_flags;
>>  	struct folio *folio;
>>  	pgoff_t index;
>>  	unsigned from, to;
>> @@ -1164,6 +1165,15 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>>  			return 0;
>>  	}
>>  
>> +	/*
>> +	 * Set FGP_WRITEBEGIN, and FGP_UNCACHED if foliop contains
>> +	 * foliop_uncached. That's how generic_perform_write() informs us
>> +	 * that this is an uncached write.
>> +	 */
>> +	fgp_flags = FGP_WRITEBEGIN;
>> +	if (*foliop == foliop_uncached)
>> +		fgp_flags |= FGP_UNCACHED;
>> +
>>  	/*
>>  	 * __filemap_get_folio() can take a long time if the
>>  	 * system is thrashing due to memory pressure, or if the folio
>> @@ -1172,7 +1182,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>>  	 * the folio (if needed) without using GFP_NOFS.
>>  	 */
>>  retry_grab:
>> -	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
>> +	folio = __filemap_get_folio(mapping, index, fgp_flags,
>>  					mapping_gfp_mask(mapping));
>>  	if (IS_ERR(folio))
>>  		return PTR_ERR(folio);
> 
> JFYI, I notice that ext4 cycles the folio lock here in this path and
> thus follows up with a couple checks presumably to accommodate that. One
> is whether i_mapping has changed, which I assume means uncached state
> would have been handled/cleared externally somewhere..? I.e., if an
> uncached folio is somehow truncated/freed without ever having been
> written back?
> 
> The next is a folio_wait_stable() call "in case writeback began ..."
> It's not immediately clear to me if that is possible here, but taking
> that at face value, is it an issue if we were to create an uncached
> folio, drop the folio lock, then have some other task dirty and
> writeback the folio (due to a sync write or something), then have
> writeback completion invalidate the folio before we relock it here?

I don't either of those are an issue. The UNCACHED flag will only be set
on a newly created folio, it does not get inherited for folios that
already exist.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-12 16:37   ` Brian Foster
@ 2024-11-12 17:16     ` Jens Axboe
  2024-11-12 18:15       ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-11-12 17:16 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On 11/12/24 9:37 AM, Brian Foster wrote:
> On Mon, Nov 11, 2024 at 04:37:40PM -0700, Jens Axboe wrote:
>> Add iomap buffered write support for RWF_UNCACHED. If RWF_UNCACHED is
>> set for a write, mark the folios being written with drop_writeback. Then
> 
> s/drop_writeback/uncached/ ?

Ah indeed, guess that never got changed. Thanks, will fix that in the
commit message.

> BTW, this might be getting into wonky "don't care that much" territory,
> but something else to be aware of is that certain writes can potentially
> change pagecache state as a side effect outside of the actual buffered
> write itself.
> 
> For example, xfs calls iomap_zero_range() on write extension (i.e. pos >
> isize), which uses buffered writes and thus could populate a pagecache
> folio without setting it uncached, even if done on behalf of an uncached
> write.
> 
> I've only made a first pass and could be missing some details, but IIUC
> I _think_ this means something like writing out a stream of small,
> sparse and file extending uncached writes could actually end up behaving
> more like sync I/O. Again, not saying that's something we really care
> about, just raising it in case it's worth considering or documenting..

No that's useful info, I'm not really surprised that there would still
be cases where UNCACHED goes unnoticed. In other words, I'd be surprised
if the current patches for eg xfs/ext4 cover all the cases where new
folios are created and should be marked as UNCACHED of IOCB_UNCACHED is
set in the iocb.

I think those can be sorted out or documented as we move forward.
UNCACHED is really just a hint - the kernel should do its best to not
have permanent folios for this IO, but there are certainly cases where
it won't be honored if you're racing with regular buffered IO or mmap.
For the case above, sounds like we could cover that, however, and
probably should.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 12/16] ext4: add RWF_UNCACHED write support
  2024-11-12 17:13     ` Jens Axboe
@ 2024-11-12 18:11       ` Brian Foster
  2024-11-12 18:47         ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2024-11-12 18:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On Tue, Nov 12, 2024 at 10:13:12AM -0700, Jens Axboe wrote:
> On 11/12/24 9:36 AM, Brian Foster wrote:
> > On Mon, Nov 11, 2024 at 04:37:39PM -0700, Jens Axboe wrote:
> >> IOCB_UNCACHED IO needs to prune writeback regions on IO completion,
> >> and hence need the worker punt that ext4 also does for unwritten
> >> extents. Add an io_end flag to manage that.
> >>
> >> If foliop is set to foliop_uncached in ext4_write_begin(), then set
> >> FGP_UNCACHED so that __filemap_get_folio() will mark newly created
> >> folios as uncached. That in turn will make writeback completion drop
> >> these ranges from the page cache.
> >>
> >> Now that ext4 supports both uncached reads and writes, add the fop_flag
> >> FOP_UNCACHED to enable it.
> >>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >> ---
> >>  fs/ext4/ext4.h    |  1 +
> >>  fs/ext4/file.c    |  2 +-
> >>  fs/ext4/inline.c  |  7 ++++++-
> >>  fs/ext4/inode.c   | 18 ++++++++++++++++--
> >>  fs/ext4/page-io.c | 28 ++++++++++++++++------------
> >>  5 files changed, 40 insertions(+), 16 deletions(-)
> >>
> > ...
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 54bdd4884fe6..afae3ab64c9e 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -1138,6 +1138,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> >>  	int ret, needed_blocks;
> >>  	handle_t *handle;
> >>  	int retries = 0;
> >> +	fgf_t fgp_flags;
> >>  	struct folio *folio;
> >>  	pgoff_t index;
> >>  	unsigned from, to;
> >> @@ -1164,6 +1165,15 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> >>  			return 0;
> >>  	}
> >>  
> >> +	/*
> >> +	 * Set FGP_WRITEBEGIN, and FGP_UNCACHED if foliop contains
> >> +	 * foliop_uncached. That's how generic_perform_write() informs us
> >> +	 * that this is an uncached write.
> >> +	 */
> >> +	fgp_flags = FGP_WRITEBEGIN;
> >> +	if (*foliop == foliop_uncached)
> >> +		fgp_flags |= FGP_UNCACHED;
> >> +
> >>  	/*
> >>  	 * __filemap_get_folio() can take a long time if the
> >>  	 * system is thrashing due to memory pressure, or if the folio
> >> @@ -1172,7 +1182,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> >>  	 * the folio (if needed) without using GFP_NOFS.
> >>  	 */
> >>  retry_grab:
> >> -	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
> >> +	folio = __filemap_get_folio(mapping, index, fgp_flags,
> >>  					mapping_gfp_mask(mapping));
> >>  	if (IS_ERR(folio))
> >>  		return PTR_ERR(folio);
> > 
> > JFYI, I notice that ext4 cycles the folio lock here in this path and
> > thus follows up with a couple checks presumably to accommodate that. One
> > is whether i_mapping has changed, which I assume means uncached state
> > would have been handled/cleared externally somewhere..? I.e., if an
> > uncached folio is somehow truncated/freed without ever having been
> > written back?
> > 
> > The next is a folio_wait_stable() call "in case writeback began ..."
> > It's not immediately clear to me if that is possible here, but taking
> > that at face value, is it an issue if we were to create an uncached
> > folio, drop the folio lock, then have some other task dirty and
> > writeback the folio (due to a sync write or something), then have
> > writeback completion invalidate the folio before we relock it here?
> 
> I don't either of those are an issue. The UNCACHED flag will only be set
> on a newly created folio, it does not get inherited for folios that
> already exist.
> 

Right.. but what I was wondering for that latter case is if the folio is
created here by ext4, so uncached is set before it is unlocked.

On second look I guess the uncached completion invalidation should clear
mapping and thus trigger the retry logic here. That seems reasonable
enough, but is it still possible to race with writeback?

Maybe this is a better way to ask.. what happens if a write completes to
an uncached folio that is already under writeback? For example, uncached
write 1 completes, submits for writeback and returns to userspace. Then
write 2 begins and redirties the same folio before the uncached
writeback completes.

If I follow correctly, if write 2 is also uncached, it eventually blocks
in writeback submission (folio_prepare_writeback() ->
folio_wait_writeback()). It looks like folio lock is held there, so
presumably that would bypass the completion time invalidation in
folio_end_uncached(). But what if write 2 was not uncached or perhaps
writeback completion won the race for folio lock vs. the write side
(between locking the folio for dirtying and later for writeback
submission)? Does anything prevent invalidation of the folio before the
second write is submitted for writeback?

IOW, I'm wondering if the uncached completion time invalidation also
needs a folio dirty check..?

Brian

> -- 
> Jens Axboe
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED
  2024-11-12 17:16     ` Jens Axboe
@ 2024-11-12 18:15       ` Brian Foster
  0 siblings, 0 replies; 37+ messages in thread
From: Brian Foster @ 2024-11-12 18:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On Tue, Nov 12, 2024 at 10:16:10AM -0700, Jens Axboe wrote:
> On 11/12/24 9:37 AM, Brian Foster wrote:
> > On Mon, Nov 11, 2024 at 04:37:40PM -0700, Jens Axboe wrote:
> >> Add iomap buffered write support for RWF_UNCACHED. If RWF_UNCACHED is
> >> set for a write, mark the folios being written with drop_writeback. Then
> > 
> > s/drop_writeback/uncached/ ?
> 
> Ah indeed, guess that never got changed. Thanks, will fix that in the
> commit message.
> 
> > BTW, this might be getting into wonky "don't care that much" territory,
> > but something else to be aware of is that certain writes can potentially
> > change pagecache state as a side effect outside of the actual buffered
> > write itself.
> > 
> > For example, xfs calls iomap_zero_range() on write extension (i.e. pos >
> > isize), which uses buffered writes and thus could populate a pagecache
> > folio without setting it uncached, even if done on behalf of an uncached
> > write.
> > 
> > I've only made a first pass and could be missing some details, but IIUC
> > I _think_ this means something like writing out a stream of small,
> > sparse and file extending uncached writes could actually end up behaving
> > more like sync I/O. Again, not saying that's something we really care
> > about, just raising it in case it's worth considering or documenting..
> 
> No that's useful info, I'm not really surprised that there would still
> be cases where UNCACHED goes unnoticed. In other words, I'd be surprised
> if the current patches for eg xfs/ext4 cover all the cases where new
> folios are created and should be marked as UNCACHED of IOCB_UNCACHED is
> set in the iocb.
> 
> I think those can be sorted out or documented as we move forward.
> UNCACHED is really just a hint - the kernel should do its best to not
> have permanent folios for this IO, but there are certainly cases where
> it won't be honored if you're racing with regular buffered IO or mmap.
> For the case above, sounds like we could cover that, however, and
> probably should.
> 

Ok. I suppose you could plumb the iocb state through the zero range call
as well, but given the description/semantics I wouldn't blame you if you
wanted to leave that for a followon improvement. Thanks for the
explanation.

Brian

> -- 
> Jens Axboe
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 12/16] ext4: add RWF_UNCACHED write support
  2024-11-12 18:11       ` Brian Foster
@ 2024-11-12 18:47         ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-11-12 18:47 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
	kirill, linux-btrfs, linux-ext4, linux-xfs

On 11/12/24 11:11 AM, Brian Foster wrote:
> On Tue, Nov 12, 2024 at 10:13:12AM -0700, Jens Axboe wrote:
>> On 11/12/24 9:36 AM, Brian Foster wrote:
>>> On Mon, Nov 11, 2024 at 04:37:39PM -0700, Jens Axboe wrote:
>>>> IOCB_UNCACHED IO needs to prune writeback regions on IO completion,
>>>> and hence need the worker punt that ext4 also does for unwritten
>>>> extents. Add an io_end flag to manage that.
>>>>
>>>> If foliop is set to foliop_uncached in ext4_write_begin(), then set
>>>> FGP_UNCACHED so that __filemap_get_folio() will mark newly created
>>>> folios as uncached. That in turn will make writeback completion drop
>>>> these ranges from the page cache.
>>>>
>>>> Now that ext4 supports both uncached reads and writes, add the fop_flag
>>>> FOP_UNCACHED to enable it.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  fs/ext4/ext4.h    |  1 +
>>>>  fs/ext4/file.c    |  2 +-
>>>>  fs/ext4/inline.c  |  7 ++++++-
>>>>  fs/ext4/inode.c   | 18 ++++++++++++++++--
>>>>  fs/ext4/page-io.c | 28 ++++++++++++++++------------
>>>>  5 files changed, 40 insertions(+), 16 deletions(-)
>>>>
>>> ...
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 54bdd4884fe6..afae3ab64c9e 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -1138,6 +1138,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>>>>  	int ret, needed_blocks;
>>>>  	handle_t *handle;
>>>>  	int retries = 0;
>>>> +	fgf_t fgp_flags;
>>>>  	struct folio *folio;
>>>>  	pgoff_t index;
>>>>  	unsigned from, to;
>>>> @@ -1164,6 +1165,15 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>>>>  			return 0;
>>>>  	}
>>>>  
>>>> +	/*
>>>> +	 * Set FGP_WRITEBEGIN, and FGP_UNCACHED if foliop contains
>>>> +	 * foliop_uncached. That's how generic_perform_write() informs us
>>>> +	 * that this is an uncached write.
>>>> +	 */
>>>> +	fgp_flags = FGP_WRITEBEGIN;
>>>> +	if (*foliop == foliop_uncached)
>>>> +		fgp_flags |= FGP_UNCACHED;
>>>> +
>>>>  	/*
>>>>  	 * __filemap_get_folio() can take a long time if the
>>>>  	 * system is thrashing due to memory pressure, or if the folio
>>>> @@ -1172,7 +1182,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>>>>  	 * the folio (if needed) without using GFP_NOFS.
>>>>  	 */
>>>>  retry_grab:
>>>> -	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
>>>> +	folio = __filemap_get_folio(mapping, index, fgp_flags,
>>>>  					mapping_gfp_mask(mapping));
>>>>  	if (IS_ERR(folio))
>>>>  		return PTR_ERR(folio);
>>>
>>> JFYI, I notice that ext4 cycles the folio lock here in this path and
>>> thus follows up with a couple checks presumably to accommodate that. One
>>> is whether i_mapping has changed, which I assume means uncached state
>>> would have been handled/cleared externally somewhere..? I.e., if an
>>> uncached folio is somehow truncated/freed without ever having been
>>> written back?
>>>
>>> The next is a folio_wait_stable() call "in case writeback began ..."
>>> It's not immediately clear to me if that is possible here, but taking
>>> that at face value, is it an issue if we were to create an uncached
>>> folio, drop the folio lock, then have some other task dirty and
>>> writeback the folio (due to a sync write or something), then have
>>> writeback completion invalidate the folio before we relock it here?
>>
>> I don't either of those are an issue. The UNCACHED flag will only be set
>> on a newly created folio, it does not get inherited for folios that
>> already exist.
>>
> 
> Right.. but what I was wondering for that latter case is if the folio is
> created here by ext4, so uncached is set before it is unlocked.
> 
> On second look I guess the uncached completion invalidation should clear
> mapping and thus trigger the retry logic here. That seems reasonable
> enough, but is it still possible to race with writeback?
> 
> Maybe this is a better way to ask.. what happens if a write completes to
> an uncached folio that is already under writeback? For example, uncached
> write 1 completes, submits for writeback and returns to userspace. Then
> write 2 begins and redirties the same folio before the uncached
> writeback completes.
> 
> If I follow correctly, if write 2 is also uncached, it eventually blocks
> in writeback submission (folio_prepare_writeback() ->
> folio_wait_writeback()). It looks like folio lock is held there, so
> presumably that would bypass the completion time invalidation in
> folio_end_uncached(). But what if write 2 was not uncached or perhaps
> writeback completion won the race for folio lock vs. the write side
> (between locking the folio for dirtying and later for writeback
> submission)? Does anything prevent invalidation of the folio before the
> second write is submitted for writeback?
> 
> IOW, I'm wondering if the uncached completion time invalidation also
> needs a folio dirty check..?

Ah ok, I see what you mean. If the folio is dirty, the unmapping will
fail. But I guess with the recent change, we'll actually unmap it first.
I'll add the folio dirty check, thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-11-12 18:48 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
2024-11-11 23:37 ` [PATCH 01/16] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
2024-11-11 23:37 ` [PATCH 02/16] mm/readahead: add folio allocation helper Jens Axboe
2024-11-11 23:37 ` [PATCH 03/16] mm: add PG_uncached page flag Jens Axboe
2024-11-12  9:12   ` Kirill A. Shutemov
2024-11-12 14:07     ` Jens Axboe
2024-11-11 23:37 ` [PATCH 04/16] mm/readahead: add readahead_control->uncached member Jens Axboe
2024-11-11 23:37 ` [PATCH 05/16] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
2024-11-11 23:37 ` [PATCH 06/16] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
2024-11-11 23:37 ` [PATCH 07/16] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
2024-11-11 23:37 ` [PATCH 08/16] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
2024-11-11 23:37 ` [PATCH 09/16] mm/filemap: drop uncached pages when writeback completes Jens Axboe
2024-11-12  9:31   ` Kirill A. Shutemov
2024-11-12 14:09     ` Jens Axboe
2024-11-11 23:37 ` [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
2024-11-12  0:57   ` Dave Chinner
2024-11-12  1:27     ` Jens Axboe
2024-11-12  8:02       ` Dave Chinner
2024-11-12  9:50         ` Kirill A. Shutemov
2024-11-12 13:36           ` Dave Chinner
2024-11-12 14:51         ` Jens Axboe
2024-11-11 23:37 ` [PATCH 11/16] mm: add FGP_UNCACHED folio creation flag Jens Axboe
2024-11-11 23:37 ` [PATCH 12/16] ext4: add RWF_UNCACHED write support Jens Axboe
2024-11-12 16:36   ` Brian Foster
2024-11-12 17:13     ` Jens Axboe
2024-11-12 18:11       ` Brian Foster
2024-11-12 18:47         ` Jens Axboe
2024-11-11 23:37 ` [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
2024-11-12  1:01   ` Darrick J. Wong
2024-11-12  1:30     ` Jens Axboe
2024-11-12 16:37   ` Brian Foster
2024-11-12 17:16     ` Jens Axboe
2024-11-12 18:15       ` Brian Foster
2024-11-11 23:37 ` [PATCH 14/16] xfs: punt uncached write completions to the completion wq Jens Axboe
2024-11-11 23:37 ` [PATCH 15/16] xfs: flag as supporting FOP_UNCACHED Jens Axboe
2024-11-11 23:37 ` [PATCH 16/16] btrfs: add support for uncached writes Jens Axboe
2024-11-12  1:31 ` [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox