* [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2024-12-20 16:11 ` Matthew Wilcox
2024-12-20 15:47 ` [PATCH 02/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
` (11 subsequent siblings)
12 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe,
Kirill A . Shutemov
Rather than pass in both the file and position directly from the kiocb,
just take a struct kiocb instead. With the kiocb being passed in, skip
passing in the address_space separately as well. While doing so, move the
ki_flags checking into filemap_create_folio() as well. In preparation for
actually needing the kiocb in the function.
No functional changes in this patch.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index f61cf51c2238..8b29323b15d7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2459,15 +2459,17 @@ static int filemap_update_page(struct kiocb *iocb,
return error;
}
-static int filemap_create_folio(struct file *file,
- struct address_space *mapping, loff_t pos,
- struct folio_batch *fbatch)
+static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch)
{
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
struct folio *folio;
int error;
unsigned int min_order = mapping_min_folio_order(mapping);
pgoff_t index;
+ if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
+ return -EAGAIN;
+
folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
if (!folio)
return -ENOMEM;
@@ -2486,7 +2488,7 @@ static int filemap_create_folio(struct file *file,
* well to keep locking rules simple.
*/
filemap_invalidate_lock_shared(mapping);
- index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
+ index = (iocb->ki_pos >> (PAGE_SHIFT + min_order)) << min_order;
error = filemap_add_folio(mapping, folio, index,
mapping_gfp_constraint(mapping, GFP_KERNEL));
if (error == -EEXIST)
@@ -2494,7 +2496,8 @@ static int filemap_create_folio(struct file *file,
if (error)
goto error;
- error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
+ error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio,
+ folio);
if (error)
goto error;
@@ -2550,9 +2553,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
}
if (!folio_batch_count(fbatch)) {
- if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
- return -EAGAIN;
- err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
+ err = filemap_create_folio(iocb, fbatch);
if (err == AOP_TRUNCATED_PAGE)
goto retry;
return err;
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb
2024-12-20 15:47 ` [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-12-20 16:11 ` Matthew Wilcox
0 siblings, 0 replies; 32+ messages in thread
From: Matthew Wilcox @ 2024-12-20 16:11 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, kirill,
bfoster, Kirill A . Shutemov
On Fri, Dec 20, 2024 at 08:47:39AM -0700, Jens Axboe wrote:
> Rather than pass in both the file and position directly from the kiocb,
> just take a struct kiocb instead. With the kiocb being passed in, skip
> passing in the address_space separately as well. While doing so, move the
> ki_flags checking into filemap_create_folio() as well. In preparation for
> actually needing the kiocb in the function.
>
> No functional changes in this patch.
>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 02/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
2024-12-20 15:47 ` [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2024-12-20 16:12 ` Matthew Wilcox
2024-12-20 15:47 ` [PATCH 03/12] mm/readahead: add folio allocation helper Jens Axboe
` (10 subsequent siblings)
12 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe,
Kirill A . Shutemov, Christoph Hellwig
Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly. In preparation for needing
to modify ractl inside filemap_get_pages().
No functional changes in this patch.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 8b29323b15d7..220dc7c6e12f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2527,7 +2527,6 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
{
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
- struct file_ra_state *ra = &filp->f_ra;
pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
pgoff_t last_index;
struct folio *folio;
@@ -2542,12 +2541,13 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
if (!folio_batch_count(fbatch)) {
+ DEFINE_READAHEAD(ractl, filp, &filp->f_ra, mapping, index);
+
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
if (iocb->ki_flags & IOCB_NOWAIT)
flags = memalloc_noio_save();
- page_cache_sync_readahead(mapping, ra, filp, index,
- last_index - index);
+ page_cache_sync_ra(&ractl, last_index - index);
if (iocb->ki_flags & IOCB_NOWAIT)
memalloc_noio_restore(flags);
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH 02/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead
2024-12-20 15:47 ` [PATCH 02/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
@ 2024-12-20 16:12 ` Matthew Wilcox
0 siblings, 0 replies; 32+ messages in thread
From: Matthew Wilcox @ 2024-12-20 16:12 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, kirill,
bfoster, Kirill A . Shutemov, Christoph Hellwig
On Fri, Dec 20, 2024 at 08:47:40AM -0700, Jens Axboe wrote:
> Rather than use the page_cache_sync_readahead() helper, define our own
> ractl and use page_cache_sync_ra() directly. In preparation for needing
> to modify ractl inside filemap_get_pages().
>
> No functional changes in this patch.
>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 03/12] mm/readahead: add folio allocation helper
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
2024-12-20 15:47 ` [PATCH 01/12] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
2024-12-20 15:47 ` [PATCH 02/12] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2024-12-20 16:12 ` Matthew Wilcox
2024-12-20 15:47 ` [PATCH 04/12] mm: add PG_dropbehind folio flag Jens Axboe
` (9 subsequent siblings)
12 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe,
Kirill A . Shutemov
Just a wrapper around filemap_alloc_folio() for now, but add it in
preparation for modifying the folio based on the 'ractl' being passed
in.
No functional changes in this patch.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/readahead.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index ea650b8b02fb..8a62ad4106ff 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -188,6 +188,12 @@ static void read_pages(struct readahead_control *rac)
BUG_ON(readahead_count(rac));
}
+static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
+ gfp_t gfp_mask, unsigned int order)
+{
+ return filemap_alloc_folio(gfp_mask, order);
+}
+
/**
* page_cache_ra_unbounded - Start unchecked readahead.
* @ractl: Readahead control.
@@ -265,8 +271,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
continue;
}
- folio = filemap_alloc_folio(gfp_mask,
- mapping_min_folio_order(mapping));
+ folio = ractl_alloc_folio(ractl, gfp_mask,
+ mapping_min_folio_order(mapping));
if (!folio)
break;
@@ -436,7 +442,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
pgoff_t mark, unsigned int order, gfp_t gfp)
{
int err;
- struct folio *folio = filemap_alloc_folio(gfp, order);
+ struct folio *folio = ractl_alloc_folio(ractl, gfp, order);
if (!folio)
return -ENOMEM;
@@ -750,7 +756,7 @@ void readahead_expand(struct readahead_control *ractl,
if (folio && !xa_is_value(folio))
return; /* Folio apparently present */
- folio = filemap_alloc_folio(gfp_mask, min_order);
+ folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
if (!folio)
return;
@@ -779,7 +785,7 @@ void readahead_expand(struct readahead_control *ractl,
if (folio && !xa_is_value(folio))
return; /* Folio apparently present */
- folio = filemap_alloc_folio(gfp_mask, min_order);
+ folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
if (!folio)
return;
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH 03/12] mm/readahead: add folio allocation helper
2024-12-20 15:47 ` [PATCH 03/12] mm/readahead: add folio allocation helper Jens Axboe
@ 2024-12-20 16:12 ` Matthew Wilcox
0 siblings, 0 replies; 32+ messages in thread
From: Matthew Wilcox @ 2024-12-20 16:12 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, kirill,
bfoster, Kirill A . Shutemov
On Fri, Dec 20, 2024 at 08:47:41AM -0700, Jens Axboe wrote:
> Just a wrapper around filemap_alloc_folio() for now, but add it in
> preparation for modifying the folio based on the 'ractl' being passed
> in.
>
> No functional changes in this patch.
>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 04/12] mm: add PG_dropbehind folio flag
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (2 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 03/12] mm/readahead: add folio allocation helper Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2024-12-20 15:47 ` [PATCH 05/12] mm/readahead: add readahead_control->dropbehind member Jens Axboe
` (8 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe,
Kirill A . Shutemov
Add a folio flag that file IO can use to indicate that the cached IO
being done should be dropped from the page cache upon completion.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/page-flags.h | 5 +++++
include/trace/events/mmflags.h | 3 ++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index cf46ac720802..16607f02abd0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -110,6 +110,7 @@ enum pageflags {
PG_reclaim, /* To be reclaimed asap */
PG_swapbacked, /* Page is backed by RAM/swap */
PG_unevictable, /* Page is "unevictable" */
+ PG_dropbehind, /* drop pages on IO completion */
#ifdef CONFIG_MMU
PG_mlocked, /* Page is vma mlocked */
#endif
@@ -562,6 +563,10 @@ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
FOLIO_FLAG(readahead, FOLIO_HEAD_PAGE)
FOLIO_TEST_CLEAR_FLAG(readahead, FOLIO_HEAD_PAGE)
+FOLIO_FLAG(dropbehind, FOLIO_HEAD_PAGE)
+ FOLIO_TEST_CLEAR_FLAG(dropbehind, FOLIO_HEAD_PAGE)
+ __FOLIO_SET_FLAG(dropbehind, FOLIO_HEAD_PAGE)
+
#ifdef CONFIG_HIGHMEM
/*
* Must use a macro here due to header dependency issues. page_zone() is not
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index bb8a59c6caa2..3bc8656c8359 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -116,7 +116,8 @@
DEF_PAGEFLAG_NAME(head), \
DEF_PAGEFLAG_NAME(reclaim), \
DEF_PAGEFLAG_NAME(swapbacked), \
- DEF_PAGEFLAG_NAME(unevictable) \
+ DEF_PAGEFLAG_NAME(unevictable), \
+ DEF_PAGEFLAG_NAME(dropbehind) \
IF_HAVE_PG_MLOCK(mlocked) \
IF_HAVE_PG_HWPOISON(hwpoison) \
IF_HAVE_PG_IDLE(idle) \
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH 05/12] mm/readahead: add readahead_control->dropbehind member
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (3 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 04/12] mm: add PG_dropbehind folio flag Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2024-12-20 15:47 ` [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
` (7 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe,
Kirill A . Shutemov
If ractl->dropbehind is set to true, then folios created are marked as
dropbehind as well.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/pagemap.h | 1 +
mm/readahead.c | 8 +++++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bcf0865a38ae..5da4b6d42fae 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1353,6 +1353,7 @@ struct readahead_control {
pgoff_t _index;
unsigned int _nr_pages;
unsigned int _batch_count;
+ bool dropbehind;
bool _workingset;
unsigned long _pflags;
};
diff --git a/mm/readahead.c b/mm/readahead.c
index 8a62ad4106ff..c0a6dc5d5686 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -191,7 +191,13 @@ static void read_pages(struct readahead_control *rac)
static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
gfp_t gfp_mask, unsigned int order)
{
- return filemap_alloc_folio(gfp_mask, order);
+ struct folio *folio;
+
+ folio = filemap_alloc_folio(gfp_mask, order);
+ if (folio && ractl->dropbehind)
+ __folio_set_dropbehind(folio);
+
+ return folio;
}
/**
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (4 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 05/12] mm/readahead: add readahead_control->dropbehind member Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2024-12-20 16:21 ` Matthew Wilcox
2024-12-20 15:47 ` [PATCH 07/12] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag Jens Axboe
` (6 subsequent siblings)
12 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Add a folio_unmap_invalidate() helper, which unmaps and invalidates a
given folio. The caller must already have locked the folio. Embed the
old invalidate_complete_folio2() helper in there as well, as nobody else
calls it.
Use this new helper in invalidate_inode_pages2_range(), rather than
duplicate the code there.
In preparation for using this elsewhere as well, have it take a gfp_t
mask rather than assume GFP_KERNEL is the right choice. This bubbles
back to invalidate_complete_folio2() as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/internal.h | 2 ++
mm/truncate.c | 53 +++++++++++++++++++++++++++------------------------
2 files changed, 30 insertions(+), 25 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index cb8d8e8e3ffa..ed3c3690eb03 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -392,6 +392,8 @@ void unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
struct zap_details *details);
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+ gfp_t gfp);
void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
unsigned int order);
diff --git a/mm/truncate.c b/mm/truncate.c
index 7c304d2f0052..e2e115adfbc5 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -525,6 +525,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
}
EXPORT_SYMBOL(invalidate_mapping_pages);
+static int folio_launder(struct address_space *mapping, struct folio *folio)
+{
+ if (!folio_test_dirty(folio))
+ return 0;
+ if (folio->mapping != mapping || mapping->a_ops->launder_folio == NULL)
+ return 0;
+ return mapping->a_ops->launder_folio(folio);
+}
+
/*
* This is like mapping_evict_folio(), except it ignores the folio's
* refcount. We do this because invalidate_inode_pages2() needs stronger
@@ -532,14 +541,26 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
* shrink_folio_list() has a temp ref on them, or because they're transiently
* sitting in the folio_add_lru() caches.
*/
-static int invalidate_complete_folio2(struct address_space *mapping,
- struct folio *folio)
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+ gfp_t gfp)
{
- if (folio->mapping != mapping)
- return 0;
+ int ret;
+
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
- if (!filemap_release_folio(folio, GFP_KERNEL))
+ if (folio_test_dirty(folio))
return 0;
+ if (folio_mapped(folio))
+ unmap_mapping_folio(folio);
+ BUG_ON(folio_mapped(folio));
+
+ ret = folio_launder(mapping, folio);
+ if (ret)
+ return ret;
+ if (folio->mapping != mapping)
+ return -EBUSY;
+ if (!filemap_release_folio(folio, gfp))
+ return -EBUSY;
spin_lock(&mapping->host->i_lock);
xa_lock_irq(&mapping->i_pages);
@@ -558,16 +579,7 @@ static int invalidate_complete_folio2(struct address_space *mapping,
failed:
xa_unlock_irq(&mapping->i_pages);
spin_unlock(&mapping->host->i_lock);
- return 0;
-}
-
-static int folio_launder(struct address_space *mapping, struct folio *folio)
-{
- if (!folio_test_dirty(folio))
- return 0;
- if (folio->mapping != mapping || mapping->a_ops->launder_folio == NULL)
- return 0;
- return mapping->a_ops->launder_folio(folio);
+ return -EBUSY;
}
/**
@@ -631,16 +643,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
}
VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
folio_wait_writeback(folio);
-
- if (folio_mapped(folio))
- unmap_mapping_folio(folio);
- BUG_ON(folio_mapped(folio));
-
- ret2 = folio_launder(mapping, folio);
- if (ret2 == 0) {
- if (!invalidate_complete_folio2(mapping, folio))
- ret2 = -EBUSY;
- }
+ ret2 = folio_unmap_invalidate(mapping, folio, GFP_KERNEL);
if (ret2 < 0)
ret = ret2;
folio_unlock(folio);
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper
2024-12-20 15:47 ` [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
@ 2024-12-20 16:21 ` Matthew Wilcox
2024-12-20 16:28 ` Jens Axboe
0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2024-12-20 16:21 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, kirill, bfoster
On Fri, Dec 20, 2024 at 08:47:44AM -0700, Jens Axboe wrote:
> +int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
> + gfp_t gfp)
> {
> - if (folio->mapping != mapping)
> - return 0;
> + int ret;
> +
> + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>
> - if (!filemap_release_folio(folio, GFP_KERNEL))
> + if (folio_test_dirty(folio))
> return 0;
> + if (folio_mapped(folio))
> + unmap_mapping_folio(folio);
> + BUG_ON(folio_mapped(folio));
> +
> + ret = folio_launder(mapping, folio);
> + if (ret)
> + return ret;
> + if (folio->mapping != mapping)
> + return -EBUSY;
The position of this test confuses me. Usually we want to test
folio->mapping early on, since if the folio is no longer part of this
file, we want to stop doing things to it, rather than go to the trouble
of unmapping it. Also, why do we want to return -EBUSY in this case?
If the folio is no longer part of this file, it has been successfully
removed from this file, right?
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper
2024-12-20 16:21 ` Matthew Wilcox
@ 2024-12-20 16:28 ` Jens Axboe
2025-01-02 20:12 ` Jens Axboe
0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 16:28 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, kirill, bfoster
On 12/20/24 9:21 AM, Matthew Wilcox wrote:
> On Fri, Dec 20, 2024 at 08:47:44AM -0700, Jens Axboe wrote:
>> +int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
>> + gfp_t gfp)
>> {
>> - if (folio->mapping != mapping)
>> - return 0;
>> + int ret;
>> +
>> + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>
>> - if (!filemap_release_folio(folio, GFP_KERNEL))
>> + if (folio_test_dirty(folio))
>> return 0;
>> + if (folio_mapped(folio))
>> + unmap_mapping_folio(folio);
>> + BUG_ON(folio_mapped(folio));
>> +
>> + ret = folio_launder(mapping, folio);
>> + if (ret)
>> + return ret;
>> + if (folio->mapping != mapping)
>> + return -EBUSY;
>
> The position of this test confuses me. Usually we want to test
> folio->mapping early on, since if the folio is no longer part of this
> file, we want to stop doing things to it, rather than go to the trouble
> of unmapping it. Also, why do we want to return -EBUSY in this case?
> If the folio is no longer part of this file, it has been successfully
> removed from this file, right?
It's simply doing what the code did before. I do agree the mapping check
is a bit odd at that point, but that's how
invalidate_inode_pages2_range() and folio_launder() was setup. We can
certainly clean that up after the merge of these helpers, but I didn't
want to introduce any potential changes with this merge.
-EBUSY was the return from a 0 return from those two helpers before.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper
2024-12-20 16:28 ` Jens Axboe
@ 2025-01-02 20:12 ` Jens Axboe
0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2025-01-02 20:12 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, kirill, bfoster
On 12/20/24 9:28 AM, Jens Axboe wrote:
> On 12/20/24 9:21 AM, Matthew Wilcox wrote:
>> On Fri, Dec 20, 2024 at 08:47:44AM -0700, Jens Axboe wrote:
>>> +int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
>>> + gfp_t gfp)
>>> {
>>> - if (folio->mapping != mapping)
>>> - return 0;
>>> + int ret;
>>> +
>>> + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>
>>> - if (!filemap_release_folio(folio, GFP_KERNEL))
>>> + if (folio_test_dirty(folio))
>>> return 0;
>>> + if (folio_mapped(folio))
>>> + unmap_mapping_folio(folio);
>>> + BUG_ON(folio_mapped(folio));
>>> +
>>> + ret = folio_launder(mapping, folio);
>>> + if (ret)
>>> + return ret;
>>> + if (folio->mapping != mapping)
>>> + return -EBUSY;
>>
>> The position of this test confuses me. Usually we want to test
>> folio->mapping early on, since if the folio is no longer part of this
>> file, we want to stop doing things to it, rather than go to the trouble
>> of unmapping it. Also, why do we want to return -EBUSY in this case?
>> If the folio is no longer part of this file, it has been successfully
>> removed from this file, right?
>
> It's simply doing what the code did before. I do agree the mapping check
> is a bit odd at that point, but that's how
> invalidate_inode_pages2_range() and folio_launder() was setup. We can
> certainly clean that up after the merge of these helpers, but I didn't
> want to introduce any potential changes with this merge.
>
> -EBUSY was the return from a 0 return from those two helpers before.
Any further concerns with this? Trying to nudge this patchset forward...
It's not like there's a lot of time left for 6.14.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 07/12] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (5 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 06/12] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2025-01-04 8:39 ` (subset) " Christian Brauner
2024-12-20 15:47 ` [PATCH 08/12] mm/filemap: add read support for RWF_DONTCACHE Jens Axboe
` (5 subsequent siblings)
12 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
If a file system supports uncached buffered IO, it may set FOP_DONTCACHE
and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted
without the file system supporting it, it'll get errored with -EOPNOTSUPP.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/fs.h | 14 +++++++++++++-
include/uapi/linux/fs.h | 6 +++++-
2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e29433c5ecc..6a838b5479a6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -322,6 +322,7 @@ struct readahead_control;
#define IOCB_NOWAIT (__force int) RWF_NOWAIT
#define IOCB_APPEND (__force int) RWF_APPEND
#define IOCB_ATOMIC (__force int) RWF_ATOMIC
+#define IOCB_DONTCACHE (__force int) RWF_DONTCACHE
/* non-RWF related bits - start at 16 */
#define IOCB_EVENTFD (1 << 16)
@@ -356,7 +357,8 @@ struct readahead_control;
{ IOCB_SYNC, "SYNC" }, \
{ IOCB_NOWAIT, "NOWAIT" }, \
{ IOCB_APPEND, "APPEND" }, \
- { IOCB_ATOMIC, "ATOMIC"}, \
+ { IOCB_ATOMIC, "ATOMIC" }, \
+ { IOCB_DONTCACHE, "DONTCACHE" }, \
{ IOCB_EVENTFD, "EVENTFD"}, \
{ IOCB_DIRECT, "DIRECT" }, \
{ IOCB_WRITE, "WRITE" }, \
@@ -2127,6 +2129,8 @@ struct file_operations {
#define FOP_UNSIGNED_OFFSET ((__force fop_flags_t)(1 << 5))
/* Supports asynchronous lock callbacks */
#define FOP_ASYNC_LOCK ((__force fop_flags_t)(1 << 6))
+/* File system supports uncached read/write buffered IO */
+#define FOP_DONTCACHE ((__force fop_flags_t)(1 << 7))
/* Wrap a directory iterator that needs exclusive inode access */
int wrap_directory_iterator(struct file *, struct dir_context *,
@@ -3614,6 +3618,14 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
return -EOPNOTSUPP;
}
+ if (flags & RWF_DONTCACHE) {
+ /* file system must support it */
+ if (!(ki->ki_filp->f_op->fop_flags & FOP_DONTCACHE))
+ return -EOPNOTSUPP;
+ /* DAX mappings not supported */
+ if (IS_DAX(ki->ki_filp->f_mapping->host))
+ return -EOPNOTSUPP;
+ }
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..56a4f93a08f4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -332,9 +332,13 @@ typedef int __bitwise __kernel_rwf_t;
/* Atomic Write */
#define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000040)
+/* buffered IO that drops the cache after reading or writing data */
+#define RWF_DONTCACHE ((__force __kernel_rwf_t)0x00000080)
+
/* mask of flags supported by the kernel */
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
- RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
+ RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
+ RWF_DONTCACHE)
#define PROCFS_IOCTL_MAGIC 'f'
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: (subset) [PATCH 07/12] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
2024-12-20 15:47 ` [PATCH 07/12] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag Jens Axboe
@ 2025-01-04 8:39 ` Christian Brauner
2025-01-06 15:44 ` Jens Axboe
0 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-01-04 8:39 UTC (permalink / raw)
To: linux-mm, linux-fsdevel, Jens Axboe
Cc: Christian Brauner, hannes, clm, linux-kernel, willy, kirill, bfoster
On Fri, 20 Dec 2024 08:47:45 -0700, Jens Axboe wrote:
> If a file system supports uncached buffered IO, it may set FOP_DONTCACHE
> and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted
> without the file system supporting it, it'll get errored with -EOPNOTSUPP.
>
>
Jens, you can use this as a stable branch for the VFS changes.
---
Applied to the vfs-6.14.uncached_buffered_io branch of the vfs/vfs.git tree.
Patches in the vfs-6.14.uncached_buffered_io branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.14.uncached_buffered_io
[07/12] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
https://git.kernel.org/vfs/vfs/c/af6505e5745b
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: (subset) [PATCH 07/12] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
2025-01-04 8:39 ` (subset) " Christian Brauner
@ 2025-01-06 15:44 ` Jens Axboe
0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2025-01-06 15:44 UTC (permalink / raw)
To: Christian Brauner, linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster
On 1/4/25 1:39 AM, Christian Brauner wrote:
> On Fri, 20 Dec 2024 08:47:45 -0700, Jens Axboe wrote:
>> If a file system supports uncached buffered IO, it may set FOP_DONTCACHE
>> and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted
>> without the file system supporting it, it'll get errored with -EOPNOTSUPP.
>>
>>
>
> Jens, you can use this as a stable branch for the VFS changes.
Thanks, but not sure it's super useful for a patch in the middle. It
just adds a dependency, and I'd need to rebase. Providing an ack
or reviewed by would probably be more useful.
But given how slow the mm side is going, honestly not sure what the
target or plan is for inclusion at this point in time.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 08/12] mm/filemap: add read support for RWF_DONTCACHE
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (6 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 07/12] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2024-12-20 15:47 ` [PATCH 09/12] mm/filemap: drop streaming/uncached pages when writeback completes Jens Axboe
` (4 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Add RWF_DONTCACHE as a read operation flag, which means that any data
read wil be removed from the page cache upon completion. Uses the page
cache to synchronize, and simply prunes folios that were instantiated
when the operation completes. While it would be possible to use private
pages for this, using the page cache as synchronization is handy for a
variety of reasons:
1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable
and the code to support this is much simpler as a result.
You can think of uncached buffered IO as being the much more attractive
cousin of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes,
it will copy the data, but unlike regular buffered IO, it doesn't run
into the unpredictability of the page cache in terms of reclaim. As an
example, on a test box with 32 drives, reading them with buffered IO
looks as follows:
Reading bs 65536, uncached 0
1s: 145945MB/sec
2s: 158067MB/sec
3s: 157007MB/sec
4s: 148622MB/sec
5s: 118824MB/sec
6s: 70494MB/sec
7s: 41754MB/sec
8s: 90811MB/sec
9s: 92204MB/sec
10s: 95178MB/sec
11s: 95488MB/sec
12s: 95552MB/sec
13s: 96275MB/sec
where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7535 root 20 0 267004 0 0 S 3199 0.0 8:40.65 uncached
3326 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd4
3327 root 20 0 0 0 0 R 100.0 0.0 0:17.22 kswapd5
3328 root 20 0 0 0 0 R 100.0 0.0 0:13.29 kswapd6
3332 root 20 0 0 0 0 R 100.0 0.0 0:11.11 kswapd10
3339 root 20 0 0 0 0 R 100.0 0.0 0:16.25 kswapd17
3348 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd26
3343 root 20 0 0 0 0 R 100.0 0.0 0:16.30 kswapd21
3344 root 20 0 0 0 0 R 100.0 0.0 0:11.92 kswapd22
3349 root 20 0 0 0 0 R 100.0 0.0 0:16.28 kswapd27
3352 root 20 0 0 0 0 R 99.7 0.0 0:11.89 kswapd30
3353 root 20 0 0 0 0 R 96.7 0.0 0:16.04 kswapd31
3329 root 20 0 0 0 0 R 96.4 0.0 0:11.41 kswapd7
3345 root 20 0 0 0 0 R 96.4 0.0 0:13.40 kswapd23
3330 root 20 0 0 0 0 S 91.1 0.0 0:08.28 kswapd8
3350 root 20 0 0 0 0 S 86.8 0.0 0:11.13 kswapd28
3325 root 20 0 0 0 0 S 76.3 0.0 0:07.43 kswapd3
3341 root 20 0 0 0 0 S 74.7 0.0 0:08.85 kswapd19
3334 root 20 0 0 0 0 S 71.7 0.0 0:10.04 kswapd12
3351 root 20 0 0 0 0 R 60.5 0.0 0:09.59 kswapd29
3323 root 20 0 0 0 0 R 57.6 0.0 0:11.50 kswapd1
[...]
which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.
If the same test case is run with RWF_DONTCACHE set for the buffered read,
the output looks as follows:
Reading bs 65536, uncached 0
1s: 153144MB/sec
2s: 156760MB/sec
3s: 158110MB/sec
4s: 158009MB/sec
5s: 158043MB/sec
6s: 157638MB/sec
7s: 157999MB/sec
8s: 158024MB/sec
9s: 157764MB/sec
10s: 157477MB/sec
11s: 157417MB/sec
12s: 157455MB/sec
13s: 157233MB/sec
14s: 156692MB/sec
which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached
8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top
where just the test app is using CPU, no reclaim is taking place outside
of the main thread. Not only is performance 65% better, it's also using
half the CPU to do it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 28 ++++++++++++++++++++++++++--
mm/swap.c | 2 ++
2 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 220dc7c6e12f..dd563208d09d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2473,6 +2473,8 @@ static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch)
folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
if (!folio)
return -ENOMEM;
+ if (iocb->ki_flags & IOCB_DONTCACHE)
+ __folio_set_dropbehind(folio);
/*
* Protect against truncate / hole punch. Grabbing invalidate_lock
@@ -2518,6 +2520,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
+ if (iocb->ki_flags & IOCB_DONTCACHE)
+ ractl.dropbehind = 1;
page_cache_async_ra(&ractl, folio, last_index - folio->index);
return 0;
}
@@ -2547,6 +2551,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
return -EAGAIN;
if (iocb->ki_flags & IOCB_NOWAIT)
flags = memalloc_noio_save();
+ if (iocb->ki_flags & IOCB_DONTCACHE)
+ ractl.dropbehind = 1;
page_cache_sync_ra(&ractl, last_index - index);
if (iocb->ki_flags & IOCB_NOWAIT)
memalloc_noio_restore(flags);
@@ -2594,6 +2600,20 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
return (pos1 >> shift == pos2 >> shift);
}
+static void filemap_end_dropbehind_read(struct address_space *mapping,
+ struct folio *folio)
+{
+ if (!folio_test_dropbehind(folio))
+ return;
+ if (folio_test_writeback(folio) || folio_test_dirty(folio))
+ return;
+ if (folio_trylock(folio)) {
+ if (folio_test_clear_dropbehind(folio))
+ folio_unmap_invalidate(mapping, folio, 0);
+ folio_unlock(folio);
+ }
+}
+
/**
* filemap_read - Read data from the page cache.
* @iocb: The iocb to read.
@@ -2707,8 +2727,12 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
}
}
put_folios:
- for (i = 0; i < folio_batch_count(&fbatch); i++)
- folio_put(fbatch.folios[i]);
+ for (i = 0; i < folio_batch_count(&fbatch); i++) {
+ struct folio *folio = fbatch.folios[i];
+
+ filemap_end_dropbehind_read(mapping, folio);
+ folio_put(folio);
+ }
folio_batch_init(&fbatch);
} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa1..ba02bd5ba145 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -427,6 +427,8 @@ static void folio_inc_refs(struct folio *folio)
*/
void folio_mark_accessed(struct folio *folio)
{
+ if (folio_test_dropbehind(folio))
+ return;
if (lru_gen_enabled()) {
folio_inc_refs(folio);
return;
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH 09/12] mm/filemap: drop streaming/uncached pages when writeback completes
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (7 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 08/12] mm/filemap: add read support for RWF_DONTCACHE Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2025-01-18 3:29 ` Jingbo Xu
2025-03-04 3:12 ` Ritesh Harjani
2024-12-20 15:47 ` [PATCH 10/12] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
` (3 subsequent siblings)
12 siblings, 2 replies; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
If the folio is marked as streaming, drop pages when writeback completes.
Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for
uncached IO.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index dd563208d09d..aa0b3af6533d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1599,6 +1599,27 @@ int folio_wait_private_2_killable(struct folio *folio)
}
EXPORT_SYMBOL(folio_wait_private_2_killable);
+/*
+ * If folio was marked as dropbehind, then pages should be dropped when writeback
+ * completes. Do that now. If we fail, it's likely because of a big folio -
+ * just reset dropbehind for that case and latter completions should invalidate.
+ */
+static void folio_end_dropbehind_write(struct folio *folio)
+{
+ /*
+ * Hitting !in_task() should not happen off RWF_DONTCACHE writeback,
+ * but can happen if normal writeback just happens to find dirty folios
+ * that were created as part of uncached writeback, and that writeback
+ * would otherwise not need non-IRQ handling. Just skip the
+ * invalidation in that case.
+ */
+ if (in_task() && folio_trylock(folio)) {
+ if (folio->mapping)
+ folio_unmap_invalidate(folio->mapping, folio, 0);
+ folio_unlock(folio);
+ }
+}
+
/**
* folio_end_writeback - End writeback against a folio.
* @folio: The folio.
@@ -1609,6 +1630,8 @@ EXPORT_SYMBOL(folio_wait_private_2_killable);
*/
void folio_end_writeback(struct folio *folio)
{
+ bool folio_dropbehind = false;
+
VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
/*
@@ -1630,9 +1653,14 @@ void folio_end_writeback(struct folio *folio)
* reused before the folio_wake_bit().
*/
folio_get(folio);
+ if (!folio_test_dirty(folio))
+ folio_dropbehind = folio_test_clear_dropbehind(folio);
if (__folio_end_writeback(folio))
folio_wake_bit(folio, PG_writeback);
acct_reclaim_writeback(folio);
+
+ if (folio_dropbehind)
+ folio_end_dropbehind_write(folio);
folio_put(folio);
}
EXPORT_SYMBOL(folio_end_writeback);
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH 09/12] mm/filemap: drop streaming/uncached pages when writeback completes
2024-12-20 15:47 ` [PATCH 09/12] mm/filemap: drop streaming/uncached pages when writeback completes Jens Axboe
@ 2025-01-18 3:29 ` Jingbo Xu
2025-03-04 3:12 ` Ritesh Harjani
1 sibling, 0 replies; 32+ messages in thread
From: Jingbo Xu @ 2025-01-18 3:29 UTC (permalink / raw)
To: Jens Axboe, linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster
On 12/20/24 11:47 PM, Jens Axboe wrote:
> If the folio is marked as streaming, drop pages when writeback completes.
> Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for
> uncached IO.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> mm/filemap.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index dd563208d09d..aa0b3af6533d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1599,6 +1599,27 @@ int folio_wait_private_2_killable(struct folio *folio)
> }
> EXPORT_SYMBOL(folio_wait_private_2_killable);
>
> +/*
> + * If folio was marked as dropbehind, then pages should be dropped when writeback
> + * completes. Do that now. If we fail, it's likely because of a big folio -
> + * just reset dropbehind for that case and latter completions should invalidate.
> + */
> +static void folio_end_dropbehind_write(struct folio *folio)
> +{
> + /*
> + * Hitting !in_task() should not happen off RWF_DONTCACHE writeback,
> + * but can happen if normal writeback just happens to find dirty folios
> + * that were created as part of uncached writeback, and that writeback
> + * would otherwise not need non-IRQ handling. Just skip the
> + * invalidation in that case.
> + */
> + if (in_task() && folio_trylock(folio)) {
> + if (folio->mapping)
> + folio_unmap_invalidate(folio->mapping, folio, 0);
> + folio_unlock(folio);
> + }
> +}
Sorry shouldn't folio_end_writeback() be called from IRQ context (of the
block device) when the IO completes? This may be a stupid question, but
I just can't understand that...
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH 09/12] mm/filemap: drop streaming/uncached pages when writeback completes
2024-12-20 15:47 ` [PATCH 09/12] mm/filemap: drop streaming/uncached pages when writeback completes Jens Axboe
2025-01-18 3:29 ` Jingbo Xu
@ 2025-03-04 3:12 ` Ritesh Harjani
1 sibling, 0 replies; 32+ messages in thread
From: Ritesh Harjani @ 2025-03-04 3:12 UTC (permalink / raw)
To: Jens Axboe, Darrick J . Wong
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe,
linux-mm, linux-fsdevel, linux-xfs, fstests, Ritesh Harjani
Jens Axboe <axboe@kernel.dk> writes:
> If the folio is marked as streaming, drop pages when writeback completes.
> Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for
> uncached IO.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> mm/filemap.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index dd563208d09d..aa0b3af6533d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1599,6 +1599,27 @@ int folio_wait_private_2_killable(struct folio *folio)
> }
> EXPORT_SYMBOL(folio_wait_private_2_killable);
>
> +/*
> + * If folio was marked as dropbehind, then pages should be dropped when writeback
> + * completes. Do that now. If we fail, it's likely because of a big folio -
> + * just reset dropbehind for that case and latter completions should invalidate.
> + */
> +static void folio_end_dropbehind_write(struct folio *folio)
> +{
> + /*
> + * Hitting !in_task() should not happen off RWF_DONTCACHE writeback,
> + * but can happen if normal writeback just happens to find dirty folios
> + * that were created as part of uncached writeback, and that writeback
> + * would otherwise not need non-IRQ handling. Just skip the
> + * invalidation in that case.
> + */
> + if (in_task() && folio_trylock(folio)) {
> + if (folio->mapping)
> + folio_unmap_invalidate(folio->mapping, folio, 0);
> + folio_unlock(folio);
> + }
> +}
> +
Hi Jens,
Want to ensure that my understanding is correct here w.r.t the above
function where we call folio_unmap_invalidate() only when in_task() is
true.
Almost always the writeback completion will run in the softirq
completion context right? Do you know of cases where the writeback
completion runs in the process context (in_task())? Few cases from the
filesystem side, where the completion can run in a process context are,
when the bio->bi_end_io is hijacked by the filesystem, e.g.
/* send ioends that might require a transaction to the completion wq */
if (xfs_ioend_is_append(ioend) ||
(ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED)))
ioend->io_bio.bi_end_io = xfs_end_bio;
if (status)
return status;
submit_bio(&ioend->io_bio);
In this case xfs_end_bio() will be called for completing the ioends and
will queue the completion to a workqueue (since this requires txn
processing). (Which means in_task() will be true)
That means the user visible effect of having an in_task() check for
calling folio_unmap_invalidate() are -
1. If an append write or write to a unwritten extent or when cow is done
using RWF_DONTCACHE on a file, it will free up the page cache folios
after the I/O completes (as expected).
2. However if RWF_DONTCACHE writes are done to any existing written extents
(i.e. overwrites), then it will not free up the page cache folios after
the I/O completes (because the completions run in the softirq context)
Is this understanding correct? Is this an expected behavior too and is
this documented someplace like in a man page or kernel Documentation?
The other thing which I wanted to check was, we can then have folios
still in the page cache which were written using RWF_DONTCACHE. So do
we drop these page cache folios later from somewhere else? Or is there
any other mm subsystem utilizing the fact that those pages remaining in
the pagecache which were written using RWF_DONTCACHE can be the
candidates for removal first?
I am just trying to better understand on whether we are treating these
page cache folios marked with PG_dropbehind as special or just as a
regular page cache folios after the I/O is complete but these didn't get
free. I understand the other reason where these won't get freed is when
someone else might be using these folios.
<Below experiment to show page cache pages cached after doing an
overwrite using RWF_DONTCACHE>
I was trying to add -U flag to xfs_io for uncached preadv2/pwritev2
calls. (I will post those patches soon).
Using -U flag in xfs_io, we can see the above mentioned behavior.
e.g.
# mount /dev/loop6 /mnt1/test
// Do uncached writes using (-U) to a new file
# ./io/xfs_io -fc "pwrite -U -V 1 0 16K" /mnt1/test/f1;
wrote 16384/16384 bytes at offset 0
16 KiB, 4 ops; 0.0036 sec (4.277 MiB/sec and 1094.9904 ops/sec)
//no page cache pages found as expected
# ./io/xfs_io -c "mmap 0 16K" -c "mincore" -c "munmap" /mnt1/test/f1
// overwrite to the same area using uncached writes (-U)
# ./io/xfs_io -c "pwrite -U -V 1 0 16K" /mnt1/test/f1;
wrote 16384/16384 bytes at offset 0
16 KiB, 4 ops; 0.0016 sec (9.340 MiB/sec and 2390.9145 ops/sec)
// Overwrite causes the page cache pages to be found even with -U
# ./io/xfs_io -c "mmap 0 16K" -c "mincore" -c "munmap" /mnt1/test/f1
0x7ffff7fb5000 - 0x7ffff7fb9000 4 pages (0 : 16384)
// Try the same after a mount cycle
# umount /dev/loop6
# mount /dev/loop6 /mnt1/test
// Overwrite causes the page cache pages to be found even with -U
# ./io/xfs_io -c "pwrite -U -V 1 0 16K" /mnt1/test/f1;
wrote 16384/16384 bytes at offset 0
16 KiB, 4 ops; 0.0009 sec (16.361 MiB/sec and 4188.4817 ops/sec)
# ./io/xfs_io -c "mmap 0 16K" -c "mincore" -c "munmap" /mnt1/test/f1
0x7ffff7fb5000 - 0x7ffff7fb9000 4 pages (0 : 16384)
#
Should we add few unit tests in xfstests for above behavior (for
preadv2() and pwritev2()). I was planning to add a few, but since we are
already discussing other things here - it's better to get an opinion
from others too for this.
-ritesh
> /**
> * folio_end_writeback - End writeback against a folio.
> * @folio: The folio.
> @@ -1609,6 +1630,8 @@ EXPORT_SYMBOL(folio_wait_private_2_killable);
> */
> void folio_end_writeback(struct folio *folio)
> {
> + bool folio_dropbehind = false;
> +
> VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
>
> /*
> @@ -1630,9 +1653,14 @@ void folio_end_writeback(struct folio *folio)
> * reused before the folio_wake_bit().
> */
> folio_get(folio);
> + if (!folio_test_dirty(folio))
> + folio_dropbehind = folio_test_clear_dropbehind(folio);
> if (__folio_end_writeback(folio))
> folio_wake_bit(folio, PG_writeback);
> acct_reclaim_writeback(folio);
> +
> + if (folio_dropbehind)
> + folio_end_dropbehind_write(folio);
> folio_put(folio);
> }
> EXPORT_SYMBOL(folio_end_writeback);
> --
> 2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 10/12] mm/filemap: add filemap_fdatawrite_range_kick() helper
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (8 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 09/12] mm/filemap: drop streaming/uncached pages when writeback completes Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2025-01-18 3:25 ` Jingbo Xu
2024-12-20 15:47 ` [PATCH 11/12] mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue Jens Axboe
` (2 subsequent siblings)
12 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Works like filemap_fdatawrite_range(), except it's a non-integrity data
writeback and hence only starts writeback on the specified range. Will
help facilitate generically starting uncached writeback from
generic_write_sync(), as header dependencies preclude doing this inline
from fs.h.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/fs.h | 2 ++
mm/filemap.c | 18 ++++++++++++++++++
2 files changed, 20 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6a838b5479a6..653b5efa3d3f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2878,6 +2878,8 @@ extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
extern int __must_check file_check_and_advance_wb_err(struct file *file);
extern int __must_check file_write_and_wait_range(struct file *file,
loff_t start, loff_t end);
+int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
+ loff_t end);
static inline int file_write_and_wait(struct file *file)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index aa0b3af6533d..9842258ba343 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -449,6 +449,24 @@ int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
}
EXPORT_SYMBOL(filemap_fdatawrite_range);
+/**
+ * filemap_fdatawrite_range_kick - start writeback on a range
+ * @mapping: target address_space
+ * @start: index to start writeback on
+ * @end: last (non-inclusive) index for writeback
+ *
+ * This is a non-integrity writeback helper, to start writing back folios
+ * for the indicated range.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
+ loff_t end)
+{
+ return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
+}
+EXPORT_SYMBOL_GPL(filemap_fdatawrite_range_kick);
+
/**
* filemap_flush - mostly a non-blocking flush
* @mapping: target address_space
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH 10/12] mm/filemap: add filemap_fdatawrite_range_kick() helper
2024-12-20 15:47 ` [PATCH 10/12] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
@ 2025-01-18 3:25 ` Jingbo Xu
0 siblings, 0 replies; 32+ messages in thread
From: Jingbo Xu @ 2025-01-18 3:25 UTC (permalink / raw)
To: Jens Axboe, linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster
On 12/20/24 11:47 PM, Jens Axboe wrote:
> Works like filemap_fdatawrite_range(), except it's a non-integrity data
> writeback and hence only starts writeback on the specified range. Will
> help facilitate generically starting uncached writeback from
> generic_write_sync(), as header dependencies preclude doing this inline
> from fs.h.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> include/linux/fs.h | 2 ++
> mm/filemap.c | 18 ++++++++++++++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6a838b5479a6..653b5efa3d3f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2878,6 +2878,8 @@ extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
> extern int __must_check file_check_and_advance_wb_err(struct file *file);
> extern int __must_check file_write_and_wait_range(struct file *file,
> loff_t start, loff_t end);
> +int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
> + loff_t end);
>
> static inline int file_write_and_wait(struct file *file)
> {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index aa0b3af6533d..9842258ba343 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -449,6 +449,24 @@ int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
> }
> EXPORT_SYMBOL(filemap_fdatawrite_range);
>
> +/**
> + * filemap_fdatawrite_range_kick - start writeback on a range
> + * @mapping: target address_space
> + * @start: index to start writeback on
> + * @end: last (non-inclusive) index for writeback
> + *
> + * This is a non-integrity writeback helper, to start writing back folios
> + * for the indicated range.
> + *
> + * Return: %0 on success, negative error code otherwise.
> + */
> +int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
> + loff_t end)
> +{
> + return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
> +}
__filemap_fdatawrite_range() actually accepts @end argument as "last
(**inclusive**) index".
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 11/12] mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (9 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 10/12] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2024-12-20 15:47 ` [PATCH 12/12] mm: add FGP_DONTCACHE folio creation flag Jens Axboe
2025-01-08 3:35 ` [PATCHSET v8 0/12] Uncached buffered IO Andrew Morton
12 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
When a buffered write submitted with IOCB_DONTCACHE has been successfully
submitted, call filemap_fdatawrite_range_kick() to kick off the IO.
File systems call generic_write_sync() for any successful buffered write
submission, hence add the logic here rather than needing to modify the
file system.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/fs.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 653b5efa3d3f..58a618853574 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2912,6 +2912,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
if (ret)
return ret;
+ } else if (iocb->ki_flags & IOCB_DONTCACHE) {
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
+
+ filemap_fdatawrite_range_kick(mapping, iocb->ki_pos,
+ iocb->ki_pos + count);
}
return count;
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH 12/12] mm: add FGP_DONTCACHE folio creation flag
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (10 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 11/12] mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue Jens Axboe
@ 2024-12-20 15:47 ` Jens Axboe
2025-01-08 3:35 ` [PATCHSET v8 0/12] Uncached buffered IO Andrew Morton
12 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2024-12-20 15:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Callers can pass this in for uncached folio creation, in which case if
a folio is newly created it gets marked as uncached. If a folio exists
for this index and lookup succeeds, then it will not get marked as
uncached. If an !uncached lookup finds a cached folio, clear the flag.
For that case, there are competeting uncached and cached users of the
folio, and it should not get pruned.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/pagemap.h | 2 ++
mm/filemap.c | 5 +++++
2 files changed, 7 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 5da4b6d42fae..64c6dada837e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -710,6 +710,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
* * %FGP_NOFS - __GFP_FS will get cleared in gfp.
* * %FGP_NOWAIT - Don't block on the folio lock.
* * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_DONTCACHE - Uncached buffered IO
* * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
* implementation.
*/
@@ -723,6 +724,7 @@ typedef unsigned int __bitwise fgf_t;
#define FGP_NOWAIT ((__force fgf_t)0x00000020)
#define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
#define FGP_STABLE ((__force fgf_t)0x00000080)
+#define FGP_DONTCACHE ((__force fgf_t)0x00000100)
#define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
#define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
diff --git a/mm/filemap.c b/mm/filemap.c
index 9842258ba343..68bdfff4117e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2001,6 +2001,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
/* Init accessed so avoid atomic mark_page_accessed later */
if (fgp_flags & FGP_ACCESSED)
__folio_set_referenced(folio);
+ if (fgp_flags & FGP_DONTCACHE)
+ __folio_set_dropbehind(folio);
err = filemap_add_folio(mapping, folio, index, gfp);
if (!err)
@@ -2023,6 +2025,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
if (!folio)
return ERR_PTR(-ENOENT);
+ /* not an uncached lookup, clear uncached if set */
+ if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
+ folio_clear_dropbehind(folio);
return folio;
}
EXPORT_SYMBOL(__filemap_get_folio);
--
2.45.2
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCHSET v8 0/12] Uncached buffered IO
2024-12-20 15:47 [PATCHSET v8 0/12] Uncached buffered IO Jens Axboe
` (11 preceding siblings ...)
2024-12-20 15:47 ` [PATCH 12/12] mm: add FGP_DONTCACHE folio creation flag Jens Axboe
@ 2025-01-08 3:35 ` Andrew Morton
2025-01-13 15:34 ` Jens Axboe
12 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2025-01-08 3:35 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
kirill, bfoster
On Fri, 20 Dec 2024 08:47:38 -0700 Jens Axboe <axboe@kernel.dk> wrote:
> So here's a new approach to the same concent, but using the page cache
> as synchronization. Due to excessive bike shedding on the naming, this
> is now named RWF_DONTCACHE, and is less special in that it's just page
> cache IO, except it prunes the ranges once IO is completed.
>
> Why do this, you may ask? The tldr is that device speeds are only
> getting faster, while reclaim is not. Doing normal buffered IO can be
> very unpredictable, and suck up a lot of resources on the reclaim side.
> This leads people to use O_DIRECT as a work-around, which has its own
> set of restrictions in terms of size, offset, and length of IO. It's
> also inherently synchronous, and now you need async IO as well. While
> the latter isn't necessarily a big problem as we have good options
> available there, it also should not be a requirement when all you want
> to do is read or write some data without caching.
Of course, we're doing something here which userspace could itself do:
drop the pagecache after reading it (with appropriate chunk sizing) and
for writes, sync the written area then invalidate it. Possible
added benefits from using separate threads for this.
I suggest that diligence requires that we at least justify an in-kernel
approach at this time, please.
And there's a possible middle-ground implementation where the kernel
itself kicks off threads to do the drop-behind just before the read or
write syscall returns, which will probably be simpler. Can we please
describe why this also isn't acceptable?
Also, it seems wrong for a read(RWF_DONTCACHE) to drop cache if it was
already present. Because it was presumably present for a reason. Does
this implementation already take care of this? To make an application
which does read(/etc/passwd, RWF_DONTCACHE) less annoying?
Also, consuming a new page flag isn't a minor thing. It would be nice
to see some justification around this, and some decription of how many
we have left.
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCHSET v8 0/12] Uncached buffered IO
2025-01-08 3:35 ` [PATCHSET v8 0/12] Uncached buffered IO Andrew Morton
@ 2025-01-13 15:34 ` Jens Axboe
2025-01-14 0:46 ` Andrew Morton
0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2025-01-13 15:34 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
kirill, bfoster
(sorry missed this reply!)
On 1/7/25 8:35 PM, Andrew Morton wrote:
> On Fri, 20 Dec 2024 08:47:38 -0700 Jens Axboe <axboe@kernel.dk> wrote:
>
>> So here's a new approach to the same concent, but using the page cache
>> as synchronization. Due to excessive bike shedding on the naming, this
>> is now named RWF_DONTCACHE, and is less special in that it's just page
>> cache IO, except it prunes the ranges once IO is completed.
>>
>> Why do this, you may ask? The tldr is that device speeds are only
>> getting faster, while reclaim is not. Doing normal buffered IO can be
>> very unpredictable, and suck up a lot of resources on the reclaim side.
>> This leads people to use O_DIRECT as a work-around, which has its own
>> set of restrictions in terms of size, offset, and length of IO. It's
>> also inherently synchronous, and now you need async IO as well. While
>> the latter isn't necessarily a big problem as we have good options
>> available there, it also should not be a requirement when all you want
>> to do is read or write some data without caching.
>
> Of course, we're doing something here which userspace could itself do:
> drop the pagecache after reading it (with appropriate chunk sizing) and
> for writes, sync the written area then invalidate it. Possible
> added benefits from using separate threads for this.
>
> I suggest that diligence requires that we at least justify an in-kernel
> approach at this time, please.
Conceptually yes. But you'd end up doing extra work to do it. Some of
that not so expensive, like system calls, and others more so, like LRU
manipulation. Outside of that, I do think it makes sense to expose as a
generic thing, rather than require applications needing to kick
writeback manually, reclaim manually, etc.
> And there's a possible middle-ground implementation where the kernel
> itself kicks off threads to do the drop-behind just before the read or
> write syscall returns, which will probably be simpler. Can we please
> describe why this also isn't acceptable?
That's more of an implementation detail. I didn't test anything like
that, though we surely could. If it's better, there's no reason why it
can't just be changed to do that. My gut tells me you want the task/CPU
that just did the page cache additions to do the pruning to, that should
be more efficient than having a kworker or similar do it.
> Also, it seems wrong for a read(RWF_DONTCACHE) to drop cache if it was
> already present. Because it was presumably present for a reason. Does
> this implementation already take care of this? To make an application
> which does read(/etc/passwd, RWF_DONTCACHE) less annoying?
The implementation doesn't drop pages that were already present, only
pages that got created/added to the page cache for the operation. So
that part should already work as you expect.
> Also, consuming a new page flag isn't a minor thing. It would be nice
> to see some justification around this, and some decription of how many
> we have left.
For sure, though various discussions on this already occurred and Kirill
posted patches for unifying some of this already. It's not something I
wanted to tackle, as I think that should be left to people more familiar
with the page/folio flags and they (sometimes odd) interactions.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v8 0/12] Uncached buffered IO
2025-01-13 15:34 ` Jens Axboe
@ 2025-01-14 0:46 ` Andrew Morton
2025-01-14 0:56 ` Jens Axboe
2025-01-16 10:06 ` Kirill A. Shutemov
0 siblings, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2025-01-14 0:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
kirill, bfoster
On Mon, 13 Jan 2025 08:34:18 -0700 Jens Axboe <axboe@kernel.dk> wrote:
> >
>
> ...
>
> > Of course, we're doing something here which userspace could itself do:
> > drop the pagecache after reading it (with appropriate chunk sizing) and
> > for writes, sync the written area then invalidate it. Possible
> > added benefits from using separate threads for this.
> >
> > I suggest that diligence requires that we at least justify an in-kernel
> > approach at this time, please.
>
> Conceptually yes. But you'd end up doing extra work to do it. Some of
> that not so expensive, like system calls, and others more so, like LRU
> manipulation. Outside of that, I do think it makes sense to expose as a
> generic thing, rather than require applications needing to kick
> writeback manually, reclaim manually, etc.
>
> > And there's a possible middle-ground implementation where the kernel
> > itself kicks off threads to do the drop-behind just before the read or
> > write syscall returns, which will probably be simpler. Can we please
> > describe why this also isn't acceptable?
>
> That's more of an implementation detail. I didn't test anything like
> that, though we surely could. If it's better, there's no reason why it
> can't just be changed to do that. My gut tells me you want the task/CPU
> that just did the page cache additions to do the pruning to, that should
> be more efficient than having a kworker or similar do it.
Well, gut might be wrong ;)
There may be benefit in using different CPUs to perform the dropbehind,
rather than making the read() caller do this synchronously.
If I understand correctly, the write() dropbehind is performed at
interrupt (write completion) time so that's already async.
> > Also, it seems wrong for a read(RWF_DONTCACHE) to drop cache if it was
> > already present. Because it was presumably present for a reason. Does
> > this implementation already take care of this? To make an application
> > which does read(/etc/passwd, RWF_DONTCACHE) less annoying?
>
> The implementation doesn't drop pages that were already present, only
> pages that got created/added to the page cache for the operation. So
> that part should already work as you expect.
>
> > Also, consuming a new page flag isn't a minor thing. It would be nice
> > to see some justification around this, and some decription of how many
> > we have left.
>
> For sure, though various discussions on this already occurred and Kirill
> posted patches for unifying some of this already. It's not something I
> wanted to tackle, as I think that should be left to people more familiar
> with the page/folio flags and they (sometimes odd) interactions.
Matthew & Kirill: are you OK with merging this as-is and then
revisiting the page-flag consumption at a later time?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v8 0/12] Uncached buffered IO
2025-01-14 0:46 ` Andrew Morton
@ 2025-01-14 0:56 ` Jens Axboe
2025-01-16 10:06 ` Kirill A. Shutemov
1 sibling, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2025-01-14 0:56 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
kirill, bfoster
On 1/13/25 5:46 PM, Andrew Morton wrote:
> On Mon, 13 Jan 2025 08:34:18 -0700 Jens Axboe <axboe@kernel.dk> wrote:
>
>>>
>>
>> ...
>>
>>> Of course, we're doing something here which userspace could itself do:
>>> drop the pagecache after reading it (with appropriate chunk sizing) and
>>> for writes, sync the written area then invalidate it. Possible
>>> added benefits from using separate threads for this.
>>>
>>> I suggest that diligence requires that we at least justify an in-kernel
>>> approach at this time, please.
>>
>> Conceptually yes. But you'd end up doing extra work to do it. Some of
>> that not so expensive, like system calls, and others more so, like LRU
>> manipulation. Outside of that, I do think it makes sense to expose as a
>> generic thing, rather than require applications needing to kick
>> writeback manually, reclaim manually, etc.
>>
>>> And there's a possible middle-ground implementation where the kernel
>>> itself kicks off threads to do the drop-behind just before the read or
>>> write syscall returns, which will probably be simpler. Can we please
>>> describe why this also isn't acceptable?
>>
>> That's more of an implementation detail. I didn't test anything like
>> that, though we surely could. If it's better, there's no reason why it
>> can't just be changed to do that. My gut tells me you want the task/CPU
>> that just did the page cache additions to do the pruning to, that should
>> be more efficient than having a kworker or similar do it.
>
> Well, gut might be wrong ;)
A gut this big is rarely wrong ;-)
> There may be benefit in using different CPUs to perform the dropbehind,
> rather than making the read() caller do this synchronously.
>
> If I understand correctly, the write() dropbehind is performed at
> interrupt (write completion) time so that's already async.
It does, but we could actually get rid of that, at least when called via
io_uring. From the testing I've done, doing it inline it definitely
superior. Though it will depend on if you care about overall efficiency
or just sheer speed/overhead of the read/write itself.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v8 0/12] Uncached buffered IO
2025-01-14 0:46 ` Andrew Morton
2025-01-14 0:56 ` Jens Axboe
@ 2025-01-16 10:06 ` Kirill A. Shutemov
1 sibling, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2025-01-16 10:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, linux-mm, linux-fsdevel, hannes, clm, linux-kernel,
willy, bfoster
On Mon, Jan 13, 2025 at 04:46:50PM -0800, Andrew Morton wrote:
> > > Also, consuming a new page flag isn't a minor thing. It would be nice
> > > to see some justification around this, and some decription of how many
> > > we have left.
> >
> > For sure, though various discussions on this already occurred and Kirill
> > posted patches for unifying some of this already. It's not something I
> > wanted to tackle, as I think that should be left to people more familiar
> > with the page/folio flags and they (sometimes odd) interactions.
>
> Matthew & Kirill: are you OK with merging this as-is and then
> revisiting the page-flag consumption at a later time?
I have tried to find a way to avoid adding a new flag bit, but I have not
found one. I am okay with merging it as it is.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 32+ messages in thread