[RFC PATCH 0/2] Use high-order folios in mmap sync RA

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] Use high-order folios in mmap sync RA
@ 2026-04-15 19:28 Anatoly Stepanov
  2026-04-15 13:18 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Anatoly Stepanov @ 2026-04-15 19:28 UTC (permalink / raw)
  To: willy, akpm, david, ljs, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, wangkefeng.wang, yanquanmin1, zuoze1, artem.kuzin,
	gutierrez.asier
  Cc: linux-fsdevel, linux-mm, linux-kernel, Anatoly Stepanov

When "fault around" is enabled, 0-order folios might significantly
slowdown filemap_map_pages().

For example when async RA won't be able to start,
we might end up with a large mmap'ed file with 0-orders.

Imagine an access pattern, when we
just access file chunk-by-chunk, where each chunk size equals to RA window,
until every chunk of the file gets loaded into the page cache.

In this case, we never touch RA-marked page, thus async RA wouldn't kick
in, ending with 0-orders covering all the file.

Let's resolve this by starting sync RA with high-order.

(procfs smaps patch is just for showing contpte coverage improvement for arm64)

Based on linux-7.0-rc5

Anatoly Stepanov (2):
  procfs: add contpte info into smaps
  filemap: use high-order folios in filemap sync RA

 fs/proc/task_mmu.c      | 20 +++++++++++++++++---
 include/linux/pagemap.h |  1 +
 mm/filemap.c            |  1 +
 mm/internal.h           |  1 +
 mm/memory.c             |  2 +-
 mm/readahead.c          |  5 +++--
 6 files changed, 24 insertions(+), 6 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] Use high-order folios in mmap sync RA
  2026-04-15 19:28 [RFC PATCH 0/2] Use high-order folios in mmap sync RA Anatoly Stepanov
@ 2026-04-15 13:18 ` Matthew Wilcox
  2026-04-15 13:33   ` Stepanov Anatoly
  2026-04-15 19:28 ` [RFC PATCH 1/2] procfs: add contpte info into smaps Anatoly Stepanov
  2026-04-15 19:28 ` [RFC PATCH 2/2] filemap: use high-order folios in filemap sync RA Anatoly Stepanov
  2 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2026-04-15 13:18 UTC (permalink / raw)
  To: Anatoly Stepanov
  Cc: akpm, david, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	wangkefeng.wang, yanquanmin1, zuoze1, artem.kuzin,
	gutierrez.asier, linux-fsdevel, linux-mm, linux-kernel

On Thu, Apr 16, 2026 at 03:28:51AM +0800, Anatoly Stepanov wrote:
> When "fault around" is enabled, 0-order folios might significantly
> slowdown filemap_map_pages().

There's a lot of "might" in this patchset.  I'd like to know that there
is a real workload that benefits from this, and if so by how much.

You raise an interesting point that faultaround may be slow, and maybe
we should start out with 0 faultaround until we've determined (somehow)
that faultaround would be beneficial for this particular mapping.  Like
we adjust the readahead window.

> For example when async RA won't be able to start,
> we might end up with a large mmap'ed file with 0-orders.

That is a feature, not a bug.  If access is random, then we don't want
to do any async readahead because we don't know where the next access
will be.  We just end up occupying large chunks of memory with
never-used data.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] Use high-order folios in mmap sync RA
  2026-04-15 13:18 ` Matthew Wilcox
@ 2026-04-15 13:33   ` Stepanov Anatoly
  0 siblings, 0 replies; 9+ messages in thread
From: Stepanov Anatoly @ 2026-04-15 13:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, david, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	wangkefeng.wang, yanquanmin1, zuoze1, artem.kuzin,
	gutierrez.asier, linux-fsdevel, linux-mm, linux-kernel

On 4/15/2026 4:18 PM, Matthew Wilcox wrote:
> On Thu, Apr 16, 2026 at 03:28:51AM +0800, Anatoly Stepanov wrote:
>> When "fault around" is enabled, 0-order folios might significantly
>> slowdown filemap_map_pages().
> 
> There's a lot of "might" in this patchset.  I'd like to know that there
> is a real workload that benefits from this, and if so by how much.
> 
Actually, no real workload at the moment.
The intention is to highlight the filemap_map_pages issue,
i found it during my experiments with the page cache.

> You raise an interesting point that faultaround may be slow, and maybe
> we should start out with 0 faultaround until we've determined (somehow)
> that faultaround would be beneficial for this particular mapping.  Like
> we adjust the readahead window.
> 
Sounds nice, 
looks like, there should be kind of "virtual readahead" or smth like this.

BTW, for the benchmark i posted, if fault_around is disabled (4K)
then the throughput is even higher.


>> For example when async RA won't be able to start,
>> we might end up with a large mmap'ed file with 0-orders.
> 
> That is a feature, not a bug.  If access is random, then we don't want
> to do any async readahead because we don't know where the next access
> will be.  We just end up occupying large chunks of memory with
> never-used data.
> 
> 
Yes, i understand the logic behind this, what i mean is that it can actually happen.


-- 
Anatoly Stepanov, Huawei


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/2] procfs: add contpte info into smaps
  2026-04-15 19:28 [RFC PATCH 0/2] Use high-order folios in mmap sync RA Anatoly Stepanov
  2026-04-15 13:18 ` Matthew Wilcox
@ 2026-04-15 19:28 ` Anatoly Stepanov
  2026-04-15 12:52   ` David Hildenbrand (Arm)
  2026-04-15 19:28 ` [RFC PATCH 2/2] filemap: use high-order folios in filemap sync RA Anatoly Stepanov
  2 siblings, 1 reply; 9+ messages in thread
From: Anatoly Stepanov @ 2026-04-15 19:28 UTC (permalink / raw)
  To: willy, akpm, david, ljs, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, wangkefeng.wang, yanquanmin1, zuoze1, artem.kuzin,
	gutierrez.asier
  Cc: linux-fsdevel, linux-mm, linux-kernel, Anatoly Stepanov

Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
---
 fs/proc/task_mmu.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e091931d7..22bcd36b9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -874,6 +874,7 @@ struct mem_size_stats {
 	unsigned long shared_hugetlb;
 	unsigned long private_hugetlb;
 	unsigned long ksm;
+	unsigned long cont_pte;
 	u64 pss;
 	u64 pss_anon;
 	u64 pss_file;
@@ -915,7 +916,7 @@ static void smaps_page_accumulate(struct mem_size_stats *mss,
 
 static void smaps_account(struct mem_size_stats *mss, struct page *page,
 		bool compound, bool young, bool dirty, bool locked,
-		bool present)
+		bool present, bool cont)
 {
 	struct folio *folio = page_folio(page);
 	int i, nr = compound ? compound_nr(page) : 1;
@@ -938,6 +939,8 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 		mss->ksm += size;
 
 	mss->resident += size;
+	if (cont)
+		mss->cont_pte += PAGE_SIZE;
 	/* Accumulate the size in pages that have been accessed. */
 	if (young || folio_test_young(folio) || folio_test_referenced(folio))
 		mss->referenced += size;
@@ -1015,6 +1018,10 @@ static void smaps_pte_hole_lookup(unsigned long addr, struct mm_walk *walk)
 #endif
 }
 
+#ifndef pte_cont
+#define pte_cont(pte) (false)
+#endif
+
 static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 		struct mm_walk *walk)
 {
@@ -1023,12 +1030,14 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 	bool locked = !!(vma->vm_flags & VM_LOCKED);
 	struct page *page = NULL;
 	bool present = false, young = false, dirty = false;
+	bool cont = false;
 	pte_t ptent = ptep_get(pte);
 
 	if (pte_present(ptent)) {
 		page = vm_normal_page(vma, addr, ptent);
 		young = pte_young(ptent);
 		dirty = pte_dirty(ptent);
+		cont = pte_cont(ptent);
 		present = true;
 	} else if (pte_none(ptent)) {
 		smaps_pte_hole_lookup(addr, walk);
@@ -1058,7 +1067,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 	if (!page)
 		return;
 
-	smaps_account(mss, page, false, young, dirty, locked, present);
+	smaps_account(mss, page, false, young, dirty, locked, present, cont);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -1096,7 +1105,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 		mss->file_thp += HPAGE_PMD_SIZE;
 
 	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd),
-		      locked, present);
+		      locked, present, false);
 }
 #else
 static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
@@ -1356,6 +1365,11 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
 	SEQ_PUT_DEC(" kB\nAnonHugePages:  ", mss->anonymous_thp);
 	SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
 	SEQ_PUT_DEC(" kB\nFilePmdMapped:  ", mss->file_thp);
+	if (mss->cont_pte) {
+		SEQ_PUT_DEC(" kB\nContPTE(Rss):        ", mss->cont_pte);
+		SEQ_PUT_DEC(" ", mss->resident);
+	}
+
 	SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
 	seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ",
 				  mss->private_hugetlb >> 10, 7);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] procfs: add contpte info into smaps
  2026-04-15 19:28 ` [RFC PATCH 1/2] procfs: add contpte info into smaps Anatoly Stepanov
@ 2026-04-15 12:52   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-15 12:52 UTC (permalink / raw)
  To: Anatoly Stepanov, willy, akpm, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, wangkefeng.wang, yanquanmin1, zuoze1,
	artem.kuzin, gutierrez.asier
  Cc: linux-fsdevel, linux-mm, linux-kernel

> +#ifndef pte_cont
> +#define pte_cont(pte) (false)
> +#endif
> +
>  static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>  		struct mm_walk *walk)
>  {
> @@ -1023,12 +1030,14 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>  	bool locked = !!(vma->vm_flags & VM_LOCKED);
>  	struct page *page = NULL;
>  	bool present = false, young = false, dirty = false;
> +	bool cont = false;
>  	pte_t ptent = ptep_get(pte);
>  
>  	if (pte_present(ptent)) {
>  		page = vm_normal_page(vma, addr, ptent);
>  		young = pte_young(ptent);
>  		dirty = pte_dirty(ptent);
> +		cont = pte_cont(ptent);

No, none of this low-level pte_cont fiddling in common code.

We have folio_pte_batch() to batch over folio ptes. And we want some
better page table walkers to just do the batching for us:

https://lore.kernel.org/r/20260412174244.133715-1-osalvador@suse.de

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/2] filemap: use high-order folios in filemap sync RA
  2026-04-15 19:28 [RFC PATCH 0/2] Use high-order folios in mmap sync RA Anatoly Stepanov
  2026-04-15 13:18 ` Matthew Wilcox
  2026-04-15 19:28 ` [RFC PATCH 1/2] procfs: add contpte info into smaps Anatoly Stepanov
@ 2026-04-15 19:28 ` Anatoly Stepanov
  2026-04-15 12:06   ` Pedro Falcato
  2 siblings, 1 reply; 9+ messages in thread
From: Anatoly Stepanov @ 2026-04-15 19:28 UTC (permalink / raw)
  To: willy, akpm, david, ljs, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, wangkefeng.wang, yanquanmin1, zuoze1, artem.kuzin,
	gutierrez.asier
  Cc: linux-fsdevel, linux-mm, linux-kernel, Anatoly Stepanov

[Idea]

If a mmap'ed file being accessed such that async RA never
kicks in, we might end up with only 0-order folios in the page cache.

if fault_around_bytes is larger than 1 single page, then
it's beneficial to use high-order folios, which brings significant
filemap_map_pages() speedup.
So, let's just use fault_around_bytes as a starting point here.

if an arch supports PTE-coalescing we can get more of those for free.
(see arm64 example below)

We don't save the new order to "ra->order", so if async RA will happen
it would normally start from order-0.

[Things to be discussed]

But at the same time, i can see drawback for 16K, 64K pages, in this case fault_around will still be 64K by default.
In this case, it seems makes sense to make the fault_around_bytes be like order-N of PAGE_SIZE, not fixed bytes number.

Another issue is - when fault_around=0, but we'd like to use high-order folios for sync_RA, for cont-PTE for example,
For this we can use kind of "max(fault_around_order, cont_pte_order)".

Or introduce some dedicated tunable like "sync_mmap_order".

[Benchmark]

Simple benchmark below reading 100M file in 4M (RA size) chunks
such that async RA doesn't kick in and the page cache ends up being
filled up with 0-order folios.

The patched kernel gives ~3 times increase in throughput,
considering the page cache is filled up at the moment.

The main speedup comes from filemap_map_pages() due to high-order
folios usage.

As a bonus, we get better cont_pte bit coverage for Arm64.

Example:
// Open 100M file and read every 4M chunk, given max_ra=4M
// Perform 10 runs, measure the throughput.
...
 char *map = mmap(NULL, filesize, PROT_READ, MAP_PRIVATE, fd, 0);
    if (map == MAP_FAILED) {
        perror("Error mapping file");
        close(fd);
        return 1;
    }

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    unsigned int size_4M = 4*1024*1024;
    unsigned int num_reads = filesize / size_4M;
    volatile char val;
    for (int i = 0; i < num_reads; i++) {
        off_t offset = (off_t)i * size_4M;
        val = map[offset];
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
...

Before patch (last 3 runs):
...
Throughput: 127942.68 operations per second
Throughput: 133646.96 operations per second
Throughput: 134321.94 operations per second

// filemap_map_pages(), fault_around_bytes = 64K
Time per 10 runs: ~2000 usec

// "smaps" numbers for the test file:
Rss:                1600 kB
Private_Clean:      1600 kB
Referenced:         1540 kB
ContPTE:	    0 kB

Patched kernel (last 3 runs):
...
Throughput: 366515.17 operations per second
Throughput: 404465.30 operations per second
Throughput: 370535.05 operations per second

// filemap_map_pages(), fault_around_bytes = 64K
Time per 10 runs: ~730 usec

// "smaps" numbers for the test file:
Rss:                1600 kB
Private_Clean:      1600 kB
Referenced:         1540 kB
ContPTE(Rss):       1536 kB

Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
---
 include/linux/pagemap.h | 1 +
 mm/filemap.c            | 1 +
 mm/internal.h           | 1 +
 mm/memory.c             | 2 +-
 mm/readahead.c          | 5 +++--
 5 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ec442af3f..e133a3a6b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1359,6 +1359,7 @@ struct readahead_control {
 	struct file *file;
 	struct address_space *mapping;
 	struct file_ra_state *ra;
+	unsigned int sync_mmap_order;
 /* private: use the readahead_* accessors instead */
 	pgoff_t _index;
 	unsigned int _nr_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index 406cef06b..1ed5a0688 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3398,6 +3398,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		ra->size = ra->ra_pages;
 		ra->async_size = ra->ra_pages / 4;
 		ra->order = 0;
+		ractl.sync_mmap_order = __ffs(fault_around_pages);
 	}
 
 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
diff --git a/mm/internal.h b/mm/internal.h
index cb0af847d..96157c82b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1770,4 +1770,5 @@ static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
 	return remap_pfn_range_complete(vma, addr, pfn, size, prot);
 }
 
+extern unsigned long fault_around_pages;
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 2f815a34d..57ae027dd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5670,7 +5670,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	return ret;
 }
 
-static unsigned long fault_around_pages __read_mostly =
+unsigned long fault_around_pages __read_mostly =
 	65536 >> PAGE_SHIFT;
 
 #ifdef CONFIG_DEBUG_FS
diff --git a/mm/readahead.c b/mm/readahead.c
index 7b05082c8..322bc115b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -476,7 +476,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
-	unsigned int new_order = ra->order;
+	unsigned int new_order = max(ra->order, ractl->sync_mmap_order);
 
 	trace_page_cache_ra_order(mapping->host, start, ra);
 	if (!mapping_large_folio_support(mapping)) {
@@ -490,7 +490,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 	new_order = max(new_order, min_order);
 
-	ra->order = new_order;
+	if (ra->order >= ractl->sync_mmap_order)
+		ra->order = new_order;
 
 	/* See comment in page_cache_ra_unbounded() */
 	nofs = memalloc_nofs_save();
-- 
2.34.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 2/2] filemap: use high-order folios in filemap sync RA
  2026-04-15 19:28 ` [RFC PATCH 2/2] filemap: use high-order folios in filemap sync RA Anatoly Stepanov
@ 2026-04-15 12:06   ` Pedro Falcato
  2026-04-15 12:31     ` Stepanov Anatoly
  2026-04-15 12:46     ` Stepanov Anatoly
  0 siblings, 2 replies; 9+ messages in thread
From: Pedro Falcato @ 2026-04-15 12:06 UTC (permalink / raw)
  To: Anatoly Stepanov
  Cc: willy, akpm, david, ljs, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, wangkefeng.wang, yanquanmin1, zuoze1, artem.kuzin,
	gutierrez.asier, linux-fsdevel, linux-mm, linux-kernel

On Thu, Apr 16, 2026 at 03:28:53AM +0800, Anatoly Stepanov wrote:
> [Idea]
> 
> If a mmap'ed file being accessed such that async RA never
> kicks in, we might end up with only 0-order folios in the page cache.
> 
> if fault_around_bytes is larger than 1 single page, then
> it's beneficial to use high-order folios, which brings significant
> filemap_map_pages() speedup.
> So, let's just use fault_around_bytes as a starting point here.

Well, this heuristic looks arbitrary. I don't like to mix different concepts.

With this, in practice most file folios will be 64K. Why? Why is it related
to faultaround when faultaround is a separate mechanism that isn't particularly
relevant here?

> 
> if an arch supports PTE-coalescing we can get more of those for free.
> (see arm64 example below)
> 
> We don't save the new order to "ra->order", so if async RA will happen
> it would normally start from order-0.
> 
> [Things to be discussed]
> 
> But at the same time, i can see drawback for 16K, 64K pages, in this case fault_around will still be 64K by default.
> In this case, it seems makes sense to make the fault_around_bytes be like order-N of PAGE_SIZE, not fixed bytes number.
> 
> Another issue is - when fault_around=0, but we'd like to use high-order folios for sync_RA, for cont-PTE for example,
> For this we can use kind of "max(fault_around_order, cont_pte_order)".
> 
> Or introduce some dedicated tunable like "sync_mmap_order".
> 
> [Benchmark]
> 
> Simple benchmark below reading 100M file in 4M (RA size) chunks
> such that async RA doesn't kick in and the page cache ends up being
> filled up with 0-order folios.

Well, the problem is that you are _never_ getting RA to kick in. Folio
size is the least of your concern, you are effectively not doing much
readahead since the kernel thinks you're doing random accesses.
> 
> The patched kernel gives ~3 times increase in throughput,
> considering the page cache is filled up at the moment.
> 
> The main speedup comes from filemap_map_pages() due to high-order
> folios usage.
> 
> As a bonus, we get better cont_pte bit coverage for Arm64.
> 
> Example:
> // Open 100M file and read every 4M chunk, given max_ra=4M
> // Perform 10 runs, measure the throughput.
> ...
>  char *map = mmap(NULL, filesize, PROT_READ, MAP_PRIVATE, fd, 0);
>     if (map == MAP_FAILED) {
>         perror("Error mapping file");
>         close(fd);
>         return 1;
>     }
> 
>     struct timespec start, end;
>     clock_gettime(CLOCK_MONOTONIC, &start);
> 
>     unsigned int size_4M = 4*1024*1024;
>     unsigned int num_reads = filesize / size_4M;
>     volatile char val;
>     for (int i = 0; i < num_reads; i++) {
>         off_t offset = (off_t)i * size_4M;
>         val = map[offset];
>     }

This doesn't seem like a real issue. And if it is, you can always issue
readahead manually. But the whole pattern of "every perfectly-sized RA
window, access 4 bytes and advance" is completely bizarre. And _if_ this
is your workload, then having order-0 folios at the read site is much better
than filling your page cache with data you are not accessing.

Do you have an actual use case for this? Where have you observed these
problems?

-- 
Pedro


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 2/2] filemap: use high-order folios in filemap sync RA
  2026-04-15 12:06   ` Pedro Falcato
@ 2026-04-15 12:31     ` Stepanov Anatoly
  2026-04-15 12:46     ` Stepanov Anatoly
  1 sibling, 0 replies; 9+ messages in thread
From: Stepanov Anatoly @ 2026-04-15 12:31 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: willy, akpm, david, ljs, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, wangkefeng.wang, yanquanmin1, zuoze1, artem.kuzin,
	gutierrez.asier, linux-fsdevel, linux-mm, linux-kernel

On 4/15/2026 3:06 PM, Pedro Falcato wrote:
> On Thu, Apr 16, 2026 at 03:28:53AM +0800, Anatoly Stepanov wrote:
>> [Idea]
>>
>> If a mmap'ed file being accessed such that async RA never
>> kicks in, we might end up with only 0-order folios in the page cache.
>>
>> if fault_around_bytes is larger than 1 single page, then
>> it's beneficial to use high-order folios, which brings significant
>> filemap_map_pages() speedup.
>> So, let's just use fault_around_bytes as a starting point here.
> 
> Well, this heuristic looks arbitrary. I don't like to mix different concepts.
> 
> With this, in practice most file folios will be 64K. Why? Why is it related
> to faultaround when faultaround is a separate mechanism that isn't particularly
> relevant here?
> 
fault_around_bytes > 4K means we need to iterate over folios in the page-cache
for high-orders it'll be faster, obviously, which is shown below in the benchmark.
So heuristic actually makes sense.

Regarding the value itself, i don't have perfect answer,
for instance for 16K,64K base pages or if the fault_around is disabled.
That's why i would like to gather feedback from community, regarding this.

>>
>> if an arch supports PTE-coalescing we can get more of those for free.
>> (see arm64 example below)
>>
>> We don't save the new order to "ra->order", so if async RA will happen
>> it would normally start from order-0.
>>
>> [Things to be discussed]
>>
>> But at the same time, i can see drawback for 16K, 64K pages, in this case fault_around will still be 64K by default.
>> In this case, it seems makes sense to make the fault_around_bytes be like order-N of PAGE_SIZE, not fixed bytes number.
>>
>> Another issue is - when fault_around=0, but we'd like to use high-order folios for sync_RA, for cont-PTE for example,
>> For this we can use kind of "max(fault_around_order, cont_pte_order)".
>>
>> Or introduce some dedicated tunable like "sync_mmap_order".
>>
>> [Benchmark]
>>
>> Simple benchmark below reading 100M file in 4M (RA size) chunks
>> such that async RA doesn't kick in and the page cache ends up being
>> filled up with 0-order folios.
> 
> Well, the problem is that you are _never_ getting RA to kick in. Folio
> size is the least of your concern, you are effectively not doing much
> readahead since the kernel thinks you're doing random accesses.
>>
>> The patched kernel gives ~3 times increase in throughput,
>> considering the page cache is filled up at the moment.
>>
>> The main speedup comes from filemap_map_pages() due to high-order
>> folios usage.
>>
>> As a bonus, we get better cont_pte bit coverage for Arm64.
>>
>> Example:
>> // Open 100M file and read every 4M chunk, given max_ra=4M
>> // Perform 10 runs, measure the throughput.
>> ...
>>  char *map = mmap(NULL, filesize, PROT_READ, MAP_PRIVATE, fd, 0);
>>     if (map == MAP_FAILED) {
>>         perror("Error mapping file");
>>         close(fd);
>>         return 1;
>>     }
>>
>>     struct timespec start, end;
>>     clock_gettime(CLOCK_MONOTONIC, &start);
>>
>>     unsigned int size_4M = 4*1024*1024;
>>     unsigned int num_reads = filesize / size_4M;
>>     volatile char val;
>>     for (int i = 0; i < num_reads; i++) {
>>         off_t offset = (off_t)i * size_4M;
>>         val = map[offset];
>>     }

> 
> This doesn't seem like a real issue. And if it is, you can always issue
> readahead manually. But the whole pattern of "every perfectly-sized RA
> window, access 4 bytes and advance" is completely bizarre. And _if_ this
> is your workload, then having order-0 folios at the read site is much better
> than filling your page cache with data you are not accessing.
This benchmark only intends to highlight possible case, when async_ra doesn't kick
and we can get more performance easily with increasing RA order.

> 
> Do you have an actual use case for this? Where have you observed these
> problems?
> 

If you're asking about real production scenario - i don't have such yet.


-- 
Anatoly Stepanov, Huawei


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 2/2] filemap: use high-order folios in filemap sync RA
  2026-04-15 12:06   ` Pedro Falcato
  2026-04-15 12:31     ` Stepanov Anatoly
@ 2026-04-15 12:46     ` Stepanov Anatoly
  1 sibling, 0 replies; 9+ messages in thread
From: Stepanov Anatoly @ 2026-04-15 12:46 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: willy, akpm, david, ljs, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, wangkefeng.wang, yanquanmin1, zuoze1, artem.kuzin,
	gutierrez.asier, linux-fsdevel, linux-mm, linux-kernel

On 4/15/2026 3:06 PM, Pedro Falcato wrote:
> On Thu, Apr 16, 2026 at 03:28:53AM +0800, Anatoly Stepanov wrote:
>> [Idea]
>>
>> If a mmap'ed file being accessed such that async RA never
>> kicks in, we might end up with only 0-order folios in the page cache.
>>
>> if fault_around_bytes is larger than 1 single page, then
>> it's beneficial to use high-order folios, which brings significant
>> filemap_map_pages() speedup.
>> So, let's just use fault_around_bytes as a starting point here.
> 
> Well, this heuristic looks arbitrary. I don't like to mix different concepts.
> 
> With this, in practice most file folios will be 64K. Why? Why is it related
> to faultaround when faultaround is a separate mechanism that isn't particularly
> relevant here?
> 
>>
>> if an arch supports PTE-coalescing we can get more of those for free.
>> (see arm64 example below)
>>
>> We don't save the new order to "ra->order", so if async RA will happen
>> it would normally start from order-0.
>>
>> [Things to be discussed]
>>
>> But at the same time, i can see drawback for 16K, 64K pages, in this case fault_around will still be 64K by default.
>> In this case, it seems makes sense to make the fault_around_bytes be like order-N of PAGE_SIZE, not fixed bytes number.
>>
>> Another issue is - when fault_around=0, but we'd like to use high-order folios for sync_RA, for cont-PTE for example,
>> For this we can use kind of "max(fault_around_order, cont_pte_order)".
>>
>> Or introduce some dedicated tunable like "sync_mmap_order".
>>
>> [Benchmark]
>>
>> Simple benchmark below reading 100M file in 4M (RA size) chunks
>> such that async RA doesn't kick in and the page cache ends up being
>> filled up with 0-order folios.

> 
> Well, the problem is that you are _never_ getting RA to kick in. Folio
> size is the least of your concern, you are effectively not doing much
> readahead since the kernel thinks you're doing random accesses.
No, that's not true, "sync mmap readahead" actually works in the case
the problem  is that "async RA" doesn't kick in. 

>>
>> The patched kernel gives ~3 times increase in throughput,
>> considering the page cache is filled up at the moment.
>>
>> The main speedup comes from filemap_map_pages() due to high-order
>> folios usage.
>>
>> As a bonus, we get better cont_pte bit coverage for Arm64.
>>
>> Example:
>> // Open 100M file and read every 4M chunk, given max_ra=4M
>> // Perform 10 runs, measure the throughput.
>> ...
>>  char *map = mmap(NULL, filesize, PROT_READ, MAP_PRIVATE, fd, 0);
>>     if (map == MAP_FAILED) {
>>         perror("Error mapping file");
>>         close(fd);
>>         return 1;
>>     }
>>
>>     struct timespec start, end;
>>     clock_gettime(CLOCK_MONOTONIC, &start);
>>
>>     unsigned int size_4M = 4*1024*1024;
>>     unsigned int num_reads = filesize / size_4M;
>>     volatile char val;
>>     for (int i = 0; i < num_reads; i++) {
>>         off_t offset = (off_t)i * size_4M;
>>         val = map[offset];
>>     }
> 
> This doesn't seem like a real issue. And if it is, you can always issue
> readahead manually. But the whole pattern of "every perfectly-sized RA
> window, access 4 bytes and advance" is completely bizarre. And _if_ this
> is your workload, then having order-0 folios at the read site is much better
> than filling your page cache with data you are not accessing.
> 
> Do you have an actual use case for this? Where have you observed these
> problems?
> 


-- 
Anatoly Stepanov, Huawei


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-04-15 13:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-15 19:28 [RFC PATCH 0/2] Use high-order folios in mmap sync RA Anatoly Stepanov
2026-04-15 13:18 ` Matthew Wilcox
2026-04-15 13:33   ` Stepanov Anatoly
2026-04-15 19:28 ` [RFC PATCH 1/2] procfs: add contpte info into smaps Anatoly Stepanov
2026-04-15 12:52   ` David Hildenbrand (Arm)
2026-04-15 19:28 ` [RFC PATCH 2/2] filemap: use high-order folios in filemap sync RA Anatoly Stepanov
2026-04-15 12:06   ` Pedro Falcato
2026-04-15 12:31     ` Stepanov Anatoly
2026-04-15 12:46     ` Stepanov Anatoly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox