linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic
@ 2025-10-06 17:51 Roman Gushchin
  2025-10-07  4:33 ` Dev Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Roman Gushchin @ 2025-10-06 17:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Roman Gushchin, Matthew Wilcox (Oracle),
	Jan Kara, Dev Jain, linux-mm

Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
introduced a special handling for VM_HUGEPAGE mappings: even if the
readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
allocated.

This change causes a significant regression for containers with a
tight memory.max limit, if VM_HUGEPAGE is widely used. Prior to this
commit, mmap_miss logic would eventually lead to the readahead
disablement, effectively reducing the memory pressure in the
cgroup. With this change the kernel is trying to allocate 1-2 huge
pages for each fault, no matter if these pages are used or not
before being evicted, increasing the memory pressure multi-fold.

To fix the regression, let's make the new VM_HUGEPAGE conditional
to the mmap_miss check, but keep independent from the ra->ra_pages.
This way the main intention of commit 4687fdbb805a ("mm/filemap:
Support VM_HUGEPAGE for file mappings") stays intact, but the
regression is resolved.

The logic behind this changes is simple: even if a user explicitly
requests using huge pages to back the file mapping (using VM_HUGEPAGE
flag), under a very strong memory pressure it's better to fall back
to ordinary pages.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dev Jain <dev.jain@arm.com>
Cc: linux-mm@kvack.org

--

v3: fixed VM_SEQ_READ handling for the THP case (by Jan Kara)
v2: fixed VM_SEQ_READ handling (by Dev Jain)
---
 mm/filemap.c | 68 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a52dd38d2b4a..ec731ac05551 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3235,11 +3235,47 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
 	struct file *fpin = NULL;
 	vm_flags_t vm_flags = vmf->vma->vm_flags;
+	bool force_thp_readahead = false;
 	unsigned short mmap_miss;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* Use the readahead code, even if readahead is disabled */
-	if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
+		force_thp_readahead = true;
+
+	if (!force_thp_readahead) {
+		/*
+		 * If we don't want any read-ahead, don't bother.
+		 * VM_EXEC case below is already intended for random access.
+		 */
+		if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
+			return fpin;
+
+		if (!ra->ra_pages)
+			return fpin;
+
+		if (vm_flags & VM_SEQ_READ) {
+			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+			page_cache_sync_ra(&ractl, ra->ra_pages);
+			return fpin;
+		}
+	}
+
+	if (!(vm_flags & VM_SEQ_READ)) {
+		/* Avoid banging the cache line if not needed */
+		mmap_miss = READ_ONCE(ra->mmap_miss);
+		if (mmap_miss < MMAP_LOTSAMISS * 10)
+			WRITE_ONCE(ra->mmap_miss, ++mmap_miss);
+
+		/*
+		 * Do we miss much more than hit in this file? If so,
+		 * stop bothering with read-ahead. It will only hurt.
+		 */
+		if (mmap_miss > MMAP_LOTSAMISS)
+			return fpin;
+	}
+
+	if (force_thp_readahead) {
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
 		ra->size = HPAGE_PMD_NR;
@@ -3254,34 +3290,6 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		page_cache_ra_order(&ractl, ra);
 		return fpin;
 	}
-#endif
-
-	/*
-	 * If we don't want any read-ahead, don't bother. VM_EXEC case below is
-	 * already intended for random access.
-	 */
-	if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
-		return fpin;
-	if (!ra->ra_pages)
-		return fpin;
-
-	if (vm_flags & VM_SEQ_READ) {
-		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
-		page_cache_sync_ra(&ractl, ra->ra_pages);
-		return fpin;
-	}
-
-	/* Avoid banging the cache line if not needed */
-	mmap_miss = READ_ONCE(ra->mmap_miss);
-	if (mmap_miss < MMAP_LOTSAMISS * 10)
-		WRITE_ONCE(ra->mmap_miss, ++mmap_miss);
-
-	/*
-	 * Do we miss much more than hit in this file? If so,
-	 * stop bothering with read-ahead. It will only hurt.
-	 */
-	if (mmap_miss > MMAP_LOTSAMISS)
-		return fpin;
 
 	if (vm_flags & VM_EXEC) {
 		/*
-- 
2.51.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic
  2025-10-06 17:51 [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic Roman Gushchin
@ 2025-10-07  4:33 ` Dev Jain
  2025-10-07 11:41 ` Jan Kara
  2025-10-07 22:34 ` Andrew Morton
  2 siblings, 0 replies; 7+ messages in thread
From: Dev Jain @ 2025-10-07  4:33 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: linux-kernel, Matthew Wilcox (Oracle), Jan Kara, linux-mm


On 06/10/25 11:21 pm, Roman Gushchin wrote:
> Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> introduced a special handling for VM_HUGEPAGE mappings: even if the
> readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
> allocated.
>
> This change causes a significant regression for containers with a
> tight memory.max limit, if VM_HUGEPAGE is widely used. Prior to this
> commit, mmap_miss logic would eventually lead to the readahead
> disablement, effectively reducing the memory pressure in the
> cgroup. With this change the kernel is trying to allocate 1-2 huge
> pages for each fault, no matter if these pages are used or not
> before being evicted, increasing the memory pressure multi-fold.
>
> To fix the regression, let's make the new VM_HUGEPAGE conditional
> to the mmap_miss check, but keep independent from the ra->ra_pages.
> This way the main intention of commit 4687fdbb805a ("mm/filemap:
> Support VM_HUGEPAGE for file mappings") stays intact, but the
> regression is resolved.
>
> The logic behind this changes is simple: even if a user explicitly
> requests using huge pages to back the file mapping (using VM_HUGEPAGE
> flag), under a very strong memory pressure it's better to fall back
> to ordinary pages.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: linux-mm@kvack.org
>
> --
>
> v3: fixed VM_SEQ_READ handling for the THP case (by Jan Kara)
> v2: fixed VM_SEQ_READ handling (by Dev Jain)
>
>   

LGTM

Reviewed-by: Dev Jain <dev.jain@arm.com>



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic
  2025-10-06 17:51 [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic Roman Gushchin
  2025-10-07  4:33 ` Dev Jain
@ 2025-10-07 11:41 ` Jan Kara
  2025-10-07 22:34 ` Andrew Morton
  2 siblings, 0 replies; 7+ messages in thread
From: Jan Kara @ 2025-10-07 11:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Matthew Wilcox (Oracle),
	Jan Kara, Dev Jain, linux-mm

On Mon 06-10-25 10:51:06, Roman Gushchin wrote:
> Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> introduced a special handling for VM_HUGEPAGE mappings: even if the
> readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
> allocated.
> 
> This change causes a significant regression for containers with a
> tight memory.max limit, if VM_HUGEPAGE is widely used. Prior to this
> commit, mmap_miss logic would eventually lead to the readahead
> disablement, effectively reducing the memory pressure in the
> cgroup. With this change the kernel is trying to allocate 1-2 huge
> pages for each fault, no matter if these pages are used or not
> before being evicted, increasing the memory pressure multi-fold.
> 
> To fix the regression, let's make the new VM_HUGEPAGE conditional
> to the mmap_miss check, but keep independent from the ra->ra_pages.
> This way the main intention of commit 4687fdbb805a ("mm/filemap:
> Support VM_HUGEPAGE for file mappings") stays intact, but the
> regression is resolved.
> 
> The logic behind this changes is simple: even if a user explicitly
> requests using huge pages to back the file mapping (using VM_HUGEPAGE
> flag), under a very strong memory pressure it's better to fall back
> to ordinary pages.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: linux-mm@kvack.org

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> --
> 
> v3: fixed VM_SEQ_READ handling for the THP case (by Jan Kara)
> v2: fixed VM_SEQ_READ handling (by Dev Jain)
> ---
>  mm/filemap.c | 68 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 38 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a52dd38d2b4a..ec731ac05551 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3235,11 +3235,47 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
>  	struct file *fpin = NULL;
>  	vm_flags_t vm_flags = vmf->vma->vm_flags;
> +	bool force_thp_readahead = false;
>  	unsigned short mmap_miss;
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	/* Use the readahead code, even if readahead is disabled */
> -	if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> +	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
> +		force_thp_readahead = true;
> +
> +	if (!force_thp_readahead) {
> +		/*
> +		 * If we don't want any read-ahead, don't bother.
> +		 * VM_EXEC case below is already intended for random access.
> +		 */
> +		if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
> +			return fpin;
> +
> +		if (!ra->ra_pages)
> +			return fpin;
> +
> +		if (vm_flags & VM_SEQ_READ) {
> +			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +			page_cache_sync_ra(&ractl, ra->ra_pages);
> +			return fpin;
> +		}
> +	}
> +
> +	if (!(vm_flags & VM_SEQ_READ)) {
> +		/* Avoid banging the cache line if not needed */
> +		mmap_miss = READ_ONCE(ra->mmap_miss);
> +		if (mmap_miss < MMAP_LOTSAMISS * 10)
> +			WRITE_ONCE(ra->mmap_miss, ++mmap_miss);
> +
> +		/*
> +		 * Do we miss much more than hit in this file? If so,
> +		 * stop bothering with read-ahead. It will only hurt.
> +		 */
> +		if (mmap_miss > MMAP_LOTSAMISS)
> +			return fpin;
> +	}
> +
> +	if (force_thp_readahead) {
>  		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>  		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
>  		ra->size = HPAGE_PMD_NR;
> @@ -3254,34 +3290,6 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  		page_cache_ra_order(&ractl, ra);
>  		return fpin;
>  	}
> -#endif
> -
> -	/*
> -	 * If we don't want any read-ahead, don't bother. VM_EXEC case below is
> -	 * already intended for random access.
> -	 */
> -	if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
> -		return fpin;
> -	if (!ra->ra_pages)
> -		return fpin;
> -
> -	if (vm_flags & VM_SEQ_READ) {
> -		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> -		page_cache_sync_ra(&ractl, ra->ra_pages);
> -		return fpin;
> -	}
> -
> -	/* Avoid banging the cache line if not needed */
> -	mmap_miss = READ_ONCE(ra->mmap_miss);
> -	if (mmap_miss < MMAP_LOTSAMISS * 10)
> -		WRITE_ONCE(ra->mmap_miss, ++mmap_miss);
> -
> -	/*
> -	 * Do we miss much more than hit in this file? If so,
> -	 * stop bothering with read-ahead. It will only hurt.
> -	 */
> -	if (mmap_miss > MMAP_LOTSAMISS)
> -		return fpin;
>  
>  	if (vm_flags & VM_EXEC) {
>  		/*
> -- 
> 2.51.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic
  2025-10-06 17:51 [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic Roman Gushchin
  2025-10-07  4:33 ` Dev Jain
  2025-10-07 11:41 ` Jan Kara
@ 2025-10-07 22:34 ` Andrew Morton
  2025-10-07 22:52   ` Roman Gushchin
  2 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2025-10-07 22:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Matthew Wilcox (Oracle), Jan Kara, Dev Jain, linux-mm

On Mon,  6 Oct 2025 10:51:06 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:

> Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> introduced a special handling for VM_HUGEPAGE mappings: even if the
> readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
> allocated.

Three years ago.

> This change causes a significant regression

So no backport suggested?  I guess reasonable given how long 4687fdbb805a has
been in tree.

>
> ...
>

> for containers with a
> tight memory.max limit, if VM_HUGEPAGE is widely used. Prior to this
> commit, mmap_miss logic would eventually lead to the readahead
> disablement, effectively reducing the memory pressure in the
> cgroup. With this change the kernel is trying to allocate 1-2 huge
> pages for each fault, no matter if these pages are used or not
> before being evicted, increasing the memory pressure multi-fold.
>
> ...
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dev Jain <dev.jain@arm.com>

But I'll slap the Fixes: in there, it might help someone.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic
  2025-10-07 22:34 ` Andrew Morton
@ 2025-10-07 22:52   ` Roman Gushchin
  2025-10-08  0:53     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Roman Gushchin @ 2025-10-07 22:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Matthew Wilcox (Oracle), Jan Kara, Dev Jain, linux-mm

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon,  6 Oct 2025 10:51:06 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
>> Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
>> introduced a special handling for VM_HUGEPAGE mappings: even if the
>> readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
>> allocated.
>
> Three years ago.
>
>> This change causes a significant regression
>
> So no backport suggested?  I guess reasonable given how long 4687fdbb805a has
> been in tree.

Yes, this was my thinking. Also you need a very specific setup to reveal
this regression.

>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Dev Jain <dev.jain@arm.com>
>
> But I'll slap the Fixes: in there, it might help someone.

I'd do exactly what you suggested: Fixes + no stable backport.

But I guess it still might end up in the LTS tree thanks to
the automation picking up all fixes. Should be ok too.

Thanks!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic
  2025-10-07 22:52   ` Roman Gushchin
@ 2025-10-08  0:53     ` Andrew Morton
  2025-10-08  2:07       ` Roman Gushchin
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2025-10-08  0:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Matthew Wilcox (Oracle), Jan Kara, Dev Jain, linux-mm

On Tue, 07 Oct 2025 15:52:49 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:

> >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> >> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> >> Cc: Jan Kara <jack@suse.cz>
> >> Cc: Dev Jain <dev.jain@arm.com>
> >
> > But I'll slap the Fixes: in there, it might help someone.
> 
> I'd do exactly what you suggested: Fixes + no stable backport.
> 
> But I guess it still might end up in the LTS tree thanks to
> the automation picking up all fixes. Should be ok too.

They've been asked not to override the MM developers' decisions (ie, mm
is special).

I'm not sure how reliable this is...  And I'm not sure how they
identify the dont-do-that patches.  Maybe mm/*, maybe mm.git, maybe
s-o-b:akpm.  But I haven't seen any transgressions in a year or three.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic
  2025-10-08  0:53     ` Andrew Morton
@ 2025-10-08  2:07       ` Roman Gushchin
  0 siblings, 0 replies; 7+ messages in thread
From: Roman Gushchin @ 2025-10-08  2:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Matthew Wilcox (Oracle), Jan Kara, Dev Jain, linux-mm

Andrew Morton <akpm@linux-foundation.org> writes:

> On Tue, 07 Oct 2025 15:52:49 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
>> >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> >> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>> >> Cc: Jan Kara <jack@suse.cz>
>> >> Cc: Dev Jain <dev.jain@arm.com>
>> >
>> > But I'll slap the Fixes: in there, it might help someone.
>> 
>> I'd do exactly what you suggested: Fixes + no stable backport.
>> 
>> But I guess it still might end up in the LTS tree thanks to
>> the automation picking up all fixes. Should be ok too.
>
> They've been asked not to override the MM developers' decisions (ie, mm
> is special).
>
> I'm not sure how reliable this is...  And I'm not sure how they
> identify the dont-do-that patches.  Maybe mm/*, maybe mm.git, maybe
> s-o-b:akpm.  But I haven't seen any transgressions in a year or three.

Nice! I didn't know this.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-10-08  2:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-06 17:51 [PATCH v3] mm: readahead: make thp readahead conditional to mmap_miss logic Roman Gushchin
2025-10-07  4:33 ` Dev Jain
2025-10-07 11:41 ` Jan Kara
2025-10-07 22:34 ` Andrew Morton
2025-10-07 22:52   ` Roman Gushchin
2025-10-08  0:53     ` Andrew Morton
2025-10-08  2:07       ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox