[PATCH RFC 0/2] tweaks for costly order __GFP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] tweaks for costly order __GFP_NORETRY reclaim
@ 2025-12-16 15:54 Vlastimil Babka
  2025-12-16 15:54 ` [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
  2025-12-16 15:54 ` [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster Vlastimil Babka
  0 siblings, 2 replies; 11+ messages in thread
From: Vlastimil Babka @ 2025-12-16 15:54 UTC (permalink / raw)
  To: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, David Rientjes, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
	Pedro Falcato
  Cc: linux-mm, linux-kernel, Vlastimil Babka

This is motivated by an excessive reclaim issue we saw in a customer
workload and was associated with THP page faults and an unexpected
zone_reclaim_mode-like behavior that can happen for those.

The first patch is intended as a minimal targeted backportable fix for
the observed issue. I hope I'll be able to confirm it has fixed the
problem, but it might take some time.

The second patch is a more "let's take a step back and rethink what
we're doing here" so it's separate, as it changes a bit the semantics
of __GFP_NORETRY.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
Vlastimil Babka (2):
      mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
      mm, page_alloc: fail costly __GFP_NORETRY allocations faster

 include/linux/gfp_types.h |  2 ++
 mm/page_alloc.c           | 33 +++------------------------------
 2 files changed, 5 insertions(+), 30 deletions(-)
---
base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
change-id: 20251216-thp-thisnode-tweak-c9c2acb3a627

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
  2025-12-16 15:54 [PATCH RFC 0/2] tweaks for costly order __GFP_NORETRY reclaim Vlastimil Babka
@ 2025-12-16 15:54 ` Vlastimil Babka
  2025-12-16 16:26   ` Michal Hocko
                     ` (3 more replies)
  2025-12-16 15:54 ` [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster Vlastimil Babka
  1 sibling, 4 replies; 11+ messages in thread
From: Vlastimil Babka @ 2025-12-16 15:54 UTC (permalink / raw)
  To: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, David Rientjes, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
	Pedro Falcato
  Cc: linux-mm, linux-kernel, Vlastimil Babka

Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
local-only and all-node allocations"), THP page fault allocations have
settled on the following scheme (from the commit log):

1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
   allocation from any node with effort determined by global defrag setting
   and VMA madvise
3. fallback to base pages on any node

Recent customer reports however revealed we have a gap in step 1 above.
What we have seen is excessive reclaim due to THP page faults on a NUMA
node that's close to its high watermark, while other nodes have plenty
of free memory.

The problem with step 1 is that it promises no reclaim after the
compaction attempt, however reclaim is only avoided for certain
compaction outcomes (deferred, or skipped due to insufficient free base
pages), and not e.g. when compaction is actually performed but fails (we
did see compact_fail vmstat counter increasing).

THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
which is not the intention.

Thus add a check for __GFP_THISNODE that corresponds to this exact
situation and prevents continuing with reclaim/compaction once the
initial compaction attempt isn't successful in allocating the page.

Note that commit cc638f329ef6 has not introduced this over-reclaim
possibility; it appears to exist in some form since commit 2f0799a0ffc0
("mm, thp: restore node-local hugepage allocations"). Followup commits
b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
may not succeed") and cc638f329ef6 have moved in the right direction,
but left the abovementioned gap.

Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a964..e6fd1213328b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4788,6 +4788,20 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			    compact_result == COMPACT_DEFERRED)
 				goto nopage;

+			/*
+			 * THP page faults may attempt local node only first,
+			 * but are then allowed to only compact, not reclaim,
+			 * see alloc_pages_mpol()
+			 *
+			 * compaction can fail for other reasons than those
+			 * checked above and we don't want such THP allocations
+			 * to put reclaim pressure on a single node in a
+			 * situation where other nodes might have plenty of
+			 * available memory
+			 */
+			if (gfp_mask & __GFP_THISNODE)
+				goto nopage;
+
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
 			 * sync compaction could be very expensive, so keep

-- 
2.52.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster
  2025-12-16 15:54 [PATCH RFC 0/2] tweaks for costly order __GFP_NORETRY reclaim Vlastimil Babka
  2025-12-16 15:54 ` [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
@ 2025-12-16 15:54 ` Vlastimil Babka
  2025-12-16 16:28   ` Michal Hocko
  2025-12-16 20:32   ` Johannes Weiner
  1 sibling, 2 replies; 11+ messages in thread
From: Vlastimil Babka @ 2025-12-16 15:54 UTC (permalink / raw)
  To: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, David Rientjes, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
	Pedro Falcato
  Cc: linux-mm, linux-kernel, Vlastimil Babka

For allocations that are of costly order and __GFP_NORETRY (and can
perform compaction) we attempt direct compaction first. If that fails,
we continue with a single round of direct reclaim+compaction (as for
other __GFP_NORETRY allocations, except the compaction is of lower
priority), with two exceptions that fail immediately:

- __GFP_THISNODE is specified, to prevent zone_reclaim_mode-like
  behavior for e.g. THP page faults

- compaction failed because it was deferred (i.e. has been failing
  recently so further attempts are not done for a while) or skipped,
  which means there are insufficient free base pages to defragment to
  begin with

Upon closer inspection, the second condition has a somewhat flawed
reasoning. If there are not enough base pages and reclaim could create
them, we instead fail. When there are enough base pages and compaction
has already ran and failed, we proceed and hope that reclaim and the
subsequent compaction attempt will succeed. But it's unclear why they
should and whether it will be as inexpensive as intended.

It might make therefore more sense to just fail unconditionally after
the initial compaction attempt, so do that instead. Costly allocations
that do want the reclaim/compaction to happen at least once can omit
__GFP_NORETRY, or even specify __GFP_RETRY_MAYFAIL for more than one
attempt.

There is a slight potential unfairness in that costly __GFP_NORETRY
allocations that can't perform direct compaction (i.e. lack __GFP_IO)
will still be allowed to direct reclaim, while those that can direct
compact will now never attempt direct reclaim. However, in cases of
memory pressure causing compaction to be skipped due to insufficient
base pages, direct reclaim was already not done before, so there should
be no functional regressions from this change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/gfp_types.h |  2 ++
 mm/page_alloc.c           | 47 +++--------------------------------------------
 2 files changed, 5 insertions(+), 44 deletions(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 3de43b12209e..051311fdbdb1 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -218,6 +218,8 @@ enum {
  * caller must handle the failure which is quite likely to happen under
  * heavy memory pressure. The flag is suitable when failure can easily be
  * handled at small cost, such as reduced throughput.
+ * For costly orders, only memory compaction can be attempted with no reclaim
+ * under some conditions.
  *
  * %__GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
  * procedures that have previously failed if there is some indication
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e6fd1213328b..2671cbbd6375 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4763,52 +4763,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			goto got_pg;
 
 		/*
-		 * Checks for costly allocations with __GFP_NORETRY, which
-		 * includes some THP page fault allocations
+		 * Compaction didn't succeed and we were told not to try hard,
+		 * so fail now.
 		 */
 		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
-			/*
-			 * If allocating entire pageblock(s) and compaction
-			 * failed because all zones are below low watermarks
-			 * or is prohibited because it recently failed at this
-			 * order, fail immediately unless the allocator has
-			 * requested compaction and reclaim retry.
-			 *
-			 * Reclaim is
-			 *  - potentially very expensive because zones are far
-			 *    below their low watermarks or this is part of very
-			 *    bursty high order allocations,
-			 *  - not guaranteed to help because isolate_freepages()
-			 *    may not iterate over freed pages as part of its
-			 *    linear scan, and
-			 *  - unlikely to make entire pageblocks free on its
-			 *    own.
-			 */
-			if (compact_result == COMPACT_SKIPPED ||
-			    compact_result == COMPACT_DEFERRED)
-				goto nopage;
-
-			/*
-			 * THP page faults may attempt local node only first,
-			 * but are then allowed to only compact, not reclaim,
-			 * see alloc_pages_mpol()
-			 *
-			 * compaction can fail for other reasons than those
-			 * checked above and we don't want such THP allocations
-			 * to put reclaim pressure on a single node in a
-			 * situation where other nodes might have plenty of
-			 * available memory
-			 */
-			if (gfp_mask & __GFP_THISNODE)
-				goto nopage;
-
-			/*
-			 * Looks like reclaim/compaction is worth trying, but
-			 * sync compaction could be very expensive, so keep
-			 * using async compaction.
-			 */
-			compact_priority = INIT_COMPACT_PRIORITY;
-		}
+			goto nopage;
 	}
 
 retry:

-- 
2.52.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
  2025-12-16 15:54 ` [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
@ 2025-12-16 16:26   ` Michal Hocko
  2025-12-16 20:11   ` Johannes Weiner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2025-12-16 16:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Brendan Jackman,
	Johannes Weiner, Zi Yan, David Rientjes, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
	Pedro Falcato, linux-mm, linux-kernel

On Tue 16-12-25 16:54:21, Vlastimil Babka wrote:
> Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
> local-only and all-node allocations"), THP page fault allocations have
> settled on the following scheme (from the commit log):
> 
> 1. local node only THP allocation with no reclaim, just compaction.
> 2. for madvised VMA's or when synchronous compaction is enabled always - THP
>    allocation from any node with effort determined by global defrag setting
>    and VMA madvise
> 3. fallback to base pages on any node
> 
> Recent customer reports however revealed we have a gap in step 1 above.
> What we have seen is excessive reclaim due to THP page faults on a NUMA
> node that's close to its high watermark, while other nodes have plenty
> of free memory.
> 
> The problem with step 1 is that it promises no reclaim after the
> compaction attempt, however reclaim is only avoided for certain
> compaction outcomes (deferred, or skipped due to insufficient free base
> pages), and not e.g. when compaction is actually performed but fails (we
> did see compact_fail vmstat counter increasing).
> 
> THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
> which is not the intention.
> 
> Thus add a check for __GFP_THISNODE that corresponds to this exact
> situation and prevents continuing with reclaim/compaction once the
> initial compaction attempt isn't successful in allocating the page.
> 
> Note that commit cc638f329ef6 has not introduced this over-reclaim
> possibility; it appears to exist in some form since commit 2f0799a0ffc0
> ("mm, thp: restore node-local hugepage allocations"). Followup commits
> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") and cc638f329ef6 have moved in the right direction,
> but left the abovementioned gap.
> 
> Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Yes, this makes sense as an intermediate state (to make a fix for stable
and other older kernels that might be interested in the fix). I would be
objecting that we should just simplify this whole thing but you have
done that in patch 2

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks

> ---
>  mm/page_alloc.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a964..e6fd1213328b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4788,6 +4788,20 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			    compact_result == COMPACT_DEFERRED)
>  				goto nopage;
>  
> +			/*
> +			 * THP page faults may attempt local node only first,
> +			 * but are then allowed to only compact, not reclaim,
> +			 * see alloc_pages_mpol()
> +			 *
> +			 * compaction can fail for other reasons than those
> +			 * checked above and we don't want such THP allocations
> +			 * to put reclaim pressure on a single node in a
> +			 * situation where other nodes might have plenty of
> +			 * available memory
> +			 */
> +			if (gfp_mask & __GFP_THISNODE)
> +				goto nopage;
> +
>  			/*
>  			 * Looks like reclaim/compaction is worth trying, but
>  			 * sync compaction could be very expensive, so keep
> 
> -- 
> 2.52.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster
  2025-12-16 15:54 ` [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster Vlastimil Babka
@ 2025-12-16 16:28   ` Michal Hocko
  2025-12-16 20:32   ` Johannes Weiner
  1 sibling, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2025-12-16 16:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Brendan Jackman,
	Johannes Weiner, Zi Yan, David Rientjes, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
	Pedro Falcato, linux-mm, linux-kernel

On Tue 16-12-25 16:54:22, Vlastimil Babka wrote:
> For allocations that are of costly order and __GFP_NORETRY (and can
> perform compaction) we attempt direct compaction first. If that fails,
> we continue with a single round of direct reclaim+compaction (as for
> other __GFP_NORETRY allocations, except the compaction is of lower
> priority), with two exceptions that fail immediately:
> 
> - __GFP_THISNODE is specified, to prevent zone_reclaim_mode-like
>   behavior for e.g. THP page faults
> 
> - compaction failed because it was deferred (i.e. has been failing
>   recently so further attempts are not done for a while) or skipped,
>   which means there are insufficient free base pages to defragment to
>   begin with
> 
> Upon closer inspection, the second condition has a somewhat flawed
> reasoning. If there are not enough base pages and reclaim could create
> them, we instead fail. When there are enough base pages and compaction
> has already ran and failed, we proceed and hope that reclaim and the
> subsequent compaction attempt will succeed. But it's unclear why they
> should and whether it will be as inexpensive as intended.
> 
> It might make therefore more sense to just fail unconditionally after
> the initial compaction attempt, so do that instead. Costly allocations
> that do want the reclaim/compaction to happen at least once can omit
> __GFP_NORETRY, or even specify __GFP_RETRY_MAYFAIL for more than one
> attempt.
> 
> There is a slight potential unfairness in that costly __GFP_NORETRY
> allocations that can't perform direct compaction (i.e. lack __GFP_IO)
> will still be allowed to direct reclaim, while those that can direct
> compact will now never attempt direct reclaim. However, in cases of
> memory pressure causing compaction to be skipped due to insufficient
> base pages, direct reclaim was already not done before, so there should
> be no functional regressions from this change.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I like this because, quite honestly, us trying to over-optimize for THP
(which seems to be the only costly allocation with GFP_NORETRY) has
turned out quite tricky and hard to reason about. So simplifying this
wrt. to the compaction feedback makes a lot of sense. Let's see where we
get from here.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/gfp_types.h |  2 ++
>  mm/page_alloc.c           | 47 +++--------------------------------------------
>  2 files changed, 5 insertions(+), 44 deletions(-)
> 
> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> index 3de43b12209e..051311fdbdb1 100644
> --- a/include/linux/gfp_types.h
> +++ b/include/linux/gfp_types.h
> @@ -218,6 +218,8 @@ enum {
>   * caller must handle the failure which is quite likely to happen under
>   * heavy memory pressure. The flag is suitable when failure can easily be
>   * handled at small cost, such as reduced throughput.
> + * For costly orders, only memory compaction can be attempted with no reclaim
> + * under some conditions.
>   *
>   * %__GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
>   * procedures that have previously failed if there is some indication
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e6fd1213328b..2671cbbd6375 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4763,52 +4763,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			goto got_pg;
>  
>  		/*
> -		 * Checks for costly allocations with __GFP_NORETRY, which
> -		 * includes some THP page fault allocations
> +		 * Compaction didn't succeed and we were told not to try hard,
> +		 * so fail now.
>  		 */
>  		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> -			/*
> -			 * If allocating entire pageblock(s) and compaction
> -			 * failed because all zones are below low watermarks
> -			 * or is prohibited because it recently failed at this
> -			 * order, fail immediately unless the allocator has
> -			 * requested compaction and reclaim retry.
> -			 *
> -			 * Reclaim is
> -			 *  - potentially very expensive because zones are far
> -			 *    below their low watermarks or this is part of very
> -			 *    bursty high order allocations,
> -			 *  - not guaranteed to help because isolate_freepages()
> -			 *    may not iterate over freed pages as part of its
> -			 *    linear scan, and
> -			 *  - unlikely to make entire pageblocks free on its
> -			 *    own.
> -			 */
> -			if (compact_result == COMPACT_SKIPPED ||
> -			    compact_result == COMPACT_DEFERRED)
> -				goto nopage;
> -
> -			/*
> -			 * THP page faults may attempt local node only first,
> -			 * but are then allowed to only compact, not reclaim,
> -			 * see alloc_pages_mpol()
> -			 *
> -			 * compaction can fail for other reasons than those
> -			 * checked above and we don't want such THP allocations
> -			 * to put reclaim pressure on a single node in a
> -			 * situation where other nodes might have plenty of
> -			 * available memory
> -			 */
> -			if (gfp_mask & __GFP_THISNODE)
> -				goto nopage;
> -
> -			/*
> -			 * Looks like reclaim/compaction is worth trying, but
> -			 * sync compaction could be very expensive, so keep
> -			 * using async compaction.
> -			 */
> -			compact_priority = INIT_COMPACT_PRIORITY;
> -		}
> +			goto nopage;
>  	}
>  
>  retry:
> 
> -- 
> 2.52.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
  2025-12-16 15:54 ` [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
  2025-12-16 16:26   ` Michal Hocko
@ 2025-12-16 20:11   ` Johannes Weiner
  2025-12-16 20:23   ` Zi Yan
  2025-12-17 15:53   ` Pedro Falcato
  3 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2025-12-16 20:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Zi Yan, David Rientjes, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Joshua Hahn, Pedro Falcato,
	linux-mm, linux-kernel

On Tue, Dec 16, 2025 at 04:54:21PM +0100, Vlastimil Babka wrote:
> Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
> local-only and all-node allocations"), THP page fault allocations have
> settled on the following scheme (from the commit log):
> 
> 1. local node only THP allocation with no reclaim, just compaction.
> 2. for madvised VMA's or when synchronous compaction is enabled always - THP
>    allocation from any node with effort determined by global defrag setting
>    and VMA madvise
> 3. fallback to base pages on any node
> 
> Recent customer reports however revealed we have a gap in step 1 above.
> What we have seen is excessive reclaim due to THP page faults on a NUMA
> node that's close to its high watermark, while other nodes have plenty
> of free memory.
> 
> The problem with step 1 is that it promises no reclaim after the
> compaction attempt, however reclaim is only avoided for certain
> compaction outcomes (deferred, or skipped due to insufficient free base
> pages), and not e.g. when compaction is actually performed but fails (we
> did see compact_fail vmstat counter increasing).
> 
> THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
> which is not the intention.
> 
> Thus add a check for __GFP_THISNODE that corresponds to this exact
> situation and prevents continuing with reclaim/compaction once the
> initial compaction attempt isn't successful in allocating the page.
> 
> Note that commit cc638f329ef6 has not introduced this over-reclaim
> possibility; it appears to exist in some form since commit 2f0799a0ffc0
> ("mm, thp: restore node-local hugepage allocations"). Followup commits
> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") and cc638f329ef6 have moved in the right direction,
> but left the abovementioned gap.
> 
> Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
  2025-12-16 15:54 ` [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
  2025-12-16 16:26   ` Michal Hocko
  2025-12-16 20:11   ` Johannes Weiner
@ 2025-12-16 20:23   ` Zi Yan
  2025-12-17 15:53   ` Pedro Falcato
  3 siblings, 0 replies; 11+ messages in thread
From: Zi Yan @ 2025-12-16 20:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, David Rientjes, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
	Pedro Falcato, linux-mm, linux-kernel

On 16 Dec 2025, at 10:54, Vlastimil Babka wrote:

> Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
> local-only and all-node allocations"), THP page fault allocations have
> settled on the following scheme (from the commit log):
>
> 1. local node only THP allocation with no reclaim, just compaction.
> 2. for madvised VMA's or when synchronous compaction is enabled always - THP
>    allocation from any node with effort determined by global defrag setting
>    and VMA madvise
> 3. fallback to base pages on any node
>
> Recent customer reports however revealed we have a gap in step 1 above.
> What we have seen is excessive reclaim due to THP page faults on a NUMA
> node that's close to its high watermark, while other nodes have plenty
> of free memory.
>
> The problem with step 1 is that it promises no reclaim after the
> compaction attempt, however reclaim is only avoided for certain
> compaction outcomes (deferred, or skipped due to insufficient free base
> pages), and not e.g. when compaction is actually performed but fails (we
> did see compact_fail vmstat counter increasing).
>
> THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
> which is not the intention.
>
> Thus add a check for __GFP_THISNODE that corresponds to this exact
> situation and prevents continuing with reclaim/compaction once the
> initial compaction attempt isn't successful in allocating the page.
>
> Note that commit cc638f329ef6 has not introduced this over-reclaim
> possibility; it appears to exist in some form since commit 2f0799a0ffc0
> ("mm, thp: restore node-local hugepage allocations"). Followup commits
> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") and cc638f329ef6 have moved in the right direction,
> but left the abovementioned gap.
>
> Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
LGTM. Acked-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster
  2025-12-16 15:54 ` [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster Vlastimil Babka
  2025-12-16 16:28   ` Michal Hocko
@ 2025-12-16 20:32   ` Johannes Weiner
  2025-12-17  8:46     ` Vlastimil Babka
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Weiner @ 2025-12-16 20:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Zi Yan, David Rientjes, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Joshua Hahn, Pedro Falcato,
	linux-mm, linux-kernel

On Tue, Dec 16, 2025 at 04:54:22PM +0100, Vlastimil Babka wrote:
> It might make therefore more sense to just fail unconditionally after
> the initial compaction attempt, so do that instead. Costly allocations
> that do want the reclaim/compaction to happen at least once can omit
> __GFP_NORETRY, or even specify __GFP_RETRY_MAYFAIL for more than one
> attempt.
>
> There is a slight potential unfairness in that costly __GFP_NORETRY
> allocations that can't perform direct compaction (i.e. lack __GFP_IO)
> will still be allowed to direct reclaim, while those that can direct
> compact will now never attempt direct reclaim. However, in cases of
> memory pressure causing compaction to be skipped due to insufficient
> base pages, direct reclaim was already not done before, so there should
> be no functional regressions from this change.

Hm, kind of. There could be enough basepages for compaction_suitable()
but compaction odds are still higher with more free pages. So there
might be cases it regresses.

__GFP_NORETRY semantics say it'll try reclaim at least once. We should
be able to keep that and still simplify, no?

>  		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> -			if (gfp_mask & __GFP_THISNODE)
> -				goto nopage;
> +			goto nopage;

IOW, maybe directly select for the NUMA-THP special case here?

		/* Optimistic node-local huge page - only compact once */
		if (costly_order &&
                    ((gfp_mask & (__GFP_NORETRY|__GFP_THISNODE)) ==
                     (__GFP_NORETRY|__GFP_THISNODE)))
			goto nopage;

and then let other __GFP_NORETRY fall through.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster
  2025-12-16 20:32   ` Johannes Weiner
@ 2025-12-17  8:46     ` Vlastimil Babka
  2025-12-17 16:35       ` Johannes Weiner
  0 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2025-12-17  8:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Zi Yan, David Rientjes, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Joshua Hahn, Pedro Falcato,
	linux-mm, linux-kernel

On 12/16/25 21:32, Johannes Weiner wrote:
> On Tue, Dec 16, 2025 at 04:54:22PM +0100, Vlastimil Babka wrote:
>> It might make therefore more sense to just fail unconditionally after
>> the initial compaction attempt, so do that instead. Costly allocations
>> that do want the reclaim/compaction to happen at least once can omit
>> __GFP_NORETRY, or even specify __GFP_RETRY_MAYFAIL for more than one
>> attempt.
>>
>> There is a slight potential unfairness in that costly __GFP_NORETRY
>> allocations that can't perform direct compaction (i.e. lack __GFP_IO)
>> will still be allowed to direct reclaim, while those that can direct
>> compact will now never attempt direct reclaim. However, in cases of
>> memory pressure causing compaction to be skipped due to insufficient
>> base pages, direct reclaim was already not done before, so there should
>> be no functional regressions from this change.
> 
> Hm, kind of. There could be enough basepages for compaction_suitable()
> but compaction odds are still higher with more free pages. So there
> might be cases it regresses.
> 
> __GFP_NORETRY semantics say it'll try reclaim at least once. We should
> be able to keep that and still simplify, no?
> 
>>  		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
>> -			if (gfp_mask & __GFP_THISNODE)
>> -				goto nopage;
>> +			goto nopage;
> 
> IOW, maybe directly select for the NUMA-THP special case here?
> 
> 		/* Optimistic node-local huge page - only compact once */
> 		if (costly_order &&
>                     ((gfp_mask & (__GFP_NORETRY|__GFP_THISNODE)) ==
>                      (__GFP_NORETRY|__GFP_THISNODE)))
> 			goto nopage;
> 
> and then let other __GFP_NORETRY fall through.

I did consider it as an alternative when realizing the potential unfairness
mentioned above, but then went with the simpler code option.
With your suggestion we keep the THP-specific check but at least remove the
arguably illogical compaction feedback.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
  2025-12-16 15:54 ` [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
                     ` (2 preceding siblings ...)
  2025-12-16 20:23   ` Zi Yan
@ 2025-12-17 15:53   ` Pedro Falcato
  3 siblings, 0 replies; 11+ messages in thread
From: Pedro Falcato @ 2025-12-17 15:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, David Rientjes, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
	linux-mm, linux-kernel

On Tue, Dec 16, 2025 at 04:54:21PM +0100, Vlastimil Babka wrote:
> Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
> local-only and all-node allocations"), THP page fault allocations have
> settled on the following scheme (from the commit log):
> 
> 1. local node only THP allocation with no reclaim, just compaction.
> 2. for madvised VMA's or when synchronous compaction is enabled always - THP
>    allocation from any node with effort determined by global defrag setting
>    and VMA madvise
> 3. fallback to base pages on any node
> 
> Recent customer reports however revealed we have a gap in step 1 above.
> What we have seen is excessive reclaim due to THP page faults on a NUMA
> node that's close to its high watermark, while other nodes have plenty
> of free memory.
> 
> The problem with step 1 is that it promises no reclaim after the
> compaction attempt, however reclaim is only avoided for certain
> compaction outcomes (deferred, or skipped due to insufficient free base
> pages), and not e.g. when compaction is actually performed but fails (we
> did see compact_fail vmstat counter increasing).
> 
> THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
> which is not the intention.
> 
> Thus add a check for __GFP_THISNODE that corresponds to this exact
> situation and prevents continuing with reclaim/compaction once the
> initial compaction attempt isn't successful in allocating the page.
> 
> Note that commit cc638f329ef6 has not introduced this over-reclaim
> possibility; it appears to exist in some form since commit 2f0799a0ffc0
> ("mm, thp: restore node-local hugepage allocations"). Followup commits
> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") and cc638f329ef6 have moved in the right direction,
> but left the abovementioned gap.
> 
> Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Pedro Falcato <pfalcato@suse.de> 

-- 
Pedro


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster
  2025-12-17  8:46     ` Vlastimil Babka
@ 2025-12-17 16:35       ` Johannes Weiner
  0 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2025-12-17 16:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Zi Yan, David Rientjes, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Joshua Hahn, Pedro Falcato,
	linux-mm, linux-kernel

On Wed, Dec 17, 2025 at 09:46:34AM +0100, Vlastimil Babka wrote:
> On 12/16/25 21:32, Johannes Weiner wrote:
> > On Tue, Dec 16, 2025 at 04:54:22PM +0100, Vlastimil Babka wrote:
> >> It might make therefore more sense to just fail unconditionally after
> >> the initial compaction attempt, so do that instead. Costly allocations
> >> that do want the reclaim/compaction to happen at least once can omit
> >> __GFP_NORETRY, or even specify __GFP_RETRY_MAYFAIL for more than one
> >> attempt.
> >>
> >> There is a slight potential unfairness in that costly __GFP_NORETRY
> >> allocations that can't perform direct compaction (i.e. lack __GFP_IO)
> >> will still be allowed to direct reclaim, while those that can direct
> >> compact will now never attempt direct reclaim. However, in cases of
> >> memory pressure causing compaction to be skipped due to insufficient
> >> base pages, direct reclaim was already not done before, so there should
> >> be no functional regressions from this change.
> > 
> > Hm, kind of. There could be enough basepages for compaction_suitable()
> > but compaction odds are still higher with more free pages. So there
> > might be cases it regresses.
> > 
> > __GFP_NORETRY semantics say it'll try reclaim at least once. We should
> > be able to keep that and still simplify, no?
> > 
> >>  		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> >> -			if (gfp_mask & __GFP_THISNODE)
> >> -				goto nopage;
> >> +			goto nopage;
> > 
> > IOW, maybe directly select for the NUMA-THP special case here?
> > 
> > 		/* Optimistic node-local huge page - only compact once */
> > 		if (costly_order &&
> >                     ((gfp_mask & (__GFP_NORETRY|__GFP_THISNODE)) ==
> >                      (__GFP_NORETRY|__GFP_THISNODE)))
> > 			goto nopage;
> > 
> > and then let other __GFP_NORETRY fall through.
> 
> I did consider it as an alternative when realizing the potential unfairness
> mentioned above, but then went with the simpler code option.
>
> With your suggestion we keep the THP-specific check but at least remove the
> arguably illogical compaction feedback.

Yes, I'm in favor of removing those either way.

Reclaim makes its own decisions around costly orders. For example, it
targets a higher number of free pages through compaction_ready() than
where compaction would return SKIPPED, to account for concurrency. I
don't think the allocator should have conflicting opinions.

Regarding __GFP_NORETRY: I think it would just be a chance to simplify
the mental model around it again. If somebody does a NORETRY request
when memory is full of stale page cache, I think it's reasonable to
expect at least one shot at dropping some cache to make it happen.

Shortcutting directly to compaction is a good optimization when we
suspect it could succeed without requiring reclaim. But I'm not sure
it's reasonable to ONLY do that and give up.

Btw, I do wonder why that up-front compaction run is so explicit, when
we have

	__alloc_pages_direct_reclaim()
	__alloc_pages_direct_compact()

calls following below. Couldn't we check for conditions upfront and
set a flag to skip reclaim initially? Then handle priority adjustments
in the retry conditions? IOW, something like:

	unsigned long did_some_progress = 0;

	if (can_compact && costly_order)
		skip_reclaim = true;

	if (can_compact && order > 0 && ac->migratetype != MIGRATE_MOVABLE)
		skip_reclaim = true;

	if (gfp_thisnode_noretry(gfp_mask))
		skip_reclaim = true;
retry:
	page = get_page_from_freelist(..., alloc_flags, ...);
	if (page)
		goto got_pg;

	if (!skip_reclaim) {
		page = __alloc_pages_direct_reclaim(..., &did_some_progress);
		if (page)
			goto got_pg;
	}

	page = __alloc_pages_direct_compact(...);
	if (page)
		goto got_pg;

	if (should_loop()) {
		skip_reclaim = false;
		compact_priority = ...;
		goto retry;
	}

That would naturally get rid of the gfp_pfmemalloc_allowed() branch
for the upfront check as well, because the ALLOC_NO_WATERMARKS attempt
happens before we do the reclaim/compaction calls.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-12-17 16:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-16 15:54 [PATCH RFC 0/2] tweaks for costly order __GFP_NORETRY reclaim Vlastimil Babka
2025-12-16 15:54 ` [PATCH RFC 1/2] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
2025-12-16 16:26   ` Michal Hocko
2025-12-16 20:11   ` Johannes Weiner
2025-12-16 20:23   ` Zi Yan
2025-12-17 15:53   ` Pedro Falcato
2025-12-16 15:54 ` [PATCH RFC 2/2] mm, page_alloc: fail costly __GFP_NORETRY allocations faster Vlastimil Babka
2025-12-16 16:28   ` Michal Hocko
2025-12-16 20:32   ` Johannes Weiner
2025-12-17  8:46     ` Vlastimil Babka
2025-12-17 16:35       ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox