* [PATCH] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
@ 2025-12-19 16:31 Vlastimil Babka
2025-12-19 16:33 ` Zi Yan
2025-12-20 22:16 ` Andrew Morton
0 siblings, 2 replies; 4+ messages in thread
From: Vlastimil Babka @ 2025-12-19 16:31 UTC (permalink / raw)
To: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, David Rientjes, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
Pedro Falcato
Cc: linux-mm, linux-kernel, Vlastimil Babka
Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
local-only and all-node allocations"), THP page fault allocations have
settled on the following scheme (from the commit log):
1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
allocation from any node with effort determined by global defrag setting
and VMA madvise
3. fallback to base pages on any node
Recent customer reports however revealed we have a gap in step 1 above.
What we have seen is excessive reclaim due to THP page faults on a NUMA
node that's close to its high watermark, while other nodes have plenty
of free memory.
The problem with step 1 is that it promises no reclaim after the
compaction attempt, however reclaim is only avoided for certain
compaction outcomes (deferred, or skipped due to insufficient free base
pages), and not e.g. when compaction is actually performed but fails (we
did see compact_fail vmstat counter increasing).
THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
which is not the intention.
Thus add a check for __GFP_THISNODE that corresponds to this exact
situation and prevents continuing with reclaim/compaction once the
initial compaction attempt isn't successful in allocating the page.
Note that commit cc638f329ef6 has not introduced this over-reclaim
possibility; it appears to exist in some form since commit 2f0799a0ffc0
("mm, thp: restore node-local hugepage allocations"). Followup commits
b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
may not succeed") and cc638f329ef6 have moved in the right direction,
but left the abovementioned gap.
Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
This is the patch 1 taken from the RFC [1] with review tags applied, and
should be ready for exposing in linux-next.
The rest of [1] will be another cleanup RFC with changes according to
feedback and likely to result in more discussions, delayed by holidays
etc. So will be posted separately so the fix is not held up.
[1] https://lore.kernel.org/all/20251216-thp-thisnode-tweak-v1-0-0e499d13d2eb@suse.cz/
---
mm/page_alloc.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a964..6f5e1b902999 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4788,6 +4788,20 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
compact_result == COMPACT_DEFERRED)
goto nopage;
+ /*
+ * THP page faults may attempt local node only first,
+ * but are then allowed to only compact, not reclaim,
+ * see alloc_pages_mpol().
+ *
+ * Compaction can fail for other reasons than those
+ * checked above and we don't want such THP allocations
+ * to put reclaim pressure on a single node in a
+ * situation where other nodes might have plenty of
+ * available memory.
+ */
+ if (gfp_mask & __GFP_THISNODE)
+ goto nopage;
+
/*
* Looks like reclaim/compaction is worth trying, but
* sync compaction could be very expensive, so keep
---
base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
change-id: 20251219-costly-noretry-thisnode-fix-b9ec46b62c1b
Best regards,
--
Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
2025-12-19 16:31 [PATCH] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
@ 2025-12-19 16:33 ` Zi Yan
2025-12-20 22:16 ` Andrew Morton
1 sibling, 0 replies; 4+ messages in thread
From: Zi Yan @ 2025-12-19 16:33 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, David Rientjes, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
Pedro Falcato, linux-mm, linux-kernel
On 19 Dec 2025, at 11:31, Vlastimil Babka wrote:
> Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
> local-only and all-node allocations"), THP page fault allocations have
> settled on the following scheme (from the commit log):
>
> 1. local node only THP allocation with no reclaim, just compaction.
> 2. for madvised VMA's or when synchronous compaction is enabled always - THP
> allocation from any node with effort determined by global defrag setting
> and VMA madvise
> 3. fallback to base pages on any node
>
> Recent customer reports however revealed we have a gap in step 1 above.
> What we have seen is excessive reclaim due to THP page faults on a NUMA
> node that's close to its high watermark, while other nodes have plenty
> of free memory.
>
> The problem with step 1 is that it promises no reclaim after the
> compaction attempt, however reclaim is only avoided for certain
> compaction outcomes (deferred, or skipped due to insufficient free base
> pages), and not e.g. when compaction is actually performed but fails (we
> did see compact_fail vmstat counter increasing).
>
> THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
> which is not the intention.
>
> Thus add a check for __GFP_THISNODE that corresponds to this exact
> situation and prevents continuing with reclaim/compaction once the
> initial compaction attempt isn't successful in allocating the page.
>
> Note that commit cc638f329ef6 has not introduced this over-reclaim
> possibility; it appears to exist in some form since commit 2f0799a0ffc0
> ("mm, thp: restore node-local hugepage allocations"). Followup commits
> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") and cc638f329ef6 have moved in the right direction,
> but left the abovementioned gap.
>
> Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> This is the patch 1 taken from the RFC [1] with review tags applied, and
> should be ready for exposing in linux-next.
>
> The rest of [1] will be another cleanup RFC with changes according to
> feedback and likely to result in more discussions, delayed by holidays
> etc. So will be posted separately so the fix is not held up.
>
> [1] https://lore.kernel.org/all/20251216-thp-thisnode-tweak-v1-0-0e499d13d2eb@suse.cz/
> ---
> mm/page_alloc.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
Acked-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
2025-12-19 16:31 [PATCH] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
2025-12-19 16:33 ` Zi Yan
@ 2025-12-20 22:16 ` Andrew Morton
2025-12-21 20:43 ` Vlastimil Babka
1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2025-12-20 22:16 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, David Rientjes, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
Pedro Falcato, linux-mm, linux-kernel
On Fri, 19 Dec 2025 17:31:57 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:
> Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
> local-only and all-node allocations"), THP page fault allocations have
> settled on the following scheme (from the commit log):
>
> 1. local node only THP allocation with no reclaim, just compaction.
> 2. for madvised VMA's or when synchronous compaction is enabled always - THP
> allocation from any node with effort determined by global defrag setting
> and VMA madvise
> 3. fallback to base pages on any node
>
> Recent customer reports however revealed we have a gap in step 1 above.
> What we have seen is excessive reclaim due to THP page faults on a NUMA
> node that's close to its high watermark, while other nodes have plenty
> of free memory.
>
> The problem with step 1 is that it promises no reclaim after the
> compaction attempt, however reclaim is only avoided for certain
> compaction outcomes (deferred, or skipped due to insufficient free base
> pages), and not e.g. when compaction is actually performed but fails (we
> did see compact_fail vmstat counter increasing).
>
> THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
> which is not the intention.
>
> Thus add a check for __GFP_THISNODE that corresponds to this exact
> situation and prevents continuing with reclaim/compaction once the
> initial compaction attempt isn't successful in allocating the page.
>
> Note that commit cc638f329ef6 has not introduced this over-reclaim
> possibility; it appears to exist in some form since commit 2f0799a0ffc0
> ("mm, thp: restore node-local hugepage allocations"). Followup commits
> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") and cc638f329ef6 have moved in the right direction,
> but left the abovementioned gap.
Cool. What are your thoughts on which kernel version(s) should receive
this?
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations
2025-12-20 22:16 ` Andrew Morton
@ 2025-12-21 20:43 ` Vlastimil Babka
0 siblings, 0 replies; 4+ messages in thread
From: Vlastimil Babka @ 2025-12-21 20:43 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, David Rientjes, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Joshua Hahn,
Pedro Falcato, linux-mm, linux-kernel
On 12/20/25 23:16, Andrew Morton wrote:
> On Fri, 19 Dec 2025 17:31:57 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:
>
>> Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
>> local-only and all-node allocations"), THP page fault allocations have
>> settled on the following scheme (from the commit log):
>>
>> 1. local node only THP allocation with no reclaim, just compaction.
>> 2. for madvised VMA's or when synchronous compaction is enabled always - THP
>> allocation from any node with effort determined by global defrag setting
>> and VMA madvise
>> 3. fallback to base pages on any node
>>
>> Recent customer reports however revealed we have a gap in step 1 above.
>> What we have seen is excessive reclaim due to THP page faults on a NUMA
>> node that's close to its high watermark, while other nodes have plenty
>> of free memory.
>>
>> The problem with step 1 is that it promises no reclaim after the
>> compaction attempt, however reclaim is only avoided for certain
>> compaction outcomes (deferred, or skipped due to insufficient free base
>> pages), and not e.g. when compaction is actually performed but fails (we
>> did see compact_fail vmstat counter increasing).
>>
>> THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
>> which is not the intention.
>>
>> Thus add a check for __GFP_THISNODE that corresponds to this exact
>> situation and prevents continuing with reclaim/compaction once the
>> initial compaction attempt isn't successful in allocating the page.
>>
>> Note that commit cc638f329ef6 has not introduced this over-reclaim
>> possibility; it appears to exist in some form since commit 2f0799a0ffc0
>> ("mm, thp: restore node-local hugepage allocations"). Followup commits
>> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
>> may not succeed") and cc638f329ef6 have moved in the right direction,
>> but left the abovementioned gap.
>
> Cool. What are your thoughts on which kernel version(s) should receive
> this?
We could just cc stable and let Fixes: do the job. Should be safe to backport.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-12-21 20:43 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-19 16:31 [PATCH] mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations Vlastimil Babka
2025-12-19 16:33 ` Zi Yan
2025-12-20 22:16 ` Andrew Morton
2025-12-21 20:43 ` Vlastimil Babka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox