* [PATCH 0/2] mm: two allocator fixes for 6.15
@ 2025-04-16 13:45 Johannes Weiner
2025-04-16 13:45 ` [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd Johannes Weiner
2025-04-16 13:45 ` [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode Johannes Weiner
0 siblings, 2 replies; 5+ messages in thread
From: Johannes Weiner @ 2025-04-16 13:45 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel
Two fixes based on Vlastimil's review of the defrag_mode patches.
#1 fixes a bug that can lead to memory deadlocks on high-CPU-count
machines. This affects not just defrag_mode.
#2 fixes an overreclaim issue when defrag_mode is enabled.
Based on:
commit 16176182efbf3dfce6dad18dcc8801164329d1c2 (akpm/mm-hotfixes-unstable)
Author: Uros Bizjak <ubizjak@gmail.com>
Date: Fri Apr 4 12:24:37 2025 +0200
include/linux/mmzone.h | 2 --
mm/page_alloc.c | 12 ------------
mm/vmscan.c | 29 ++++++++++++++++++++++++++---
3 files changed, 26 insertions(+), 17 deletions(-)
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd
2025-04-16 13:45 [PATCH 0/2] mm: two allocator fixes for 6.15 Johannes Weiner
@ 2025-04-16 13:45 ` Johannes Weiner
2025-04-16 14:53 ` Vlastimil Babka
2025-04-16 13:45 ` [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode Johannes Weiner
1 sibling, 1 reply; 5+ messages in thread
From: Johannes Weiner @ 2025-04-16 13:45 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel
Vlastimil points out that commit a211c6550efc ("mm: page_alloc:
defrag_mode kswapd/kcompactd watermarks") switched kswapd from
zone_watermark_ok_safe() to the standard, percpu-cached version of
reading free pages, thus dropping the watermark safety precautions for
systems with high CPU counts (e.g. >212 cpus on 64G). Restore them.
Since zone_watermark_ok_safe() is no longer the right interface, and
this was the last caller of the function anyway, open-code the
zone_page_state_snapshot() conditional and delete the function.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/mmzone.h | 2 --
mm/page_alloc.c | 12 ------------
mm/vmscan.c | 21 +++++++++++++++++++--
3 files changed, 19 insertions(+), 16 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4c95fcc9e9df..6ccec1bf2896 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1502,8 +1502,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
bool zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int highest_zoneidx,
unsigned int alloc_flags);
-bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
- unsigned long mark, int highest_zoneidx);
/*
* Memory initialization context, use to differentiate memory added by
* the platform statically or via memory hotplug interface.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d7cfcfa2b077..928a81f67326 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3470,18 +3470,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
return false;
}
-bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
- unsigned long mark, int highest_zoneidx)
-{
- long free_pages = zone_page_state(z, NR_FREE_PAGES);
-
- if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
- free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
-
- return __zone_watermark_ok(z, order, mark, highest_zoneidx, 0,
- free_pages);
-}
-
#ifdef CONFIG_NUMA
int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b620d74b0f66..cc422ad830d6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6736,6 +6736,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
* meet watermarks.
*/
for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
+ enum zone_stat_item item;
unsigned long free_pages;
if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
@@ -6748,9 +6749,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
* blocks to avoid polluting allocator fallbacks.
*/
if (defrag_mode)
- free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
+ item = NR_FREE_PAGES_BLOCKS;
else
- free_pages = zone_page_state(zone, NR_FREE_PAGES);
+ item = NR_FREE_PAGES;
+
+ /*
+ * When there is a high number of CPUs in the system,
+ * the cumulative error from the vmstat per-cpu cache
+ * can blur the line between the watermarks. In that
+ * case, be safe and get an accurate snapshot.
+ *
+ * TODO: NR_FREE_PAGES_BLOCKS moves in steps of
+ * pageblock_nr_pages, while the vmstat pcp threshold
+ * is limited to 125. On many configurations that
+ * counter won't actually be per-cpu cached. But keep
+ * things simple for now; revisit when somebody cares.
+ */
+ free_pages = zone_page_state(zone, item);
+ if (zone->percpu_drift_mark && free_pages < zone->percpu_drift_mark)
+ free_pages = zone_page_state_snapshot(zone, item);
if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
0, free_pages))
--
2.49.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode
2025-04-16 13:45 [PATCH 0/2] mm: two allocator fixes for 6.15 Johannes Weiner
2025-04-16 13:45 ` [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd Johannes Weiner
@ 2025-04-16 13:45 ` Johannes Weiner
2025-04-16 14:54 ` Vlastimil Babka
1 sibling, 1 reply; 5+ messages in thread
From: Johannes Weiner @ 2025-04-16 13:45 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel
Vlastimil points out an issue with kswapd in defrag_mode not waking up
kcompactd reliably.
Background: When kswapd is woken for any higher-order request, it
initially checks those high-order watermarks to decide if work is
necesary. However, it cannot (efficiently) meet the contiguity goal of
such a request by itself. So once it has reclaimed a compaction gap,
it adjusts the request down to check for free order-0 pages, then
wakes kcompactd to coalesce them into larger blocks.
In defrag_mode, the initial watermark check needs to be analogously
against free pageblocks. However, once kswapd drops the high-order to
hand off contiguity work, it also needs to fall back to base page
watermarks - otherwise it'll keep reclaiming until blocks are freed.
While it appears kcompactd is woken up frequently enough to do most of
the compaction work, kswapd ends up overreclaiming by quite a bit:
DEFRAGMODE DEFRAGMODE-thispatch
Hugealloc Time mean 79381.34 ( +0.00%) 88126.12 ( +11.02%)
Hugealloc Time stddev 85852.16 ( +0.00%) 135366.75 ( +57.67%)
Kbuild Real time 249.35 ( +0.00%) 226.71 ( -9.04%)
Kbuild User time 1249.16 ( +0.00%) 1249.37 ( +0.02%)
Kbuild System time 171.76 ( +0.00%) 166.93 ( -2.79%)
THP fault alloc 51666.87 ( +0.00%) 52685.60 ( +1.97%)
THP fault fallback 16970.00 ( +0.00%) 15951.87 ( -6.00%)
Direct compact fail 166.53 ( +0.00%) 178.93 ( +7.40%)
Direct compact success 17.13 ( +0.00%) 4.13 ( -71.69%)
Compact daemon scanned migrate 3095413.33 ( +0.00%) 9231239.53 ( +198.22%)
Compact daemon scanned free 2155966.53 ( +0.00%) 7053692.87 ( +227.17%)
Compact direct scanned migrate 265642.47 ( +0.00%) 68388.33 ( -74.26%)
Compact direct scanned free 130252.60 ( +0.00%) 55634.87 ( -57.29%)
Compact total migrate scanned 3361055.80 ( +0.00%) 9299627.87 ( +176.69%)
Compact total free scanned 2286219.13 ( +0.00%) 7109327.73 ( +210.96%)
Alloc stall 1890.80 ( +0.00%) 6297.60 ( +232.94%)
Pages kswapd scanned 9043558.80 ( +0.00%) 5952576.73 ( -34.18%)
Pages kswapd reclaimed 1891708.67 ( +0.00%) 1030645.00 ( -45.52%)
Pages direct scanned 1017090.60 ( +0.00%) 2688047.60 ( +164.29%)
Pages direct reclaimed 92682.60 ( +0.00%) 309770.53 ( +234.22%)
Pages total scanned 10060649.40 ( +0.00%) 8640624.33 ( -14.11%)
Pages total reclaimed 1984391.27 ( +0.00%) 1340415.53 ( -32.45%)
Swap out 884585.73 ( +0.00%) 417781.93 ( -52.77%)
Swap in 287106.27 ( +0.00%) 95589.73 ( -66.71%)
File refaults 551697.60 ( +0.00%) 426474.80 ( -22.70%)
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/vmscan.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc422ad830d6..3783e45bfc92 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6747,8 +6747,14 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
/*
* In defrag_mode, watermarks must be met in whole
* blocks to avoid polluting allocator fallbacks.
+ *
+ * However, kswapd usually cannot accomplish this on
+ * its own and needs kcompactd support. Once it's
+ * reclaimed a compaction gap, and kswapd_shrink_node
+ * has dropped order, simply ensure there are enough
+ * base pages for compaction, wake kcompactd & sleep.
*/
- if (defrag_mode)
+ if (defrag_mode && order)
item = NR_FREE_PAGES_BLOCKS;
else
item = NR_FREE_PAGES;
--
2.49.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd
2025-04-16 13:45 ` [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd Johannes Weiner
@ 2025-04-16 14:53 ` Vlastimil Babka
0 siblings, 0 replies; 5+ messages in thread
From: Vlastimil Babka @ 2025-04-16 14:53 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton; +Cc: Brendan Jackman, linux-mm, linux-kernel
On 4/16/25 15:45, Johannes Weiner wrote:
> Vlastimil points out that commit a211c6550efc ("mm: page_alloc:
> defrag_mode kswapd/kcompactd watermarks") switched kswapd from
> zone_watermark_ok_safe() to the standard, percpu-cached version of
> reading free pages, thus dropping the watermark safety precautions for
> systems with high CPU counts (e.g. >212 cpus on 64G). Restore them.
>
> Since zone_watermark_ok_safe() is no longer the right interface, and
> this was the last caller of the function anyway, open-code the
> zone_page_state_snapshot() conditional and delete the function.
>
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode
2025-04-16 13:45 ` [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode Johannes Weiner
@ 2025-04-16 14:54 ` Vlastimil Babka
0 siblings, 0 replies; 5+ messages in thread
From: Vlastimil Babka @ 2025-04-16 14:54 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton; +Cc: Brendan Jackman, linux-mm, linux-kernel
On 4/16/25 15:45, Johannes Weiner wrote:
> Vlastimil points out an issue with kswapd in defrag_mode not waking up
> kcompactd reliably.
>
> Background: When kswapd is woken for any higher-order request, it
> initially checks those high-order watermarks to decide if work is
> necesary. However, it cannot (efficiently) meet the contiguity goal of
> such a request by itself. So once it has reclaimed a compaction gap,
> it adjusts the request down to check for free order-0 pages, then
> wakes kcompactd to coalesce them into larger blocks.
>
> In defrag_mode, the initial watermark check needs to be analogously
> against free pageblocks. However, once kswapd drops the high-order to
> hand off contiguity work, it also needs to fall back to base page
> watermarks - otherwise it'll keep reclaiming until blocks are freed.
>
> While it appears kcompactd is woken up frequently enough to do most of
> the compaction work, kswapd ends up overreclaiming by quite a bit:
>
> DEFRAGMODE DEFRAGMODE-thispatch
> Hugealloc Time mean 79381.34 ( +0.00%) 88126.12 ( +11.02%)
> Hugealloc Time stddev 85852.16 ( +0.00%) 135366.75 ( +57.67%)
> Kbuild Real time 249.35 ( +0.00%) 226.71 ( -9.04%)
> Kbuild User time 1249.16 ( +0.00%) 1249.37 ( +0.02%)
> Kbuild System time 171.76 ( +0.00%) 166.93 ( -2.79%)
> THP fault alloc 51666.87 ( +0.00%) 52685.60 ( +1.97%)
> THP fault fallback 16970.00 ( +0.00%) 15951.87 ( -6.00%)
> Direct compact fail 166.53 ( +0.00%) 178.93 ( +7.40%)
> Direct compact success 17.13 ( +0.00%) 4.13 ( -71.69%)
> Compact daemon scanned migrate 3095413.33 ( +0.00%) 9231239.53 ( +198.22%)
> Compact daemon scanned free 2155966.53 ( +0.00%) 7053692.87 ( +227.17%)
> Compact direct scanned migrate 265642.47 ( +0.00%) 68388.33 ( -74.26%)
> Compact direct scanned free 130252.60 ( +0.00%) 55634.87 ( -57.29%)
> Compact total migrate scanned 3361055.80 ( +0.00%) 9299627.87 ( +176.69%)
> Compact total free scanned 2286219.13 ( +0.00%) 7109327.73 ( +210.96%)
> Alloc stall 1890.80 ( +0.00%) 6297.60 ( +232.94%)
> Pages kswapd scanned 9043558.80 ( +0.00%) 5952576.73 ( -34.18%)
> Pages kswapd reclaimed 1891708.67 ( +0.00%) 1030645.00 ( -45.52%)
> Pages direct scanned 1017090.60 ( +0.00%) 2688047.60 ( +164.29%)
> Pages direct reclaimed 92682.60 ( +0.00%) 309770.53 ( +234.22%)
> Pages total scanned 10060649.40 ( +0.00%) 8640624.33 ( -14.11%)
> Pages total reclaimed 1984391.27 ( +0.00%) 1340415.53 ( -32.45%)
> Swap out 884585.73 ( +0.00%) 417781.93 ( -52.77%)
> Swap in 287106.27 ( +0.00%) 95589.73 ( -66.71%)
> File refaults 551697.60 ( +0.00%) 426474.80 ( -22.70%)
>
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Thanks!
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-04-16 14:54 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-16 13:45 [PATCH 0/2] mm: two allocator fixes for 6.15 Johannes Weiner
2025-04-16 13:45 ` [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd Johannes Weiner
2025-04-16 14:53 ` Vlastimil Babka
2025-04-16 13:45 ` [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode Johannes Weiner
2025-04-16 14:54 ` Vlastimil Babka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox