linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] mm: two allocator fixes for 6.15
@ 2025-04-16 13:45 Johannes Weiner
  2025-04-16 13:45 ` [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd Johannes Weiner
  2025-04-16 13:45 ` [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode Johannes Weiner
  0 siblings, 2 replies; 5+ messages in thread
From: Johannes Weiner @ 2025-04-16 13:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel

Two fixes based on Vlastimil's review of the defrag_mode patches.

#1 fixes a bug that can lead to memory deadlocks on high-CPU-count
machines. This affects not just defrag_mode.

#2 fixes an overreclaim issue when defrag_mode is enabled.

Based on:

  commit 16176182efbf3dfce6dad18dcc8801164329d1c2 (akpm/mm-hotfixes-unstable)
  Author: Uros Bizjak <ubizjak@gmail.com>
  Date:   Fri Apr 4 12:24:37 2025 +0200

 include/linux/mmzone.h |  2 --
 mm/page_alloc.c        | 12 ------------
 mm/vmscan.c            | 29 ++++++++++++++++++++++++++---
 3 files changed, 26 insertions(+), 17 deletions(-)



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd
  2025-04-16 13:45 [PATCH 0/2] mm: two allocator fixes for 6.15 Johannes Weiner
@ 2025-04-16 13:45 ` Johannes Weiner
  2025-04-16 14:53   ` Vlastimil Babka
  2025-04-16 13:45 ` [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode Johannes Weiner
  1 sibling, 1 reply; 5+ messages in thread
From: Johannes Weiner @ 2025-04-16 13:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel

Vlastimil points out that commit a211c6550efc ("mm: page_alloc:
defrag_mode kswapd/kcompactd watermarks") switched kswapd from
zone_watermark_ok_safe() to the standard, percpu-cached version of
reading free pages, thus dropping the watermark safety precautions for
systems with high CPU counts (e.g. >212 cpus on 64G). Restore them.

Since zone_watermark_ok_safe() is no longer the right interface, and
this was the last caller of the function anyway, open-code the
zone_page_state_snapshot() conditional and delete the function.

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  2 --
 mm/page_alloc.c        | 12 ------------
 mm/vmscan.c            | 21 +++++++++++++++++++--
 3 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4c95fcc9e9df..6ccec1bf2896 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1502,8 +1502,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 bool zone_watermark_ok(struct zone *z, unsigned int order,
 		unsigned long mark, int highest_zoneidx,
 		unsigned int alloc_flags);
-bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
-		unsigned long mark, int highest_zoneidx);
 /*
  * Memory initialization context, use to differentiate memory added by
  * the platform statically or via memory hotplug interface.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d7cfcfa2b077..928a81f67326 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3470,18 +3470,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 	return false;
 }
 
-bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
-			unsigned long mark, int highest_zoneidx)
-{
-	long free_pages = zone_page_state(z, NR_FREE_PAGES);
-
-	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
-		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
-
-	return __zone_watermark_ok(z, order, mark, highest_zoneidx, 0,
-								free_pages);
-}
-
 #ifdef CONFIG_NUMA
 int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b620d74b0f66..cc422ad830d6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6736,6 +6736,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
 	 * meet watermarks.
 	 */
 	for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
+		enum zone_stat_item item;
 		unsigned long free_pages;
 
 		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
@@ -6748,9 +6749,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
 		 * blocks to avoid polluting allocator fallbacks.
 		 */
 		if (defrag_mode)
-			free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
+			item = NR_FREE_PAGES_BLOCKS;
 		else
-			free_pages = zone_page_state(zone, NR_FREE_PAGES);
+			item = NR_FREE_PAGES;
+
+		/*
+		 * When there is a high number of CPUs in the system,
+		 * the cumulative error from the vmstat per-cpu cache
+		 * can blur the line between the watermarks. In that
+		 * case, be safe and get an accurate snapshot.
+		 *
+		 * TODO: NR_FREE_PAGES_BLOCKS moves in steps of
+		 * pageblock_nr_pages, while the vmstat pcp threshold
+		 * is limited to 125. On many configurations that
+		 * counter won't actually be per-cpu cached. But keep
+		 * things simple for now; revisit when somebody cares.
+		 */
+		free_pages = zone_page_state(zone, item);
+		if (zone->percpu_drift_mark && free_pages < zone->percpu_drift_mark)
+			free_pages = zone_page_state_snapshot(zone, item);
 
 		if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
 					0, free_pages))
-- 
2.49.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode
  2025-04-16 13:45 [PATCH 0/2] mm: two allocator fixes for 6.15 Johannes Weiner
  2025-04-16 13:45 ` [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd Johannes Weiner
@ 2025-04-16 13:45 ` Johannes Weiner
  2025-04-16 14:54   ` Vlastimil Babka
  1 sibling, 1 reply; 5+ messages in thread
From: Johannes Weiner @ 2025-04-16 13:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel

Vlastimil points out an issue with kswapd in defrag_mode not waking up
kcompactd reliably.

Background: When kswapd is woken for any higher-order request, it
initially checks those high-order watermarks to decide if work is
necesary. However, it cannot (efficiently) meet the contiguity goal of
such a request by itself. So once it has reclaimed a compaction gap,
it adjusts the request down to check for free order-0 pages, then
wakes kcompactd to coalesce them into larger blocks.

In defrag_mode, the initial watermark check needs to be analogously
against free pageblocks. However, once kswapd drops the high-order to
hand off contiguity work, it also needs to fall back to base page
watermarks - otherwise it'll keep reclaiming until blocks are freed.

While it appears kcompactd is woken up frequently enough to do most of
the compaction work, kswapd ends up overreclaiming by quite a bit:

                                                     DEFRAGMODE     DEFRAGMODE-thispatch
Hugealloc Time mean                       79381.34 (    +0.00%)    88126.12 (   +11.02%)
Hugealloc Time stddev                     85852.16 (    +0.00%)   135366.75 (   +57.67%)
Kbuild Real time                            249.35 (    +0.00%)      226.71 (    -9.04%)
Kbuild User time                           1249.16 (    +0.00%)     1249.37 (    +0.02%)
Kbuild System time                          171.76 (    +0.00%)      166.93 (    -2.79%)
THP fault alloc                           51666.87 (    +0.00%)    52685.60 (    +1.97%)
THP fault fallback                        16970.00 (    +0.00%)    15951.87 (    -6.00%)
Direct compact fail                         166.53 (    +0.00%)      178.93 (    +7.40%)
Direct compact success                       17.13 (    +0.00%)        4.13 (   -71.69%)
Compact daemon scanned migrate          3095413.33 (    +0.00%)  9231239.53 (  +198.22%)
Compact daemon scanned free             2155966.53 (    +0.00%)  7053692.87 (  +227.17%)
Compact direct scanned migrate           265642.47 (    +0.00%)    68388.33 (   -74.26%)
Compact direct scanned free              130252.60 (    +0.00%)    55634.87 (   -57.29%)
Compact total migrate scanned           3361055.80 (    +0.00%)  9299627.87 (  +176.69%)
Compact total free scanned              2286219.13 (    +0.00%)  7109327.73 (  +210.96%)
Alloc stall                                1890.80 (    +0.00%)     6297.60 (  +232.94%)
Pages kswapd scanned                    9043558.80 (    +0.00%)  5952576.73 (   -34.18%)
Pages kswapd reclaimed                  1891708.67 (    +0.00%)  1030645.00 (   -45.52%)
Pages direct scanned                    1017090.60 (    +0.00%)  2688047.60 (  +164.29%)
Pages direct reclaimed                    92682.60 (    +0.00%)   309770.53 (  +234.22%)
Pages total scanned                    10060649.40 (    +0.00%)  8640624.33 (   -14.11%)
Pages total reclaimed                   1984391.27 (    +0.00%)  1340415.53 (   -32.45%)
Swap out                                 884585.73 (    +0.00%)   417781.93 (   -52.77%)
Swap in                                  287106.27 (    +0.00%)    95589.73 (   -66.71%)
File refaults                            551697.60 (    +0.00%)   426474.80 (   -22.70%)

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc422ad830d6..3783e45bfc92 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6747,8 +6747,14 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
 		/*
 		 * In defrag_mode, watermarks must be met in whole
 		 * blocks to avoid polluting allocator fallbacks.
+		 *
+		 * However, kswapd usually cannot accomplish this on
+		 * its own and needs kcompactd support. Once it's
+		 * reclaimed a compaction gap, and kswapd_shrink_node
+		 * has dropped order, simply ensure there are enough
+		 * base pages for compaction, wake kcompactd & sleep.
 		 */
-		if (defrag_mode)
+		if (defrag_mode && order)
 			item = NR_FREE_PAGES_BLOCKS;
 		else
 			item = NR_FREE_PAGES;
-- 
2.49.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd
  2025-04-16 13:45 ` [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd Johannes Weiner
@ 2025-04-16 14:53   ` Vlastimil Babka
  0 siblings, 0 replies; 5+ messages in thread
From: Vlastimil Babka @ 2025-04-16 14:53 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton; +Cc: Brendan Jackman, linux-mm, linux-kernel

On 4/16/25 15:45, Johannes Weiner wrote:
> Vlastimil points out that commit a211c6550efc ("mm: page_alloc:
> defrag_mode kswapd/kcompactd watermarks") switched kswapd from
> zone_watermark_ok_safe() to the standard, percpu-cached version of
> reading free pages, thus dropping the watermark safety precautions for
> systems with high CPU counts (e.g. >212 cpus on 64G). Restore them.
> 
> Since zone_watermark_ok_safe() is no longer the right interface, and
> this was the last caller of the function anyway, open-code the
> zone_page_state_snapshot() conditional and delete the function.
> 
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode
  2025-04-16 13:45 ` [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode Johannes Weiner
@ 2025-04-16 14:54   ` Vlastimil Babka
  0 siblings, 0 replies; 5+ messages in thread
From: Vlastimil Babka @ 2025-04-16 14:54 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton; +Cc: Brendan Jackman, linux-mm, linux-kernel

On 4/16/25 15:45, Johannes Weiner wrote:
> Vlastimil points out an issue with kswapd in defrag_mode not waking up
> kcompactd reliably.
> 
> Background: When kswapd is woken for any higher-order request, it
> initially checks those high-order watermarks to decide if work is
> necesary. However, it cannot (efficiently) meet the contiguity goal of
> such a request by itself. So once it has reclaimed a compaction gap,
> it adjusts the request down to check for free order-0 pages, then
> wakes kcompactd to coalesce them into larger blocks.
> 
> In defrag_mode, the initial watermark check needs to be analogously
> against free pageblocks. However, once kswapd drops the high-order to
> hand off contiguity work, it also needs to fall back to base page
> watermarks - otherwise it'll keep reclaiming until blocks are freed.
> 
> While it appears kcompactd is woken up frequently enough to do most of
> the compaction work, kswapd ends up overreclaiming by quite a bit:
> 
>                                                      DEFRAGMODE     DEFRAGMODE-thispatch
> Hugealloc Time mean                       79381.34 (    +0.00%)    88126.12 (   +11.02%)
> Hugealloc Time stddev                     85852.16 (    +0.00%)   135366.75 (   +57.67%)
> Kbuild Real time                            249.35 (    +0.00%)      226.71 (    -9.04%)
> Kbuild User time                           1249.16 (    +0.00%)     1249.37 (    +0.02%)
> Kbuild System time                          171.76 (    +0.00%)      166.93 (    -2.79%)
> THP fault alloc                           51666.87 (    +0.00%)    52685.60 (    +1.97%)
> THP fault fallback                        16970.00 (    +0.00%)    15951.87 (    -6.00%)
> Direct compact fail                         166.53 (    +0.00%)      178.93 (    +7.40%)
> Direct compact success                       17.13 (    +0.00%)        4.13 (   -71.69%)
> Compact daemon scanned migrate          3095413.33 (    +0.00%)  9231239.53 (  +198.22%)
> Compact daemon scanned free             2155966.53 (    +0.00%)  7053692.87 (  +227.17%)
> Compact direct scanned migrate           265642.47 (    +0.00%)    68388.33 (   -74.26%)
> Compact direct scanned free              130252.60 (    +0.00%)    55634.87 (   -57.29%)
> Compact total migrate scanned           3361055.80 (    +0.00%)  9299627.87 (  +176.69%)
> Compact total free scanned              2286219.13 (    +0.00%)  7109327.73 (  +210.96%)
> Alloc stall                                1890.80 (    +0.00%)     6297.60 (  +232.94%)
> Pages kswapd scanned                    9043558.80 (    +0.00%)  5952576.73 (   -34.18%)
> Pages kswapd reclaimed                  1891708.67 (    +0.00%)  1030645.00 (   -45.52%)
> Pages direct scanned                    1017090.60 (    +0.00%)  2688047.60 (  +164.29%)
> Pages direct reclaimed                    92682.60 (    +0.00%)   309770.53 (  +234.22%)
> Pages total scanned                    10060649.40 (    +0.00%)  8640624.33 (   -14.11%)
> Pages total reclaimed                   1984391.27 (    +0.00%)  1340415.53 (   -32.45%)
> Swap out                                 884585.73 (    +0.00%)   417781.93 (   -52.77%)
> Swap in                                  287106.27 (    +0.00%)    95589.73 (   -66.71%)
> File refaults                            551697.60 (    +0.00%)   426474.80 (   -22.70%)
> 
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-04-16 14:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-16 13:45 [PATCH 0/2] mm: two allocator fixes for 6.15 Johannes Weiner
2025-04-16 13:45 ` [PATCH 1/2] mm: vmscan: restore high-cpu watermark safety in kswapd Johannes Weiner
2025-04-16 14:53   ` Vlastimil Babka
2025-04-16 13:45 ` [PATCH 2/2] mm: vmscan: fix kswapd exit condition in defrag_mode Johannes Weiner
2025-04-16 14:54   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox