* [PATCH 0/5] mm: reliable huge page allocator
@ 2025-03-13 21:05 Johannes Weiner
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
` (4 more replies)
0 siblings, 5 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-13 21:05 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm, linux-kernel
This series makes changes to the allocator and reclaim/compaction code
to try harder to avoid fragmentation. As a result, this makes huge
page allocations cheaper, more reliable and more sustainable.
It's a subset of the huge page allocator RFC initially proposed here:
https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/
The following results are from a kernel build test, with additional
concurrent bursts of THP allocations on a memory-constrained system.
Comparing before and after the changes over 15 runs:
before after
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP latencies are cut in half, and failure rates are cut by 75%. These
metrics also hold up over time, while the vanilla kernel sees a steady
downward trend in success rates with each subsequent run, owed to the
cumulative effects of fragmentation.
A more detailed discussion of results is in the patch changelogs.
The patches first introduce a vm.defrag_mode sysctl, which enforces
the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and
compaction have run. They then change kswapd and kcompactd to target
pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths.
Main differences to the RFC:
- The freelist hygiene patches have since been upstreamed separately.
- The RFC version would prohibit fallbacks entirely, and make
pageblock reclaim and compaction mandatory for all allocation
contexts. This opens up a large dependency graph for compaction,
possibly remaining sources of pollution, and the handling of
low-memory situations, OOMs and deadlocks.
This version uses only kswapd & kcompactd to pre-produce pageblocks,
while still allowing last-ditch fallbacks to avoid memory deadlocks.
The long-term goal remains converging on the version proposed in the
RFC and its ~100% THP success rate. But this is reserved for future
iterations that can build on the changes proposed here.
- The RFC version proposed a new MIGRATE_FREE type as well as
per-migratetype counters. This allowed making compaction more
efficient, and the pre-compaction gap checks more precise, but again
at the cost of complex changes in an already invasive series.
This series simply uses a new vmstat counter to track the number of
free pages in whole blocks to base reclaim/compaction goals on.
- The behavior is opt-in and can be toggled at runtime. The risk for
regressions with any allocator change is sizable, and while many
users care about huge pages, obviously not all do. A runtime knob is
warranted to make the behavior optional and provide an escape hatch.
Based on today's akpm/mm-unstable.
Patches #1 and #2 are somewhat unrelated cleanups, but touch the same
code and so included here to avoid conflicts from re-ordering.
Documentation/admin-guide/sysctl/vm.rst | 9 ++++
include/linux/compaction.h | 5 +-
include/linux/mmzone.h | 1 +
mm/compaction.c | 87 ++++++++++++++++++++-----------
mm/internal.h | 1 +
mm/page_alloc.c | 72 +++++++++++++++++++++----
mm/vmscan.c | 41 ++++++++++-----
mm/vmstat.c | 1 +
8 files changed, 161 insertions(+), 56 deletions(-)
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
2025-03-13 21:05 [PATCH 0/5] mm: reliable huge page allocator Johannes Weiner
@ 2025-03-13 21:05 ` Johannes Weiner
2025-03-14 15:08 ` Zi Yan
` (3 more replies)
2025-03-13 21:05 ` [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing Johannes Weiner
` (3 subsequent siblings)
4 siblings, 4 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-13 21:05 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm, linux-kernel
compaction_suitable() hardcodes the min watermark, with a boost to the
low watermark for costly orders. However, compaction_ready() requires
order-0 at the high watermark. It currently checks the marks twice.
Make the watermark a parameter to compaction_suitable() and have the
callers pass in what they require:
- compaction_zonelist_suitable() is used by the direct reclaim path,
so use the min watermark.
- compact_suit_allocation_order() has a watermark in context derived
from cc->alloc_flags.
The only quirk is that kcompactd doesn't initialize cc->alloc_flags
explicitly. There is a direct check in kcompactd_do_work() that
passes ALLOC_WMARK_MIN, but there is another check downstack in
compact_zone() that ends up passing the unset alloc_flags. Since
they default to 0, and that coincides with ALLOC_WMARK_MIN, it is
correct. But it's subtle. Set cc->alloc_flags explicitly.
- should_continue_reclaim() is direct reclaim, use the min watermark.
- Finally, consolidate the two checks in compaction_ready() to a
single compaction_suitable() call passing the high watermark.
There is a tiny change in behavior: before, compaction_suitable()
would check order-0 against min or low, depending on costly
order. Then there'd be another high watermark check.
Now, the high watermark is passed to compaction_suitable(), and the
costly order-boost (low - min) is added on top. This means
compaction_ready() sets a marginally higher target for free pages.
In a kernelbuild + THP pressure test, though, this didn't show any
measurable negative effects on memory pressure or reclaim rates. As
the comment above the check says, reclaim is usually stopped short
on should_continue_reclaim(), and this just defines the worst-case
reclaim cutoff in case compaction is not making any headway.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/compaction.h | 5 ++--
mm/compaction.c | 52 ++++++++++++++++++++------------------
mm/vmscan.c | 26 ++++++++++---------
3 files changed, 45 insertions(+), 38 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7bf0c521db63..173d9c07a895 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -95,7 +95,7 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx);
+ unsigned long watermark, int highest_zoneidx);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -113,7 +113,8 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
}
static inline bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx)
+ unsigned long watermark,
+ int highest_zoneidx)
{
return false;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index 550ce5021807..036353ef1878 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2382,40 +2382,42 @@ static enum compact_result compact_finished(struct compact_control *cc)
}
static bool __compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx,
- unsigned long wmark_target)
+ unsigned long watermark, int highest_zoneidx,
+ unsigned long free_pages)
{
- unsigned long watermark;
/*
* Watermarks for order-0 must be met for compaction to be able to
* isolate free pages for migration targets. This means that the
- * watermark and alloc_flags have to match, or be more pessimistic than
- * the check in __isolate_free_page(). We don't use the direct
- * compactor's alloc_flags, as they are not relevant for freepage
- * isolation. We however do use the direct compactor's highest_zoneidx
- * to skip over zones where lowmem reserves would prevent allocation
- * even if compaction succeeds.
- * For costly orders, we require low watermark instead of min for
- * compaction to proceed to increase its chances.
+ * watermark have to match, or be more pessimistic than the check in
+ * __isolate_free_page().
+ *
+ * For costly orders, we require a higher watermark for compaction to
+ * proceed to increase its chances.
+ *
+ * We use the direct compactor's highest_zoneidx to skip over zones
+ * where lowmem reserves would prevent allocation even if compaction
+ * succeeds.
+ *
* ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * suitable migration targets.
*/
- watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
- low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ watermark += low_wmark_pages(zone) - min_wmark_pages(zone);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ ALLOC_CMA, free_pages);
}
/*
* compaction_suitable: Is this suitable to run compaction on this zone now?
*/
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, unsigned long watermark,
+ int highest_zoneidx)
{
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx,
+ suitable = __compaction_suitable(zone, order, highest_zoneidx, watermark,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
@@ -2453,6 +2455,7 @@ bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
return suitable;
}
+/* Used by direct reclaimers */
bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
int alloc_flags)
{
@@ -2475,8 +2478,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
*/
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
- if (__compaction_suitable(zone, order, ac->highest_zoneidx,
- available))
+ if (__compaction_suitable(zone, order, min_wmark_pages(zone),
+ ac->highest_zoneidx, available))
return true;
}
@@ -2513,13 +2516,13 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
*/
if (order > PAGE_ALLOC_COSTLY_ORDER && async &&
!(alloc_flags & ALLOC_CMA)) {
- watermark = low_wmark_pages(zone) + compact_gap(order);
- if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- 0, zone_page_state(zone, NR_FREE_PAGES)))
+ if (!__zone_watermark_ok(zone, 0, watermark + compact_gap(order),
+ highest_zoneidx, 0,
+ zone_page_state(zone, NR_FREE_PAGES)))
return COMPACT_SKIPPED;
}
- if (!compaction_suitable(zone, order, highest_zoneidx))
+ if (!compaction_suitable(zone, order, watermark, highest_zoneidx))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
@@ -3082,6 +3085,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.mode = MIGRATE_SYNC_LIGHT,
.ignore_skip_hint = false,
.gfp_mask = GFP_KERNEL,
+ .alloc_flags = ALLOC_WMARK_MIN,
};
enum compact_result ret;
@@ -3100,7 +3104,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
continue;
ret = compaction_suit_allocation_order(zone,
- cc.order, zoneid, ALLOC_WMARK_MIN,
+ cc.order, zoneid, cc.alloc_flags,
false);
if (ret != COMPACT_CONTINUE)
continue;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bc740637a6c..3370bdca6868 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5890,12 +5890,15 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
/* If compaction would go ahead or the allocation would succeed, stop */
for_each_managed_zone_pgdat(zone, pgdat, z, sc->reclaim_idx) {
+ unsigned long watermark = min_wmark_pages(zone);
+
/* Allocation can already succeed, nothing to do */
- if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
+ if (zone_watermark_ok(zone, sc->order, watermark,
sc->reclaim_idx, 0))
return false;
- if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (compaction_suitable(zone, sc->order, watermark,
+ sc->reclaim_idx))
return false;
}
@@ -6122,22 +6125,21 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
sc->reclaim_idx, 0))
return true;
- /* Compaction cannot yet proceed. Do reclaim. */
- if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
- return false;
-
/*
- * Compaction is already possible, but it takes time to run and there
- * are potentially other callers using the pages just freed. So proceed
- * with reclaim to make a buffer of free pages available to give
- * compaction a reasonable chance of completing and allocating the page.
+ * Direct reclaim usually targets the min watermark, but compaction
+ * takes time to run and there are potentially other callers using the
+ * pages just freed. So target a higher buffer to give compaction a
+ * reasonable chance of completing and allocating the pages.
+ *
* Note that we won't actually reclaim the whole buffer in one attempt
* as the target watermark in should_continue_reclaim() is lower. But if
* we are already above the high+gap watermark, don't reclaim at all.
*/
- watermark = high_wmark_pages(zone) + compact_gap(sc->order);
+ watermark = high_wmark_pages(zone);
+ if (compaction_suitable(zone, sc->order, watermark, sc->reclaim_idx))
+ return true;
- return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
+ return false;
}
static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
--
2.48.1
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing
2025-03-13 21:05 [PATCH 0/5] mm: reliable huge page allocator Johannes Weiner
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
@ 2025-03-13 21:05 ` Johannes Weiner
2025-03-14 18:36 ` Zi Yan
2025-03-13 21:05 ` [PATCH 3/5] mm: page_alloc: defrag_mode Johannes Weiner
` (2 subsequent siblings)
4 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2025-03-13 21:05 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm, linux-kernel
When the page allocator places pages of a certain migratetype into
blocks of another type, it has lasting effects on the ability to
compact and defragment down the line. For improving placement and
compaction, visibility into such events is crucial.
The most common case, allocator fallbacks, is already annotated, but
compaction capturing is also allowed to grab pages of a different
type. Extend the tracepoint to cover this case.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9b4a5e6dfee9..6f0404941886 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -614,6 +614,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
capc->cc->migratetype != MIGRATE_MOVABLE)
return false;
+ if (migratetype != capc->cc->migratetype)
+ trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
+ capc->cc->migratetype, migratetype);
+
capc->page = page;
return true;
}
--
2.48.1
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-13 21:05 [PATCH 0/5] mm: reliable huge page allocator Johannes Weiner
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
2025-03-13 21:05 ` [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing Johannes Weiner
@ 2025-03-13 21:05 ` Johannes Weiner
2025-03-14 18:54 ` Zi Yan
2025-03-22 15:05 ` Brendan Jackman
2025-03-13 21:05 ` [PATCH 4/5] mm: page_alloc: defrag_mode kswapd/kcompactd assistance Johannes Weiner
2025-03-13 21:05 ` [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Johannes Weiner
4 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-13 21:05 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm, linux-kernel
The page allocator groups requests by migratetype to stave off
fragmentation. However, in practice this is routinely defeated by the
fact that it gives up *before* invoking reclaim and compaction - which
may well produce suitable pages. As a result, fragmentation of
physical memory is a common ongoing process in many load scenarios.
Fragmentation deteriorates compaction's ability to produce huge
pages. Depending on the lifetime of the fragmenting allocations, those
effects can be long-lasting or even permanent, requiring drastic
measures like forcible idle states or even reboots as the only
reliable ways to recover the address space for THP production.
In a kernel build test with supplemental THP pressure, the THP
allocation rate steadily declines over 15 runs:
thp_fault_alloc
61988
56474
57258
50187
52388
55409
52925
47648
43669
40621
36077
41721
36685
34641
33215
This is a hurdle in adopting THP in any environment where hosts are
shared between multiple overlapping workloads (cloud environments),
and rarely experience true idle periods. To make THP a reliable and
predictable optimization, there needs to be a stronger guarantee to
avoid such fragmentation.
Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
is enforced on the allocator fastpath and the reclaiming slowpath.
For now, fallbacks are permitted to avert OOMs. There is a plan to add
defrag_mode=2 to prefer OOMs over fragmentation, but this requires
additional prep work in compaction and the reserve management to make
it ready for all possible allocation contexts.
The following test results are from a kernel build with periodic
bursts of THP allocations, over 15 runs:
vanilla defrag_mode=1
@claimer[unmovable]: 189 103
@claimer[movable]: 92 103
@claimer[reclaimable]: 207 61
@pollute[unmovable from movable]: 25 0
@pollute[unmovable from reclaimable]: 28 0
@pollute[movable from unmovable]: 38835 0
@pollute[movable from reclaimable]: 147136 0
@pollute[reclaimable from unmovable]: 178 0
@pollute[reclaimable from movable]: 33 0
@steal[unmovable from movable]: 11 0
@steal[unmovable from reclaimable]: 5 0
@steal[reclaimable from unmovable]: 107 0
@steal[reclaimable from movable]: 90 0
@steal[movable from reclaimable]: 354 0
@steal[movable from unmovable]: 130 0
Both types of polluting fallbacks are eliminated in this workload.
Interestingly, whole block conversions are reduced as well. This is
because once a block is claimed for a type, its empty space remains
available for future allocations, instead of being padded with
fallbacks; this allows the native type to group up instead of
spreading out to new blocks. The assumption in the allocator has been
that pollution from movable allocations is less harmful than from
other types, since they can be reclaimed or migrated out should the
space be needed. However, since fallbacks occur *before*
reclaim/compaction is invoked, movable pollution will still cause
non-movable allocations to spread out and claim more blocks.
Without fragmentation, THP rates hold steady with defrag_mode=1:
thp_fault_alloc
32478
20725
45045
32130
14018
21711
40791
29134
34458
45381
28305
17265
22584
28454
30850
While the downward trend is eliminated, the keen reader will of course
notice that the baseline rate is much smaller than the vanilla
kernel's to begin with. This is due to deficiencies in how reclaim and
compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
to which smaller allocations are competing with THPs for pageblocks,
while making no effort themselves to reclaim or compact beyond their
own request size. This effect already exists with the current usage of
ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
block stealing much more strongly.
Subsequent patches will address defrag_mode reclaim strategy to raise
the THP success baseline above the vanilla kernel.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++
mm/page_alloc.c | 27 +++++++++++++++++++++++--
2 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index ec6343ee4248..e169dbf48180 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -29,6 +29,7 @@ files can be found in mm/swap.c.
- compaction_proactiveness
- compaction_proactiveness_leeway
- compact_unevictable_allowed
+- defrag_mode
- dirty_background_bytes
- dirty_background_ratio
- dirty_bytes
@@ -162,6 +163,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
to compaction, which would block the task from becoming active until the fault
is resolved.
+defrag_mode
+===========
+
+When set to 1, the page allocator tries harder to avoid fragmentation
+and maintain the ability to produce huge pages / higher-order pages.
+
+It is recommended to enable this right after boot, as fragmentation,
+once it occurred, can be long-lasting or even permanent.
dirty_background_bytes
======================
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f0404941886..9a02772c2461 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,6 +273,7 @@ int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
static int watermark_boost_factor __read_mostly = 15000;
static int watermark_scale_factor = 10;
+static int defrag_mode;
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
@@ -3389,6 +3390,11 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
*/
alloc_flags = (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM);
+ if (defrag_mode) {
+ alloc_flags |= ALLOC_NOFRAGMENT;
+ return alloc_flags;
+ }
+
#ifdef CONFIG_ZONE_DMA32
if (!zone)
return alloc_flags;
@@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
continue;
}
- if (no_fallback && nr_online_nodes > 1 &&
+ if (no_fallback && !defrag_mode && nr_online_nodes > 1 &&
zone != zonelist_zone(ac->preferred_zoneref)) {
int local_nid;
@@ -3591,7 +3597,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
* It's possible on a UMA machine to get through all zones that are
* fragmented. If avoiding fragmentation, reset and try again.
*/
- if (no_fallback) {
+ if (no_fallback && !defrag_mode) {
alloc_flags &= ~ALLOC_NOFRAGMENT;
goto retry;
}
@@ -4128,6 +4134,9 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
+ if (defrag_mode)
+ alloc_flags |= ALLOC_NOFRAGMENT;
+
return alloc_flags;
}
@@ -4510,6 +4519,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
&compaction_retries))
goto retry;
+ /* Reclaim/compaction failed to prevent the fallback */
+ if (defrag_mode) {
+ alloc_flags &= ALLOC_NOFRAGMENT;
+ goto retry;
+ }
/*
* Deal with possible cpuset update races or zonelist updates to avoid
@@ -6286,6 +6300,15 @@ static const struct ctl_table page_alloc_sysctl_table[] = {
.extra1 = SYSCTL_ONE,
.extra2 = SYSCTL_THREE_THOUSAND,
},
+ {
+ .procname = "defrag_mode",
+ .data = &defrag_mode,
+ .maxlen = sizeof(defrag_mode),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
{
.procname = "percpu_pagelist_high_fraction",
.data = &percpu_pagelist_high_fraction,
--
2.48.1
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 4/5] mm: page_alloc: defrag_mode kswapd/kcompactd assistance
2025-03-13 21:05 [PATCH 0/5] mm: reliable huge page allocator Johannes Weiner
` (2 preceding siblings ...)
2025-03-13 21:05 ` [PATCH 3/5] mm: page_alloc: defrag_mode Johannes Weiner
@ 2025-03-13 21:05 ` Johannes Weiner
2025-03-13 21:05 ` [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Johannes Weiner
4 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-13 21:05 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm, linux-kernel
When defrag_mode is enabled, allocation fallbacks strongly prefer
whole block conversions instead of polluting or stealing partially
used blocks. This means there is a demand for pageblocks even from
sub-block requests. Let kswapd/kcompactd help produce them.
By the time kswapd gets woken up, normal rmqueue and block conversion
fallbacks have been attempted and failed. So always wake kswapd with
the block order; it will take care of producing a suitable compaction
gap and then chain-wake kcompactd with the block order when its done.
VANILLA DEFRAGMODE-ASYNC
Hugealloc Time mean 52739.45 ( +0.00%) 34300.36 ( -34.96%)
Hugealloc Time stddev 56541.26 ( +0.00%) 36390.42 ( -35.64%)
Kbuild Real time 197.47 ( +0.00%) 196.13 ( -0.67%)
Kbuild User time 1240.49 ( +0.00%) 1234.74 ( -0.46%)
Kbuild System time 70.08 ( +0.00%) 62.62 ( -10.50%)
THP fault alloc 46727.07 ( +0.00%) 57054.53 ( +22.10%)
THP fault fallback 21910.60 ( +0.00%) 11581.40 ( -47.14%)
Direct compact fail 195.80 ( +0.00%) 107.80 ( -44.72%)
Direct compact success 7.93 ( +0.00%) 4.53 ( -38.06%)
Direct compact success rate % 3.51 ( +0.00%) 3.20 ( -6.89%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 5461033.93 ( +62.07%)
Compact daemon scanned free 5075474.47 ( +0.00%) 5824897.93 ( +14.77%)
Compact direct scanned migrate 161787.27 ( +0.00%) 58336.93 ( -63.94%)
Compact direct scanned free 163467.53 ( +0.00%) 32791.87 ( -79.94%)
Compact total migrate scanned 3531388.53 ( +0.00%) 5519370.87 ( +56.29%)
Compact total free scanned 5238942.00 ( +0.00%) 5857689.80 ( +11.81%)
Alloc stall 2371.07 ( +0.00%) 2424.60 ( +2.26%)
Pages kswapd scanned 2160926.73 ( +0.00%) 2657018.33 ( +22.96%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 559583.07 ( +4.95%)
Pages direct scanned 400450.33 ( +0.00%) 722094.07 ( +80.32%)
Pages direct reclaimed 94441.73 ( +0.00%) 107257.80 ( +13.57%)
Pages total scanned 2561377.07 ( +0.00%) 3379112.40 ( +31.93%)
Pages total reclaimed 627632.80 ( +0.00%) 666840.87 ( +6.25%)
Swap out 47959.53 ( +0.00%) 77238.20 ( +61.05%)
Swap in 7276.00 ( +0.00%) 11712.80 ( +60.97%)
File refaults 138043.00 ( +0.00%) 143438.80 ( +3.91%)
With this patch, defrag_mode=1 beats the vanilla kernel in THP success
rates and allocation latencies. The trend holds over time:
thp_fault_alloc
VANILLA DEFRAGMODE-ASYNC
61988 52066
56474 58844
57258 58233
50187 58476
52388 54516
55409 59938
52925 57204
47648 60238
43669 55733
40621 56211
36077 59861
41721 57771
36685 58579
34641 51868
33215 56280
DEFRAGMODE-ASYNC also wins on %sys as ~3/4 of the direct compaction
work is shifted to kcompactd.
Reclaim activity is higher. Part of that is simply due to the
increased memory footprint from higher THP use. The other aspect is
that *direct* reclaim/compaction are still going for requested orders
rather than targeting the page blocks required for fallbacks, which is
less efficient than it could be. However, this is already a useful
tradeoff to make, as in many environments peak periods are short and
retaining the ability to produce THP through them is more important.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9a02772c2461..4a0d8f871e56 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4076,15 +4076,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
struct zone *zone;
pg_data_t *last_pgdat = NULL;
enum zone_type highest_zoneidx = ac->highest_zoneidx;
+ unsigned int reclaim_order;
+
+ if (defrag_mode)
+ reclaim_order = max(order, pageblock_order);
+ else
+ reclaim_order = order;
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, highest_zoneidx,
ac->nodemask) {
if (!managed_zone(zone))
continue;
- if (last_pgdat != zone->zone_pgdat) {
- wakeup_kswapd(zone, gfp_mask, order, highest_zoneidx);
- last_pgdat = zone->zone_pgdat;
- }
+ if (last_pgdat == zone->zone_pgdat)
+ continue;
+ wakeup_kswapd(zone, gfp_mask, reclaim_order, highest_zoneidx);
+ last_pgdat = zone->zone_pgdat;
}
}
--
2.48.1
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
2025-03-13 21:05 [PATCH 0/5] mm: reliable huge page allocator Johannes Weiner
` (3 preceding siblings ...)
2025-03-13 21:05 ` [PATCH 4/5] mm: page_alloc: defrag_mode kswapd/kcompactd assistance Johannes Weiner
@ 2025-03-13 21:05 ` Johannes Weiner
2025-03-14 21:05 ` Johannes Weiner
2025-04-11 8:19 ` Vlastimil Babka
4 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-13 21:05 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm, linux-kernel
The previous patch added pageblock_order reclaim to kswapd/kcompactd,
which helps, but produces only one block at a time. Allocation stalls
and THP failure rates are still higher than they could be.
To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
the watermarking for kswapd & kcompactd: instead of targeting the high
watermark in order-0 pages and checking for one suitable block, simply
require that the high watermark is entirely met in pageblocks.
To this end, track the number of free pages within contiguous
pageblocks, then change pgdat_balanced() and compact_finished() to
check watermarks against this new value.
This further reduces THP latencies and allocation stalls, and improves
THP success rates against the previous patch:
DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%)
Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%)
Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%)
Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%)
Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%)
THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%)
THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%)
Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%)
Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%)
Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%)
Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%)
Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%)
Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%)
Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%)
Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%)
Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%)
Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%)
Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%)
Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%)
Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%)
Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%)
Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%)
Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%)
Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%)
Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%)
File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%)
Also of note is that compaction work overall is reduced. The reason
for this is that when free pageblocks are more readily available,
allocations are also much more likely to get physically placed in LRU
order, instead of being forced to scavenge free space here and there.
This means that reclaim by itself has better chances of freeing up
whole blocks, and the system relies less on compaction.
Comparing all changes to the vanilla kernel:
VANILLA DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP allocation latencies and %sys time are down dramatically.
THP allocation failures are down from nearly 50% to 8.5%. And to
recall previous data points, the success rates are steady and reliable
without the cumulative deterioration of fragmentation events.
Compaction work is down overall. Direct compaction work especially is
drastically reduced. As an aside, its success rate of 4% indicates
there is room for improvement. For now it's good to rely on it less.
Reclaim work is up overall, however direct reclaim work is down. Part
of the increase can be attributed to a higher use of THPs, which due
to internal fragmentation increase the memory footprint. This is not
necessarily an unexpected side-effect for users of THP.
However, taken both points together, there may well be some
opportunities for fine tuning in the reclaim/compaction coordination.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/mmzone.h | 1 +
mm/compaction.c | 37 ++++++++++++++++++++++++++++++-------
mm/internal.h | 1 +
mm/page_alloc.c | 29 +++++++++++++++++++++++------
mm/vmscan.c | 15 ++++++++++++++-
mm/vmstat.c | 1 +
6 files changed, 70 insertions(+), 14 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dbb0ad69e17f..37c29f3fbca8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -138,6 +138,7 @@ enum numa_stat_item {
enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
+ NR_FREE_PAGES_BLOCKS,
NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
NR_ZONE_ACTIVE_ANON,
diff --git a/mm/compaction.c b/mm/compaction.c
index 036353ef1878..4a2ccb82d0b2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
if (!pageblock_aligned(cc->migrate_pfn))
return COMPACT_CONTINUE;
+ /*
+ * When defrag_mode is enabled, make kcompactd target
+ * watermarks in whole pageblocks. Because they can be stolen
+ * without polluting, no further fallback checks are needed.
+ */
+ if (defrag_mode && !cc->direct_compaction) {
+ if (__zone_watermark_ok(cc->zone, cc->order,
+ high_wmark_pages(cc->zone),
+ cc->highest_zoneidx, cc->alloc_flags,
+ zone_page_state(cc->zone,
+ NR_FREE_PAGES_BLOCKS)))
+ return COMPACT_SUCCESS;
+
+ return COMPACT_CONTINUE;
+ }
+
/* Direct compactor: Is a suitable page free? */
ret = COMPACT_NO_SUITABLE_PAGE;
for (order = cc->order; order < NR_PAGE_ORDERS; order++) {
@@ -2496,13 +2512,19 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
static enum compact_result
compaction_suit_allocation_order(struct zone *zone, unsigned int order,
int highest_zoneidx, unsigned int alloc_flags,
- bool async)
+ bool async, bool kcompactd)
{
+ unsigned long free_pages;
unsigned long watermark;
+ if (kcompactd && defrag_mode)
+ free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
+ else
+ free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
- if (zone_watermark_ok(zone, order, watermark, highest_zoneidx,
- alloc_flags))
+ if (__zone_watermark_ok(zone, order, watermark, highest_zoneidx,
+ alloc_flags, free_pages))
return COMPACT_SUCCESS;
/*
@@ -2558,7 +2580,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
ret = compaction_suit_allocation_order(cc->zone, cc->order,
cc->highest_zoneidx,
cc->alloc_flags,
- cc->mode == MIGRATE_ASYNC);
+ cc->mode == MIGRATE_ASYNC,
+ !cc->direct_compaction);
if (ret != COMPACT_CONTINUE)
return ret;
}
@@ -3062,7 +3085,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
ret = compaction_suit_allocation_order(zone,
pgdat->kcompactd_max_order,
highest_zoneidx, ALLOC_WMARK_MIN,
- false);
+ false, true);
if (ret == COMPACT_CONTINUE)
return true;
}
@@ -3085,7 +3108,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.mode = MIGRATE_SYNC_LIGHT,
.ignore_skip_hint = false,
.gfp_mask = GFP_KERNEL,
- .alloc_flags = ALLOC_WMARK_MIN,
+ .alloc_flags = ALLOC_WMARK_HIGH,
};
enum compact_result ret;
@@ -3105,7 +3128,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
ret = compaction_suit_allocation_order(zone,
cc.order, zoneid, cc.alloc_flags,
- false);
+ false, true);
if (ret != COMPACT_CONTINUE)
continue;
diff --git a/mm/internal.h b/mm/internal.h
index 2f52a65272c1..286520a424fe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -536,6 +536,7 @@ extern char * const zone_names[MAX_NR_ZONES];
DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled);
extern int min_free_kbytes;
+extern int defrag_mode;
void setup_per_zone_wmarks(void);
void calculate_min_free_kbytes(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a0d8f871e56..c33c08e278f9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,7 +273,7 @@ int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
static int watermark_boost_factor __read_mostly = 15000;
static int watermark_scale_factor = 10;
-static int defrag_mode;
+int defrag_mode;
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
@@ -660,16 +660,20 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
bool tail)
{
struct free_area *area = &zone->free_area[order];
+ int nr_pages = 1 << order;
VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
"page type is %lu, passed migratetype is %d (nr=%d)\n",
- get_pageblock_migratetype(page), migratetype, 1 << order);
+ get_pageblock_migratetype(page), migratetype, nr_pages);
if (tail)
list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
else
list_add(&page->buddy_list, &area->free_list[migratetype]);
area->nr_free++;
+
+ if (order >= pageblock_order && !is_migrate_isolate(migratetype))
+ __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
}
/*
@@ -681,24 +685,34 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
unsigned int order, int old_mt, int new_mt)
{
struct free_area *area = &zone->free_area[order];
+ int nr_pages = 1 << order;
/* Free page moving can fail, so it happens before the type update */
VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
"page type is %lu, passed migratetype is %d (nr=%d)\n",
- get_pageblock_migratetype(page), old_mt, 1 << order);
+ get_pageblock_migratetype(page), old_mt, nr_pages);
list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
- account_freepages(zone, -(1 << order), old_mt);
- account_freepages(zone, 1 << order, new_mt);
+ account_freepages(zone, -nr_pages, old_mt);
+ account_freepages(zone, nr_pages, new_mt);
+
+ if (order >= pageblock_order &&
+ is_migrate_isolate(old_mt) != is_migrate_isolate(new_mt)) {
+ if (!is_migrate_isolate(old_mt))
+ nr_pages = -nr_pages;
+ __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
+ }
}
static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
unsigned int order, int migratetype)
{
+ int nr_pages = 1 << order;
+
VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
"page type is %lu, passed migratetype is %d (nr=%d)\n",
- get_pageblock_migratetype(page), migratetype, 1 << order);
+ get_pageblock_migratetype(page), migratetype, nr_pages);
/* clear reported state and update reported page count */
if (page_reported(page))
@@ -708,6 +722,9 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
__ClearPageBuddy(page);
set_page_private(page, 0);
zone->free_area[order].nr_free--;
+
+ if (order >= pageblock_order && !is_migrate_isolate(migratetype))
+ __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
}
static inline void del_page_from_free_list(struct page *page, struct zone *zone,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3370bdca6868..b5c7dfc2b189 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
* meet watermarks.
*/
for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
+ unsigned long free_pages;
+
if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
mark = promo_wmark_pages(zone);
else
mark = high_wmark_pages(zone);
- if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
+
+ /*
+ * In defrag_mode, watermarks must be met in whole
+ * blocks to avoid polluting allocator fallbacks.
+ */
+ if (defrag_mode)
+ free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
+ else
+ free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+ if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
+ 0, free_pages))
return true;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd..ed49a86348f7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1190,6 +1190,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
const char * const vmstat_text[] = {
/* enum zone_stat_item counters */
"nr_free_pages",
+ "nr_free_pages_blocks",
"nr_zone_inactive_anon",
"nr_zone_active_anon",
"nr_zone_inactive_file",
--
2.48.1
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
@ 2025-03-14 15:08 ` Zi Yan
2025-03-16 4:28 ` Hugh Dickins
` (2 subsequent siblings)
3 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-03-14 15:08 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, linux-mm, linux-kernel
On 13 Mar 2025, at 17:05, Johannes Weiner wrote:
> compaction_suitable() hardcodes the min watermark, with a boost to the
> low watermark for costly orders. However, compaction_ready() requires
> order-0 at the high watermark. It currently checks the marks twice.
>
> Make the watermark a parameter to compaction_suitable() and have the
> callers pass in what they require:
>
> - compaction_zonelist_suitable() is used by the direct reclaim path,
> so use the min watermark.
>
> - compact_suit_allocation_order() has a watermark in context derived
> from cc->alloc_flags.
>
> The only quirk is that kcompactd doesn't initialize cc->alloc_flags
> explicitly. There is a direct check in kcompactd_do_work() that
> passes ALLOC_WMARK_MIN, but there is another check downstack in
> compact_zone() that ends up passing the unset alloc_flags. Since
> they default to 0, and that coincides with ALLOC_WMARK_MIN, it is
> correct. But it's subtle. Set cc->alloc_flags explicitly.
>
> - should_continue_reclaim() is direct reclaim, use the min watermark.
>
> - Finally, consolidate the two checks in compaction_ready() to a
> single compaction_suitable() call passing the high watermark.
>
> There is a tiny change in behavior: before, compaction_suitable()
> would check order-0 against min or low, depending on costly
> order. Then there'd be another high watermark check.
>
> Now, the high watermark is passed to compaction_suitable(), and the
> costly order-boost (low - min) is added on top. This means
> compaction_ready() sets a marginally higher target for free pages.
>
> In a kernelbuild + THP pressure test, though, this didn't show any
> measurable negative effects on memory pressure or reclaim rates. As
> the comment above the check says, reclaim is usually stopped short
> on should_continue_reclaim(), and this just defines the worst-case
> reclaim cutoff in case compaction is not making any headway.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> include/linux/compaction.h | 5 ++--
> mm/compaction.c | 52 ++++++++++++++++++++------------------
> mm/vmscan.c | 26 ++++++++++---------
> 3 files changed, 45 insertions(+), 38 deletions(-)
>
The changes look good to me. Acked-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing
2025-03-13 21:05 ` [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing Johannes Weiner
@ 2025-03-14 18:36 ` Zi Yan
0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-03-14 18:36 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, linux-mm, linux-kernel
On 13 Mar 2025, at 17:05, Johannes Weiner wrote:
> When the page allocator places pages of a certain migratetype into
> blocks of another type, it has lasting effects on the ability to
> compact and defragment down the line. For improving placement and
> compaction, visibility into such events is crucial.
>
> The most common case, allocator fallbacks, is already annotated, but
> compaction capturing is also allowed to grab pages of a different
> type. Extend the tracepoint to cover this case.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/page_alloc.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
Acked-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-13 21:05 ` [PATCH 3/5] mm: page_alloc: defrag_mode Johannes Weiner
@ 2025-03-14 18:54 ` Zi Yan
2025-03-14 20:50 ` Johannes Weiner
2025-03-22 15:05 ` Brendan Jackman
1 sibling, 1 reply; 32+ messages in thread
From: Zi Yan @ 2025-03-14 18:54 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, linux-mm, linux-kernel
On 13 Mar 2025, at 17:05, Johannes Weiner wrote:
> The page allocator groups requests by migratetype to stave off
> fragmentation. However, in practice this is routinely defeated by the
> fact that it gives up *before* invoking reclaim and compaction - which
> may well produce suitable pages. As a result, fragmentation of
> physical memory is a common ongoing process in many load scenarios.
>
> Fragmentation deteriorates compaction's ability to produce huge
> pages. Depending on the lifetime of the fragmenting allocations, those
> effects can be long-lasting or even permanent, requiring drastic
> measures like forcible idle states or even reboots as the only
> reliable ways to recover the address space for THP production.
>
> In a kernel build test with supplemental THP pressure, the THP
> allocation rate steadily declines over 15 runs:
>
> thp_fault_alloc
> 61988
> 56474
> 57258
> 50187
> 52388
> 55409
> 52925
> 47648
> 43669
> 40621
> 36077
> 41721
> 36685
> 34641
> 33215
>
> This is a hurdle in adopting THP in any environment where hosts are
> shared between multiple overlapping workloads (cloud environments),
> and rarely experience true idle periods. To make THP a reliable and
> predictable optimization, there needs to be a stronger guarantee to
> avoid such fragmentation.
>
> Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
> its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
> is enforced on the allocator fastpath and the reclaiming slowpath.
>
> For now, fallbacks are permitted to avert OOMs. There is a plan to add
> defrag_mode=2 to prefer OOMs over fragmentation, but this requires
> additional prep work in compaction and the reserve management to make
> it ready for all possible allocation contexts.
>
> The following test results are from a kernel build with periodic
> bursts of THP allocations, over 15 runs:
>
> vanilla defrag_mode=1
> @claimer[unmovable]: 189 103
> @claimer[movable]: 92 103
> @claimer[reclaimable]: 207 61
> @pollute[unmovable from movable]: 25 0
> @pollute[unmovable from reclaimable]: 28 0
> @pollute[movable from unmovable]: 38835 0
> @pollute[movable from reclaimable]: 147136 0
> @pollute[reclaimable from unmovable]: 178 0
> @pollute[reclaimable from movable]: 33 0
> @steal[unmovable from movable]: 11 0
> @steal[unmovable from reclaimable]: 5 0
> @steal[reclaimable from unmovable]: 107 0
> @steal[reclaimable from movable]: 90 0
> @steal[movable from reclaimable]: 354 0
> @steal[movable from unmovable]: 130 0
>
> Both types of polluting fallbacks are eliminated in this workload.
>
> Interestingly, whole block conversions are reduced as well. This is
> because once a block is claimed for a type, its empty space remains
> available for future allocations, instead of being padded with
> fallbacks; this allows the native type to group up instead of
> spreading out to new blocks. The assumption in the allocator has been
> that pollution from movable allocations is less harmful than from
> other types, since they can be reclaimed or migrated out should the
> space be needed. However, since fallbacks occur *before*
> reclaim/compaction is invoked, movable pollution will still cause
> non-movable allocations to spread out and claim more blocks.
>
> Without fragmentation, THP rates hold steady with defrag_mode=1:
>
> thp_fault_alloc
> 32478
> 20725
> 45045
> 32130
> 14018
> 21711
> 40791
> 29134
> 34458
> 45381
> 28305
> 17265
> 22584
> 28454
> 30850
>
> While the downward trend is eliminated, the keen reader will of course
> notice that the baseline rate is much smaller than the vanilla
> kernel's to begin with. This is due to deficiencies in how reclaim and
> compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
> to which smaller allocations are competing with THPs for pageblocks,
> while making no effort themselves to reclaim or compact beyond their
> own request size. This effect already exists with the current usage of
> ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
> block stealing much more strongly.
>
> Subsequent patches will address defrag_mode reclaim strategy to raise
> the THP success baseline above the vanilla kernel.
All makes sense to me. But is there a better name than defrag_mode?
It sounds very similar to /sys/kernel/mm/transparent_hugepage/defrag.
Or it actually means the THP defrag mode?
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++
> mm/page_alloc.c | 27 +++++++++++++++++++++++--
> 2 files changed, 34 insertions(+), 2 deletions(-)
>
When I am checking ALLOC_NOFRAGMENT, I find that in get_page_from_freelist(),
ALLOC_NOFRAGMENT is removed when allocation goes into a remote node. I wonder
if this could reduce the anti-fragmentation effort for NUMA systems. Basically,
falling back to a remote node for allocation would fragment the remote node,
even the remote node is trying hard to not fragment itself. Have you tested
on a NUMA system?
Thanks.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-14 18:54 ` Zi Yan
@ 2025-03-14 20:50 ` Johannes Weiner
2025-03-14 22:54 ` Zi Yan
0 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2025-03-14 20:50 UTC (permalink / raw)
To: Zi Yan; +Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, linux-mm, linux-kernel
On Fri, Mar 14, 2025 at 02:54:03PM -0400, Zi Yan wrote:
> On 13 Mar 2025, at 17:05, Johannes Weiner wrote:
>
> > The page allocator groups requests by migratetype to stave off
> > fragmentation. However, in practice this is routinely defeated by the
> > fact that it gives up *before* invoking reclaim and compaction - which
> > may well produce suitable pages. As a result, fragmentation of
> > physical memory is a common ongoing process in many load scenarios.
> >
> > Fragmentation deteriorates compaction's ability to produce huge
> > pages. Depending on the lifetime of the fragmenting allocations, those
> > effects can be long-lasting or even permanent, requiring drastic
> > measures like forcible idle states or even reboots as the only
> > reliable ways to recover the address space for THP production.
> >
> > In a kernel build test with supplemental THP pressure, the THP
> > allocation rate steadily declines over 15 runs:
> >
> > thp_fault_alloc
> > 61988
> > 56474
> > 57258
> > 50187
> > 52388
> > 55409
> > 52925
> > 47648
> > 43669
> > 40621
> > 36077
> > 41721
> > 36685
> > 34641
> > 33215
> >
> > This is a hurdle in adopting THP in any environment where hosts are
> > shared between multiple overlapping workloads (cloud environments),
> > and rarely experience true idle periods. To make THP a reliable and
> > predictable optimization, there needs to be a stronger guarantee to
> > avoid such fragmentation.
> >
> > Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
> > its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
> > is enforced on the allocator fastpath and the reclaiming slowpath.
> >
> > For now, fallbacks are permitted to avert OOMs. There is a plan to add
> > defrag_mode=2 to prefer OOMs over fragmentation, but this requires
> > additional prep work in compaction and the reserve management to make
> > it ready for all possible allocation contexts.
> >
> > The following test results are from a kernel build with periodic
> > bursts of THP allocations, over 15 runs:
> >
> > vanilla defrag_mode=1
> > @claimer[unmovable]: 189 103
> > @claimer[movable]: 92 103
> > @claimer[reclaimable]: 207 61
> > @pollute[unmovable from movable]: 25 0
> > @pollute[unmovable from reclaimable]: 28 0
> > @pollute[movable from unmovable]: 38835 0
> > @pollute[movable from reclaimable]: 147136 0
> > @pollute[reclaimable from unmovable]: 178 0
> > @pollute[reclaimable from movable]: 33 0
> > @steal[unmovable from movable]: 11 0
> > @steal[unmovable from reclaimable]: 5 0
> > @steal[reclaimable from unmovable]: 107 0
> > @steal[reclaimable from movable]: 90 0
> > @steal[movable from reclaimable]: 354 0
> > @steal[movable from unmovable]: 130 0
> >
> > Both types of polluting fallbacks are eliminated in this workload.
> >
> > Interestingly, whole block conversions are reduced as well. This is
> > because once a block is claimed for a type, its empty space remains
> > available for future allocations, instead of being padded with
> > fallbacks; this allows the native type to group up instead of
> > spreading out to new blocks. The assumption in the allocator has been
> > that pollution from movable allocations is less harmful than from
> > other types, since they can be reclaimed or migrated out should the
> > space be needed. However, since fallbacks occur *before*
> > reclaim/compaction is invoked, movable pollution will still cause
> > non-movable allocations to spread out and claim more blocks.
> >
> > Without fragmentation, THP rates hold steady with defrag_mode=1:
> >
> > thp_fault_alloc
> > 32478
> > 20725
> > 45045
> > 32130
> > 14018
> > 21711
> > 40791
> > 29134
> > 34458
> > 45381
> > 28305
> > 17265
> > 22584
> > 28454
> > 30850
> >
> > While the downward trend is eliminated, the keen reader will of course
> > notice that the baseline rate is much smaller than the vanilla
> > kernel's to begin with. This is due to deficiencies in how reclaim and
> > compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
> > to which smaller allocations are competing with THPs for pageblocks,
> > while making no effort themselves to reclaim or compact beyond their
> > own request size. This effect already exists with the current usage of
> > ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
> > block stealing much more strongly.
> >
> > Subsequent patches will address defrag_mode reclaim strategy to raise
> > the THP success baseline above the vanilla kernel.
>
> All makes sense to me. But is there a better name than defrag_mode?
> It sounds very similar to /sys/kernel/mm/transparent_hugepage/defrag.
> Or it actually means the THP defrag mode?
Thanks for taking a look!
I'm not set on defrag_mode, but I also couldn't think of anything
better.
The proximity to the THP flag name strikes me as beneficial, since
it's an established term for "try harder to make huge pages".
Suggestions welcome :)
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> > Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++
> > mm/page_alloc.c | 27 +++++++++++++++++++++++--
> > 2 files changed, 34 insertions(+), 2 deletions(-)
> >
>
> When I am checking ALLOC_NOFRAGMENT, I find that in get_page_from_freelist(),
> ALLOC_NOFRAGMENT is removed when allocation goes into a remote node. I wonder
> if this could reduce the anti-fragmentation effort for NUMA systems. Basically,
> falling back to a remote node for allocation would fragment the remote node,
> even the remote node is trying hard to not fragment itself. Have you tested
> on a NUMA system?
There is this hunk in the patch:
@@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
continue;
}
- if (no_fallback && nr_online_nodes > 1 &&
+ if (no_fallback && !defrag_mode && nr_online_nodes > 1 &&
zone != zonelist_zone(ac->preferred_zoneref)) {
int local_nid;
So it shouldn't clear the flag when spilling into the next node.
Am I missing something?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
2025-03-13 21:05 ` [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Johannes Weiner
@ 2025-03-14 21:05 ` Johannes Weiner
2025-04-11 8:19 ` Vlastimil Babka
1 sibling, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-14 21:05 UTC (permalink / raw)
To: Andrew Morton; +Cc: Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm, linux-kernel
Andrew, could you please fold this delta patch?
---
From 3d2ff7b72df9e4f1a31b3cff2ae6a4584c06bdca Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 14 Mar 2025 11:38:41 -0400
Subject: [PATCH] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks fix
Fix squawks from rebasing that affect the behavior of !defrag_mode.
FWIW, it seems to actually have slightly helped the vanilla kernel in
the benchmark. But the point was to not change the default behavior:
VANILLA WMARKFIX-VANILLA
Hugealloc Time mean 52739.45 ( +0.00%) 62758.21 ( +19.00%)
Hugealloc Time stddev 56541.26 ( +0.00%) 76253.41 ( +34.86%)
Kbuild Real time 197.47 ( +0.00%) 197.25 ( -0.11%)
Kbuild User time 1240.49 ( +0.00%) 1241.33 ( +0.07%)
Kbuild System time 70.08 ( +0.00%) 71.00 ( +1.28%)
THP fault alloc 46727.07 ( +0.00%) 41492.73 ( -11.20%)
THP fault fallback 21910.60 ( +0.00%) 27146.53 ( +23.90%)
Direct compact fail 195.80 ( +0.00%) 260.93 ( +33.10%)
Direct compact success 7.93 ( +0.00%) 6.67 ( -14.18%)
Direct compact success rate % 3.51 ( +0.00%) 2.76 ( -16.78%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 3827734.27 ( +13.60%)
Compact daemon scanned free 5075474.47 ( +0.00%) 5910839.73 ( +16.46%)
Compact direct scanned migrate 161787.27 ( +0.00%) 168271.13 ( +4.01%)
Compact direct scanned free 163467.53 ( +0.00%) 222558.33 ( +36.15%)
Compact total migrate scanned 3531388.53 ( +0.00%) 3996005.40 ( +13.16%)
Compact total free scanned 5238942.00 ( +0.00%) 6133398.07 ( +17.07%)
Alloc stall 2371.07 ( +0.00%) 2478.00 ( +4.51%)
Pages kswapd scanned 2160926.73 ( +0.00%) 1726204.67 ( -20.12%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 537963.73 ( +0.90%)
Pages direct scanned 400450.33 ( +0.00%) 450004.87 ( +12.37%)
Pages direct reclaimed 94441.73 ( +0.00%) 99193.07 ( +5.03%)
Pages total scanned 2561377.07 ( +0.00%) 2176209.53 ( -15.04%)
Pages total reclaimed 627632.80 ( +0.00%) 637156.80 ( +1.52%)
Swap out 47959.53 ( +0.00%) 45186.20 ( -5.78%)
Swap in 7276.00 ( +0.00%) 7109.40 ( -2.29%)
File refaults 138043.00 ( +0.00%) 145238.73 ( +5.21%)
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/compaction.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 4a2ccb82d0b2..a481755791a9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -3075,6 +3075,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
struct zone *zone;
enum zone_type highest_zoneidx = pgdat->kcompactd_highest_zoneidx;
enum compact_result ret;
+ unsigned int alloc_flags = defrag_mode ?
+ ALLOC_WMARK_HIGH : ALLOC_WMARK_MIN;
for (zoneid = 0; zoneid <= highest_zoneidx; zoneid++) {
zone = &pgdat->node_zones[zoneid];
@@ -3084,7 +3086,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
ret = compaction_suit_allocation_order(zone,
pgdat->kcompactd_max_order,
- highest_zoneidx, ALLOC_WMARK_MIN,
+ highest_zoneidx, alloc_flags,
false, true);
if (ret == COMPACT_CONTINUE)
return true;
@@ -3108,7 +3110,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.mode = MIGRATE_SYNC_LIGHT,
.ignore_skip_hint = false,
.gfp_mask = GFP_KERNEL,
- .alloc_flags = ALLOC_WMARK_HIGH,
+ .alloc_flags = defrag_mode ? ALLOC_WMARK_HIGH : ALLOC_WMARK_MIN,
};
enum compact_result ret;
--
2.48.1
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-14 20:50 ` Johannes Weiner
@ 2025-03-14 22:54 ` Zi Yan
0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-03-14 22:54 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, linux-mm, linux-kernel
On 14 Mar 2025, at 16:50, Johannes Weiner wrote:
> On Fri, Mar 14, 2025 at 02:54:03PM -0400, Zi Yan wrote:
>> On 13 Mar 2025, at 17:05, Johannes Weiner wrote:
>>
>>> The page allocator groups requests by migratetype to stave off
>>> fragmentation. However, in practice this is routinely defeated by the
>>> fact that it gives up *before* invoking reclaim and compaction - which
>>> may well produce suitable pages. As a result, fragmentation of
>>> physical memory is a common ongoing process in many load scenarios.
>>>
>>> Fragmentation deteriorates compaction's ability to produce huge
>>> pages. Depending on the lifetime of the fragmenting allocations, those
>>> effects can be long-lasting or even permanent, requiring drastic
>>> measures like forcible idle states or even reboots as the only
>>> reliable ways to recover the address space for THP production.
>>>
>>> In a kernel build test with supplemental THP pressure, the THP
>>> allocation rate steadily declines over 15 runs:
>>>
>>> thp_fault_alloc
>>> 61988
>>> 56474
>>> 57258
>>> 50187
>>> 52388
>>> 55409
>>> 52925
>>> 47648
>>> 43669
>>> 40621
>>> 36077
>>> 41721
>>> 36685
>>> 34641
>>> 33215
>>>
>>> This is a hurdle in adopting THP in any environment where hosts are
>>> shared between multiple overlapping workloads (cloud environments),
>>> and rarely experience true idle periods. To make THP a reliable and
>>> predictable optimization, there needs to be a stronger guarantee to
>>> avoid such fragmentation.
>>>
>>> Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
>>> its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
>>> is enforced on the allocator fastpath and the reclaiming slowpath.
>>>
>>> For now, fallbacks are permitted to avert OOMs. There is a plan to add
>>> defrag_mode=2 to prefer OOMs over fragmentation, but this requires
>>> additional prep work in compaction and the reserve management to make
>>> it ready for all possible allocation contexts.
>>>
>>> The following test results are from a kernel build with periodic
>>> bursts of THP allocations, over 15 runs:
>>>
>>> vanilla defrag_mode=1
>>> @claimer[unmovable]: 189 103
>>> @claimer[movable]: 92 103
>>> @claimer[reclaimable]: 207 61
>>> @pollute[unmovable from movable]: 25 0
>>> @pollute[unmovable from reclaimable]: 28 0
>>> @pollute[movable from unmovable]: 38835 0
>>> @pollute[movable from reclaimable]: 147136 0
>>> @pollute[reclaimable from unmovable]: 178 0
>>> @pollute[reclaimable from movable]: 33 0
>>> @steal[unmovable from movable]: 11 0
>>> @steal[unmovable from reclaimable]: 5 0
>>> @steal[reclaimable from unmovable]: 107 0
>>> @steal[reclaimable from movable]: 90 0
>>> @steal[movable from reclaimable]: 354 0
>>> @steal[movable from unmovable]: 130 0
>>>
>>> Both types of polluting fallbacks are eliminated in this workload.
>>>
>>> Interestingly, whole block conversions are reduced as well. This is
>>> because once a block is claimed for a type, its empty space remains
>>> available for future allocations, instead of being padded with
>>> fallbacks; this allows the native type to group up instead of
>>> spreading out to new blocks. The assumption in the allocator has been
>>> that pollution from movable allocations is less harmful than from
>>> other types, since they can be reclaimed or migrated out should the
>>> space be needed. However, since fallbacks occur *before*
>>> reclaim/compaction is invoked, movable pollution will still cause
>>> non-movable allocations to spread out and claim more blocks.
>>>
>>> Without fragmentation, THP rates hold steady with defrag_mode=1:
>>>
>>> thp_fault_alloc
>>> 32478
>>> 20725
>>> 45045
>>> 32130
>>> 14018
>>> 21711
>>> 40791
>>> 29134
>>> 34458
>>> 45381
>>> 28305
>>> 17265
>>> 22584
>>> 28454
>>> 30850
>>>
>>> While the downward trend is eliminated, the keen reader will of course
>>> notice that the baseline rate is much smaller than the vanilla
>>> kernel's to begin with. This is due to deficiencies in how reclaim and
>>> compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
>>> to which smaller allocations are competing with THPs for pageblocks,
>>> while making no effort themselves to reclaim or compact beyond their
>>> own request size. This effect already exists with the current usage of
>>> ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
>>> block stealing much more strongly.
>>>
>>> Subsequent patches will address defrag_mode reclaim strategy to raise
>>> the THP success baseline above the vanilla kernel.
>>
>> All makes sense to me. But is there a better name than defrag_mode?
>> It sounds very similar to /sys/kernel/mm/transparent_hugepage/defrag.
>> Or it actually means the THP defrag mode?
>
> Thanks for taking a look!
>
> I'm not set on defrag_mode, but I also couldn't think of anything
> better.
>
> The proximity to the THP flag name strikes me as beneficial, since
> it's an established term for "try harder to make huge pages".
>
> Suggestions welcome :)
>
>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>> ---
>>> Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++
>>> mm/page_alloc.c | 27 +++++++++++++++++++++++--
>>> 2 files changed, 34 insertions(+), 2 deletions(-)
>>>
>>
>> When I am checking ALLOC_NOFRAGMENT, I find that in get_page_from_freelist(),
>> ALLOC_NOFRAGMENT is removed when allocation goes into a remote node. I wonder
>> if this could reduce the anti-fragmentation effort for NUMA systems. Basically,
>> falling back to a remote node for allocation would fragment the remote node,
>> even the remote node is trying hard to not fragment itself. Have you tested
>> on a NUMA system?
>
> There is this hunk in the patch:
>
> @@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> continue;
> }
>
> - if (no_fallback && nr_online_nodes > 1 &&
> + if (no_fallback && !defrag_mode && nr_online_nodes > 1 &&
> zone != zonelist_zone(ac->preferred_zoneref)) {
> int local_nid;
>
> So it shouldn't clear the flag when spilling into the next node.
>
> Am I missing something?
Oh, I missed that part. Thank you for pointing it out.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
2025-03-14 15:08 ` Zi Yan
@ 2025-03-16 4:28 ` Hugh Dickins
2025-03-17 18:18 ` Johannes Weiner
2025-03-21 6:21 ` kernel test robot
2025-04-10 15:19 ` Vlastimil Babka
3 siblings, 1 reply; 32+ messages in thread
From: Hugh Dickins @ 2025-03-16 4:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm,
linux-kernel
[PATCH] mm: compaction: push watermark into compaction_suitable() callers fix
Stop oops on out-of-range highest_zoneidx: compaction_suitable() pass
args to __compaction_suitable() in the order which it is expecting.
Signed-off-by: Hugh Dickins <hughd@google.com>
---
mm/compaction.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 4a2ccb82d0b2..b5c9e8fd39f1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2433,7 +2433,7 @@ bool compaction_suitable(struct zone *zone, int order, unsigned long watermark,
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx, watermark,
+ suitable = __compaction_suitable(zone, order, watermark, highest_zoneidx,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
2025-03-16 4:28 ` Hugh Dickins
@ 2025-03-17 18:18 ` Johannes Weiner
0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-17 18:18 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Mel Gorman,
Zi Yan, linux-mm, linux-kernel
On Sat, Mar 15, 2025 at 09:28:16PM -0700, Hugh Dickins wrote:
> [PATCH] mm: compaction: push watermark into compaction_suitable() callers fix
>
> Stop oops on out-of-range highest_zoneidx: compaction_suitable() pass
> args to __compaction_suitable() in the order which it is expecting.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> mm/compaction.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4a2ccb82d0b2..b5c9e8fd39f1 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2433,7 +2433,7 @@ bool compaction_suitable(struct zone *zone, int order, unsigned long watermark,
> enum compact_result compact_result;
> bool suitable;
>
> - suitable = __compaction_suitable(zone, order, highest_zoneidx, watermark,
> + suitable = __compaction_suitable(zone, order, watermark, highest_zoneidx,
> zone_page_state(zone, NR_FREE_PAGES));
Ouch, thanks for the fix Hugh.
This obviously didn't crash for me, but I re-ran the benchmarks with
your fix in my test environment.
This affects the direct compaction path, and I indeed see a minor
uptick in direct compaction, with a larger reduction in daemon work.
Compact daemon scanned migrate 2455570.93 ( +0.00%) 1770142.33 ( -27.91%)
Compact daemon scanned free 2429309.20 ( +0.00%) 1604744.00 ( -33.94%)
Compact direct scanned migrate 40136.60 ( +0.00%) 58326.67 ( +45.32%)
Compact direct scanned free 22127.13 ( +0.00%) 52216.93 ( +135.98%)
Compact total migrate scanned 2495707.53 ( +0.00%) 1828469.00 ( -26.74%)
Compact total free scanned 2451436.33 ( +0.00%) 1656960.93 ( -32.41%)
It doesn't change the overall A/B picture between baseline and the
series, so I'm comfortable keeping the current changelog results.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
2025-03-14 15:08 ` Zi Yan
2025-03-16 4:28 ` Hugh Dickins
@ 2025-03-21 6:21 ` kernel test robot
2025-03-21 13:55 ` Johannes Weiner
2025-04-10 15:19 ` Vlastimil Babka
3 siblings, 1 reply; 32+ messages in thread
From: kernel test robot @ 2025-03-21 6:21 UTC (permalink / raw)
To: Johannes Weiner
Cc: oe-lkp, lkp, linux-kernel, linux-mm, Andrew Morton,
Vlastimil Babka, Mel Gorman, Zi Yan, oliver.sang
Hello,
kernel test robot noticed "BUG:unable_to_handle_page_fault_for_address" on:
commit: 6304be90cf5460f33b031e1e19cbe7ffdcbc9f66 ("[PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers")
url: https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-compaction-push-watermark-into-compaction_suitable-callers/20250314-050839
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20250313210647.1314586-2-hannes@cmpxchg.org/
patch subject: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
in testcase: trinity
version: trinity-x86_64-ba2360ed-1_20241228
with following parameters:
runtime: 300s
group: group-03
nr_groups: 5
config: x86_64-kexec
compiler: clang-20
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
+---------------------------------------------+------------+------------+
| | 0174ed04ed | 6304be90cf |
+---------------------------------------------+------------+------------+
| BUG:unable_to_handle_page_fault_for_address | 0 | 5 |
| Oops | 0 | 5 |
| RIP:__zone_watermark_ok | 0 | 5 |
+---------------------------------------------+------------+------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202503201604.a3aa6a95-lkp@intel.com
[ 24.321289][ T36] BUG: unable to handle page fault for address: ffff88844000c5f8
[ 24.322631][ T36] #PF: supervisor read access in kernel mode
[ 24.323577][ T36] #PF: error_code(0x0000) - not-present page
[ 24.324482][ T36] PGD 3a01067 P4D 3a01067 PUD 0
[ 24.325301][ T36] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
[ 24.326157][ T36] CPU: 1 UID: 0 PID: 36 Comm: kcompactd0 Not tainted 6.14.0-rc6-00559-g6304be90cf54 #1
[ 24.327631][ T36] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 24.329194][ T36] RIP: 0010:__zone_watermark_ok (mm/page_alloc.c:3256)
[ 24.330125][ T36] Code: 84 c0 78 14 4c 8b 97 48 06 00 00 45 31 db 4d 85 d2 4d 0f 4f da 4c 01 de 49 29 f1 41 f7 c0 38 02 00 00 0f 85 92 00 00 00 48 98 <48> 03 54 c7 38 49 39 d1 7e 7e b0 01 85 c9 74 7a 83 f9 0a 7f 73 48
All code
========
0: 84 c0 test %al,%al
2: 78 14 js 0x18
4: 4c 8b 97 48 06 00 00 mov 0x648(%rdi),%r10
b: 45 31 db xor %r11d,%r11d
e: 4d 85 d2 test %r10,%r10
11: 4d 0f 4f da cmovg %r10,%r11
15: 4c 01 de add %r11,%rsi
18: 49 29 f1 sub %rsi,%r9
1b: 41 f7 c0 38 02 00 00 test $0x238,%r8d
22: 0f 85 92 00 00 00 jne 0xba
28: 48 98 cltq
2a:* 48 03 54 c7 38 add 0x38(%rdi,%rax,8),%rdx <-- trapping instruction
2f: 49 39 d1 cmp %rdx,%r9
32: 7e 7e jle 0xb2
34: b0 01 mov $0x1,%al
36: 85 c9 test %ecx,%ecx
38: 74 7a je 0xb4
3a: 83 f9 0a cmp $0xa,%ecx
3d: 7f 73 jg 0xb2
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 03 54 c7 38 add 0x38(%rdi,%rax,8),%rdx
5: 49 39 d1 cmp %rdx,%r9
8: 7e 7e jle 0x88
a: b0 01 mov $0x1,%al
c: 85 c9 test %ecx,%ecx
e: 74 7a je 0x8a
10: 83 f9 0a cmp $0xa,%ecx
13: 7f 73 jg 0x88
15: 48 rex.W
[ 24.333001][ T36] RSP: 0018:ffffc90000137cd0 EFLAGS: 00010246
[ 24.334003][ T36] RAX: 00000000000036a8 RBX: 0000000000000001 RCX: 0000000000000000
[ 24.335270][ T36] RDX: 0000000000000006 RSI: 0000000000000000 RDI: ffff88843fff1080
[ 24.336551][ T36] RBP: 0000000000000001 R08: 0000000000000080 R09: 0000000000003b14
[ 24.337799][ T36] R10: 00000000000018b0 R11: 00000000000018b0 R12: 0000000000000001
[ 24.339130][ T36] R13: 0000000000000000 R14: ffff88843fff1080 R15: 00000000000036a8
[ 24.340412][ T36] FS: 0000000000000000(0000) GS:ffff88842fd00000(0000) knlGS:0000000000000000
[ 24.341739][ T36] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 24.342448][ T36] CR2: ffff88844000c5f8 CR3: 00000001bceba000 CR4: 00000000000406f0
[ 24.343331][ T36] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 24.347498][ T36] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 24.348260][ T36] Call Trace:
[ 24.348621][ T36] <TASK>
[ 24.348958][ T36] ? __die_body (arch/x86/kernel/dumpstack.c:421)
[ 24.349447][ T36] ? page_fault_oops (arch/x86/mm/fault.c:710)
[ 24.350008][ T36] ? do_kern_addr_fault (arch/x86/mm/fault.c:1198)
[ 24.350582][ T36] ? exc_page_fault (arch/x86/mm/fault.c:1479)
[ 24.351065][ T36] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:623)
[ 24.351550][ T36] ? __zone_watermark_ok (mm/page_alloc.c:3256)
[ 24.352049][ T36] compaction_suitable (mm/compaction.c:2407)
[ 24.352532][ T36] compaction_suit_allocation_order (mm/compaction.c:?)
[ 24.353127][ T36] kcompactd (mm/compaction.c:3109)
[ 24.353618][ T36] kthread (kernel/kthread.c:466)
[ 24.354105][ T36] ? __pfx_kcompactd (mm/compaction.c:3184)
[ 24.354658][ T36] ? __pfx_kthread (kernel/kthread.c:413)
[ 24.355121][ T36] ret_from_fork (arch/x86/kernel/process.c:154)
[ 24.355567][ T36] ? __pfx_kthread (kernel/kthread.c:413)
[ 24.356032][ T36] ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
[ 24.356505][ T36] </TASK>
[ 24.356837][ T36] Modules linked in: can_bcm can_raw can cn scsi_transport_iscsi sr_mod ipmi_msghandler cdrom sg ata_generic dm_mod fuse
[ 24.358098][ T36] CR2: ffff88844000c5f8
[ 24.358662][ T36] ---[ end trace 0000000000000000 ]---
[ 24.359178][ T36] RIP: 0010:__zone_watermark_ok (mm/page_alloc.c:3256)
[ 24.359726][ T36] Code: 84 c0 78 14 4c 8b 97 48 06 00 00 45 31 db 4d 85 d2 4d 0f 4f da 4c 01 de 49 29 f1 41 f7 c0 38 02 00 00 0f 85 92 00 00 00 48 98 <48> 03 54 c7 38 49 39 d1 7e 7e b0 01 85 c9 74 7a 83 f9 0a 7f 73 48
All code
========
0: 84 c0 test %al,%al
2: 78 14 js 0x18
4: 4c 8b 97 48 06 00 00 mov 0x648(%rdi),%r10
b: 45 31 db xor %r11d,%r11d
e: 4d 85 d2 test %r10,%r10
11: 4d 0f 4f da cmovg %r10,%r11
15: 4c 01 de add %r11,%rsi
18: 49 29 f1 sub %rsi,%r9
1b: 41 f7 c0 38 02 00 00 test $0x238,%r8d
22: 0f 85 92 00 00 00 jne 0xba
28: 48 98 cltq
2a:* 48 03 54 c7 38 add 0x38(%rdi,%rax,8),%rdx <-- trapping instruction
2f: 49 39 d1 cmp %rdx,%r9
32: 7e 7e jle 0xb2
34: b0 01 mov $0x1,%al
36: 85 c9 test %ecx,%ecx
38: 74 7a je 0xb4
3a: 83 f9 0a cmp $0xa,%ecx
3d: 7f 73 jg 0xb2
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 03 54 c7 38 add 0x38(%rdi,%rax,8),%rdx
5: 49 39 d1 cmp %rdx,%r9
8: 7e 7e jle 0x88
a: b0 01 mov $0x1,%al
c: 85 c9 test %ecx,%ecx
e: 74 7a je 0x8a
10: 83 f9 0a cmp $0xa,%ecx
13: 7f 73 jg 0x88
15: 48 rex.W
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250320/202503201604.a3aa6a95-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
2025-03-21 6:21 ` kernel test robot
@ 2025-03-21 13:55 ` Johannes Weiner
0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-21 13:55 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, linux-kernel, linux-mm, Andrew Morton,
Vlastimil Babka, Mel Gorman, Zi Yan
On Fri, Mar 21, 2025 at 02:21:20PM +0800, kernel test robot wrote:
> commit: 6304be90cf5460f33b031e1e19cbe7ffdcbc9f66 ("[PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers")
> url: https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-compaction-push-watermark-into-compaction_suitable-callers/20250314-050839
> base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/all/20250313210647.1314586-2-hannes@cmpxchg.org/
> patch subject: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> [ 24.321289][ T36] BUG: unable to handle page fault for address: ffff88844000c5f8
> [ 24.322631][ T36] #PF: supervisor read access in kernel mode
> [ 24.323577][ T36] #PF: error_code(0x0000) - not-present page
> [ 24.324482][ T36] PGD 3a01067 P4D 3a01067 PUD 0
> [ 24.325301][ T36] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
> [ 24.326157][ T36] CPU: 1 UID: 0 PID: 36 Comm: kcompactd0 Not tainted 6.14.0-rc6-00559-g6304be90cf54 #1
> [ 24.327631][ T36] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 24.329194][ T36] RIP: 0010:__zone_watermark_ok (mm/page_alloc.c:3256)
> [ 24.330125][ T36] Code: 84 c0 78 14 4c 8b 97 48 06 00 00 45 31 db 4d 85 d2 4d 0f 4f da 4c 01 de 49 29 f1 41 f7 c0 38 02 00 00 0f 85 92 00 00 00 48 98 <48> 03 54 c7 38 49 39 d1 7e 7e b0 01 85 c9 74 7a 83 f9 0a 7f 73 48
> All code
> ========
> 0: 84 c0 test %al,%al
> 2: 78 14 js 0x18
> 4: 4c 8b 97 48 06 00 00 mov 0x648(%rdi),%r10
> b: 45 31 db xor %r11d,%r11d
> e: 4d 85 d2 test %r10,%r10
> 11: 4d 0f 4f da cmovg %r10,%r11
> 15: 4c 01 de add %r11,%rsi
> 18: 49 29 f1 sub %rsi,%r9
> 1b: 41 f7 c0 38 02 00 00 test $0x238,%r8d
> 22: 0f 85 92 00 00 00 jne 0xba
> 28: 48 98 cltq
> 2a:* 48 03 54 c7 38 add 0x38(%rdi,%rax,8),%rdx <-- trapping instruction
That would be the zone->lowmem_reserve[highest_zoneidx] deref:
long int lowmem_reserve[4]; /* 0x38 0x20 */
> 2f: 49 39 d1 cmp %rdx,%r9
> 32: 7e 7e jle 0xb2
> 34: b0 01 mov $0x1,%al
> 36: 85 c9 test %ecx,%ecx
> 38: 74 7a je 0xb4
> 3a: 83 f9 0a cmp $0xa,%ecx
> 3d: 7f 73 jg 0xb2
> 3f: 48 rex.W
>
> Code starting with the faulting instruction
> ===========================================
> 0: 48 03 54 c7 38 add 0x38(%rdi,%rax,8),%rdx
> 5: 49 39 d1 cmp %rdx,%r9
> 8: 7e 7e jle 0x88
> a: b0 01 mov $0x1,%al
> c: 85 c9 test %ecx,%ecx
> e: 74 7a je 0x8a
> 10: 83 f9 0a cmp $0xa,%ecx
> 13: 7f 73 jg 0x88
> 15: 48 rex.W
> [ 24.333001][ T36] RSP: 0018:ffffc90000137cd0 EFLAGS: 00010246
> [ 24.334003][ T36] RAX: 00000000000036a8 RBX: 0000000000000001 RCX: 0000000000000000
> [ 24.335270][ T36] RDX: 0000000000000006 RSI: 0000000000000000 RDI: ffff88843fff1080
and %rax and %rdx look like the swapped watermark and zoneidx (36a8 is
14k pages, or 54M, which matches a min watermark on a 16G system).
So this is the bug that Hugh fixed here:
https://lore.kernel.org/all/005ace8b-07fa-01d4-b54b-394a3e029c07@google.com/
It's resolved in the latest version of the patch in -mm.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-13 21:05 ` [PATCH 3/5] mm: page_alloc: defrag_mode Johannes Weiner
2025-03-14 18:54 ` Zi Yan
@ 2025-03-22 15:05 ` Brendan Jackman
2025-03-23 0:58 ` Johannes Weiner
1 sibling, 1 reply; 32+ messages in thread
From: Brendan Jackman @ 2025-03-22 15:05 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton
Cc: Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm, linux-kernel
On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> + /* Reclaim/compaction failed to prevent the fallback */
> + if (defrag_mode) {
> + alloc_flags &= ALLOC_NOFRAGMENT;
> + goto retry;
> + }
I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
here (i.e. should this be ~ALLOC_NOFRAGMENT)?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-22 15:05 ` Brendan Jackman
@ 2025-03-23 0:58 ` Johannes Weiner
2025-03-23 1:34 ` Johannes Weiner
0 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2025-03-23 0:58 UTC (permalink / raw)
To: Brendan Jackman
Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm,
linux-kernel
On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > + /* Reclaim/compaction failed to prevent the fallback */
> > + if (defrag_mode) {
> > + alloc_flags &= ALLOC_NOFRAGMENT;
> > + goto retry;
> > + }
>
> I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> here (i.e. should this be ~ALLOC_NOFRAGMENT)?
Yes, it should be. Thanks for catching that.
Note that this happens right before OOM, and __alloc_pages_may_oom()
does another allocation attempt without the flag set. In fact, I was
briefly debating whether I need the explicit retry here at all, but
then decided it's clearer and more future proof than quietly relying
on that OOM attempt, which is really only there to check for racing
frees. But this is most likely what hid this during testing.
What might be more of an issue is retrying without ALLOC_CPUSET and
then potentially violating cgroup placement rules too readily -
e.g. OOM only does that for __GFP_NOFAIL.
---
From e81c2086ee8e4b9f2750b821e104d3b5174b81f2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Sat, 22 Mar 2025 19:21:45 -0400
Subject: [PATCH] mm: page_alloc: fix defrag_mode's last allocation before OOM
Brendan points out that defrag_mode doesn't properly clear
ALLOC_NOFRAGMENT on its last-ditch attempt to allocate.
This is not necessarily a practical issue because it's followed by
__alloc_pages_may_oom(), which does its own attempt at the freelist
without ALLOC_NOFRAGMENT set. However, this is restricted to the high
watermark instead of the usual min mark (since it's merely to check
for racing frees). While this usually works - we just ran a full set
of reclaim/compaction, after all, and likely failed due to a lack of
pageblocks rather than watermarks - it's not as reliable as intended.
A more practical implication is retrying with the other flags cleared,
which means ALLOC_CPUSET is cleared, which can violate placement rules
defined by cgroup policy - OOM usually only does this for GFP_NOFAIL.
Reported-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c01998cb3a0..b9ee0c00eea5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4544,7 +4544,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
/* Reclaim/compaction failed to prevent the fallback */
if (defrag_mode) {
- alloc_flags &= ALLOC_NOFRAGMENT;
+ alloc_flags &= ~ALLOC_NOFRAGMENT;
goto retry;
}
--
2.49.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-23 0:58 ` Johannes Weiner
@ 2025-03-23 1:34 ` Johannes Weiner
2025-03-23 3:46 ` Johannes Weiner
0 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2025-03-23 1:34 UTC (permalink / raw)
To: Brendan Jackman
Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm,
linux-kernel
On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > + /* Reclaim/compaction failed to prevent the fallback */
> > > + if (defrag_mode) {
> > > + alloc_flags &= ALLOC_NOFRAGMENT;
> > > + goto retry;
> > > + }
> >
> > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > here (i.e. should this be ~ALLOC_NOFRAGMENT)?
Please ignore my previous email, this is actually a much more severe
issue than I thought at first. The screwed up clearing is bad, but
this will also not check the flag before retrying, which means the
thread will retry reclaim/compaction and never reach OOM.
This code has weeks of load testing, with workloads fine-tuned to
*avoid* OOM. A blatant OOM test shows this problem immediately.
A simple fix, but I'll put it through the wringer before sending it.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-23 1:34 ` Johannes Weiner
@ 2025-03-23 3:46 ` Johannes Weiner
2025-03-23 18:04 ` Brendan Jackman
0 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2025-03-23 3:46 UTC (permalink / raw)
To: Brendan Jackman
Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm,
linux-kernel
On Sat, Mar 22, 2025 at 09:34:09PM -0400, Johannes Weiner wrote:
> On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> > On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > > + /* Reclaim/compaction failed to prevent the fallback */
> > > > + if (defrag_mode) {
> > > > + alloc_flags &= ALLOC_NOFRAGMENT;
> > > > + goto retry;
> > > > + }
> > >
> > > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > > here (i.e. should this be ~ALLOC_NOFRAGMENT)?
>
> Please ignore my previous email, this is actually a much more severe
> issue than I thought at first. The screwed up clearing is bad, but
> this will also not check the flag before retrying, which means the
> thread will retry reclaim/compaction and never reach OOM.
>
> This code has weeks of load testing, with workloads fine-tuned to
> *avoid* OOM. A blatant OOM test shows this problem immediately.
>
> A simple fix, but I'll put it through the wringer before sending it.
Ok, here is the patch. I verified this with intentional OOMing 100
times in a loop; this would previously lock up on first try in
defrag_mode, but kills and recovers reliably with this applied.
I also re-ran the full THP benchmarks, to verify that erroneous
looping here did not accidentally contribute to fragmentation
avoidance and thus THP success & latency rates. They were in fact not;
the improvements claimed for defrag_mode are unchanged with this fix:
VANILLA defrag_mode=1-OOMFIX
Hugealloc Time mean 52739.45 ( +0.00%) 27342.44 ( -48.15%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33227.16 ( -41.23%)
Kbuild Real time 197.47 ( +0.00%) 196.32 ( -0.58%)
Kbuild User time 1240.49 ( +0.00%) 1231.89 ( -0.69%)
Kbuild System time 70.08 ( +0.00%) 58.75 ( -15.95%)
THP fault alloc 46727.07 ( +0.00%) 62669.93 ( +34.12%)
THP fault fallback 21910.60 ( +0.00%) 5966.40 ( -72.77%)
Direct compact fail 195.80 ( +0.00%) 50.53 ( -73.81%)
Direct compact success 7.93 ( +0.00%) 4.07 ( -43.28%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 1588238.93 ( -52.87%)
Compact daemon scanned free 5075474.47 ( +0.00%) 1441944.27 ( -71.59%)
Compact direct scanned migrate 161787.27 ( +0.00%) 64838.53 ( -59.92%)
Compact direct scanned free 163467.53 ( +0.00%) 37243.00 ( -77.22%)
Compact total migrate scanned 3531388.53 ( +0.00%) 1653077.47 ( -53.19%)
Compact total free scanned 5238942.00 ( +0.00%) 1479187.27 ( -71.77%)
Alloc stall 2371.07 ( +0.00%) 553.00 ( -76.64%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4052539.93 ( +87.54%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 765447.47 ( +43.56%)
Pages direct scanned 400450.33 ( +0.00%) 358933.93 ( -10.37%)
Pages direct reclaimed 94441.73 ( +0.00%) 26991.60 ( -71.42%)
Pages total scanned 2561377.07 ( +0.00%) 4411473.87 ( +72.23%)
Pages total reclaimed 627632.80 ( +0.00%) 792439.07 ( +26.26%)
Swap out 47959.53 ( +0.00%) 128511.80 ( +167.96%)
Swap in 7276.00 ( +0.00%) 27736.20 ( +281.16%)
File refaults 138043.00 ( +0.00%) 206198.40 ( +49.37%)
Many thanks for your careful review, Brendan.
---
From c84651a46910448c6cfaf44885644fdb215f7f6a Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Sat, 22 Mar 2025 19:21:45 -0400
Subject: [PATCH] mm: page_alloc: fix defrag_mode's retry & OOM path
Brendan points out that defrag_mode doesn't properly clear
ALLOC_NOFRAGMENT on its last-ditch attempt to allocate. But looking
closer, the problem is actually more severe: it doesn't actually
*check* whether it's already retried, and keeps looping. This means
the OOM path is never taken, and the thread can loop indefinitely.
This is verified with an intentional OOM test on defrag_mode=1, which
results in the machine hanging. After this patch, it triggers the OOM
kill reliably and recovers.
Clear ALLOC_NOFRAGMENT properly, and only retry once.
Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
Reported-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c01998cb3a0..582364d42906 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4543,8 +4543,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto retry;
/* Reclaim/compaction failed to prevent the fallback */
- if (defrag_mode) {
- alloc_flags &= ALLOC_NOFRAGMENT;
+ if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT)) {
+ alloc_flags &= ~ALLOC_NOFRAGMENT;
goto retry;
}
--
2.49.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-23 3:46 ` Johannes Weiner
@ 2025-03-23 18:04 ` Brendan Jackman
2025-03-31 15:55 ` Johannes Weiner
0 siblings, 1 reply; 32+ messages in thread
From: Brendan Jackman @ 2025-03-23 18:04 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm,
linux-kernel
On Sun Mar 23, 2025 at 4:46 AM CET, Johannes Weiner wrote:
> On Sat, Mar 22, 2025 at 09:34:09PM -0400, Johannes Weiner wrote:
> > On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> > > On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > > > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > > > + /* Reclaim/compaction failed to prevent the fallback */
> > > > > + if (defrag_mode) {
> > > > > + alloc_flags &= ALLOC_NOFRAGMENT;
> > > > > + goto retry;
> > > > > + }
> > > >
> > > > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > > > here (i.e. should this be ~ALLOC_NOFRAGMENT)?
> >
> > Please ignore my previous email, this is actually a much more severe
> > issue than I thought at first. The screwed up clearing is bad, but
> > this will also not check the flag before retrying, which means the
> > thread will retry reclaim/compaction and never reach OOM.
> >
> > This code has weeks of load testing, with workloads fine-tuned to
> > *avoid* OOM. A blatant OOM test shows this problem immediately.
> >
> > A simple fix, but I'll put it through the wringer before sending it.
>
> Ok, here is the patch. I verified this with intentional OOMing 100
> times in a loop; this would previously lock up on first try in
> defrag_mode, but kills and recovers reliably with this applied.
>
> I also re-ran the full THP benchmarks, to verify that erroneous
> looping here did not accidentally contribute to fragmentation
> avoidance and thus THP success & latency rates. They were in fact not;
> the improvements claimed for defrag_mode are unchanged with this fix:
Sounds good :)
Off topic, but could you share some details about the
tests/benchmarks you're running here? Do you have any links e.g. to
the scripts you're using to run them?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] mm: page_alloc: defrag_mode
2025-03-23 18:04 ` Brendan Jackman
@ 2025-03-31 15:55 ` Johannes Weiner
0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-31 15:55 UTC (permalink / raw)
To: Brendan Jackman
Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm,
linux-kernel
Hi Brendan,
On Sun, Mar 23, 2025 at 07:04:29PM +0100, Brendan Jackman wrote:
> On Sun Mar 23, 2025 at 4:46 AM CET, Johannes Weiner wrote:
> > On Sat, Mar 22, 2025 at 09:34:09PM -0400, Johannes Weiner wrote:
> > > On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> > > > On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > > > > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > > > > + /* Reclaim/compaction failed to prevent the fallback */
> > > > > > + if (defrag_mode) {
> > > > > > + alloc_flags &= ALLOC_NOFRAGMENT;
> > > > > > + goto retry;
> > > > > > + }
> > > > >
> > > > > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > > > > here (i.e. should this be ~ALLOC_NOFRAGMENT)?
> > >
> > > Please ignore my previous email, this is actually a much more severe
> > > issue than I thought at first. The screwed up clearing is bad, but
> > > this will also not check the flag before retrying, which means the
> > > thread will retry reclaim/compaction and never reach OOM.
> > >
> > > This code has weeks of load testing, with workloads fine-tuned to
> > > *avoid* OOM. A blatant OOM test shows this problem immediately.
> > >
> > > A simple fix, but I'll put it through the wringer before sending it.
> >
> > Ok, here is the patch. I verified this with intentional OOMing 100
> > times in a loop; this would previously lock up on first try in
> > defrag_mode, but kills and recovers reliably with this applied.
> >
> > I also re-ran the full THP benchmarks, to verify that erroneous
> > looping here did not accidentally contribute to fragmentation
> > avoidance and thus THP success & latency rates. They were in fact not;
> > the improvements claimed for defrag_mode are unchanged with this fix:
>
> Sounds good :)
>
> Off topic, but could you share some details about the
> tests/benchmarks you're running here? Do you have any links e.g. to
> the scripts you're using to run them?
Sure! The numbers I quoted here are from a dual workload of kernel
build and THP allocation bursts. The kernel build is an x86_64
defconfig, -j16 on 8 cores (no ht). I boot this machine with mem=1800M
to make sure there is some memory pressure, but not hopeless
thrashing. Filesystem and conventional swap on an older SATA SSD.
While the kernel builds, every 20s another worker mmaps 80M, madvises
for THP, measures the time to memset-fault the range in, and unmaps.
THP policy is upstream defaults: enabled=always, defrag=madvise. So
the kernel build itself will also optimistically consume THPs, but
only the burst allocations will direct reclaim/compact for them.
Aside from that - and this is a lot less scientific - I just run the
patches on the machines I use every day, looking for interactivity
problems, kswapd or kcompactd going crazy, and generally paying
attention to how well they cope under pressure compared to upstream.
My desktop is an 8G ARM machine (with zswap), so it's almost always
under some form of memory pressure. It's also using 16k pages and
order-11 pageblocks (32M THPs), which adds extra spice.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
` (2 preceding siblings ...)
2025-03-21 6:21 ` kernel test robot
@ 2025-04-10 15:19 ` Vlastimil Babka
2025-04-10 20:17 ` Johannes Weiner
3 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2025-04-10 15:19 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton; +Cc: Mel Gorman, Zi Yan, linux-mm, linux-kernel
On 3/13/25 22:05, Johannes Weiner wrote:
> compaction_suitable() hardcodes the min watermark, with a boost to the
> low watermark for costly orders. However, compaction_ready() requires
> order-0 at the high watermark. It currently checks the marks twice.
>
> Make the watermark a parameter to compaction_suitable() and have the
> callers pass in what they require:
>
> - compaction_zonelist_suitable() is used by the direct reclaim path,
> so use the min watermark.
>
> - compact_suit_allocation_order() has a watermark in context derived
> from cc->alloc_flags.
>
> The only quirk is that kcompactd doesn't initialize cc->alloc_flags
> explicitly. There is a direct check in kcompactd_do_work() that
> passes ALLOC_WMARK_MIN, but there is another check downstack in
> compact_zone() that ends up passing the unset alloc_flags. Since
> they default to 0, and that coincides with ALLOC_WMARK_MIN, it is
> correct. But it's subtle. Set cc->alloc_flags explicitly.
>
> - should_continue_reclaim() is direct reclaim, use the min watermark.
>
> - Finally, consolidate the two checks in compaction_ready() to a
> single compaction_suitable() call passing the high watermark.
>
> There is a tiny change in behavior: before, compaction_suitable()
> would check order-0 against min or low, depending on costly
> order. Then there'd be another high watermark check.
>
> Now, the high watermark is passed to compaction_suitable(), and the
> costly order-boost (low - min) is added on top. This means
> compaction_ready() sets a marginally higher target for free pages.
>
> In a kernelbuild + THP pressure test, though, this didn't show any
> measurable negative effects on memory pressure or reclaim rates. As
> the comment above the check says, reclaim is usually stopped short
> on should_continue_reclaim(), and this just defines the worst-case
> reclaim cutoff in case compaction is not making any headway.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
<snip>
> @@ -2513,13 +2516,13 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
> */
> if (order > PAGE_ALLOC_COSTLY_ORDER && async &&
> !(alloc_flags & ALLOC_CMA)) {
> - watermark = low_wmark_pages(zone) + compact_gap(order);
> - if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
> - 0, zone_page_state(zone, NR_FREE_PAGES)))
> + if (!__zone_watermark_ok(zone, 0, watermark + compact_gap(order),
> + highest_zoneidx, 0,
> + zone_page_state(zone, NR_FREE_PAGES)))
> return COMPACT_SKIPPED;
The watermark here is no longer recalculated as low_wmark_pages() but the
value from above based on alloc_flags is reused.
It's probably ok, maybe it's even more correct, just wasn't mentioned in the
changelog as another tiny change of behavior so I wanted to point it out.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
2025-04-10 15:19 ` Vlastimil Babka
@ 2025-04-10 20:17 ` Johannes Weiner
2025-04-11 7:32 ` Vlastimil Babka
0 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2025-04-10 20:17 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Andrew Morton, Mel Gorman, Zi Yan, linux-mm, linux-kernel
On Thu, Apr 10, 2025 at 05:19:06PM +0200, Vlastimil Babka wrote:
> On 3/13/25 22:05, Johannes Weiner wrote:
> > compaction_suitable() hardcodes the min watermark, with a boost to the
> > low watermark for costly orders. However, compaction_ready() requires
> > order-0 at the high watermark. It currently checks the marks twice.
> >
> > Make the watermark a parameter to compaction_suitable() and have the
> > callers pass in what they require:
> >
> > - compaction_zonelist_suitable() is used by the direct reclaim path,
> > so use the min watermark.
> >
> > - compact_suit_allocation_order() has a watermark in context derived
> > from cc->alloc_flags.
> >
> > The only quirk is that kcompactd doesn't initialize cc->alloc_flags
> > explicitly. There is a direct check in kcompactd_do_work() that
> > passes ALLOC_WMARK_MIN, but there is another check downstack in
> > compact_zone() that ends up passing the unset alloc_flags. Since
> > they default to 0, and that coincides with ALLOC_WMARK_MIN, it is
> > correct. But it's subtle. Set cc->alloc_flags explicitly.
> >
> > - should_continue_reclaim() is direct reclaim, use the min watermark.
> >
> > - Finally, consolidate the two checks in compaction_ready() to a
> > single compaction_suitable() call passing the high watermark.
> >
> > There is a tiny change in behavior: before, compaction_suitable()
> > would check order-0 against min or low, depending on costly
> > order. Then there'd be another high watermark check.
> >
> > Now, the high watermark is passed to compaction_suitable(), and the
> > costly order-boost (low - min) is added on top. This means
> > compaction_ready() sets a marginally higher target for free pages.
> >
> > In a kernelbuild + THP pressure test, though, this didn't show any
> > measurable negative effects on memory pressure or reclaim rates. As
> > the comment above the check says, reclaim is usually stopped short
> > on should_continue_reclaim(), and this just defines the worst-case
> > reclaim cutoff in case compaction is not making any headway.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>
> <snip>
>
> > @@ -2513,13 +2516,13 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
> > */
> > if (order > PAGE_ALLOC_COSTLY_ORDER && async &&
> > !(alloc_flags & ALLOC_CMA)) {
> > - watermark = low_wmark_pages(zone) + compact_gap(order);
> > - if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
> > - 0, zone_page_state(zone, NR_FREE_PAGES)))
> > + if (!__zone_watermark_ok(zone, 0, watermark + compact_gap(order),
> > + highest_zoneidx, 0,
> > + zone_page_state(zone, NR_FREE_PAGES)))
> > return COMPACT_SKIPPED;
>
> The watermark here is no longer recalculated as low_wmark_pages() but the
> value from above based on alloc_flags is reused.
> It's probably ok, maybe it's even more correct, just wasn't mentioned in the
> changelog as another tiny change of behavior so I wanted to point it out.
Ah yes, it would have made sense to point out.
I was wondering about this check. It was introduced to bail on
compaction if there are not enough free non-CMA pages. But if there
are, we still fall through and check the superset of regular + CMA
pages against the watermarks as well. We know this will succeed, so
this seems moot.
It's also a little odd that compaction_suitable() hardcodes ALLOC_CMA
with the explanation that "CMA are migration targets", but then this
check says "actually, it doesn't help us if blocks are formed in CMA".
Does it make more sense to plumb alloc_flags to compaction_suitable()?
There is more head-scratching, though. The check is meant to test
whether compaction has a chance of forming non-CMA blocks. But free
pages are targets. You could have plenty of non-contiguous, free
non-CMA memory - compaction will then form blocks in CMA by moving CMA
pages into those non-CMA targets.
The longer I look at this, the more I feel like this just hard-coded
the very specific scenario the patch author had a problem with: CMA is
massive. The page allocator fills up regular memory first. Once
regular memory is full, non-CMA requests stall on compaction making
CMA blocks. So just bail on compaction then.
It's a valid problem, but I don't see how this code makes any general
sense outside of this exact sequence of events. Especially once
compaction has moved stuff around between regular and CMA memory, the
issue will be back, and the check does nothing to prevent it.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
2025-04-10 20:17 ` Johannes Weiner
@ 2025-04-11 7:32 ` Vlastimil Babka
0 siblings, 0 replies; 32+ messages in thread
From: Vlastimil Babka @ 2025-04-11 7:32 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Mel Gorman, Zi Yan, linux-mm, linux-kernel
On 4/10/25 22:17, Johannes Weiner wrote:
> On Thu, Apr 10, 2025 at 05:19:06PM +0200, Vlastimil Babka wrote:
>> On 3/13/25 22:05, Johannes Weiner wrote:
>
> Ah yes, it would have made sense to point out.
>
> I was wondering about this check. It was introduced to bail on
> compaction if there are not enough free non-CMA pages. But if there
> are, we still fall through and check the superset of regular + CMA
> pages against the watermarks as well. We know this will succeed, so
> this seems moot.
I guess we didn't want to avoid the fragindex part of compaction_suitable(),
which in theory may not succeed?
> It's also a little odd that compaction_suitable() hardcodes ALLOC_CMA
> with the explanation that "CMA are migration targets", but then this
> check says "actually, it doesn't help us if blocks are formed in CMA".
Hm yes.
> Does it make more sense to plumb alloc_flags to compaction_suitable()?
Possibly.
> There is more head-scratching, though. The check is meant to test
> whether compaction has a chance of forming non-CMA blocks. But free
> pages are targets. You could have plenty of non-contiguous, free
> non-CMA memory - compaction will then form blocks in CMA by moving CMA
> pages into those non-CMA targets.
>
> The longer I look at this, the more I feel like this just hard-coded
> the very specific scenario the patch author had a problem with: CMA is
> massive. The page allocator fills up regular memory first. Once
> regular memory is full, non-CMA requests stall on compaction making
> CMA blocks. So just bail on compaction then.
Right.
> It's a valid problem, but I don't see how this code makes any general
> sense outside of this exact sequence of events. Especially once
> compaction has moved stuff around between regular and CMA memory, the
> issue will be back, and the check does nothing to prevent it.
Yeah, it seemed to fix a real problem and we both acked it :) but it's not
ideal.
Maybe the true solution (or a step towards it) would be for compaction for
!ALLOC_CMA only use non-CMA pageblocks as migration sources.
IMHO it's just another symptom of the general problem that CMA pageblocks
exist as part of a zone that's not otherwise ZONE_MOVABLE and suddenly the
watermarks have to depend on the allocation type. And of course for the
high-order allocations it doesn't just matter the amount of memory in cma vs
non-cma parts of the zone, but also its contiguity.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
2025-03-13 21:05 ` [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Johannes Weiner
2025-03-14 21:05 ` Johannes Weiner
@ 2025-04-11 8:19 ` Vlastimil Babka
2025-04-11 15:39 ` Johannes Weiner
1 sibling, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2025-04-11 8:19 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton; +Cc: Mel Gorman, Zi Yan, linux-mm, linux-kernel
On 3/13/25 22:05, Johannes Weiner wrote:
> The previous patch added pageblock_order reclaim to kswapd/kcompactd,
> which helps, but produces only one block at a time. Allocation stalls
> and THP failure rates are still higher than they could be.
>
> To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
> the watermarking for kswapd & kcompactd: instead of targeting the high
> watermark in order-0 pages and checking for one suitable block, simply
> require that the high watermark is entirely met in pageblocks.
Hrm.
> ---
> include/linux/mmzone.h | 1 +
> mm/compaction.c | 37 ++++++++++++++++++++++++++++++-------
> mm/internal.h | 1 +
> mm/page_alloc.c | 29 +++++++++++++++++++++++------
> mm/vmscan.c | 15 ++++++++++++++-
> mm/vmstat.c | 1 +
> 6 files changed, 70 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index dbb0ad69e17f..37c29f3fbca8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -138,6 +138,7 @@ enum numa_stat_item {
> enum zone_stat_item {
> /* First 128 byte cacheline (assuming 64 bit words) */
> NR_FREE_PAGES,
> + NR_FREE_PAGES_BLOCKS,
> NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
> NR_ZONE_ACTIVE_ANON,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 036353ef1878..4a2ccb82d0b2 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> if (!pageblock_aligned(cc->migrate_pfn))
> return COMPACT_CONTINUE;
>
> + /*
> + * When defrag_mode is enabled, make kcompactd target
> + * watermarks in whole pageblocks. Because they can be stolen
> + * without polluting, no further fallback checks are needed.
> + */
> + if (defrag_mode && !cc->direct_compaction) {
> + if (__zone_watermark_ok(cc->zone, cc->order,
> + high_wmark_pages(cc->zone),
> + cc->highest_zoneidx, cc->alloc_flags,
> + zone_page_state(cc->zone,
> + NR_FREE_PAGES_BLOCKS)))
> + return COMPACT_SUCCESS;
> +
> + return COMPACT_CONTINUE;
> + }
Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
to pageblock size? If not, and it's X pageblocks and a half, we will rarely
have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
put us below high wmark quickly and then we never satisfy this?
Doesn't then happen that with defrag_mode, in practice kcompactd basically
always runs until scanners met?
Maybe the check could instead e.g. compare NR_FREE_PAGES (aligned down to
pageblock size, or even with some extra slack?) with NR_FREE_PAGES_BLOCKS?
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> * meet watermarks.
> */
> for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
> + unsigned long free_pages;
> +
> if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> mark = promo_wmark_pages(zone);
> else
> mark = high_wmark_pages(zone);
> - if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
I think you just removed the only user of this _safe() function. Is the
cpu-drift control it does no longer necessary?
> +
> + /*
> + * In defrag_mode, watermarks must be met in whole
> + * blocks to avoid polluting allocator fallbacks.
> + */
> + if (defrag_mode)
> + free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
> + else
> + free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> + if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> + 0, free_pages))
> return true;
> }
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
2025-04-11 8:19 ` Vlastimil Babka
@ 2025-04-11 15:39 ` Johannes Weiner
2025-04-11 16:51 ` Vlastimil Babka
0 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2025-04-11 15:39 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Andrew Morton, Mel Gorman, Zi Yan, linux-mm, linux-kernel
On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
> On 3/13/25 22:05, Johannes Weiner wrote:
> > The previous patch added pageblock_order reclaim to kswapd/kcompactd,
> > which helps, but produces only one block at a time. Allocation stalls
> > and THP failure rates are still higher than they could be.
> >
> > To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
> > the watermarking for kswapd & kcompactd: instead of targeting the high
> > watermark in order-0 pages and checking for one suitable block, simply
> > require that the high watermark is entirely met in pageblocks.
>
> Hrm.
Hrm!
> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> > if (!pageblock_aligned(cc->migrate_pfn))
> > return COMPACT_CONTINUE;
> >
> > + /*
> > + * When defrag_mode is enabled, make kcompactd target
> > + * watermarks in whole pageblocks. Because they can be stolen
> > + * without polluting, no further fallback checks are needed.
> > + */
> > + if (defrag_mode && !cc->direct_compaction) {
> > + if (__zone_watermark_ok(cc->zone, cc->order,
> > + high_wmark_pages(cc->zone),
> > + cc->highest_zoneidx, cc->alloc_flags,
> > + zone_page_state(cc->zone,
> > + NR_FREE_PAGES_BLOCKS)))
> > + return COMPACT_SUCCESS;
> > +
> > + return COMPACT_CONTINUE;
> > + }
>
> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
> put us below high wmark quickly and then we never satisfy this?
The high watermark is not aligned, but why does it have to be? It's a
binary condition: met or not met. Compaction continues until it's met.
NR_FREE_PAGES_BLOCKS moves in pageblock_nr_pages steps. This means
it'll really work until align_up(highmark, pageblock_nr_pages), as
that's when NR_FREE_PAGES_BLOCKS snaps above the (unaligned) mark. But
that seems reasonable, no?
The allocator side is using low/min, so we have the conventional
hysteresis between consumer and producer.
For illustration, on my 2G test box, the watermarks in DMA32 look like
this:
pages free 212057
boost 0
min 11164 (21.8 blocks)
low 13955 (27.3 blocks)
high 16746 (32.7 blocks)
promo 19537
spanned 456704
present 455680
managed 431617 (843.1 blocks)
So there are several blocks between the kblahds wakeup and sleep. The
first allocation to cut into a whole free block will decrease
NR_FREE_PAGES_BLOCK by a whole block. But subsequent allocs that fill
the remaining space won't change that counter. So the distance between
the watermarks didn't fundamentally change (modulo block rounding).
> Doesn't then happen that with defrag_mode, in practice kcompactd basically
> always runs until scanners met?
Tracing kcompactd calls to compaction_finished() with defrag_mode:
@[COMPACT_CONTINUE]: 6955
@[COMPACT_COMPLETE]: 19
@[COMPACT_PARTIAL_SKIPPED]: 1
@[COMPACT_SUCCESS]: 17
@wakeuprequests: 3
Of course, similar to kswapd, it might not reach the watermarks and
keep running if there is a continuous stream of allocations consuming
the blocks it's making. Hence the ratio between wakeups & continues.
But when demand stops, it'll balance the high mark and quit.
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> > * meet watermarks.
> > */
> > for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
> > + unsigned long free_pages;
> > +
> > if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> > mark = promo_wmark_pages(zone);
> > else
> > mark = high_wmark_pages(zone);
> > - if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
>
> I think you just removed the only user of this _safe() function. Is the
> cpu-drift control it does no longer necessary?
Ah good catch. This should actually use zone_page_state_snapshot()
below depending on z->percpu_drift_mark.
This is active when there are enough cpus for the vmstat pcp deltas to
exceed low-min. Afaics this is still necessary, but also still
requires a lot of CPUs to matter (>212 cpus with 64G of memory).
I'll send a follow-up fix.
> > + /*
> > + * In defrag_mode, watermarks must be met in whole
> > + * blocks to avoid polluting allocator fallbacks.
> > + */
> > + if (defrag_mode)
> > + free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
> > + else
> > + free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > + if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> > + 0, free_pages))
> > return true;
> > }
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
2025-04-11 15:39 ` Johannes Weiner
@ 2025-04-11 16:51 ` Vlastimil Babka
2025-04-11 18:21 ` Johannes Weiner
0 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2025-04-11 16:51 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Mel Gorman, Zi Yan, linux-mm, linux-kernel
On 4/11/25 17:39, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
>> On 3/13/25 22:05, Johannes Weiner wrote:
>> > The previous patch added pageblock_order reclaim to kswapd/kcompactd,
>> > which helps, but produces only one block at a time. Allocation stalls
>> > and THP failure rates are still higher than they could be.
>> >
>> > To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
>> > the watermarking for kswapd & kcompactd: instead of targeting the high
>> > watermark in order-0 pages and checking for one suitable block, simply
>> > require that the high watermark is entirely met in pageblocks.
>>
>> Hrm.
>
> Hrm!
>
>> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
>> > if (!pageblock_aligned(cc->migrate_pfn))
>> > return COMPACT_CONTINUE;
>> >
>> > + /*
>> > + * When defrag_mode is enabled, make kcompactd target
>> > + * watermarks in whole pageblocks. Because they can be stolen
>> > + * without polluting, no further fallback checks are needed.
>> > + */
>> > + if (defrag_mode && !cc->direct_compaction) {
>> > + if (__zone_watermark_ok(cc->zone, cc->order,
>> > + high_wmark_pages(cc->zone),
>> > + cc->highest_zoneidx, cc->alloc_flags,
>> > + zone_page_state(cc->zone,
>> > + NR_FREE_PAGES_BLOCKS)))
>> > + return COMPACT_SUCCESS;
>> > +
>> > + return COMPACT_CONTINUE;
>> > + }
>>
>> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
>> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
>> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
>> put us below high wmark quickly and then we never satisfy this?
>
> The high watermark is not aligned, but why does it have to be? It's a
> binary condition: met or not met. Compaction continues until it's met.
What I mean is, kswapd will reclaim until the high watermark, which would be
32.7 blocks, wake up kcompactd [*] but that can only create up to 32 blocks
of NR_FREE_PAGES_BLOCKS so it has already lost at that point? (unless
there's concurrent freeing pushing it above the high wmark)
> NR_FREE_PAGES_BLOCKS moves in pageblock_nr_pages steps. This means
> it'll really work until align_up(highmark, pageblock_nr_pages), as
> that's when NR_FREE_PAGES_BLOCKS snaps above the (unaligned) mark. But
> that seems reasonable, no?
How can it snap if it doesn't have enough free pages? Unlike kswapd,
kcompactd doesn't create them, only defragments.
> The allocator side is using low/min, so we have the conventional
> hysteresis between consumer and producer.
Sure but we cap kswapd at high wmark and the hunk quoted above also uses
high wmark so there's no hysteresis happening between kswapd and kcompactd?
> For illustration, on my 2G test box, the watermarks in DMA32 look like
> this:
>
> pages free 212057
> boost 0
> min 11164 (21.8 blocks)
> low 13955 (27.3 blocks)
> high 16746 (32.7 blocks)
> promo 19537
> spanned 456704
> present 455680
> managed 431617 (843.1 blocks)
>
> So there are several blocks between the kblahds wakeup and sleep. The
> first allocation to cut into a whole free block will decrease
> NR_FREE_PAGES_BLOCK by a whole block. But subsequent allocs that fill
> the remaining space won't change that counter. So the distance between
> the watermarks didn't fundamentally change (modulo block rounding).
>
>> Doesn't then happen that with defrag_mode, in practice kcompactd basically
>> always runs until scanners met?
>
> Tracing kcompactd calls to compaction_finished() with defrag_mode:
>
> @[COMPACT_CONTINUE]: 6955
> @[COMPACT_COMPLETE]: 19
> @[COMPACT_PARTIAL_SKIPPED]: 1
> @[COMPACT_SUCCESS]: 17
> @wakeuprequests: 3
OK that doesn't look that bad.
> Of course, similar to kswapd, it might not reach the watermarks and
> keep running if there is a continuous stream of allocations consuming
> the blocks it's making. Hence the ratio between wakeups & continues.
>
> But when demand stops, it'll balance the high mark and quit.
Again, since kcompactd can only defragment free space and not create it, it
may be trying in vain?
[*] now when checking the code between kswapd and kcompactd handover, I
think I found a another problem?
we have:
kswapd_try_to_sleep()
prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
- with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
we were only reclaiming until now and didn't wake up kcompactd and
this actually prevents the wake up?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
2025-04-11 16:51 ` Vlastimil Babka
@ 2025-04-11 18:21 ` Johannes Weiner
2025-04-13 2:20 ` Johannes Weiner
0 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2025-04-11 18:21 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Andrew Morton, Mel Gorman, Zi Yan, linux-mm, linux-kernel
On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
> On 4/11/25 17:39, Johannes Weiner wrote:
> > On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
> >> On 3/13/25 22:05, Johannes Weiner wrote:
> >> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> >> > if (!pageblock_aligned(cc->migrate_pfn))
> >> > return COMPACT_CONTINUE;
> >> >
> >> > + /*
> >> > + * When defrag_mode is enabled, make kcompactd target
> >> > + * watermarks in whole pageblocks. Because they can be stolen
> >> > + * without polluting, no further fallback checks are needed.
> >> > + */
> >> > + if (defrag_mode && !cc->direct_compaction) {
> >> > + if (__zone_watermark_ok(cc->zone, cc->order,
> >> > + high_wmark_pages(cc->zone),
> >> > + cc->highest_zoneidx, cc->alloc_flags,
> >> > + zone_page_state(cc->zone,
> >> > + NR_FREE_PAGES_BLOCKS)))
> >> > + return COMPACT_SUCCESS;
> >> > +
> >> > + return COMPACT_CONTINUE;
> >> > + }
> >>
> >> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
> >> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
> >> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
> >> put us below high wmark quickly and then we never satisfy this?
> >
> > The high watermark is not aligned, but why does it have to be? It's a
> > binary condition: met or not met. Compaction continues until it's met.
>
> What I mean is, kswapd will reclaim until the high watermark, which would be
> 32.7 blocks, wake up kcompactd [*] but that can only create up to 32 blocks
> of NR_FREE_PAGES_BLOCKS so it has already lost at that point? (unless
> there's concurrent freeing pushing it above the high wmark)
Ah, but kswapd also uses the (rounded up) NR_FREE_PAGES_BLOCKS check.
Buckle up...
> > Of course, similar to kswapd, it might not reach the watermarks and
> > keep running if there is a continuous stream of allocations consuming
> > the blocks it's making. Hence the ratio between wakeups & continues.
> >
> > But when demand stops, it'll balance the high mark and quit.
>
> Again, since kcompactd can only defragment free space and not create it, it
> may be trying in vain?
>
> [*] now when checking the code between kswapd and kcompactd handover, I
> think I found a another problem?
>
> we have:
> kswapd_try_to_sleep()
> prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
> pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
> - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
> we were only reclaiming until now and didn't wake up kcompactd and
> this actually prevents the wake up?
Correct, so as per above, kswapd also does the NR_FREE_PAGES_BLOCKS
check. At first, at least. So it continues to produce adequate scratch
space and won't leave compaction high and dry on a watermark it cannot
meet. They are indeed coordinated in this aspect.
As far as the *handoff* to kcompactd goes, I've been pulling my hair
over this for a very long time. You're correct about the graph
above. And actually, this is the case before defrag_mode too: if you
wake kswapd with, say, an order-8, it will do pgdat_balanced() checks
against that, seemingly reclaim until the request can succeed, *then*
wake kcompactd and sleep. WTF?
But kswapd has this:
/*
* Fragmentation may mean that the system cannot be rebalanced for
* high-order allocations. If twice the allocation size has been
* reclaimed then recheck watermarks only at order-0 to prevent
* excessive reclaim. Assume that a process requested a high-order
* can direct reclaim/compact.
*/
if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
sc->order = 0;
Ignore the comment and just consider the code. What it does for higher
orders (whether defrag_mode is enabled or not), is reclaim a gap for
the order, ensure order-0 is met (but most likely it is), then enter
the sleep path - wake kcompactd and wait for more work.
Effectively, as long as there are pending higher-order requests
looping in the allocator, kswapd does this:
1) reclaim a compaction gap delta
2) wake kcompactd
3) goto 1
This pipelining seems to work *very* well in practice, especially when
there is a large number of concurrent requests.
In the huge allocator original series, I tried to convert kswapd to
use compaction_suitable() to hand over quicker. However, this ran into
scaling issues with higher allocation concurrency: maintaining just a
single, static compact gap when there could be hundreds of allocation
requests waiting for compaction results falls apart fast.
The current code has it right. The comments might be a bit dated and
maybe it could use some fine tuning. But generally, as long as there
are incoming wakeups from the allocator, it makes sense to keep making
more space for compaction as well.
I think Mel was playing 4d chess with this stuff.
[ I kept direct reclaim/compaction out of this defrag_mode series, but
testing suggests the same is likely true for the direct path.
Direct reclaim bails from compaction_ready() if there is a static
compaction gap for that order. But once the gap for a given order is
there, you can get a thundering herd of direct compactors storming
on this gap, most of which will then fail compaction_suitable().
A pipeline of "reclaim gap delta, direct compact, retry" seems to
make more sense there as well. With adequate checks to prevent
excessive reclaim in corner cases of course... ]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
2025-04-11 18:21 ` Johannes Weiner
@ 2025-04-13 2:20 ` Johannes Weiner
2025-04-15 7:31 ` Vlastimil Babka
2025-04-15 7:44 ` Vlastimil Babka
0 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-04-13 2:20 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Andrew Morton, Mel Gorman, Zi Yan, linux-mm, linux-kernel
On Fri, Apr 11, 2025 at 02:21:58PM -0400, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
> > [*] now when checking the code between kswapd and kcompactd handover, I
> > think I found a another problem?
> >
> > we have:
> > kswapd_try_to_sleep()
> > prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
> > pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
> > - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
> > we were only reclaiming until now and didn't wake up kcompactd and
> > this actually prevents the wake up?
Coming back to this, there is indeed a defrag_mode issue. My
apologies, I misunderstood what you were pointing at.
Like I said, kswapd reverts to order-0 in some other place to go to
sleep and trigger the handoff. At that point, defrag_mode also needs
to revert to NR_FREE_PAGES.
It's curious that this didn't stand out in testing. On the contrary,
kcompactd was still doing the vast majority of the compaction work. It
looks like kswapd and direct workers on their own manage to balance
NR_FREE_PAGES_BLOCK every so often, and then kswapd hands off. Given
the low number of kcompactd wakeups, the consumers keep it going.
So testing with this:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc422ad830d6..c2aa0a4b67de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6747,8 +6747,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
/*
* In defrag_mode, watermarks must be met in whole
* blocks to avoid polluting allocator fallbacks.
+ *
+ * When kswapd has compact gap, check regular
+ * NR_FREE_PAGES and hand over to kcompactd.
*/
- if (defrag_mode)
+ if (defrag_mode && order)
item = NR_FREE_PAGES_BLOCKS;
else
item = NR_FREE_PAGES;
I'm getting the following results:
fallbackspeed/STUPID-DEFRAGMODE fallbackspeed/DEFRAGMODE
Hugealloc Time mean 79381.34 ( +0.00%) 88126.12 ( +11.02%)
Hugealloc Time stddev 85852.16 ( +0.00%) 135366.75 ( +57.67%)
Kbuild Real time 249.35 ( +0.00%) 226.71 ( -9.04%)
Kbuild User time 1249.16 ( +0.00%) 1249.37 ( +0.02%)
Kbuild System time 171.76 ( +0.00%) 166.93 ( -2.79%)
THP fault alloc 51666.87 ( +0.00%) 52685.60 ( +1.97%)
THP fault fallback 16970.00 ( +0.00%) 15951.87 ( -6.00%)
Direct compact fail 166.53 ( +0.00%) 178.93 ( +7.40%)
Direct compact success 17.13 ( +0.00%) 4.13 ( -71.69%)
Compact daemon scanned migrate 3095413.33 ( +0.00%) 9231239.53 ( +198.22%)
Compact daemon scanned free 2155966.53 ( +0.00%) 7053692.87 ( +227.17%)
Compact direct scanned migrate 265642.47 ( +0.00%) 68388.33 ( -74.26%)
Compact direct scanned free 130252.60 ( +0.00%) 55634.87 ( -57.29%)
Compact total migrate scanned 3361055.80 ( +0.00%) 9299627.87 ( +176.69%)
Compact total free scanned 2286219.13 ( +0.00%) 7109327.73 ( +210.96%)
Alloc stall 1890.80 ( +0.00%) 6297.60 ( +232.94%)
Pages kswapd scanned 9043558.80 ( +0.00%) 5952576.73 ( -34.18%)
Pages kswapd reclaimed 1891708.67 ( +0.00%) 1030645.00 ( -45.52%)
Pages direct scanned 1017090.60 ( +0.00%) 2688047.60 ( +164.29%)
Pages direct reclaimed 92682.60 ( +0.00%) 309770.53 ( +234.22%)
Pages total scanned 10060649.40 ( +0.00%) 8640624.33 ( -14.11%)
Pages total reclaimed 1984391.27 ( +0.00%) 1340415.53 ( -32.45%)
Swap out 884585.73 ( +0.00%) 417781.93 ( -52.77%)
Swap in 287106.27 ( +0.00%) 95589.73 ( -66.71%)
File refaults 551697.60 ( +0.00%) 426474.80 ( -22.70%)
Work has shifted from direct to kcompactd. In aggregate there is more
compaction happening. Meanwhile aggregate reclaim and swapping drops
quite substantially. %sys is down, so this is just more efficient.
Reclaim and swapping are down substantially, which is great. But the
reclaim work that remains has shifted somewhat to direct reclaim,
which is unfortunate. THP delivery is also a tad worse, but still much
better than !defrag_mode, so not too concerning. That part deserves a
bit more thought.
Overall, this looks good, though. I'll send a proper patch next week.
Thanks for the review, Vlastimil.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
2025-04-13 2:20 ` Johannes Weiner
@ 2025-04-15 7:31 ` Vlastimil Babka
2025-04-15 7:44 ` Vlastimil Babka
1 sibling, 0 replies; 32+ messages in thread
From: Vlastimil Babka @ 2025-04-15 7:31 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Mel Gorman, Zi Yan, linux-mm, linux-kernel
On 4/13/25 04:20, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 02:21:58PM -0400, Johannes Weiner wrote:
>> On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
>> > [*] now when checking the code between kswapd and kcompactd handover, I
>> > think I found a another problem?
>> >
>> > we have:
>> > kswapd_try_to_sleep()
>> > prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
>> > pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
>> > - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
>> > we were only reclaiming until now and didn't wake up kcompactd and
>> > this actually prevents the wake up?
>
> Coming back to this, there is indeed a defrag_mode issue. My
> apologies, I misunderstood what you were pointing at.
>
> Like I said, kswapd reverts to order-0 in some other place to go to
> sleep and trigger the handoff. At that point, defrag_mode also needs
> to revert to NR_FREE_PAGES.
I missed that revert to order-0 and that without it the current code also
wouldn't make sense. But I agree with the fix.
> It's curious that this didn't stand out in testing. On the contrary,
> kcompactd was still doing the vast majority of the compaction work. It
> looks like kswapd and direct workers on their own manage to balance
> NR_FREE_PAGES_BLOCK every so often, and then kswapd hands off. Given
> the low number of kcompactd wakeups, the consumers keep it going.
>
> So testing with this:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc422ad830d6..c2aa0a4b67de 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6747,8 +6747,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> /*
> * In defrag_mode, watermarks must be met in whole
> * blocks to avoid polluting allocator fallbacks.
> + *
> + * When kswapd has compact gap, check regular
> + * NR_FREE_PAGES and hand over to kcompactd.
> */
> - if (defrag_mode)
> + if (defrag_mode && order)
> item = NR_FREE_PAGES_BLOCKS;
> else
> item = NR_FREE_PAGES;
>
> I'm getting the following results:
>
> fallbackspeed/STUPID-DEFRAGMODE fallbackspeed/DEFRAGMODE
> Hugealloc Time mean 79381.34 ( +0.00%) 88126.12 ( +11.02%)
> Hugealloc Time stddev 85852.16 ( +0.00%) 135366.75 ( +57.67%)
> Kbuild Real time 249.35 ( +0.00%) 226.71 ( -9.04%)
> Kbuild User time 1249.16 ( +0.00%) 1249.37 ( +0.02%)
> Kbuild System time 171.76 ( +0.00%) 166.93 ( -2.79%)
> THP fault alloc 51666.87 ( +0.00%) 52685.60 ( +1.97%)
> THP fault fallback 16970.00 ( +0.00%) 15951.87 ( -6.00%)
> Direct compact fail 166.53 ( +0.00%) 178.93 ( +7.40%)
> Direct compact success 17.13 ( +0.00%) 4.13 ( -71.69%)
> Compact daemon scanned migrate 3095413.33 ( +0.00%) 9231239.53 ( +198.22%)
> Compact daemon scanned free 2155966.53 ( +0.00%) 7053692.87 ( +227.17%)
> Compact direct scanned migrate 265642.47 ( +0.00%) 68388.33 ( -74.26%)
> Compact direct scanned free 130252.60 ( +0.00%) 55634.87 ( -57.29%)
> Compact total migrate scanned 3361055.80 ( +0.00%) 9299627.87 ( +176.69%)
> Compact total free scanned 2286219.13 ( +0.00%) 7109327.73 ( +210.96%)
> Alloc stall 1890.80 ( +0.00%) 6297.60 ( +232.94%)
> Pages kswapd scanned 9043558.80 ( +0.00%) 5952576.73 ( -34.18%)
> Pages kswapd reclaimed 1891708.67 ( +0.00%) 1030645.00 ( -45.52%)
> Pages direct scanned 1017090.60 ( +0.00%) 2688047.60 ( +164.29%)
> Pages direct reclaimed 92682.60 ( +0.00%) 309770.53 ( +234.22%)
> Pages total scanned 10060649.40 ( +0.00%) 8640624.33 ( -14.11%)
> Pages total reclaimed 1984391.27 ( +0.00%) 1340415.53 ( -32.45%)
> Swap out 884585.73 ( +0.00%) 417781.93 ( -52.77%)
> Swap in 287106.27 ( +0.00%) 95589.73 ( -66.71%)
> File refaults 551697.60 ( +0.00%) 426474.80 ( -22.70%)
>
> Work has shifted from direct to kcompactd. In aggregate there is more
> compaction happening. Meanwhile aggregate reclaim and swapping drops
> quite substantially. %sys is down, so this is just more efficient.
>
> Reclaim and swapping are down substantially, which is great. But the
> reclaim work that remains has shifted somewhat to direct reclaim,
> which is unfortunate. THP delivery is also a tad worse, but still much
> better than !defrag_mode, so not too concerning. That part deserves a
> bit more thought.
>
> Overall, this looks good, though. I'll send a proper patch next week.
>
> Thanks for the review, Vlastimil.
NP!
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
2025-04-13 2:20 ` Johannes Weiner
2025-04-15 7:31 ` Vlastimil Babka
@ 2025-04-15 7:44 ` Vlastimil Babka
1 sibling, 0 replies; 32+ messages in thread
From: Vlastimil Babka @ 2025-04-15 7:44 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Mel Gorman, Zi Yan, linux-mm, linux-kernel
On 4/13/25 04:20, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 02:21:58PM -0400, Johannes Weiner wrote:
>> On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
>> > [*] now when checking the code between kswapd and kcompactd handover, I
>> > think I found a another problem?
>> >
>> > we have:
>> > kswapd_try_to_sleep()
>> > prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
>> > pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
>> > - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
>> > we were only reclaiming until now and didn't wake up kcompactd and
>> > this actually prevents the wake up?
>
> Coming back to this, there is indeed a defrag_mode issue. My
> apologies, I misunderstood what you were pointing at.
>
> Like I said, kswapd reverts to order-0 in some other place to go to
> sleep and trigger the handoff. At that point, defrag_mode also needs
> to revert to NR_FREE_PAGES.
>
> It's curious that this didn't stand out in testing. On the contrary,
> kcompactd was still doing the vast majority of the compaction work. It
> looks like kswapd and direct workers on their own manage to balance
> NR_FREE_PAGES_BLOCK every so often, and then kswapd hands off. Given
> the low number of kcompactd wakeups, the consumers keep it going.
>
> So testing with this:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc422ad830d6..c2aa0a4b67de 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6747,8 +6747,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> /*
> * In defrag_mode, watermarks must be met in whole
> * blocks to avoid polluting allocator fallbacks.
> + *
> + * When kswapd has compact gap, check regular
> + * NR_FREE_PAGES and hand over to kcompactd.
> */
> - if (defrag_mode)
> + if (defrag_mode && order)
> item = NR_FREE_PAGES_BLOCKS;
> else
> item = NR_FREE_PAGES;
>
> I'm getting the following results:
>
> fallbackspeed/STUPID-DEFRAGMODE fallbackspeed/DEFRAGMODE
> Hugealloc Time mean 79381.34 ( +0.00%) 88126.12 ( +11.02%)
> Hugealloc Time stddev 85852.16 ( +0.00%) 135366.75 ( +57.67%)
> Kbuild Real time 249.35 ( +0.00%) 226.71 ( -9.04%)
> Kbuild User time 1249.16 ( +0.00%) 1249.37 ( +0.02%)
> Kbuild System time 171.76 ( +0.00%) 166.93 ( -2.79%)
> THP fault alloc 51666.87 ( +0.00%) 52685.60 ( +1.97%)
> THP fault fallback 16970.00 ( +0.00%) 15951.87 ( -6.00%)
> Direct compact fail 166.53 ( +0.00%) 178.93 ( +7.40%)
> Direct compact success 17.13 ( +0.00%) 4.13 ( -71.69%)
> Compact daemon scanned migrate 3095413.33 ( +0.00%) 9231239.53 ( +198.22%)
> Compact daemon scanned free 2155966.53 ( +0.00%) 7053692.87 ( +227.17%)
However this brings me back to my concern with __compact_finished()
requiring high watermark of NR_FREE_PAGES_BLOCKS. IMHO it can really easily
lead to situations where all free memory is already contiguous, but because
of one or two THP concurrent allocations we're below the high watermark (but
not yet low watermark to wake kswapd again) so further compaction by
kcompactd is simply wasting cpu cycles at that point.
Again I think a comparison of NR_FREE_PAGES_BLOCKS to NR_FREE_PAGES would in
theory work better for determining if all free space is as defragmented as
possible?
> Compact direct scanned migrate 265642.47 ( +0.00%) 68388.33 ( -74.26%)
> Compact direct scanned free 130252.60 ( +0.00%) 55634.87 ( -57.29%)
> Compact total migrate scanned 3361055.80 ( +0.00%) 9299627.87 ( +176.69%)
> Compact total free scanned 2286219.13 ( +0.00%) 7109327.73 ( +210.96%)
> Alloc stall 1890.80 ( +0.00%) 6297.60 ( +232.94%)
> Pages kswapd scanned 9043558.80 ( +0.00%) 5952576.73 ( -34.18%)
> Pages kswapd reclaimed 1891708.67 ( +0.00%) 1030645.00 ( -45.52%)
> Pages direct scanned 1017090.60 ( +0.00%) 2688047.60 ( +164.29%)
> Pages direct reclaimed 92682.60 ( +0.00%) 309770.53 ( +234.22%)
> Pages total scanned 10060649.40 ( +0.00%) 8640624.33 ( -14.11%)
> Pages total reclaimed 1984391.27 ( +0.00%) 1340415.53 ( -32.45%)
> Swap out 884585.73 ( +0.00%) 417781.93 ( -52.77%)
> Swap in 287106.27 ( +0.00%) 95589.73 ( -66.71%)
> File refaults 551697.60 ( +0.00%) 426474.80 ( -22.70%)
>
> Work has shifted from direct to kcompactd. In aggregate there is more
> compaction happening. Meanwhile aggregate reclaim and swapping drops
> quite substantially. %sys is down, so this is just more efficient.
>
> Reclaim and swapping are down substantially, which is great. But the
> reclaim work that remains has shifted somewhat to direct reclaim,
> which is unfortunate. THP delivery is also a tad worse, but still much
> better than !defrag_mode, so not too concerning. That part deserves a
> bit more thought.
>
> Overall, this looks good, though. I'll send a proper patch next week.
>
> Thanks for the review, Vlastimil.
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2025-04-15 7:44 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-13 21:05 [PATCH 0/5] mm: reliable huge page allocator Johannes Weiner
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
2025-03-14 15:08 ` Zi Yan
2025-03-16 4:28 ` Hugh Dickins
2025-03-17 18:18 ` Johannes Weiner
2025-03-21 6:21 ` kernel test robot
2025-03-21 13:55 ` Johannes Weiner
2025-04-10 15:19 ` Vlastimil Babka
2025-04-10 20:17 ` Johannes Weiner
2025-04-11 7:32 ` Vlastimil Babka
2025-03-13 21:05 ` [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing Johannes Weiner
2025-03-14 18:36 ` Zi Yan
2025-03-13 21:05 ` [PATCH 3/5] mm: page_alloc: defrag_mode Johannes Weiner
2025-03-14 18:54 ` Zi Yan
2025-03-14 20:50 ` Johannes Weiner
2025-03-14 22:54 ` Zi Yan
2025-03-22 15:05 ` Brendan Jackman
2025-03-23 0:58 ` Johannes Weiner
2025-03-23 1:34 ` Johannes Weiner
2025-03-23 3:46 ` Johannes Weiner
2025-03-23 18:04 ` Brendan Jackman
2025-03-31 15:55 ` Johannes Weiner
2025-03-13 21:05 ` [PATCH 4/5] mm: page_alloc: defrag_mode kswapd/kcompactd assistance Johannes Weiner
2025-03-13 21:05 ` [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Johannes Weiner
2025-03-14 21:05 ` Johannes Weiner
2025-04-11 8:19 ` Vlastimil Babka
2025-04-11 15:39 ` Johannes Weiner
2025-04-11 16:51 ` Vlastimil Babka
2025-04-11 18:21 ` Johannes Weiner
2025-04-13 2:20 ` Johannes Weiner
2025-04-15 7:31 ` Vlastimil Babka
2025-04-15 7:44 ` Vlastimil Babka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox