* [PATCH v4 1/1] mm/page_alloc: auto-tune watermarks on atomic allocation failure
[not found] <20260106061950.1498914-1-realwujing@qq.com>
@ 2026-01-06 6:19 ` wujing
0 siblings, 0 replies; only message in thread
From: wujing @ 2026-01-06 6:19 UTC (permalink / raw)
To: Andrew Morton
Cc: Vlastimil Babka, Matthew Wilcox, Lance Yang, David Hildenbrand,
Michal Hocko, Johannes Weiner, Brendan Jackman,
Suren Baghdasaryan, Zi Yan, Mike Rapoport, Qi Zheng,
Shakeel Butt, linux-mm, netdev, linux-kernel, wujing,
Qiliang Yuan
During high-concurrency network traffic bursts, GFP_ATOMIC order-0
allocations can fail due to rapid exhaustion of the atomic reserve.
The current kernel lacks a reactive mechanism to refill these
reserves quickly enough to prevent packet drops and performance
degradation.
This patch introduces a multi-tier, reactive auto-tuning mechanism
using the watermark_boost infrastructure with the following
optimizations for robustness and horizontal precision:
1. Per-Zone Debounce: Move the boost debounce timer from a global
variable to struct zone. This ensures that memory pressure is handled
independently per node/zone, preventing one node from inadvertently
stifling the response of another.
2. Scaled Boosting Strength: Replace the fixed pageblock_nr_pages
increment with a dynamic value scaled by zone_managed_pages (approx.
0.1%). This ensures sufficient reclaim pressure on large-memory
systems where a single pageblock might be insufficient.
3. Precision Path: Optimize the slowpath failure logic to only boost
the candidate zones that are actually under pressure, avoiding
unnecessary reclaim overhead on distant or unrelated nodes.
4. Proactive Soft-Boosting: Trigger a smaller, half-strength (pageblock >> 1)
boost when an atomic request enters the slowpath but has not yet failed.
This proactive approach aims to prevent the reserve exhaustion before
it leads to allocation failure.
5. Hybrid Tuning & Gradual Decay: Introduce watermark_scale_boost in
struct zone. When failure occurs, we not only boost the watermark level
but also temporarily increase the watermark_scale_factor. To ensure
stability, the scale boost is decayed gradually (-5 per kswapd cycle)
in balance_pgdat() rather than reset instantly, with watermarks
recalculated at each step via setup_per_zone_wmarks().
Additionally, the patch implements a strict (gfp_mask & GFP_ATOMIC) ==
GFP_ATOMIC check to ensure that only true mission-critical atomic
requests trigger the tuning, excluding less sensitive non-blocking
allocations.
Together, these changes provide a robust, scalable, and precise
defense-in-depth for critical atomic allocations.
Observed failure logs:
[38535644.718700] node 0: slabs: 1031, objs: 43328, free: 0
[38535644.725059] node 1: slabs: 339, objs: 17616, free: 317
[38535645.428345] SLUB: Unable to allocate memory on node -1, gfp=0x480020(GFP_ATOMIC)
[38535645.436888] cache: skbuff_head_cache, object size: 232, buffer size: 256, default order: 2, min order: 0
[38535645.447664] node 0: slabs: 940, objs: 40864, free: 144
[38535645.454026] node 1: slabs: 322, objs: 19168, free: 383
[38535645.556122] SLUB: Unable to allocate memory on node -1, gfp=0x480020(GFP_ATOMIC)
[38535645.564576] cache: skbuff_head_cache, object size: 232, buffer size: 256, default order: 2, min order: 0
[38535649.655523] warn_alloc: 59 callbacks suppressed
[38535649.655527] swapper/100: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
[38535649.671692] swapper/100 cpuset=/ mems_allowed=0-1
Signed-off-by: wujing <realwujing@qq.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
include/linux/mmzone.h | 2 ++
mm/page_alloc.c | 55 +++++++++++++++++++++++++++++++++++++++---
mm/vmscan.c | 10 ++++++++
3 files changed, 64 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 75ef7c9f9307..4d06b041f318 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -882,6 +882,8 @@ struct zone {
/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long _watermark[NR_WMARK];
unsigned long watermark_boost;
+ unsigned long last_boost_jiffies;
+ unsigned int watermark_scale_boost;
unsigned long nr_reserved_highatomic;
unsigned long nr_free_highatomic;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c380f063e8b7..4a8243abfb17 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -217,6 +217,7 @@ unsigned int pageblock_order __read_mostly;
static void __free_pages_ok(struct page *page, unsigned int order,
fpi_t fpi_flags);
+static void __setup_per_zone_wmarks(void);
/*
* results with 256, 32 in the lowmem_reserve sysctl:
@@ -2189,7 +2190,7 @@ static inline bool boost_watermark(struct zone *zone)
max_boost = max(pageblock_nr_pages, max_boost);
- zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
+ zone->watermark_boost = min(zone->watermark_boost + max(pageblock_nr_pages, zone_managed_pages(zone) >> 10),
max_boost);
return true;
@@ -3975,6 +3976,9 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
mem_cgroup_show_protected_memory(NULL);
}
+/* Auto-tuning watermarks on atomic allocation failures */
+#define BOOST_DEBOUNCE_MS 10000 /* 10 seconds debounce */
+
void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
{
struct va_format vaf;
@@ -4742,6 +4746,27 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (page)
goto got_pg;
+ /* Proactively boost watermarks when atomic request enters slowpath */
+ if (((gfp_mask & GFP_ATOMIC) == GFP_ATOMIC) && order == 0) {
+ struct zoneref *z;
+ struct zone *zone;
+
+ for_each_zone_zonelist(zone, z, ac->zonelist, ac->highest_zoneidx) {
+ if (time_after(jiffies, zone->last_boost_jiffies + msecs_to_jiffies(BOOST_DEBOUNCE_MS))) {
+ zone->last_boost_jiffies = jiffies;
+ /* Smaller boost than the failure path */
+ zone->watermark_boost = min(zone->watermark_boost + (pageblock_nr_pages >> 1),
+ high_wmark_pages(zone) >> 1);
+ wakeup_kswapd(zone, gfp_mask, 0, ac->highest_zoneidx);
+ /*
+ * Precision: only boost the preferred zone(s) to avoid
+ * overallocation across all nodes if one is sufficient.
+ */
+ break;
+ }
+ }
+ }
+
/*
* For costly allocations, try direct compaction first, as it's likely
* that we have enough base pages and don't need to reclaim. For non-
@@ -4947,6 +4972,30 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto retry;
}
fail:
+ /* Auto-tuning: boost watermarks on atomic allocation failure */
+ if (((gfp_mask & GFP_ATOMIC) == GFP_ATOMIC) && order == 0) {
+ unsigned long now = jiffies;
+ struct zoneref *z;
+ struct zone *zone;
+
+ for_each_zone_zonelist(zone, z, ac->zonelist, ac->highest_zoneidx) {
+ if (time_after(now, zone->last_boost_jiffies + msecs_to_jiffies(BOOST_DEBOUNCE_MS))) {
+ zone->last_boost_jiffies = now;
+ if (boost_watermark(zone)) {
+ /* Temporarily increase scale factor to accelerate reclaim */
+ zone->watermark_scale_boost = min(zone->watermark_scale_boost + 5, 100U);
+ __setup_per_zone_wmarks();
+ wakeup_kswapd(zone, gfp_mask, 0, ac->highest_zoneidx);
+ }
+ /*
+ * Precision: only boost the preferred zone(s) to avoid
+ * overallocation across all nodes if one is sufficient.
+ */
+ break;
+ }
+ }
+ }
+
warn_alloc(gfp_mask, ac->nodemask,
"page allocation failure: order:%u", order);
got_pg:
@@ -6296,6 +6345,7 @@ void __init page_alloc_init_cpuhp(void)
* calculate_totalreserve_pages - called when sysctl_lowmem_reserve_ratio
* or min_free_kbytes changes.
*/
+static void __setup_per_zone_wmarks(void);
static void calculate_totalreserve_pages(void)
{
struct pglist_data *pgdat;
@@ -6440,9 +6490,8 @@ static void __setup_per_zone_wmarks(void)
*/
tmp = max_t(u64, tmp >> 2,
mult_frac(zone_managed_pages(zone),
- watermark_scale_factor, 10000));
+ watermark_scale_factor + zone->watermark_scale_boost, 10000));
- zone->watermark_boost = 0;
zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
zone->_watermark[WMARK_HIGH] = low_wmark_pages(zone) + tmp;
zone->_watermark[WMARK_PROMO] = high_wmark_pages(zone) + tmp;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5ba..7fca44bdbfe5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7143,6 +7143,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
/* If reclaim was boosted, account for the reclaim done in this pass */
if (boosted) {
unsigned long flags;
+ bool scale_decayed = false;
for (i = 0; i <= highest_zoneidx; i++) {
if (!zone_boosts[i])
@@ -7152,9 +7153,18 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
zone = pgdat->node_zones + i;
spin_lock_irqsave(&zone->lock, flags);
zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+ /* Decay scale boost gradually after kswapd completes work */
+ if (zone->watermark_scale_boost) {
+ zone->watermark_scale_boost = (zone->watermark_scale_boost > 5) ?
+ (zone->watermark_scale_boost - 5) : 0;
+ scale_decayed = true;
+ }
spin_unlock_irqrestore(&zone->lock, flags);
}
+ if (scale_decayed)
+ setup_per_zone_wmarks();
+
/*
* As there is now likely space, wakeup kcompact to defragment
* pageblocks.
--
2.39.5
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-01-06 6:20 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20260106061950.1498914-1-realwujing@qq.com>
2026-01-06 6:19 ` [PATCH v4 1/1] mm/page_alloc: auto-tune watermarks on atomic allocation failure wujing
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox