From: Qiliang Yuan <realwujing@gmail.com>
To: akpm@linux-foundation.org
Cc: david@kernel.org, mhocko@suse.com, vbabka@suse.cz,
willy@infradead.org, lance.yang@linux.dev, hannes@cmpxchg.org,
surenb@google.com, jackmanb@google.com, ziy@nvidia.com,
weixugc@google.com, rppt@kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
edumazet@google.com
Subject: Re: [PATCH v5] mm/page_alloc: boost watermarks on atomic allocation failure
Date: Wed, 21 Jan 2026 20:40:10 -0500 [thread overview]
Message-ID: <20260122014034.223163-1-realwujing@gmail.com> (raw)
In-Reply-To: <20260121125603.47b204cc8fbe9466b25cce16@linux-foundation.org>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 4687 bytes --]
On Wed, 21 Jan 2026 12:56:03 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
> This seems sensible to me - dynamically boost reserves in response to
> sustained GFP_ATOMIC allocation failures. It's very much a networking
> thing and I expect the networking people have been looking at these
> issues for years. So let's start by cc'ing them!
Thank you for the feedback and for cc'ing the networking folks! I appreciate
your continued engagement throughout this patch series (v1-v5).
> Obvious question, which I think was asked before: what about gradually
> decreasing those reserves when the packet storm has subsided?
>
> > v4:
> > - Introduced watermark_scale_boost and gradual decay via balance_pgdat.
>
> And there it is, but v5 removed this. Why? Or perhaps I'm misreading
> the implementation.
You're absolutely right - v4 did include a gradual decay mechanism. The
evolution from v1 to v5 was driven by community feedback, and I'd like to
explain the rationale for each major change:
**v1 → v2**: Following your and Matthew Wilcox's feedback on v1, I:
- Reduced the boost from doubling (100%) to 50% increase
- Added a decay mechanism (5% every 5 minutes)
- Added debounce logic
- v1: https://lore.kernel.org/all/tencent_9DB6637676D639B4B7AEA09CC6A6F9E49D0A@qq.com/
- v2: https://lore.kernel.org/all/tencent_6FE67BA7BE8376AB038A71ACAD4FF8A90006@qq.com/
**v2 → v3**: Following Michal Hocko's suggestion to use watermark_scale_factor
instead of min_free_kbytes, I switched to the watermark_boost infrastructure.
This was a significant simplification that reused existing MM subsystem patterns.
- v3: https://lore.kernel.org/all/tencent_44B556221480D8371FBC534ACCF3CE2C8707@qq.com/
**v3 → v4**: Added watermark_scale_boost and gradual decay via balance_pgdat()
to provide more fine-grained control over the reclaim aggressiveness.
- v4: https://lore.kernel.org/all/tencent_D23BFCB69EA088C55AFAF89F926036743E0A@qq.com/
**v4 → v5**: Removed watermark_scale_boost for the following reasons:
- v5: https://lore.kernel.org/all/20260121065740.35616-1-realwujing@gmail.com/
1. **Natural decay exists**: The existing watermark_boost infrastructure already
has a built-in decay path. When kswapd successfully reclaims memory and the
zone becomes balanced, kswapd_shrink_node() automatically resets
watermark_boost to 0. This happens organically without custom decay logic.
2. **Simplicity**: The v4 approach added custom watermark_scale_boost tracking
and manual decay in balance_pgdat(). This added complexity that duplicated
functionality already present in the kswapd reclaim path.
3. **Production validation**: In our production environment (high-throughput
networking workloads), the natural decay via kswapd proved sufficient. Once
memory pressure subsides and kswapd successfully reclaims to the high
watermark, the boost is cleared automatically within seconds.
However, I recognize this is a trade-off. The v4 gradual decay provided more
explicit control over the decay rate. If you or the networking maintainers feel
that explicit decay control is important for packet storm scenarios, I'm happy
to reintroduce the v4 approach or explore alternative decay strategies (e.g.,
time-based decay independent of kswapd success).
> > + zone->watermark_boost = min(zone->watermark_boost +
> > + max(pageblock_nr_pages, zone_managed_pages(zone) >> 10),
>
> ">> 10" is a magic number. What is the reasoning behind choosing this
> value?
Good catch. The ">> 10" (divide by 1024) was chosen to provide a
zone-proportional boost that scales with zone size:
- For a 1GB zone: ~1MB boost per trigger
- For a 16GB zone: ~16MB boost per trigger
The rationale:
1. **Proportionality**: Larger zones experiencing atomic allocation pressure
likely need proportionally larger safety buffers. A fixed pageblock_nr_pages
(typically 2MB) might be insufficient for large zones under heavy load.
2. **Conservative scaling**: 1/1024 (~0.1%) is aggressive enough to help during
sustained pressure but conservative enough to avoid over-reclaim. This was
empirically tuned based on our production workload.
3. **Production results**: In our high-throughput networking environment
(100Gbps+ traffic bursts), this value reduced GFP_ATOMIC failures by ~95%
without causing excessive kswapd activity or impacting normal allocations.
I should document this better. I propose adding a #define:
```c
/*
* Boost watermarks by ~0.1% of zone size on atomic allocation pressure.
* This provides zone-proportional safety buffers: ~1MB per 1GB of zone size.
*/
#define ATOMIC_BOOST_SCALE_SHIFT 10
```
Best regards,
Qiliang Yuan
next prev parent reply other threads:[~2026-01-22 1:40 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-21 6:57 Qiliang Yuan
2026-01-21 20:56 ` Andrew Morton
2026-01-22 1:40 ` Qiliang Yuan [this message]
2026-01-22 2:00 ` [PATCH] " Qiliang Yuan
2026-01-22 2:17 ` Qiliang Yuan
2026-01-22 2:07 ` [PATCH v6] " Qiliang Yuan
2026-01-22 12:22 ` Vlastimil Babka
2026-01-23 6:42 ` [PATCH v7] " Qiliang Yuan
2026-01-27 6:06 ` kernel test robot
-- strict thread matches above, loose matches on Subject: below --
2026-01-21 6:48 [PATCH v5] " Qiliang Yuan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260122014034.223163-1-realwujing@gmail.com \
--to=realwujing@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=edumazet@google.com \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=netdev@vger.kernel.org \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox