From: Barry Song <baohua@kernel.org>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
Johannes Weiner <hannes@cmpxchg.org>,
David Hildenbrand <david@kernel.org>,
Michal Hocko <mhocko@kernel.org>,
Qi Zheng <zhengqi.arch@bytedance.com>,
Shakeel Butt <shakeel.butt@linux.dev>,
Lorenzo Stoakes <ljs@kernel.org>,
Kairui Song <kasong@tencent.com>,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
Wang Lian <wanglian@kylinos.cn>, Kunwu Chan <chentao@kylinos.cn>
Subject: Re: [RFC PATCH v2] mm: Improve pgdat_balanced() to avoid over-reclamation for higher-order allocation
Date: Wed, 22 Apr 2026 18:56:26 +0800 [thread overview]
Message-ID: <CAGsJ_4wyDqnoBXcBQL932pkg8QY79EWrbmKaVqNvm_s5RQrNFw@mail.gmail.com> (raw)
In-Reply-To: <8d4df864-2954-4eb6-b8d7-ae6595646e6e@linux.alibaba.com>
On Wed, Apr 22, 2026 at 2:59 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 4/22/26 10:18 AM, Barry Song (Xiaomi) wrote:
> > We may encounter cases where the system still has plenty of free
> > memory, but cannot satisfy higher-order allocations. On phones, we
> > have observed that bursty network transfers can cause devices to
> > heat up. Baolin and Kairui have seen similar behavior on servers.
> >
> > Currently, kswapd behaves as follows: when a higher-order allocation
> > is issued with __GFP_KSWAPD_RECLAIM, pgdat_balanced() returns false
> > because __zone_watermark_ok() fails if no suitable higher-order
> > pages exist, even when free memory is well above the high watermark.
> > As a result, kswapd_shrink_node() sets an excessively large
> > sc->nr_to_reclaim and attempts aggressive reclamation:
> >
> > for_each_managed_zone_pgdat(zone, pgdat, z, sc->reclaim_idx) {
> > sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX);
> > }
> >
> > We have an opportunity to re-evaluate the balance by resetting
> > sc->order to 0 after shrink_node() with the following code
> > in kswapd_shrink_node():
> > /*
> > * Fragmentation may mean that the system cannot be rebalanced for
> > * high-order allocations. If twice the allocation size has been
> > * reclaimed then recheck watermarks only at order-0 to prevent
> > * excessive reclaim.
> > */
> > if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> > sc->order = 0;
> >
> > But we have actually scanned and over-reclaimed far more than
> > compact_gap(sc->order). If higher-order allocations continue, we may
> > see persistently high kswapd CPU utilization coexisting with plenty of
> > free memory in the system.
> >
> > We may want to evaluate the situation earlier at the beginning.
> > If there is plenty of free memory, we could avoid triggering
> > reclamation with an excessively large sc->nr_to_reclaim value
> > and instead prefer compaction.
> >
> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> > Cc: Shakeel Butt <shakeel.butt@linux.dev>
> > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > Cc: Kairui Song <kasong@tencent.com>
> > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > Cc: Yuanchu Xie <yuanchu@google.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Co-developed-by: Wang Lian <wanglian@kylinos.cn>
> > Co-developed-by: Kunwu Chan <chentao@kylinos.cn>
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
>
> Thanks Barry for sending out the RFC patch for discussion.
>
> Yes, we have indeed seen reports from our customers' scenarios where
> fragmentation caused kswapd to be woken up and reclaim too many file
> folios (even when free memory was sufficient), leading to severe I/O
> contention that impacted some applications.
>
> However, I'm concerned that this patch might also have side effects,
> such as affecting system defragmentation. In some scenarios, directly
> reclaiming clean pagecache to free up space might be a faster way to
balance_pgdat() can still reclaim clean page cache even when
pgdat_balanced() returns true, provided that nr_boost_reclaim is
non-zero.
/*
* If boosting is not active then only reclaim if there are no
* eligible zones. Note that sc.reclaim_idx is not used as
* buffer_heads_over_limit may have adjusted it.
*/
if (!nr_boost_reclaim && balanced)
goto out;
/* Limit the priority of boosting to avoid reclaim writeback */
if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
raise_priority = false;
/*
* Do not writeback or swap pages for boosted reclaim. The
* intent is to relieve pressure not issue sub-optimal IO
* from reclaim context. If no pages are reclaimed, the
* reclaim will be aborted.
*/
sc.may_writepage = !nr_boost_reclaim;
sc.may_swap = !nr_boost_reclaim;
I find that nr_boost_reclaim is almost always non-zero in bursty
network scenarios. So I guess clean page cache is still reclaimed,
but with much lower kswapd pressure.
> defragment. At the very least, I think under defrag_mode, we should be
> more aggressive about defragmentation (including reclaiming some memory
> by kswapd).
I guess we can keep the current behavior if defrag_mode prefers
over-reclaiming to form contiguous pages. Is it simply an
if (defrag_mode) check?
>
> > -RFC v1 was "mm: net: disable kswapd for high-order network
> > buffer allocation":
> > https://lore.kernel.org/linux-mm/20251013101636.69220-1-21cnbao@gmail.com/
> >
> > mm/vmscan.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bd1b1aa12581..4f9668aa8eef 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6964,6 +6964,13 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> > if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> > 0, free_pages))
> > return true;
> > + /*
> > + * Free pages may be well above the watermark, but if
> > + * higher-order pages are unavailable, kswapd may still
> > + * trigger excessive reclamation.
> > + */
> > + if (order && compaction_suitable(zone, order, mark, highest_zoneidx))
> > + return true;
> > }
> >
> > /*
>
Thanks
Barry
next prev parent reply other threads:[~2026-04-22 10:56 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 2:18 Barry Song (Xiaomi)
2026-04-22 6:58 ` Baolin Wang
2026-04-22 10:56 ` Barry Song [this message]
2026-04-22 15:47 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGsJ_4wyDqnoBXcBQL932pkg8QY79EWrbmKaVqNvm_s5RQrNFw@mail.gmail.com \
--to=baohua@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baolin.wang@linux.alibaba.com \
--cc=chentao@kylinos.cn \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=wanglian@kylinos.cn \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox