linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: wang lian <lianux.mm@gmail.com>
To: willy@infradead.org
Cc: 21cnbao@gmail.com, corbet@lwn.net, davem@davemloft.net,
	edumazet@google.com, hannes@cmpxchg.org, horms@kernel.org,
	jackmanb@google.com, kuba@kernel.org, kuniyu@google.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linyunsheng@huawei.com, mhocko@suse.com,
	netdev@vger.kernel.org, pabeni@redhat.com, surenb@google.com,
	v-songbaohua@oppo.com, vbabka@suse.cz, willemb@google.com,
	zhouhuacai@oppo.com, ziy@nvidia.com,
	wang lian <lianux.mm@gmail.com>
Subject: Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
Date: Fri, 17 Apr 2026 16:11:34 +0800	[thread overview]
Message-ID: <20260417081138.23426-1-lianux.mm@gmail.com> (raw)
In-Reply-To: <aO11jqD6jgNs5h8K@casper.infradead.org>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 4538 bytes --]

Hi Matthew, Barry,

> So, we try to do an order-3 allocation. kswapd runs and ...
> succeeds in creating order-3 pages? Or fails to?
From our reproducer runs, both happen. We observe intermittent order-3
successes, but also frequent high-order failures followed by order-0
fallback.

> If it fails, that's something we need to sort out.
Agreed. In this workload, the bottleneck appears to be contiguity, not
raw reclaimable memory shortage. Order-0 memory remains available while
suitable order-3 blocks are often unavailable.

> If it succeeds, now we have several order-3 pages, great. But where do
> they all go that we need to run kswapd again?
In our runs, order-3 pockets do show up, but they do not last long.
They get consumed quickly by ongoing skb demand, and the pressure returns.

To investigate this, we built a reproducer that keeps creating memory fragments 
while the network stack continuously requests order-3 allocations.[1][2]

Raw sample output (trimmed):
---------------------------------------------------------------------------------------------------
TIME       | BUDDYINFO (Normal Zone)        | MEMINFO                   | KSWAPD CPU & VMSTAT      
---------------------------------------------------------------------------------------------------
11:08:11   | ord0:11622 ord3:0              | Free:96MB Avail:1309MB    | CPU: 10.0%  scan:83107932
[*] PHASE 3: Triggering Order-3 Pressure (UDP Storm).
11:08:15   | ord0:52079 ord3:0              | Free:273MB Avail:1300MB   | CPU: 90.9%  scan:85328881
11:08:16   | ord0:102895 ord3:0             | Free:477MB Avail:1309MB   | CPU: 60.0%  scan:85873777
11:08:17   | ord0:115459 ord3:5             | Free:517MB Avail:1284MB   | CPU: 54.5%  scan:86584389
11:08:18   | ord0:115164 ord3:0             | Free:509MB Avail:1107MB   | CPU: 36.4%  scan:87083561
---------------------------------------------------------------------------------------------------

The current phenomenon we observed is: Free memory is plentiful, Order-0 
pages are abundant, and the network allocation has already successfully 
entered the fallback-to-order-0 path. Everything seems normal on the 
surface, yet kswapd remains trapped in a futile loop.

It appears that kswapd is stuck in the following logic: 
wakeup_kswapd -> pgdat_balance -> __zone_watermark_ok. 

Specifically, in __zone_watermark_ok():

        /* For a high-order request, check at least one suitable page is free */
        for (o = order; o < NR_PAGE_ORDERS; o++) {
                struct free_area *area = &z->free_area[o];
                int mt;

                if (!area->nr_free)
                        continue;

                for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
                        if (!free_area_empty(area, mt))
                                return true;
                }
        }

Because our reproducer keeps creating fragmentation while the network 
stack requests order-3, this loop continues to return 'false' for the 
high-order requirement, even though the system is functionally fine 
with order-0. To be clear, we are not intentionally creating "artificial" 
fragments just for the sake of it. Rather, we designed this reproducer to 
effectively stress-test and expose the existing feedback gap in the 
reclaim/compaction logic—helping to pinpoint why kswapd continues 
thumping CPU cycles to satisfy a watermark that the allocator has 
already abandoned in favor of order-0 fallback.

A related discussion in [3] helps reduce vmpressure noise in this area.
Useful, but it does not close the contiguity gap by itself: high-order
wake/reclaim can still repeat when contiguous blocks cannot be formed.

It seems the current situation directs us to take a much closer look at 
how kswapd behaves in these scenarios. After carefully reviewing 
everyone's input, we believe it is time to do some targeted work on 
handling these high-order page issues. 

We already have some rough ideas and plan to conduct further experiments 
in this area. We would appreciate a broader discussion to help address 
this potential oversight that we might have collectively missed.

Links:
[1] https://github.com/hack-kernel-just-for-fun/kswap/blob/main/kswapd_spin_repro.c
[2] https://github.com/hack-kernel-just-for-fun/kswap/blob/main/kswapd.sh
[3] https://lore.kernel.org/all/20260406195014.112521-1-jp.kobryn@linux.dev/#r

This was reproduced and cross-checked independently by our team
(Wang Lian <lianux.mm@gmail.com> and Kunwu Chan <kunwu.chan@gmail.com>).

--
Best Regards,
wang lian


      parent reply	other threads:[~2026-04-17  8:12 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-13 10:16 Barry Song
2025-10-13 18:30 ` Vlastimil Babka
2025-10-13 21:35   ` Shakeel Butt
2025-10-13 21:53     ` Alexei Starovoitov
2025-10-13 22:25       ` Shakeel Butt
2025-10-13 22:46   ` Roman Gushchin
2025-10-14  4:31     ` Barry Song
2025-10-14  7:24     ` Michal Hocko
2025-10-14  7:26   ` Michal Hocko
2025-10-14  8:08     ` Barry Song
2025-10-14 14:27     ` Shakeel Butt
2025-10-14 15:14       ` Michal Hocko
2025-10-14 17:22         ` Shakeel Butt
2025-10-15  6:21           ` Michal Hocko
2025-10-15 18:26             ` Shakeel Butt
2025-10-13 18:53 ` Eric Dumazet
2025-10-14  3:58   ` Barry Song
2025-10-14  5:07     ` Eric Dumazet
2025-10-14  6:43       ` Barry Song
2025-10-14  7:01         ` Eric Dumazet
2025-10-14  8:17           ` Barry Song
2025-10-14  8:25             ` Eric Dumazet
2025-10-13 21:56 ` Matthew Wilcox
2025-10-14  4:09   ` Barry Song
2025-10-14  5:04     ` Eric Dumazet
2025-10-14  8:58       ` Barry Song
2025-10-14  9:49         ` Eric Dumazet
2025-10-14 10:19           ` Barry Song
2025-10-14 10:39             ` Eric Dumazet
2025-10-14 20:17               ` Barry Song
2025-10-15  6:39                 ` Eric Dumazet
2025-10-15  7:35                   ` Barry Song
2025-10-15 16:39                     ` Suren Baghdasaryan
2025-10-14 14:37             ` Shakeel Butt
2025-10-14 20:28               ` Barry Song
2025-10-15 18:13                 ` Shakeel Butt
2026-04-17  8:11   ` wang lian [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260417081138.23426-1-lianux.mm@gmail.com \
    --to=lianux.mm@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=horms@kernel.org \
    --cc=jackmanb@google.com \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linyunsheng@huawei.com \
    --cc=mhocko@suse.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=surenb@google.com \
    --cc=v-songbaohua@oppo.com \
    --cc=vbabka@suse.cz \
    --cc=willemb@google.com \
    --cc=willy@infradead.org \
    --cc=zhouhuacai@oppo.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox