Hi Matthew, Barry,

> So, we try to do an order-3 allocation. kswapd runs and ...
> succeeds in creating order-3 pages? Or fails to?
From our reproducer runs, both happen. We observe intermittent order-3
successes, but also frequent high-order failures followed by order-0
fallback.

> If it fails, that's something we need to sort out.
Agreed. In this workload, the bottleneck appears to be contiguity, not
raw reclaimable memory shortage. Order-0 memory remains available while
suitable order-3 blocks are often unavailable.

> If it succeeds, now we have several order-3 pages, great. But where do
> they all go that we need to run kswapd again?
In our runs, order-3 pockets do show up, but they do not last long.
They get consumed quickly by ongoing skb demand, and the pressure returns.

To investigate this, we built a reproducer that keeps creating memory fragments 
while the network stack continuously requests order-3 allocations.[1][2]

Raw sample output (trimmed):
---------------------------------------------------------------------------------------------------
TIME       | BUDDYINFO (Normal Zone)        | MEMINFO                   | KSWAPD CPU & VMSTAT      
---------------------------------------------------------------------------------------------------
11:08:11   | ord0:11622 ord3:0              | Free:96MB Avail:1309MB    | CPU: 10.0%  scan:83107932
[*] PHASE 3: Triggering Order-3 Pressure (UDP Storm).
11:08:15   | ord0:52079 ord3:0              | Free:273MB Avail:1300MB   | CPU: 90.9%  scan:85328881
11:08:16   | ord0:102895 ord3:0             | Free:477MB Avail:1309MB   | CPU: 60.0%  scan:85873777
11:08:17   | ord0:115459 ord3:5             | Free:517MB Avail:1284MB   | CPU: 54.5%  scan:86584389
11:08:18   | ord0:115164 ord3:0             | Free:509MB Avail:1107MB   | CPU: 36.4%  scan:87083561
---------------------------------------------------------------------------------------------------

The current phenomenon we observed is: Free memory is plentiful, Order-0 
pages are abundant, and the network allocation has already successfully 
entered the fallback-to-order-0 path. Everything seems normal on the 
surface, yet kswapd remains trapped in a futile loop.

It appears that kswapd is stuck in the following logic: 
wakeup_kswapd -> pgdat_balance -> __zone_watermark_ok. 

Specifically, in __zone_watermark_ok():

        /* For a high-order request, check at least one suitable page is free */
        for (o = order; o < NR_PAGE_ORDERS; o++) {
                struct free_area *area = &z->free_area[o];
                int mt;

                if (!area->nr_free)
                        continue;

                for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
                        if (!free_area_empty(area, mt))
                                return true;
                }
        }

Because our reproducer keeps creating fragmentation while the network 
stack requests order-3, this loop continues to return 'false' for the 
high-order requirement, even though the system is functionally fine 
with order-0. To be clear, we are not intentionally creating "artificial" 
fragments just for the sake of it. Rather, we designed this reproducer to 
effectively stress-test and expose the existing feedback gap in the 
reclaim/compaction logic—helping to pinpoint why kswapd continues 
thumping CPU cycles to satisfy a watermark that the allocator has 
already abandoned in favor of order-0 fallback.

A related discussion in [3] helps reduce vmpressure noise in this area.
Useful, but it does not close the contiguity gap by itself: high-order
wake/reclaim can still repeat when contiguous blocks cannot be formed.

It seems the current situation directs us to take a much closer look at 
how kswapd behaves in these scenarios. After carefully reviewing 
everyone's input, we believe it is time to do some targeted work on 
handling these high-order page issues. 

We already have some rough ideas and plan to conduct further experiments 
in this area. We would appreciate a broader discussion to help address 
this potential oversight that we might have collectively missed.

Links:
[1] https://github.com/hack-kernel-just-for-fun/kswap/blob/main/kswapd_spin_repro.c
[2] https://github.com/hack-kernel-just-for-fun/kswap/blob/main/kswapd.sh
[3] https://lore.kernel.org/all/20260406195014.112521-1-jp.kobryn@linux.dev/#r

This was reproduced and cross-checked independently by our team
(Wang Lian <lianux.mm@gmail.com> and Kunwu Chan <kunwu.chan@gmail.com>).

--
Best Regards,
wang lian