From: Shakeel Butt <shakeel.butt@linux.dev>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Barry Song <21cnbao@gmail.com>,
netdev@vger.kernel.org, linux-mm@kvack.org,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
Barry Song <v-songbaohua@oppo.com>,
Jonathan Corbet <corbet@lwn.net>,
Eric Dumazet <edumazet@google.com>,
Kuniyuki Iwashima <kuniyu@google.com>,
Paolo Abeni <pabeni@redhat.com>,
Willem de Bruijn <willemb@google.com>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
Simon Horman <horms@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Brendan Jackman <jackmanb@google.com>,
Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
Yunsheng Lin <linyunsheng@huawei.com>,
Huacai Zhou <zhouhuacai@oppo.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Harry Yoo <harry.yoo@oracle.com>,
David Hildenbrand <david@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Roman Gushchin <roman.gushchin@linux.dev>
Subject: Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
Date: Mon, 13 Oct 2025 14:35:28 -0700 [thread overview]
Message-ID: <dhmafwxu2jj4lu6acoqdhqh46k33sbsj5jvepcfzly4c7dn2t7@ln5dgubll4ac> (raw)
In-Reply-To: <927bcdf7-1283-4ddd-bd5e-d2e399b26f7d@suse.cz>
On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote:
> On 10/13/25 12:16, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > On phones, we have observed significant phone heating when running apps
> > with high network bandwidth. This is caused by the network stack frequently
> > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > constantly active, even though plenty of memory is still available for network
> > allocations which can fall back to order-0.
> >
> > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> > introduced high_order_alloc_disable for the transmit (TX) path
> > (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> > allowing the TX path to fall back to order-0 immediately, while leaving the
> > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> > generally unaware of the sysctl and cannot easily adjust it for specific use
> > cases. Enabling high_order_alloc_disable also completely disables the
> > benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> > RX path.
> >
> > An alternative approach is to disable kswapd for these frequent
> > allocations and provide best-effort order-3 service for both TX and RX paths,
> > while removing the sysctl entirely.
> >
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Kuniyuki Iwashima <kuniyu@google.com>
> > Cc: Paolo Abeni <pabeni@redhat.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Simon Horman <horms@kernel.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Brendan Jackman <jackmanb@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Yunsheng Lin <linyunsheng@huawei.com>
> > Cc: Huacai Zhou <zhouhuacai@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> > Documentation/admin-guide/sysctl/net.rst | 12 ------------
> > include/net/sock.h | 1 -
> > mm/page_frag_cache.c | 2 +-
> > net/core/sock.c | 8 ++------
> > net/core/sysctl_net_core.c | 7 -------
> > 5 files changed, 3 insertions(+), 27 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > index 2ef50828aff1..b903bbae239c 100644
> > --- a/Documentation/admin-guide/sysctl/net.rst
> > +++ b/Documentation/admin-guide/sysctl/net.rst
> > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > list is then passed to the stack when the number of segments reaches the
> > gro_normal_batch limit.
> >
> > -high_order_alloc_disable
> > -------------------------
> > -
> > -By default the allocator for page frags tries to use high order pages (order-3
> > -on x86). While the default behavior gives good results in most cases, some users
> > -might have hit a contention in page allocations/freeing. This was especially
> > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > -historical importance.
> > -
> > -Default: 0
> > -
> > 2. /proc/sys/net/unix - Parameters for Unix domain sockets
> > ----------------------------------------------------------
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 60bcb13f045c..62306c1095d5 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
> > extern __u32 sysctl_rmem_default;
> >
> > #define SKB_FRAG_PAGE_ORDER get_order(32768)
> > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> >
> > static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
> > {
> > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> > index d2423f30577e..dd36114dd16f 100644
> > --- a/mm/page_frag_cache.c
> > +++ b/mm/page_frag_cache.c
> > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> > gfp_t gfp = gfp_mask;
> >
> > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> > + gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP |
> > __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>
> I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
> we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
> interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
> fine for the page allocator itself where we have a different entry point
> that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
> of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
> objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully
>
> I wonder if we should either:
>
> 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> determine it precisely.
>
> 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
> not being disturbing (like proposed here), but that can in fact allow
> spinning. Instead, decide to not wake up kswapd by those when other
> information indicates it's an opportunistic allocation
> (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
> order > 0...)
>
> 3) something better?
>
For the !allow_spin allocations, I think we should just add a new __GFP
flag instead of adding more complexity to other allocators which may or
may not want kswapd wakeup for many different reasons.
next prev parent reply other threads:[~2025-10-13 21:35 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-13 10:16 Barry Song
2025-10-13 18:30 ` Vlastimil Babka
2025-10-13 21:35 ` Shakeel Butt [this message]
2025-10-13 21:53 ` Alexei Starovoitov
2025-10-13 22:25 ` Shakeel Butt
2025-10-13 22:46 ` Roman Gushchin
2025-10-14 4:31 ` Barry Song
2025-10-14 7:24 ` Michal Hocko
2025-10-14 7:26 ` Michal Hocko
2025-10-14 8:08 ` Barry Song
2025-10-14 14:27 ` Shakeel Butt
2025-10-14 15:14 ` Michal Hocko
2025-10-14 17:22 ` Shakeel Butt
2025-10-15 6:21 ` Michal Hocko
2025-10-15 18:26 ` Shakeel Butt
2025-10-13 18:53 ` Eric Dumazet
2025-10-14 3:58 ` Barry Song
2025-10-14 5:07 ` Eric Dumazet
2025-10-14 6:43 ` Barry Song
2025-10-14 7:01 ` Eric Dumazet
2025-10-14 8:17 ` Barry Song
2025-10-14 8:25 ` Eric Dumazet
2025-10-13 21:56 ` Matthew Wilcox
2025-10-14 4:09 ` Barry Song
2025-10-14 5:04 ` Eric Dumazet
2025-10-14 8:58 ` Barry Song
2025-10-14 9:49 ` Eric Dumazet
2025-10-14 10:19 ` Barry Song
2025-10-14 10:39 ` Eric Dumazet
2025-10-14 20:17 ` Barry Song
2025-10-15 6:39 ` Eric Dumazet
2025-10-15 7:35 ` Barry Song
2025-10-15 16:39 ` Suren Baghdasaryan
2025-10-14 14:37 ` Shakeel Butt
2025-10-14 20:28 ` Barry Song
2025-10-15 18:13 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dhmafwxu2jj4lu6acoqdhqh46k33sbsj5jvepcfzly4c7dn2t7@ln5dgubll4ac \
--to=shakeel.butt@linux.dev \
--cc=21cnbao@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=corbet@lwn.net \
--cc=davem@davemloft.net \
--cc=david@redhat.com \
--cc=edumazet@google.com \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=horms@kernel.org \
--cc=jackmanb@google.com \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linyunsheng@huawei.com \
--cc=mhocko@suse.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=roman.gushchin@linux.dev \
--cc=surenb@google.com \
--cc=v-songbaohua@oppo.com \
--cc=vbabka@suse.cz \
--cc=willemb@google.com \
--cc=willy@infradead.org \
--cc=zhouhuacai@oppo.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox