From: Eric Dumazet <edumazet@google.com>
To: Barry Song <21cnbao@gmail.com>
Cc: corbet@lwn.net, davem@davemloft.net, hannes@cmpxchg.org,
horms@kernel.org, jackmanb@google.com, kuba@kernel.org,
kuniyu@google.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
linyunsheng@huawei.com, mhocko@suse.com, netdev@vger.kernel.org,
pabeni@redhat.com, surenb@google.com, v-songbaohua@oppo.com,
vbabka@suse.cz, willemb@google.com, zhouhuacai@oppo.com,
ziy@nvidia.com
Subject: Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
Date: Mon, 13 Oct 2025 22:07:55 -0700 [thread overview]
Message-ID: <CANn89iKCZyYi+J=5t2sdmvtERnknkwXrGi4QRzM9btYUywkDfw@mail.gmail.com> (raw)
In-Reply-To: <20251014035846.1519-1-21cnbao@gmail.com>
On Mon, Oct 13, 2025 at 8:58 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > > index 2ef50828aff1..b903bbae239c 100644
> > > --- a/Documentation/admin-guide/sysctl/net.rst
> > > +++ b/Documentation/admin-guide/sysctl/net.rst
> > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > > list is then passed to the stack when the number of segments reaches the
> > > gro_normal_batch limit.
> > >
> > > -high_order_alloc_disable
> > > -------------------------
> > > -
> > > -By default the allocator for page frags tries to use high order pages (order-3
> > > -on x86). While the default behavior gives good results in most cases, some users
> > > -might have hit a contention in page allocations/freeing. This was especially
> > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > > -historical importance.
> > > -
> >
> > The sysctl is quite useful for testing purposes, say on a freshly
> > booted host, with plenty of free memory.
> >
> > Also, having order-3 pages if possible is quite important for IOMM use cases.
> >
> > Perhaps kswapd should have some kind of heuristic to not start if a
> > recent run has already happened.
>
> I don’t understand why it shouldn’t start when users continuously request
> order-3 allocations and ask kswapd to prepare order-3 memory — it doesn’t
> make sense logically to skip it just because earlier requests were already
> satisfied.
>
> >
> > I am guessing phones do not need to send 1.6 Tbit per second on
> > network devices (yet),
> > an option could be to disable it in your boot scripts.
>
> A problem with the existing sysctl is that it only covers the TX path;
> for the RX path, we also observe that kswapd consumes significant power.
> I could add the patch below to make it support the RX path, but it feels
> like a bit of a layer violation, since the RX path code resides in mm
> and is intended to serve generic users rather than networking, even
> though the current callers are primarily network-related.
You might have a buggy driver.
High performance drivers use order-0 allocations only.
>
> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> index d2423f30577e..8ad18ec49f39 100644
> --- a/mm/page_frag_cache.c
> +++ b/mm/page_frag_cache.c
> @@ -18,6 +18,7 @@
> #include <linux/init.h>
> #include <linux/mm.h>
> #include <linux/page_frag_cache.h>
> +#include <net/sock.h>
> #include "internal.h"
>
> static unsigned long encoded_page_create(struct page *page, unsigned int order,
> @@ -54,10 +55,12 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> gfp_t gfp = gfp_mask;
>
> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> - __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> - page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> - numa_mem_id(), NULL);
> + if (!static_branch_unlikely(&net_high_order_alloc_disable_key)) {
> + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> + __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> + page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> + numa_mem_id(), NULL);
> + }
> #endif
> if (unlikely(!page)) {
>
>
> Do you have a better idea on how to make the sysctl also cover the RX path?
>
> Thanks
> Barry
>
next prev parent reply other threads:[~2025-10-14 5:08 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-13 10:16 Barry Song
2025-10-13 18:30 ` Vlastimil Babka
2025-10-13 21:35 ` Shakeel Butt
2025-10-13 21:53 ` Alexei Starovoitov
2025-10-13 22:25 ` Shakeel Butt
2025-10-13 22:46 ` Roman Gushchin
2025-10-14 4:31 ` Barry Song
2025-10-14 7:24 ` Michal Hocko
2025-10-14 7:26 ` Michal Hocko
2025-10-14 8:08 ` Barry Song
2025-10-14 14:27 ` Shakeel Butt
2025-10-14 15:14 ` Michal Hocko
2025-10-14 17:22 ` Shakeel Butt
2025-10-15 6:21 ` Michal Hocko
2025-10-15 18:26 ` Shakeel Butt
2025-10-13 18:53 ` Eric Dumazet
2025-10-14 3:58 ` Barry Song
2025-10-14 5:07 ` Eric Dumazet [this message]
2025-10-14 6:43 ` Barry Song
2025-10-14 7:01 ` Eric Dumazet
2025-10-14 8:17 ` Barry Song
2025-10-14 8:25 ` Eric Dumazet
2025-10-13 21:56 ` Matthew Wilcox
2025-10-14 4:09 ` Barry Song
2025-10-14 5:04 ` Eric Dumazet
2025-10-14 8:58 ` Barry Song
2025-10-14 9:49 ` Eric Dumazet
2025-10-14 10:19 ` Barry Song
2025-10-14 10:39 ` Eric Dumazet
2025-10-14 20:17 ` Barry Song
2025-10-15 6:39 ` Eric Dumazet
2025-10-15 7:35 ` Barry Song
2025-10-15 16:39 ` Suren Baghdasaryan
2025-10-14 14:37 ` Shakeel Butt
2025-10-14 20:28 ` Barry Song
2025-10-15 18:13 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CANn89iKCZyYi+J=5t2sdmvtERnknkwXrGi4QRzM9btYUywkDfw@mail.gmail.com' \
--to=edumazet@google.com \
--cc=21cnbao@gmail.com \
--cc=corbet@lwn.net \
--cc=davem@davemloft.net \
--cc=hannes@cmpxchg.org \
--cc=horms@kernel.org \
--cc=jackmanb@google.com \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linyunsheng@huawei.com \
--cc=mhocko@suse.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=surenb@google.com \
--cc=v-songbaohua@oppo.com \
--cc=vbabka@suse.cz \
--cc=willemb@google.com \
--cc=zhouhuacai@oppo.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox