* [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
@ 2025-10-13 10:16 Barry Song
2025-10-13 18:30 ` Vlastimil Babka
` (2 more replies)
0 siblings, 3 replies; 36+ messages in thread
From: Barry Song @ 2025-10-13 10:16 UTC (permalink / raw)
To: netdev, linux-mm, linux-doc
Cc: linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet,
Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou
From: Barry Song <v-songbaohua@oppo.com>
On phones, we have observed significant phone heating when running apps
with high network bandwidth. This is caused by the network stack frequently
waking kswapd for order-3 allocations. As a result, memory reclamation becomes
constantly active, even though plenty of memory is still available for network
allocations which can fall back to order-0.
Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
introduced high_order_alloc_disable for the transmit (TX) path
(skb_page_frag_refill()) to mitigate some memory reclamation issues,
allowing the TX path to fall back to order-0 immediately, while leaving the
receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
generally unaware of the sysctl and cannot easily adjust it for specific use
cases. Enabling high_order_alloc_disable also completely disables the
benefit of order-3 allocations. Additionally, the sysctl does not apply to the
RX path.
An alternative approach is to disable kswapd for these frequent
allocations and provide best-effort order-3 service for both TX and RX paths,
while removing the sysctl entirely.
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Cc: Huacai Zhou <zhouhuacai@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
Documentation/admin-guide/sysctl/net.rst | 12 ------------
include/net/sock.h | 1 -
mm/page_frag_cache.c | 2 +-
net/core/sock.c | 8 ++------
net/core/sysctl_net_core.c | 7 -------
5 files changed, 3 insertions(+), 27 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 2ef50828aff1..b903bbae239c 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
list is then passed to the stack when the number of segments reaches the
gro_normal_batch limit.
-high_order_alloc_disable
-------------------------
-
-By default the allocator for page frags tries to use high order pages (order-3
-on x86). While the default behavior gives good results in most cases, some users
-might have hit a contention in page allocations/freeing. This was especially
-true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
-lists. This allows to opt-in for order-0 allocation instead but is now mostly of
-historical importance.
-
-Default: 0
-
2. /proc/sys/net/unix - Parameters for Unix domain sockets
----------------------------------------------------------
diff --git a/include/net/sock.h b/include/net/sock.h
index 60bcb13f045c..62306c1095d5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
extern __u32 sysctl_rmem_default;
#define SKB_FRAG_PAGE_ORDER get_order(32768)
-DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
{
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e..dd36114dd16f 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
gfp_t gfp = gfp_mask;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
+ gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP |
__GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
numa_mem_id(), NULL);
diff --git a/net/core/sock.c b/net/core/sock.c
index dc03d4b5909a..1fa1e9177d86 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3085,8 +3085,6 @@ static void sk_leave_memory_pressure(struct sock *sk)
}
}
-DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
-
/**
* skb_page_frag_refill - check that a page_frag contains enough room
* @sz: minimum size of the fragment we want to get
@@ -3110,10 +3108,8 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
}
pfrag->offset = 0;
- if (SKB_FRAG_PAGE_ORDER &&
- !static_branch_unlikely(&net_high_order_alloc_disable_key)) {
- /* Avoid direct reclaim but allow kswapd to wake */
- pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+ if (SKB_FRAG_PAGE_ORDER) {
+ pfrag->page = alloc_pages((gfp & ~__GFP_RECLAIM) |
__GFP_COMP | __GFP_NOWARN |
__GFP_NORETRY,
SKB_FRAG_PAGE_ORDER);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8cf04b57ade1..181f6532beb8 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -599,13 +599,6 @@ static struct ctl_table net_core_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_THREE,
},
- {
- .procname = "high_order_alloc_disable",
- .data = &net_high_order_alloc_disable_key.key,
- .maxlen = sizeof(net_high_order_alloc_disable_key),
- .mode = 0644,
- .proc_handler = proc_do_static_key,
- },
{
.procname = "gro_normal_batch",
.data = &net_hotdata.gro_normal_batch,
--
2.39.3 (Apple Git-146)
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 10:16 [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation Barry Song @ 2025-10-13 18:30 ` Vlastimil Babka 2025-10-13 21:35 ` Shakeel Butt ` (2 more replies) 2025-10-13 18:53 ` Eric Dumazet 2025-10-13 21:56 ` Matthew Wilcox 2 siblings, 3 replies; 36+ messages in thread From: Vlastimil Babka @ 2025-10-13 18:30 UTC (permalink / raw) To: Barry Song, netdev, linux-mm, linux-doc Cc: linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On 10/13/25 12:16, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > On phones, we have observed significant phone heating when running apps > with high network bandwidth. This is caused by the network stack frequently > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > constantly active, even though plenty of memory is still available for network > allocations which can fall back to order-0. > > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key") > introduced high_order_alloc_disable for the transmit (TX) path > (skb_page_frag_refill()) to mitigate some memory reclamation issues, > allowing the TX path to fall back to order-0 immediately, while leaving the > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are > generally unaware of the sysctl and cannot easily adjust it for specific use > cases. Enabling high_order_alloc_disable also completely disables the > benefit of order-3 allocations. Additionally, the sysctl does not apply to the > RX path. > > An alternative approach is to disable kswapd for these frequent > allocations and provide best-effort order-3 service for both TX and RX paths, > while removing the sysctl entirely. > > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Kuniyuki Iwashima <kuniyu@google.com> > Cc: Paolo Abeni <pabeni@redhat.com> > Cc: Willem de Bruijn <willemb@google.com> > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: Simon Horman <horms@kernel.org> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Brendan Jackman <jackmanb@google.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Yunsheng Lin <linyunsheng@huawei.com> > Cc: Huacai Zhou <zhouhuacai@oppo.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > Documentation/admin-guide/sysctl/net.rst | 12 ------------ > include/net/sock.h | 1 - > mm/page_frag_cache.c | 2 +- > net/core/sock.c | 8 ++------ > net/core/sysctl_net_core.c | 7 ------- > 5 files changed, 3 insertions(+), 27 deletions(-) > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst > index 2ef50828aff1..b903bbae239c 100644 > --- a/Documentation/admin-guide/sysctl/net.rst > +++ b/Documentation/admin-guide/sysctl/net.rst > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This > list is then passed to the stack when the number of segments reaches the > gro_normal_batch limit. > > -high_order_alloc_disable > ------------------------- > - > -By default the allocator for page frags tries to use high order pages (order-3 > -on x86). While the default behavior gives good results in most cases, some users > -might have hit a contention in page allocations/freeing. This was especially > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of > -historical importance. > - > -Default: 0 > - > 2. /proc/sys/net/unix - Parameters for Unix domain sockets > ---------------------------------------------------------- > > diff --git a/include/net/sock.h b/include/net/sock.h > index 60bcb13f045c..62306c1095d5 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default; > extern __u32 sysctl_rmem_default; > > #define SKB_FRAG_PAGE_ORDER get_order(32768) > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key); > > static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto) > { > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c > index d2423f30577e..dd36114dd16f 100644 > --- a/mm/page_frag_cache.c > +++ b/mm/page_frag_cache.c > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc, > gfp_t gfp = gfp_mask; > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) > - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | > + gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP | > __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that we introduced alloc_pages_nolock() and kmalloc_nolock() where it's interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's fine for the page allocator itself where we have a different entry point that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds of debugging and accounting metadata (page_owner, memcg, alloc tags for slab objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully I wonder if we should either: 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to determine it precisely. 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of not being disturbing (like proposed here), but that can in fact allow spinning. Instead, decide to not wake up kswapd by those when other information indicates it's an opportunistic allocation (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC, order > 0...) 3) something better? Vlastimil > page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER, > numa_mem_id(), NULL); > diff --git a/net/core/sock.c b/net/core/sock.c > index dc03d4b5909a..1fa1e9177d86 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -3085,8 +3085,6 @@ static void sk_leave_memory_pressure(struct sock *sk) > } > } > > -DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key); > - > /** > * skb_page_frag_refill - check that a page_frag contains enough room > * @sz: minimum size of the fragment we want to get > @@ -3110,10 +3108,8 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp) > } > > pfrag->offset = 0; > - if (SKB_FRAG_PAGE_ORDER && > - !static_branch_unlikely(&net_high_order_alloc_disable_key)) { > - /* Avoid direct reclaim but allow kswapd to wake */ > - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | > + if (SKB_FRAG_PAGE_ORDER) { > + pfrag->page = alloc_pages((gfp & ~__GFP_RECLAIM) | > __GFP_COMP | __GFP_NOWARN | > __GFP_NORETRY, > SKB_FRAG_PAGE_ORDER); > diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c > index 8cf04b57ade1..181f6532beb8 100644 > --- a/net/core/sysctl_net_core.c > +++ b/net/core/sysctl_net_core.c > @@ -599,13 +599,6 @@ static struct ctl_table net_core_table[] = { > .extra1 = SYSCTL_ZERO, > .extra2 = SYSCTL_THREE, > }, > - { > - .procname = "high_order_alloc_disable", > - .data = &net_high_order_alloc_disable_key.key, > - .maxlen = sizeof(net_high_order_alloc_disable_key), > - .mode = 0644, > - .proc_handler = proc_do_static_key, > - }, > { > .procname = "gro_normal_batch", > .data = &net_hotdata.gro_normal_batch, ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 18:30 ` Vlastimil Babka @ 2025-10-13 21:35 ` Shakeel Butt 2025-10-13 21:53 ` Alexei Starovoitov 2025-10-13 22:46 ` Roman Gushchin 2025-10-14 7:26 ` Michal Hocko 2 siblings, 1 reply; 36+ messages in thread From: Shakeel Butt @ 2025-10-13 21:35 UTC (permalink / raw) To: Vlastimil Babka Cc: Barry Song, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote: > On 10/13/25 12:16, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > On phones, we have observed significant phone heating when running apps > > with high network bandwidth. This is caused by the network stack frequently > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > constantly active, even though plenty of memory is still available for network > > allocations which can fall back to order-0. > > > > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key") > > introduced high_order_alloc_disable for the transmit (TX) path > > (skb_page_frag_refill()) to mitigate some memory reclamation issues, > > allowing the TX path to fall back to order-0 immediately, while leaving the > > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are > > generally unaware of the sysctl and cannot easily adjust it for specific use > > cases. Enabling high_order_alloc_disable also completely disables the > > benefit of order-3 allocations. Additionally, the sysctl does not apply to the > > RX path. > > > > An alternative approach is to disable kswapd for these frequent > > allocations and provide best-effort order-3 service for both TX and RX paths, > > while removing the sysctl entirely. > > > > Cc: Jonathan Corbet <corbet@lwn.net> > > Cc: Eric Dumazet <edumazet@google.com> > > Cc: Kuniyuki Iwashima <kuniyu@google.com> > > Cc: Paolo Abeni <pabeni@redhat.com> > > Cc: Willem de Bruijn <willemb@google.com> > > Cc: "David S. Miller" <davem@davemloft.net> > > Cc: Jakub Kicinski <kuba@kernel.org> > > Cc: Simon Horman <horms@kernel.org> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Suren Baghdasaryan <surenb@google.com> > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: Brendan Jackman <jackmanb@google.com> > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > Cc: Zi Yan <ziy@nvidia.com> > > Cc: Yunsheng Lin <linyunsheng@huawei.com> > > Cc: Huacai Zhou <zhouhuacai@oppo.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > Documentation/admin-guide/sysctl/net.rst | 12 ------------ > > include/net/sock.h | 1 - > > mm/page_frag_cache.c | 2 +- > > net/core/sock.c | 8 ++------ > > net/core/sysctl_net_core.c | 7 ------- > > 5 files changed, 3 insertions(+), 27 deletions(-) > > > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst > > index 2ef50828aff1..b903bbae239c 100644 > > --- a/Documentation/admin-guide/sysctl/net.rst > > +++ b/Documentation/admin-guide/sysctl/net.rst > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This > > list is then passed to the stack when the number of segments reaches the > > gro_normal_batch limit. > > > > -high_order_alloc_disable > > ------------------------- > > - > > -By default the allocator for page frags tries to use high order pages (order-3 > > -on x86). While the default behavior gives good results in most cases, some users > > -might have hit a contention in page allocations/freeing. This was especially > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of > > -historical importance. > > - > > -Default: 0 > > - > > 2. /proc/sys/net/unix - Parameters for Unix domain sockets > > ---------------------------------------------------------- > > > > diff --git a/include/net/sock.h b/include/net/sock.h > > index 60bcb13f045c..62306c1095d5 100644 > > --- a/include/net/sock.h > > +++ b/include/net/sock.h > > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default; > > extern __u32 sysctl_rmem_default; > > > > #define SKB_FRAG_PAGE_ORDER get_order(32768) > > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key); > > > > static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto) > > { > > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c > > index d2423f30577e..dd36114dd16f 100644 > > --- a/mm/page_frag_cache.c > > +++ b/mm/page_frag_cache.c > > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc, > > gfp_t gfp = gfp_mask; > > > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) > > - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | > > + gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP | > > __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; > > I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that > we introduced alloc_pages_nolock() and kmalloc_nolock() where it's > interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's > fine for the page allocator itself where we have a different entry point > that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds > of debugging and accounting metadata (page_owner, memcg, alloc tags for slab > objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully > > I wonder if we should either: > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > determine it precisely. > > 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of > not being disturbing (like proposed here), but that can in fact allow > spinning. Instead, decide to not wake up kswapd by those when other > information indicates it's an opportunistic allocation > (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC, > order > 0...) > > 3) something better? > For the !allow_spin allocations, I think we should just add a new __GFP flag instead of adding more complexity to other allocators which may or may not want kswapd wakeup for many different reasons. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 21:35 ` Shakeel Butt @ 2025-10-13 21:53 ` Alexei Starovoitov 2025-10-13 22:25 ` Shakeel Butt 0 siblings, 1 reply; 36+ messages in thread From: Alexei Starovoitov @ 2025-10-13 21:53 UTC (permalink / raw) To: Shakeel Butt Cc: Vlastimil Babka, Barry Song, Network Development, linux-mm, open list:DOCUMENTATION, LKML, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On Mon, Oct 13, 2025 at 2:35 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote: > > On 10/13/25 12:16, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > On phones, we have observed significant phone heating when running apps > > > with high network bandwidth. This is caused by the network stack frequently > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > > constantly active, even though plenty of memory is still available for network > > > allocations which can fall back to order-0. > > > > > > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key") > > > introduced high_order_alloc_disable for the transmit (TX) path > > > (skb_page_frag_refill()) to mitigate some memory reclamation issues, > > > allowing the TX path to fall back to order-0 immediately, while leaving the > > > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are > > > generally unaware of the sysctl and cannot easily adjust it for specific use > > > cases. Enabling high_order_alloc_disable also completely disables the > > > benefit of order-3 allocations. Additionally, the sysctl does not apply to the > > > RX path. > > > > > > An alternative approach is to disable kswapd for these frequent > > > allocations and provide best-effort order-3 service for both TX and RX paths, > > > while removing the sysctl entirely. > > > > > > Cc: Jonathan Corbet <corbet@lwn.net> > > > Cc: Eric Dumazet <edumazet@google.com> > > > Cc: Kuniyuki Iwashima <kuniyu@google.com> > > > Cc: Paolo Abeni <pabeni@redhat.com> > > > Cc: Willem de Bruijn <willemb@google.com> > > > Cc: "David S. Miller" <davem@davemloft.net> > > > Cc: Jakub Kicinski <kuba@kernel.org> > > > Cc: Simon Horman <horms@kernel.org> > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > Cc: Michal Hocko <mhocko@suse.com> > > > Cc: Brendan Jackman <jackmanb@google.com> > > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > > Cc: Zi Yan <ziy@nvidia.com> > > > Cc: Yunsheng Lin <linyunsheng@huawei.com> > > > Cc: Huacai Zhou <zhouhuacai@oppo.com> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > --- > > > Documentation/admin-guide/sysctl/net.rst | 12 ------------ > > > include/net/sock.h | 1 - > > > mm/page_frag_cache.c | 2 +- > > > net/core/sock.c | 8 ++------ > > > net/core/sysctl_net_core.c | 7 ------- > > > 5 files changed, 3 insertions(+), 27 deletions(-) > > > > > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst > > > index 2ef50828aff1..b903bbae239c 100644 > > > --- a/Documentation/admin-guide/sysctl/net.rst > > > +++ b/Documentation/admin-guide/sysctl/net.rst > > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This > > > list is then passed to the stack when the number of segments reaches the > > > gro_normal_batch limit. > > > > > > -high_order_alloc_disable > > > ------------------------- > > > - > > > -By default the allocator for page frags tries to use high order pages (order-3 > > > -on x86). While the default behavior gives good results in most cases, some users > > > -might have hit a contention in page allocations/freeing. This was especially > > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu > > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of > > > -historical importance. > > > - > > > -Default: 0 > > > - > > > 2. /proc/sys/net/unix - Parameters for Unix domain sockets > > > ---------------------------------------------------------- > > > > > > diff --git a/include/net/sock.h b/include/net/sock.h > > > index 60bcb13f045c..62306c1095d5 100644 > > > --- a/include/net/sock.h > > > +++ b/include/net/sock.h > > > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default; > > > extern __u32 sysctl_rmem_default; > > > > > > #define SKB_FRAG_PAGE_ORDER get_order(32768) > > > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key); > > > > > > static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto) > > > { > > > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c > > > index d2423f30577e..dd36114dd16f 100644 > > > --- a/mm/page_frag_cache.c > > > +++ b/mm/page_frag_cache.c > > > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc, > > > gfp_t gfp = gfp_mask; > > > > > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) > > > - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | > > > + gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP | > > > __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; > > > > I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that > > we introduced alloc_pages_nolock() and kmalloc_nolock() where it's > > interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's > > fine for the page allocator itself where we have a different entry point > > that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds > > of debugging and accounting metadata (page_owner, memcg, alloc tags for slab > > objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully > > > > I wonder if we should either: > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > determine it precisely. > > > > 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of > > not being disturbing (like proposed here), but that can in fact allow > > spinning. Instead, decide to not wake up kswapd by those when other > > information indicates it's an opportunistic allocation > > (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC, > > order > 0...) > > > > 3) something better? > > > > For the !allow_spin allocations, I think we should just add a new __GFP > flag instead of adding more complexity to other allocators which may or > may not want kswapd wakeup for many different reasons. That's what I proposed long ago, but was convinced that the new flag adds more complexity. Looks like we walked this road far enough and the new flag will actually make things simpler. Back then I proposed __GFP_TRYLOCK which is not a good name. How about __GFP_NOLOCK ? or __GFP_NOSPIN ? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 21:53 ` Alexei Starovoitov @ 2025-10-13 22:25 ` Shakeel Butt 0 siblings, 0 replies; 36+ messages in thread From: Shakeel Butt @ 2025-10-13 22:25 UTC (permalink / raw) To: Alexei Starovoitov Cc: Vlastimil Babka, Barry Song, Network Development, linux-mm, open list:DOCUMENTATION, LKML, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On Mon, Oct 13, 2025 at 02:53:17PM -0700, Alexei Starovoitov wrote: > On Mon, Oct 13, 2025 at 2:35 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote: [...] > > > > > > I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that > > > we introduced alloc_pages_nolock() and kmalloc_nolock() where it's > > > interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's > > > fine for the page allocator itself where we have a different entry point > > > that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds > > > of debugging and accounting metadata (page_owner, memcg, alloc tags for slab > > > objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully > > > > > > I wonder if we should either: > > > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > > determine it precisely. > > > > > > 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of > > > not being disturbing (like proposed here), but that can in fact allow > > > spinning. Instead, decide to not wake up kswapd by those when other > > > information indicates it's an opportunistic allocation > > > (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC, > > > order > 0...) > > > > > > 3) something better? > > > > > > > For the !allow_spin allocations, I think we should just add a new __GFP > > flag instead of adding more complexity to other allocators which may or > > may not want kswapd wakeup for many different reasons. > > That's what I proposed long ago, but was convinced that the new flag > adds more complexity. Oh somehow I thought we took that route because we are low on available bits. > Looks like we walked this road far enough and > the new flag will actually make things simpler. > Back then I proposed __GFP_TRYLOCK which is not a good name. > How about __GFP_NOLOCK ? or __GFP_NOSPIN ? Let's go with __GFP_NOLOCK as we already have nolock variants of the allocation APIs. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 18:30 ` Vlastimil Babka 2025-10-13 21:35 ` Shakeel Butt @ 2025-10-13 22:46 ` Roman Gushchin 2025-10-14 4:31 ` Barry Song 2025-10-14 7:24 ` Michal Hocko 2025-10-14 7:26 ` Michal Hocko 2 siblings, 2 replies; 36+ messages in thread From: Roman Gushchin @ 2025-10-13 22:46 UTC (permalink / raw) To: Vlastimil Babka Cc: Barry Song, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox Vlastimil Babka <vbabka@suse.cz> writes: > On 10/13/25 12:16, Barry Song wrote: >> From: Barry Song <v-songbaohua@oppo.com> >> >> On phones, we have observed significant phone heating when running apps >> with high network bandwidth. This is caused by the network stack frequently >> waking kswapd for order-3 allocations. As a result, memory reclamation becomes >> constantly active, even though plenty of memory is still available for network >> allocations which can fall back to order-0. >> >> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key") >> introduced high_order_alloc_disable for the transmit (TX) path >> (skb_page_frag_refill()) to mitigate some memory reclamation issues, >> allowing the TX path to fall back to order-0 immediately, while leaving the >> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are >> generally unaware of the sysctl and cannot easily adjust it for specific use >> cases. Enabling high_order_alloc_disable also completely disables the >> benefit of order-3 allocations. Additionally, the sysctl does not apply to the >> RX path. >> >> An alternative approach is to disable kswapd for these frequent >> allocations and provide best-effort order-3 service for both TX and RX paths, >> while removing the sysctl entirely. I'm not sure this is the right path long-term. There are significant benefits associated with using larger pages, so making the kernel fall back to order-0 pages easier and sooner feels wrong, tbh. Without kswapd trying to defragment memory, the only other option is to force tasks into the direct compaction and it's known to be problematic. I wonder if instead we should look into optimizing kswapd to be less power-hungry? And if you still prefer to disable kswapd for this purpose, at least it should be conditional to vm.laptop_mode. But again, I don't think it's the right long-term approach. Thanks! ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 22:46 ` Roman Gushchin @ 2025-10-14 4:31 ` Barry Song 2025-10-14 7:24 ` Michal Hocko 1 sibling, 0 replies; 36+ messages in thread From: Barry Song @ 2025-10-14 4:31 UTC (permalink / raw) To: Roman Gushchin Cc: Vlastimil Babka, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox On Tue, Oct 14, 2025 at 6:47 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > Vlastimil Babka <vbabka@suse.cz> writes: > > > On 10/13/25 12:16, Barry Song wrote: > >> From: Barry Song <v-songbaohua@oppo.com> > >> > >> On phones, we have observed significant phone heating when running apps > >> with high network bandwidth. This is caused by the network stack frequently > >> waking kswapd for order-3 allocations. As a result, memory reclamation becomes > >> constantly active, even though plenty of memory is still available for network > >> allocations which can fall back to order-0. > >> > >> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key") > >> introduced high_order_alloc_disable for the transmit (TX) path > >> (skb_page_frag_refill()) to mitigate some memory reclamation issues, > >> allowing the TX path to fall back to order-0 immediately, while leaving the > >> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are > >> generally unaware of the sysctl and cannot easily adjust it for specific use > >> cases. Enabling high_order_alloc_disable also completely disables the > >> benefit of order-3 allocations. Additionally, the sysctl does not apply to the > >> RX path. > >> > >> An alternative approach is to disable kswapd for these frequent > >> allocations and provide best-effort order-3 service for both TX and RX paths, > >> while removing the sysctl entirely. > > I'm not sure this is the right path long-term. There are significant > benefits associated with using larger pages, so making the kernel fall > back to order-0 pages easier and sooner feels wrong, tbh. Without kswapd > trying to defragment memory, the only other option is to force tasks > into the direct compaction and it's known to be problematic. I guess the benefits depend on the hardware: for loopback, they might be significant, while for slower network devices, order-3 memory may provide much smaller gains? On the other hand, I wonder if we could make kcompactd more active when kswapd is woken for order-3 allocations, instead of reclaiming order-0 pages to form order-3. > > I wonder if instead we should look into optimizing kswapd to be less > power-hungry? People have been working on this for years, yet reclaiming a folio still requires a lot of effort, including folio_referenced, try_to_unmap_one, and compressing folios to swap out to zRAM. > > And if you still prefer to disable kswapd for this purpose, at least it > should be conditional to vm.laptop_mode. But again, I don't think it's > the right long-term approach. My point is that phones generally have much slower network hardware compared to PCs, and far slower hardware compared to servers, so they are likely not very sensitive to whether memory is order-3 or order-0. On the other hand, phones are highly sensitive to power consumption. As a result, the power cost of creating order-3 pages is likely to outweigh any benefit that order-3 memory might offer for network performance. It might be worth extending the existing net_high_order_alloc_disable_key to the RX path, as I mentioned in my reply to Eric[1], allowing users to decide whether network or power consumption is more important? [1] https://lore.kernel.org/linux-mm/20251014035846.1519-1-21cnbao@gmail.com/ Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 22:46 ` Roman Gushchin 2025-10-14 4:31 ` Barry Song @ 2025-10-14 7:24 ` Michal Hocko 1 sibling, 0 replies; 36+ messages in thread From: Michal Hocko @ 2025-10-14 7:24 UTC (permalink / raw) To: Roman Gushchin Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox On Mon 13-10-25 15:46:54, Roman Gushchin wrote: > Vlastimil Babka <vbabka@suse.cz> writes: > > > On 10/13/25 12:16, Barry Song wrote: [...] > >> An alternative approach is to disable kswapd for these frequent > >> allocations and provide best-effort order-3 service for both TX and RX paths, > >> while removing the sysctl entirely. > > I'm not sure this is the right path long-term. There are significant > benefits associated with using larger pages, so making the kernel fall > back to order-0 pages easier and sooner feels wrong, tbh. Without kswapd > trying to defragment memory, the only other option is to force tasks > into the direct compaction and it's known to be problematic. > > I wonder if instead we should look into optimizing kswapd to be less > power-hungry? Exactly. If your specific needs prefer low power consumption to higher order pages availability then we should have more flixible way to say that than a hardcoded allocation mode. We should be able to tell kswapd/kcompactd how much to try for those allocations. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 18:30 ` Vlastimil Babka 2025-10-13 21:35 ` Shakeel Butt 2025-10-13 22:46 ` Roman Gushchin @ 2025-10-14 7:26 ` Michal Hocko 2025-10-14 8:08 ` Barry Song 2025-10-14 14:27 ` Shakeel Butt 2 siblings, 2 replies; 36+ messages in thread From: Michal Hocko @ 2025-10-14 7:26 UTC (permalink / raw) To: Vlastimil Babka Cc: Barry Song, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > On 10/13/25 12:16, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> [...] > I wonder if we should either: > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > determine it precisely. As said in other reply I do not think this is a good fit for this specific case as it is all or nothing approach. Soon enough we discover that "no effort to reclaim/compact" hurts other usecases. So I do not think we need a dedicated flag for this specific case. We need a way to tell kswapd/kcompactd how much to try instead. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 7:26 ` Michal Hocko @ 2025-10-14 8:08 ` Barry Song 2025-10-14 14:27 ` Shakeel Butt 1 sibling, 0 replies; 36+ messages in thread From: Barry Song @ 2025-10-14 8:08 UTC (permalink / raw) To: mhocko Cc: 21cnbao, alexei.starovoitov, corbet, davem, david, edumazet, hannes, harry.yoo, horms, jackmanb, kuba, kuniyu, linux-doc, linux-kernel, linux-mm, linyunsheng, netdev, pabeni, roman.gushchin, surenb, v-songbaohua, vbabka, willemb, willy, zhouhuacai, ziy, baolin.wang On Tue, Oct 14, 2025 at 3:26 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > > On 10/13/25 12:16, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > [...] > > I wonder if we should either: > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > determine it precisely. > > As said in other reply I do not think this is a good fit for this > specific case as it is all or nothing approach. Soon enough we discover > that "no effort to reclaim/compact" hurts other usecases. So I do not > think we need a dedicated flag for this specific case. We need a way to > tell kswapd/kcompactd how much to try instead. +Baolin, who may have observed the same issue. An issue with vmscan is that kcompactd is woken up very late, only after reclaiming a large number of order-0 pages to satisfy an order-3 application. static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) { ... balanced = pgdat_balanced(pgdat, sc.order, highest_zoneidx); if (!balanced && nr_boost_reclaim) { nr_boost_reclaim = 0; goto restart; } /* * If boosting is not active then only reclaim if there are no * eligible zones. Note that sc.reclaim_idx is not used as * buffer_heads_over_limit may have adjusted it. */ if (!nr_boost_reclaim && balanced) goto out; ... if (kswapd_shrink_node(pgdat, &sc)) raise_priority = false; ... out: ... /* * As there is now likely space, wakeup kcompact to defragment * pageblocks. */ wakeup_kcompactd(pgdat, pageblock_order, highest_zoneidx); } As pgdat_balanced() needs at least one 3-order pages to return true: bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, int highest_zoneidx, unsigned int alloc_flags, long free_pages) { ... if (free_pages <= min + z->lowmem_reserve[highest_zoneidx]) return false; /* If this is an order-0 request then the watermark is fine */ if (!order) return true; /* For a high-order request, check at least one suitable page is free */ for (o = order; o < NR_PAGE_ORDERS; o++) { struct free_area *area = &z->free_area[o]; int mt; if (!area->nr_free) continue; for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { if (!free_area_empty(area, mt)) return true; } #ifdef CONFIG_CMA if ((alloc_flags & ALLOC_CMA) && !free_area_empty(area, MIGRATE_CMA)) { return true; } #endif if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) && !free_area_empty(area, MIGRATE_HIGHATOMIC)) { return true; } } This appears to be incorrect and will always lead to over-reclamation in order0 to satisfy high-order applications. I wonder if we should "goto out" earlier to wake up kcompactd when there is plenty of memory available, even if no order-3 pages exist. Conceptually, what I mean is: diff --git a/mm/vmscan.c b/mm/vmscan.c index c80fcae7f2a1..d0e03066bbaa 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -7057,9 +7057,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) * eligible zones. Note that sc.reclaim_idx is not used as * buffer_heads_over_limit may have adjusted it. */ - if (!nr_boost_reclaim && balanced) + if (!nr_boost_reclaim && (balanced || we_have_plenty_memory_to_compact())) goto out; /* Limit the priority of boosting to avoid reclaim writeback */ if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) raise_priority = false; Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 7:26 ` Michal Hocko 2025-10-14 8:08 ` Barry Song @ 2025-10-14 14:27 ` Shakeel Butt 2025-10-14 15:14 ` Michal Hocko 1 sibling, 1 reply; 36+ messages in thread From: Shakeel Butt @ 2025-10-14 14:27 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote: > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > > On 10/13/25 12:16, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > [...] > > I wonder if we should either: > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > determine it precisely. > > As said in other reply I do not think this is a good fit for this > specific case as it is all or nothing approach. Soon enough we discover > that "no effort to reclaim/compact" hurts other usecases. So I do not > think we need a dedicated flag for this specific case. We need a way to > tell kswapd/kcompactd how much to try instead. To me this new floag is to decouple two orthogonal requests i.e. no lock semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp flag convey the semantics of no lock. This can lead to unintended usage of no lock semantics by users which for whatever reason don't want to wakeup kswapd. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 14:27 ` Shakeel Butt @ 2025-10-14 15:14 ` Michal Hocko 2025-10-14 17:22 ` Shakeel Butt 0 siblings, 1 reply; 36+ messages in thread From: Michal Hocko @ 2025-10-14 15:14 UTC (permalink / raw) To: Shakeel Butt Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On Tue 14-10-25 07:27:06, Shakeel Butt wrote: > On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote: > > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > > > On 10/13/25 12:16, Barry Song wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > [...] > > > I wonder if we should either: > > > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > > determine it precisely. > > > > As said in other reply I do not think this is a good fit for this > > specific case as it is all or nothing approach. Soon enough we discover > > that "no effort to reclaim/compact" hurts other usecases. So I do not > > think we need a dedicated flag for this specific case. We need a way to > > tell kswapd/kcompactd how much to try instead. > > To me this new floag is to decouple two orthogonal requests i.e. no lock > semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp > flag convey the semantics of no lock. This can lead to unintended usage > of no lock semantics by users which for whatever reason don't want to > wakeup kswapd. I would argue that callers should have no business into saying whether the MM should wake up kswapd or not. The flag name currently suggests that but that is mostly for historic reasons. A random page allocator user shouldn't really care about this low level detail, really. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 15:14 ` Michal Hocko @ 2025-10-14 17:22 ` Shakeel Butt 2025-10-15 6:21 ` Michal Hocko 0 siblings, 1 reply; 36+ messages in thread From: Shakeel Butt @ 2025-10-14 17:22 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On Tue, Oct 14, 2025 at 05:14:47PM +0200, Michal Hocko wrote: > On Tue 14-10-25 07:27:06, Shakeel Butt wrote: > > On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote: > > > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > > > > On 10/13/25 12:16, Barry Song wrote: > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > [...] > > > > I wonder if we should either: > > > > > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > > > determine it precisely. > > > > > > As said in other reply I do not think this is a good fit for this > > > specific case as it is all or nothing approach. Soon enough we discover > > > that "no effort to reclaim/compact" hurts other usecases. So I do not > > > think we need a dedicated flag for this specific case. We need a way to > > > tell kswapd/kcompactd how much to try instead. > > > > To me this new floag is to decouple two orthogonal requests i.e. no lock > > semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp > > flag convey the semantics of no lock. This can lead to unintended usage > > of no lock semantics by users which for whatever reason don't want to > > wakeup kswapd. > > I would argue that callers should have no business into saying whether > the MM should wake up kswapd or not. The flag name currently suggests > that but that is mostly for historic reasons. A random page allocator > user shouldn't really care about this low level detail, really. I agree but unless we somehow enforce/warn for such cases, there will be users doing this. A simple grep shows kmsan is doing this. I worry there might be users who are manually setting up gfp flags for their allocations and not providing kswapd flag explicitly. Finding such cases with grep is not easy. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 17:22 ` Shakeel Butt @ 2025-10-15 6:21 ` Michal Hocko 2025-10-15 18:26 ` Shakeel Butt 0 siblings, 1 reply; 36+ messages in thread From: Michal Hocko @ 2025-10-15 6:21 UTC (permalink / raw) To: Shakeel Butt Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On Tue 14-10-25 10:22:03, Shakeel Butt wrote: > On Tue, Oct 14, 2025 at 05:14:47PM +0200, Michal Hocko wrote: > > On Tue 14-10-25 07:27:06, Shakeel Butt wrote: > > > On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote: > > > > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > > > > > On 10/13/25 12:16, Barry Song wrote: > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > [...] > > > > > I wonder if we should either: > > > > > > > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > > > > determine it precisely. > > > > > > > > As said in other reply I do not think this is a good fit for this > > > > specific case as it is all or nothing approach. Soon enough we discover > > > > that "no effort to reclaim/compact" hurts other usecases. So I do not > > > > think we need a dedicated flag for this specific case. We need a way to > > > > tell kswapd/kcompactd how much to try instead. > > > > > > To me this new floag is to decouple two orthogonal requests i.e. no lock > > > semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp > > > flag convey the semantics of no lock. This can lead to unintended usage > > > of no lock semantics by users which for whatever reason don't want to > > > wakeup kswapd. > > > > I would argue that callers should have no business into saying whether > > the MM should wake up kswapd or not. The flag name currently suggests > > that but that is mostly for historic reasons. A random page allocator > > user shouldn't really care about this low level detail, really. > > I agree but unless we somehow enforce/warn for such cases, there will be > users doing this. A simple grep shows kmsan is doing this. I worry there > might be users who are manually setting up gfp flags for their > allocations and not providing kswapd flag explicitly. Finding such cases > with grep is not easy. You are right but this is inherent problem of our gfp interface. It is too late to have a defensive interface I am afraid. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-15 6:21 ` Michal Hocko @ 2025-10-15 18:26 ` Shakeel Butt 0 siblings, 0 replies; 36+ messages in thread From: Shakeel Butt @ 2025-10-15 18:26 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox, Roman Gushchin On Wed, Oct 15, 2025 at 08:21:21AM +0200, Michal Hocko wrote: > On Tue 14-10-25 10:22:03, Shakeel Butt wrote: > > On Tue, Oct 14, 2025 at 05:14:47PM +0200, Michal Hocko wrote: > > > On Tue 14-10-25 07:27:06, Shakeel Butt wrote: > > > > On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote: > > > > > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > > > > > > On 10/13/25 12:16, Barry Song wrote: > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > [...] > > > > > > I wonder if we should either: > > > > > > > > > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > > > > > determine it precisely. > > > > > > > > > > As said in other reply I do not think this is a good fit for this > > > > > specific case as it is all or nothing approach. Soon enough we discover > > > > > that "no effort to reclaim/compact" hurts other usecases. So I do not > > > > > think we need a dedicated flag for this specific case. We need a way to > > > > > tell kswapd/kcompactd how much to try instead. > > > > > > > > To me this new floag is to decouple two orthogonal requests i.e. no lock > > > > semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp > > > > flag convey the semantics of no lock. This can lead to unintended usage > > > > of no lock semantics by users which for whatever reason don't want to > > > > wakeup kswapd. > > > > > > I would argue that callers should have no business into saying whether > > > the MM should wake up kswapd or not. The flag name currently suggests > > > that but that is mostly for historic reasons. A random page allocator > > > user shouldn't really care about this low level detail, really. > > > > I agree but unless we somehow enforce/warn for such cases, there will be > > users doing this. A simple grep shows kmsan is doing this. I worry there > > might be users who are manually setting up gfp flags for their > > allocations and not providing kswapd flag explicitly. Finding such cases > > with grep is not easy. > > You are right but this is inherent problem of our gfp interface. It is > too late to have a defensive interface I am afraid. I am not really asking to overhaul the whole gfp interface but rather not introduce one more case which can easily be misused. Anyways, this conversation is orthogonal to the original email and I am fine with wait and see approach here for now. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 10:16 [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation Barry Song 2025-10-13 18:30 ` Vlastimil Babka @ 2025-10-13 18:53 ` Eric Dumazet 2025-10-14 3:58 ` Barry Song 2025-10-13 21:56 ` Matthew Wilcox 2 siblings, 1 reply; 36+ messages in thread From: Eric Dumazet @ 2025-10-13 18:53 UTC (permalink / raw) To: Barry Song Cc: netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Mon, Oct 13, 2025 at 3:16 AM Barry Song <21cnbao@gmail.com> wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > On phones, we have observed significant phone heating when running apps > with high network bandwidth. This is caused by the network stack frequently > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > constantly active, even though plenty of memory is still available for network > allocations which can fall back to order-0. > > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key") > introduced high_order_alloc_disable for the transmit (TX) path > (skb_page_frag_refill()) to mitigate some memory reclamation issues, > allowing the TX path to fall back to order-0 immediately, while leaving the > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are > generally unaware of the sysctl and cannot easily adjust it for specific use > cases. Enabling high_order_alloc_disable also completely disables the > benefit of order-3 allocations. Additionally, the sysctl does not apply to the > RX path. > > An alternative approach is to disable kswapd for these frequent > allocations and provide best-effort order-3 service for both TX and RX paths, > while removing the sysctl entirely. > > ... > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > Documentation/admin-guide/sysctl/net.rst | 12 ------------ > include/net/sock.h | 1 - > mm/page_frag_cache.c | 2 +- > net/core/sock.c | 8 ++------ > net/core/sysctl_net_core.c | 7 ------- > 5 files changed, 3 insertions(+), 27 deletions(-) > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst > index 2ef50828aff1..b903bbae239c 100644 > --- a/Documentation/admin-guide/sysctl/net.rst > +++ b/Documentation/admin-guide/sysctl/net.rst > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This > list is then passed to the stack when the number of segments reaches the > gro_normal_batch limit. > > -high_order_alloc_disable > ------------------------- > - > -By default the allocator for page frags tries to use high order pages (order-3 > -on x86). While the default behavior gives good results in most cases, some users > -might have hit a contention in page allocations/freeing. This was especially > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of > -historical importance. > - The sysctl is quite useful for testing purposes, say on a freshly booted host, with plenty of free memory. Also, having order-3 pages if possible is quite important for IOMM use cases. Perhaps kswapd should have some kind of heuristic to not start if a recent run has already happened. I am guessing phones do not need to send 1.6 Tbit per second on network devices (yet), an option could be to disable it in your boot scripts. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 18:53 ` Eric Dumazet @ 2025-10-14 3:58 ` Barry Song 2025-10-14 5:07 ` Eric Dumazet 0 siblings, 1 reply; 36+ messages in thread From: Barry Song @ 2025-10-14 3:58 UTC (permalink / raw) To: edumazet Cc: 21cnbao, corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc, linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni, surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy > > > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst > > index 2ef50828aff1..b903bbae239c 100644 > > --- a/Documentation/admin-guide/sysctl/net.rst > > +++ b/Documentation/admin-guide/sysctl/net.rst > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This > > list is then passed to the stack when the number of segments reaches the > > gro_normal_batch limit. > > > > -high_order_alloc_disable > > ------------------------- > > - > > -By default the allocator for page frags tries to use high order pages (order-3 > > -on x86). While the default behavior gives good results in most cases, some users > > -might have hit a contention in page allocations/freeing. This was especially > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of > > -historical importance. > > - > > The sysctl is quite useful for testing purposes, say on a freshly > booted host, with plenty of free memory. > > Also, having order-3 pages if possible is quite important for IOMM use cases. > > Perhaps kswapd should have some kind of heuristic to not start if a > recent run has already happened. I don’t understand why it shouldn’t start when users continuously request order-3 allocations and ask kswapd to prepare order-3 memory — it doesn’t make sense logically to skip it just because earlier requests were already satisfied. > > I am guessing phones do not need to send 1.6 Tbit per second on > network devices (yet), > an option could be to disable it in your boot scripts. A problem with the existing sysctl is that it only covers the TX path; for the RX path, we also observe that kswapd consumes significant power. I could add the patch below to make it support the RX path, but it feels like a bit of a layer violation, since the RX path code resides in mm and is intended to serve generic users rather than networking, even though the current callers are primarily network-related. diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c index d2423f30577e..8ad18ec49f39 100644 --- a/mm/page_frag_cache.c +++ b/mm/page_frag_cache.c @@ -18,6 +18,7 @@ #include <linux/init.h> #include <linux/mm.h> #include <linux/page_frag_cache.h> +#include <net/sock.h> #include "internal.h" static unsigned long encoded_page_create(struct page *page, unsigned int order, @@ -54,10 +55,12 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc, gfp_t gfp = gfp_mask; #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | - __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; - page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER, - numa_mem_id(), NULL); + if (!static_branch_unlikely(&net_high_order_alloc_disable_key)) { + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | + __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; + page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER, + numa_mem_id(), NULL); + } #endif if (unlikely(!page)) { Do you have a better idea on how to make the sysctl also cover the RX path? Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 3:58 ` Barry Song @ 2025-10-14 5:07 ` Eric Dumazet 2025-10-14 6:43 ` Barry Song 0 siblings, 1 reply; 36+ messages in thread From: Eric Dumazet @ 2025-10-14 5:07 UTC (permalink / raw) To: Barry Song Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc, linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni, surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy On Mon, Oct 13, 2025 at 8:58 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst > > > index 2ef50828aff1..b903bbae239c 100644 > > > --- a/Documentation/admin-guide/sysctl/net.rst > > > +++ b/Documentation/admin-guide/sysctl/net.rst > > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This > > > list is then passed to the stack when the number of segments reaches the > > > gro_normal_batch limit. > > > > > > -high_order_alloc_disable > > > ------------------------- > > > - > > > -By default the allocator for page frags tries to use high order pages (order-3 > > > -on x86). While the default behavior gives good results in most cases, some users > > > -might have hit a contention in page allocations/freeing. This was especially > > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu > > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of > > > -historical importance. > > > - > > > > The sysctl is quite useful for testing purposes, say on a freshly > > booted host, with plenty of free memory. > > > > Also, having order-3 pages if possible is quite important for IOMM use cases. > > > > Perhaps kswapd should have some kind of heuristic to not start if a > > recent run has already happened. > > I don’t understand why it shouldn’t start when users continuously request > order-3 allocations and ask kswapd to prepare order-3 memory — it doesn’t > make sense logically to skip it just because earlier requests were already > satisfied. > > > > > I am guessing phones do not need to send 1.6 Tbit per second on > > network devices (yet), > > an option could be to disable it in your boot scripts. > > A problem with the existing sysctl is that it only covers the TX path; > for the RX path, we also observe that kswapd consumes significant power. > I could add the patch below to make it support the RX path, but it feels > like a bit of a layer violation, since the RX path code resides in mm > and is intended to serve generic users rather than networking, even > though the current callers are primarily network-related. You might have a buggy driver. High performance drivers use order-0 allocations only. > > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c > index d2423f30577e..8ad18ec49f39 100644 > --- a/mm/page_frag_cache.c > +++ b/mm/page_frag_cache.c > @@ -18,6 +18,7 @@ > #include <linux/init.h> > #include <linux/mm.h> > #include <linux/page_frag_cache.h> > +#include <net/sock.h> > #include "internal.h" > > static unsigned long encoded_page_create(struct page *page, unsigned int order, > @@ -54,10 +55,12 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc, > gfp_t gfp = gfp_mask; > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) > - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | > - __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; > - page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER, > - numa_mem_id(), NULL); > + if (!static_branch_unlikely(&net_high_order_alloc_disable_key)) { > + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | > + __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; > + page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER, > + numa_mem_id(), NULL); > + } > #endif > if (unlikely(!page)) { > > > Do you have a better idea on how to make the sysctl also cover the RX path? > > Thanks > Barry > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 5:07 ` Eric Dumazet @ 2025-10-14 6:43 ` Barry Song 2025-10-14 7:01 ` Eric Dumazet 0 siblings, 1 reply; 36+ messages in thread From: Barry Song @ 2025-10-14 6:43 UTC (permalink / raw) To: Eric Dumazet Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc, linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni, surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy > > > > A problem with the existing sysctl is that it only covers the TX path; > > for the RX path, we also observe that kswapd consumes significant power. > > I could add the patch below to make it support the RX path, but it feels > > like a bit of a layer violation, since the RX path code resides in mm > > and is intended to serve generic users rather than networking, even > > though the current callers are primarily network-related. > > You might have a buggy driver. We are observing the RX path as follows: do_softirq taskset_hi_action kalPacketAlloc __netdev_alloc_skb page_frag_alloc_align __page_frag_cache_refill This appears to be a fairly common stack. So it is a buggy driver? > > High performance drivers use order-0 allocations only. > Do you have an example of high-performance drivers that use only order-0 memory? Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 6:43 ` Barry Song @ 2025-10-14 7:01 ` Eric Dumazet 2025-10-14 8:17 ` Barry Song 0 siblings, 1 reply; 36+ messages in thread From: Eric Dumazet @ 2025-10-14 7:01 UTC (permalink / raw) To: Barry Song Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc, linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni, surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy On Mon, Oct 13, 2025 at 11:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > A problem with the existing sysctl is that it only covers the TX path; > > > for the RX path, we also observe that kswapd consumes significant power. > > > I could add the patch below to make it support the RX path, but it feels > > > like a bit of a layer violation, since the RX path code resides in mm > > > and is intended to serve generic users rather than networking, even > > > though the current callers are primarily network-related. > > > > You might have a buggy driver. > > We are observing the RX path as follows: > > do_softirq > taskset_hi_action > kalPacketAlloc > __netdev_alloc_skb > page_frag_alloc_align > __page_frag_cache_refill > > This appears to be a fairly common stack. > > So it is a buggy driver? No idea, kalPacketAlloc is not in upstream trees. It apparently needs high order allocations. It will fail at some point. > > > > > High performance drivers use order-0 allocations only. > > > > Do you have an example of high-performance drivers that use only order-0 memory? About all drivers using XDP, and/or using napi_get_frags() XDP has been using order-0 pages from the very beginning. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 7:01 ` Eric Dumazet @ 2025-10-14 8:17 ` Barry Song 2025-10-14 8:25 ` Eric Dumazet 0 siblings, 1 reply; 36+ messages in thread From: Barry Song @ 2025-10-14 8:17 UTC (permalink / raw) To: Eric Dumazet Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc, linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni, surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy On Tue, Oct 14, 2025 at 3:01 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Oct 13, 2025 at 11:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > A problem with the existing sysctl is that it only covers the TX path; > > > > for the RX path, we also observe that kswapd consumes significant power. > > > > I could add the patch below to make it support the RX path, but it feels > > > > like a bit of a layer violation, since the RX path code resides in mm > > > > and is intended to serve generic users rather than networking, even > > > > though the current callers are primarily network-related. > > > > > > You might have a buggy driver. > > > > We are observing the RX path as follows: > > > > do_softirq > > taskset_hi_action > > kalPacketAlloc > > __netdev_alloc_skb > > page_frag_alloc_align > > __page_frag_cache_refill > > > > This appears to be a fairly common stack. > > > > So it is a buggy driver? > > No idea, kalPacketAlloc is not in upstream trees. > > It apparently needs high order allocations. It will fail at some point. > > > > > > > > > High performance drivers use order-0 allocations only. > > > > > > > Do you have an example of high-performance drivers that use only order-0 memory? > > About all drivers using XDP, and/or using napi_get_frags() > > XDP has been using order-0 pages from the very beginning. Thanks! But there are still many drivers using netdev_alloc_skb()—we shouldn’t overlook them, right? net % git grep netdev_alloc_skb | wc -l 359 Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 8:17 ` Barry Song @ 2025-10-14 8:25 ` Eric Dumazet 0 siblings, 0 replies; 36+ messages in thread From: Eric Dumazet @ 2025-10-14 8:25 UTC (permalink / raw) To: Barry Song Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc, linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni, surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy On Tue, Oct 14, 2025 at 1:17 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 14, 2025 at 3:01 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Oct 13, 2025 at 11:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > > A problem with the existing sysctl is that it only covers the TX path; > > > > > for the RX path, we also observe that kswapd consumes significant power. > > > > > I could add the patch below to make it support the RX path, but it feels > > > > > like a bit of a layer violation, since the RX path code resides in mm > > > > > and is intended to serve generic users rather than networking, even > > > > > though the current callers are primarily network-related. > > > > > > > > You might have a buggy driver. > > > > > > We are observing the RX path as follows: > > > > > > do_softirq > > > taskset_hi_action > > > kalPacketAlloc > > > __netdev_alloc_skb > > > page_frag_alloc_align > > > __page_frag_cache_refill > > > > > > This appears to be a fairly common stack. > > > > > > So it is a buggy driver? > > > > No idea, kalPacketAlloc is not in upstream trees. > > > > It apparently needs high order allocations. It will fail at some point. > > > > > > > > > > > > > High performance drivers use order-0 allocations only. > > > > > > > > > > Do you have an example of high-performance drivers that use only order-0 memory? > > > > About all drivers using XDP, and/or using napi_get_frags() > > > > XDP has been using order-0 pages from the very beginning. > > Thanks! But there are still many drivers using netdev_alloc_skb()—we > shouldn’t overlook them, right? > > net % git grep netdev_alloc_skb | wc -l > 359 Only the ones that are using 16KB allocations like some WAN drivers :) Some networks use MTU=9000 If a hardware does not provide SG support on receive, a kmalloc() based will use 16KB of memory. By using a frag allocator, we can pack 3 allocations per 32KB instead of 2. TCP can go 50% faster. If memory is short, it will fail no matter what. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 10:16 [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation Barry Song 2025-10-13 18:30 ` Vlastimil Babka 2025-10-13 18:53 ` Eric Dumazet @ 2025-10-13 21:56 ` Matthew Wilcox 2025-10-14 4:09 ` Barry Song 2 siblings, 1 reply; 36+ messages in thread From: Matthew Wilcox @ 2025-10-13 21:56 UTC (permalink / raw) To: Barry Song Cc: netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > On phones, we have observed significant phone heating when running apps > with high network bandwidth. This is caused by the network stack frequently > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > constantly active, even though plenty of memory is still available for network > allocations which can fall back to order-0. I think we need to understand what's going on here a whole lot more than this! So, we try to do an order-3 allocation. kswapd runs and ... succeeds in creating order-3 pages? Or fails to? If it fails, that's something we need to sort out. If it succeeds, now we have several order-3 pages, great. But where do they all go that we need to run kswapd again? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-13 21:56 ` Matthew Wilcox @ 2025-10-14 4:09 ` Barry Song 2025-10-14 5:04 ` Eric Dumazet 0 siblings, 1 reply; 36+ messages in thread From: Barry Song @ 2025-10-14 4:09 UTC (permalink / raw) To: Matthew Wilcox Cc: netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > > On phones, we have observed significant phone heating when running apps > > with high network bandwidth. This is caused by the network stack frequently > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > constantly active, even though plenty of memory is still available for network > > allocations which can fall back to order-0. > > I think we need to understand what's going on here a whole lot more than > this! > > So, we try to do an order-3 allocation. kswapd runs and ... succeeds in > creating order-3 pages? Or fails to? > Our team observed that most of the time we successfully obtain order-3 memory, but the cost is excessive memory reclamation, since we end up over-reclaiming order-0 pages that could have remained in memory. > If it fails, that's something we need to sort out. > > If it succeeds, now we have several order-3 pages, great. But where do > they all go that we need to run kswapd again? The network app keeps running and continues to issue new order-3 allocation requests, so those few order-3 pages won’t be enough to satisfy the continuous demand. Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 4:09 ` Barry Song @ 2025-10-14 5:04 ` Eric Dumazet 2025-10-14 8:58 ` Barry Song 0 siblings, 1 reply; 36+ messages in thread From: Eric Dumazet @ 2025-10-14 5:04 UTC (permalink / raw) To: Barry Song Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > > > On phones, we have observed significant phone heating when running apps > > > with high network bandwidth. This is caused by the network stack frequently > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > > constantly active, even though plenty of memory is still available for network > > > allocations which can fall back to order-0. > > > > I think we need to understand what's going on here a whole lot more than > > this! > > > > So, we try to do an order-3 allocation. kswapd runs and ... succeeds in > > creating order-3 pages? Or fails to? > > > > Our team observed that most of the time we successfully obtain order-3 > memory, but the cost is excessive memory reclamation, since we end up > over-reclaiming order-0 pages that could have remained in memory. > > > If it fails, that's something we need to sort out. > > > > If it succeeds, now we have several order-3 pages, great. But where do > > they all go that we need to run kswapd again? > > The network app keeps running and continues to issue new order-3 allocation > requests, so those few order-3 pages won’t be enough to satisfy the > continuous demand. These pages are freed as order-3 pages, and should replenish the buddy as if nothing happened. I think you are missing something to control how much memory can be pushed on each TCP socket ? What is tcp_wmem on your phones ? What about tcp_mem ? Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 5:04 ` Eric Dumazet @ 2025-10-14 8:58 ` Barry Song 2025-10-14 9:49 ` Eric Dumazet 0 siblings, 1 reply; 36+ messages in thread From: Barry Song @ 2025-10-14 8:58 UTC (permalink / raw) To: Eric Dumazet Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Tue, Oct 14, 2025 at 1:04 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > > > > On phones, we have observed significant phone heating when running apps > > > > with high network bandwidth. This is caused by the network stack frequently > > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > > > constantly active, even though plenty of memory is still available for network > > > > allocations which can fall back to order-0. > > > > > > I think we need to understand what's going on here a whole lot more than > > > this! > > > > > > So, we try to do an order-3 allocation. kswapd runs and ... succeeds in > > > creating order-3 pages? Or fails to? > > > > > > > Our team observed that most of the time we successfully obtain order-3 > > memory, but the cost is excessive memory reclamation, since we end up > > over-reclaiming order-0 pages that could have remained in memory. > > > > > If it fails, that's something we need to sort out. > > > > > > If it succeeds, now we have several order-3 pages, great. But where do > > > they all go that we need to run kswapd again? > > > > The network app keeps running and continues to issue new order-3 allocation > > requests, so those few order-3 pages won’t be enough to satisfy the > > continuous demand. > > These pages are freed as order-3 pages, and should replenish the buddy > as if nothing happened. Ideally, that would be the case if the workload were simple. However, the system may have many other processes and kernel drivers running simultaneously, also consuming memory from the buddy allocator and possibly taking the replenished pages. As a result, we can still observe multiple kswapd wakeups and instances of over-reclamation caused by the network stack’s high-order allocations. > > I think you are missing something to control how much memory can be > pushed on each TCP socket ? > > What is tcp_wmem on your phones ? What about tcp_mem ? > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat # cat /proc/sys/net/ipv4/tcp_wmem 524288 1048576 6710886 # cat /proc/sys/net/ipv4/tcp_mem 131220 174961 262440 # cat /proc/sys/net/ipv4/tcp_notsent_lowat 4294967295 Any thoughts on these settings? Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 8:58 ` Barry Song @ 2025-10-14 9:49 ` Eric Dumazet 2025-10-14 10:19 ` Barry Song 0 siblings, 1 reply; 36+ messages in thread From: Eric Dumazet @ 2025-10-14 9:49 UTC (permalink / raw) To: Barry Song Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Tue, Oct 14, 2025 at 1:58 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 14, 2025 at 1:04 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > > > > > On phones, we have observed significant phone heating when running apps > > > > > with high network bandwidth. This is caused by the network stack frequently > > > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > > > > constantly active, even though plenty of memory is still available for network > > > > > allocations which can fall back to order-0. > > > > > > > > I think we need to understand what's going on here a whole lot more than > > > > this! > > > > > > > > So, we try to do an order-3 allocation. kswapd runs and ... succeeds in > > > > creating order-3 pages? Or fails to? > > > > > > > > > > Our team observed that most of the time we successfully obtain order-3 > > > memory, but the cost is excessive memory reclamation, since we end up > > > over-reclaiming order-0 pages that could have remained in memory. > > > > > > > If it fails, that's something we need to sort out. > > > > > > > > If it succeeds, now we have several order-3 pages, great. But where do > > > > they all go that we need to run kswapd again? > > > > > > The network app keeps running and continues to issue new order-3 allocation > > > requests, so those few order-3 pages won’t be enough to satisfy the > > > continuous demand. > > > > These pages are freed as order-3 pages, and should replenish the buddy > > as if nothing happened. > > Ideally, that would be the case if the workload were simple. However, the > system may have many other processes and kernel drivers running > simultaneously, also consuming memory from the buddy allocator and possibly > taking the replenished pages. As a result, we can still observe multiple > kswapd wakeups and instances of over-reclamation caused by the network > stack’s high-order allocations. > > > > > I think you are missing something to control how much memory can be > > pushed on each TCP socket ? > > > > What is tcp_wmem on your phones ? What about tcp_mem ? > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat > > # cat /proc/sys/net/ipv4/tcp_wmem > 524288 1048576 6710886 Ouch. That is insane tcp_wmem[0] . Please stick to 4096, or risk OOM of various sorts. > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat > 4294967295 > > Any thoughts on these settings? Please look at https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt tcp_notsent_lowat - UNSIGNED INTEGER A TCP socket can control the amount of unsent bytes in its write queue, thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() reports POLLOUT events if the amount of unsent bytes is below a per socket value, and if the write queue is not full. sendmsg() will also not add new buffers if the limit is hit. This global variable controls the amount of unsent data for sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change to the global variable has immediate effect. Setting this sysctl to 2MB can effectively reduce the amount of memory in TCP write queues by 66 %, or allow you to increase tcp_wmem[2] so that only flows needing big BDP can get it. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 9:49 ` Eric Dumazet @ 2025-10-14 10:19 ` Barry Song 2025-10-14 10:39 ` Eric Dumazet 2025-10-14 14:37 ` Shakeel Butt 0 siblings, 2 replies; 36+ messages in thread From: Barry Song @ 2025-10-14 10:19 UTC (permalink / raw) To: Eric Dumazet Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou > > > > > > > > I think you are missing something to control how much memory can be > > > pushed on each TCP socket ? > > > > > > What is tcp_wmem on your phones ? What about tcp_mem ? > > > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat > > > > # cat /proc/sys/net/ipv4/tcp_wmem > > 524288 1048576 6710886 > > Ouch. That is insane tcp_wmem[0] . > > Please stick to 4096, or risk OOM of various sorts. > > > > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat > > 4294967295 > > > > Any thoughts on these settings? > > Please look at > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt > > tcp_notsent_lowat - UNSIGNED INTEGER > A TCP socket can control the amount of unsent bytes in its write queue, > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() > reports POLLOUT events if the amount of unsent bytes is below a per > socket value, and if the write queue is not full. sendmsg() will > also not add new buffers if the limit is hit. > > This global variable controls the amount of unsent data for > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change > to the global variable has immediate effect. > > > Setting this sysctl to 2MB can effectively reduce the amount of memory > in TCP write queues by 66 %, > or allow you to increase tcp_wmem[2] so that only flows needing big > BDP can get it. We obtained these settings from our hardware vendors. It might be worth exploring these settings further, but I can’t quite see their connection to high-order allocations, since high-order allocations are kernel macros. #define SKB_FRAG_PAGE_ORDER get_order(32768) #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) Is there anything I’m missing? Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 10:19 ` Barry Song @ 2025-10-14 10:39 ` Eric Dumazet 2025-10-14 20:17 ` Barry Song 2025-10-14 14:37 ` Shakeel Butt 1 sibling, 1 reply; 36+ messages in thread From: Eric Dumazet @ 2025-10-14 10:39 UTC (permalink / raw) To: Barry Song Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Tue, Oct 14, 2025 at 3:19 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > I think you are missing something to control how much memory can be > > > > pushed on each TCP socket ? > > > > > > > > What is tcp_wmem on your phones ? What about tcp_mem ? > > > > > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat > > > > > > # cat /proc/sys/net/ipv4/tcp_wmem > > > 524288 1048576 6710886 > > > > Ouch. That is insane tcp_wmem[0] . > > > > Please stick to 4096, or risk OOM of various sorts. > > > > > > > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat > > > 4294967295 > > > > > > Any thoughts on these settings? > > > > Please look at > > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt > > > > tcp_notsent_lowat - UNSIGNED INTEGER > > A TCP socket can control the amount of unsent bytes in its write queue, > > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() > > reports POLLOUT events if the amount of unsent bytes is below a per > > socket value, and if the write queue is not full. sendmsg() will > > also not add new buffers if the limit is hit. > > > > This global variable controls the amount of unsent data for > > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change > > to the global variable has immediate effect. > > > > > > Setting this sysctl to 2MB can effectively reduce the amount of memory > > in TCP write queues by 66 %, > > or allow you to increase tcp_wmem[2] so that only flows needing big > > BDP can get it. > > We obtained these settings from our hardware vendors. Tell them they are wrong. > > It might be worth exploring these settings further, but I can’t quite see > their connection to high-order allocations, since high-order allocations are > kernel macros. > > #define SKB_FRAG_PAGE_ORDER get_order(32768) > #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) > #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) > > Is there anything I’m missing? What is your question exactly ? You read these macros just fine. What is your point ? We had in the past something dynamic that we removed commit d9b2938aabf757da2d40153489b251d4fc3fdd18 Author: Eric Dumazet <edumazet@google.com> Date: Wed Aug 27 20:49:34 2014 -0700 net: attempt a single high order allocation ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 10:39 ` Eric Dumazet @ 2025-10-14 20:17 ` Barry Song 2025-10-15 6:39 ` Eric Dumazet 0 siblings, 1 reply; 36+ messages in thread From: Barry Song @ 2025-10-14 20:17 UTC (permalink / raw) To: Eric Dumazet Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Tue, Oct 14, 2025 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Oct 14, 2025 at 3:19 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > > > > > I think you are missing something to control how much memory can be > > > > > pushed on each TCP socket ? > > > > > > > > > > What is tcp_wmem on your phones ? What about tcp_mem ? > > > > > > > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat > > > > > > > > # cat /proc/sys/net/ipv4/tcp_wmem > > > > 524288 1048576 6710886 > > > > > > Ouch. That is insane tcp_wmem[0] . > > > > > > Please stick to 4096, or risk OOM of various sorts. > > > > > > > > > > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat > > > > 4294967295 > > > > > > > > Any thoughts on these settings? > > > > > > Please look at > > > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt > > > > > > tcp_notsent_lowat - UNSIGNED INTEGER > > > A TCP socket can control the amount of unsent bytes in its write queue, > > > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() > > > reports POLLOUT events if the amount of unsent bytes is below a per > > > socket value, and if the write queue is not full. sendmsg() will > > > also not add new buffers if the limit is hit. > > > > > > This global variable controls the amount of unsent data for > > > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change > > > to the global variable has immediate effect. > > > > > > > > > Setting this sysctl to 2MB can effectively reduce the amount of memory > > > in TCP write queues by 66 %, > > > or allow you to increase tcp_wmem[2] so that only flows needing big > > > BDP can get it. > > > > We obtained these settings from our hardware vendors. > > Tell them they are wrong. Well, we checked Qualcomm and MTK, and it seems both set these values relatively high. In other words, all the AOSP products we examined also use high values for these settings. Nobody is using tcp_wmem[0]=4096. We’ll need some time to understand why these are configured this way in AOSP hardware. > > > > > It might be worth exploring these settings further, but I can’t quite see > > their connection to high-order allocations, since high-order allocations are > > kernel macros. > > > > #define SKB_FRAG_PAGE_ORDER get_order(32768) > > #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) > > #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) > > > > Is there anything I’m missing? > > What is your question exactly ? You read these macros just fine. What > is your point ? My question is whether these settings influence how often high-order allocations occur. In other words, would lowering these values make high-order allocations less frequent? If so, why? I’m not a network expert, apologies if the question sounds naive. > > We had in the past something dynamic that we removed > > commit d9b2938aabf757da2d40153489b251d4fc3fdd18 > Author: Eric Dumazet <edumazet@google.com> > Date: Wed Aug 27 20:49:34 2014 -0700 > > net: attempt a single high order allocation Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 20:17 ` Barry Song @ 2025-10-15 6:39 ` Eric Dumazet 2025-10-15 7:35 ` Barry Song 0 siblings, 1 reply; 36+ messages in thread From: Eric Dumazet @ 2025-10-15 6:39 UTC (permalink / raw) To: Barry Song Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Tue, Oct 14, 2025 at 1:17 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 14, 2025 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Oct 14, 2025 at 3:19 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > I think you are missing something to control how much memory can be > > > > > > pushed on each TCP socket ? > > > > > > > > > > > > What is tcp_wmem on your phones ? What about tcp_mem ? > > > > > > > > > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat > > > > > > > > > > # cat /proc/sys/net/ipv4/tcp_wmem > > > > > 524288 1048576 6710886 > > > > > > > > Ouch. That is insane tcp_wmem[0] . > > > > > > > > Please stick to 4096, or risk OOM of various sorts. > > > > > > > > > > > > > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat > > > > > 4294967295 > > > > > > > > > > Any thoughts on these settings? > > > > > > > > Please look at > > > > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt > > > > > > > > tcp_notsent_lowat - UNSIGNED INTEGER > > > > A TCP socket can control the amount of unsent bytes in its write queue, > > > > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() > > > > reports POLLOUT events if the amount of unsent bytes is below a per > > > > socket value, and if the write queue is not full. sendmsg() will > > > > also not add new buffers if the limit is hit. > > > > > > > > This global variable controls the amount of unsent data for > > > > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change > > > > to the global variable has immediate effect. > > > > > > > > > > > > Setting this sysctl to 2MB can effectively reduce the amount of memory > > > > in TCP write queues by 66 %, > > > > or allow you to increase tcp_wmem[2] so that only flows needing big > > > > BDP can get it. > > > > > > We obtained these settings from our hardware vendors. > > > > Tell them they are wrong. > > Well, we checked Qualcomm and MTK, and it seems both set these values > relatively high. In other words, all the AOSP products we examined also > use high values for these settings. Nobody is using tcp_wmem[0]=4096. > The (fine and safe) default should be PAGE_SIZE. Perhaps they are dealing with systems with PAGE_SIZE=65536, but then the skb_page_frag_refill() would be a non issue there, because it would only allocate order-0 pages. > We’ll need some time to understand why these are configured this way in > AOSP hardware. > > > > > > > > > It might be worth exploring these settings further, but I can’t quite see > > > their connection to high-order allocations, since high-order allocations are > > > kernel macros. > > > > > > #define SKB_FRAG_PAGE_ORDER get_order(32768) > > > #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) > > > #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) > > > > > > Is there anything I’m missing? > > > > What is your question exactly ? You read these macros just fine. What > > is your point ? > > My question is whether these settings influence how often high-order > allocations occur. In other words, would lowering these values make > high-order allocations less frequent? If so, why? Because almost all of the buffers stored in TCP write queues are using order-3 pages on arches with 4K pages. I am a bit confused because you posted a patch changing skb_page_frag_refill() without realizing its first user is TCP. Look for sk_page_frag_refill() in tcp_sendmsg_locked() > I’m not a network expert, apologies if the question sounds naive. > > > > > We had in the past something dynamic that we removed > > > > commit d9b2938aabf757da2d40153489b251d4fc3fdd18 > > Author: Eric Dumazet <edumazet@google.com> > > Date: Wed Aug 27 20:49:34 2014 -0700 > > > > net: attempt a single high order allocation > > Thanks > Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-15 6:39 ` Eric Dumazet @ 2025-10-15 7:35 ` Barry Song 2025-10-15 16:39 ` Suren Baghdasaryan 0 siblings, 1 reply; 36+ messages in thread From: Barry Song @ 2025-10-15 7:35 UTC (permalink / raw) To: Eric Dumazet Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Wed, Oct 15, 2025 at 2:39 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > Tell them they are wrong. > > > > Well, we checked Qualcomm and MTK, and it seems both set these values > > relatively high. In other words, all the AOSP products we examined also > > use high values for these settings. Nobody is using tcp_wmem[0]=4096. > > > > The (fine and safe) default should be PAGE_SIZE. > > Perhaps they are dealing with systems with PAGE_SIZE=65536, but then > the skb_page_frag_refill() would be a non issue there, because it would > only allocate order-0 pages. I am 100% sure that all of them handle PAGE_SIZE=4096. Google is working on 16KB page size for Android, but it is not ready yet(Please correct me if 16KB has been ready, Suren). > > > We’ll need some time to understand why these are configured this way in > > AOSP hardware. > > > > > > > > > > > > > It might be worth exploring these settings further, but I can’t quite see > > > > their connection to high-order allocations, since high-order allocations are > > > > kernel macros. > > > > > > > > #define SKB_FRAG_PAGE_ORDER get_order(32768) > > > > #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) > > > > #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) > > > > > > > > Is there anything I’m missing? > > > > > > What is your question exactly ? You read these macros just fine. What > > > is your point ? > > > > My question is whether these settings influence how often high-order > > allocations occur. In other words, would lowering these values make > > high-order allocations less frequent? If so, why? > > Because almost all of the buffers stored in TCP write queues are using > order-3 pages > on arches with 4K pages. > > I am a bit confused because you posted a patch changing skb_page_frag_refill() > without realizing its first user is TCP. > > Look for sk_page_frag_refill() in tcp_sendmsg_locked() Sure. Let me review the code further. The problem was observed on the MM side, causing over-reclamation and phone heating, while the source of the allocations lies in network activity. I am not a network expert and may be missing many network details, so I am raising this RFC to both lists to see if the network and MM folks can discuss together to find a solution. As you can see, the discussion has absolutely forked into two branches. :-) Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-15 7:35 ` Barry Song @ 2025-10-15 16:39 ` Suren Baghdasaryan 0 siblings, 0 replies; 36+ messages in thread From: Suren Baghdasaryan @ 2025-10-15 16:39 UTC (permalink / raw) To: Barry Song Cc: Eric Dumazet, Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Wed, Oct 15, 2025 at 12:35 AM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, Oct 15, 2025 at 2:39 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > Tell them they are wrong. > > > > > > Well, we checked Qualcomm and MTK, and it seems both set these values > > > relatively high. In other words, all the AOSP products we examined also > > > use high values for these settings. Nobody is using tcp_wmem[0]=4096. > > > > > > > The (fine and safe) default should be PAGE_SIZE. > > > > Perhaps they are dealing with systems with PAGE_SIZE=65536, but then > > the skb_page_frag_refill() would be a non issue there, because it would > > only allocate order-0 pages. > > I am 100% sure that all of them handle PAGE_SIZE=4096. Google is working on > 16KB page size for Android, but it is not ready yet(Please correct me > if 16KB has been > ready, Suren). It is ready but it is new, so it will take some time before we see it in production devices. > > > > > > We’ll need some time to understand why these are configured this way in > > > AOSP hardware. > > > > > > > > > > > > > > > > > It might be worth exploring these settings further, but I can’t quite see > > > > > their connection to high-order allocations, since high-order allocations are > > > > > kernel macros. > > > > > > > > > > #define SKB_FRAG_PAGE_ORDER get_order(32768) > > > > > #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) > > > > > #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) > > > > > > > > > > Is there anything I’m missing? > > > > > > > > What is your question exactly ? You read these macros just fine. What > > > > is your point ? > > > > > > My question is whether these settings influence how often high-order > > > allocations occur. In other words, would lowering these values make > > > high-order allocations less frequent? If so, why? > > > > Because almost all of the buffers stored in TCP write queues are using > > order-3 pages > > on arches with 4K pages. > > > > I am a bit confused because you posted a patch changing skb_page_frag_refill() > > without realizing its first user is TCP. > > > > Look for sk_page_frag_refill() in tcp_sendmsg_locked() > > Sure. Let me review the code further. The problem was observed on the MM > side, causing over-reclamation and phone heating, while the source of the > allocations lies in network activity. I am not a network expert and may be > missing many network details, so I am raising this RFC to both lists to see > if the network and MM folks can discuss together to find a solution. > > As you can see, the discussion has absolutely forked into two branches. :-) > > Thanks > Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 10:19 ` Barry Song 2025-10-14 10:39 ` Eric Dumazet @ 2025-10-14 14:37 ` Shakeel Butt 2025-10-14 20:28 ` Barry Song 1 sibling, 1 reply; 36+ messages in thread From: Shakeel Butt @ 2025-10-14 14:37 UTC (permalink / raw) To: Barry Song Cc: Eric Dumazet, Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Tue, Oct 14, 2025 at 06:19:05PM +0800, Barry Song wrote: > > > > > > > > > > > I think you are missing something to control how much memory can be > > > > pushed on each TCP socket ? > > > > > > > > What is tcp_wmem on your phones ? What about tcp_mem ? > > > > > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat > > > > > > # cat /proc/sys/net/ipv4/tcp_wmem > > > 524288 1048576 6710886 > > > > Ouch. That is insane tcp_wmem[0] . > > > > Please stick to 4096, or risk OOM of various sorts. > > > > > > > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat > > > 4294967295 > > > > > > Any thoughts on these settings? > > > > Please look at > > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt > > > > tcp_notsent_lowat - UNSIGNED INTEGER > > A TCP socket can control the amount of unsent bytes in its write queue, > > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() > > reports POLLOUT events if the amount of unsent bytes is below a per > > socket value, and if the write queue is not full. sendmsg() will > > also not add new buffers if the limit is hit. > > > > This global variable controls the amount of unsent data for > > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change > > to the global variable has immediate effect. > > > > > > Setting this sysctl to 2MB can effectively reduce the amount of memory > > in TCP write queues by 66 %, > > or allow you to increase tcp_wmem[2] so that only flows needing big > > BDP can get it. > > We obtained these settings from our hardware vendors. > > It might be worth exploring these settings further, but I can’t quite see > their connection to high-order allocations, I don't think there is a connection between them. Is there a reason you are expecting a connection/relation between them? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 14:37 ` Shakeel Butt @ 2025-10-14 20:28 ` Barry Song 2025-10-15 18:13 ` Shakeel Butt 0 siblings, 1 reply; 36+ messages in thread From: Barry Song @ 2025-10-14 20:28 UTC (permalink / raw) To: Shakeel Butt Cc: Eric Dumazet, Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Tue, Oct 14, 2025 at 10:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > It might be worth exploring these settings further, but I can’t quite see > > their connection to high-order allocations, > > I don't think there is a connection between them. Is there a reason you > are expecting a connection/relation between them? Eric replied to my email about frequent high-order allocation requests, suggesting that I might be missing some proper configurations for these settings[1]. So I’m trying to understand whether these configurations affect the frequency of high-order allocations. [1] https://lore.kernel.org/linux-mm/pow5zt7dmo2wiydophoap6ntaycyjt2yrszo3ue7mg2hgnzcmv@oi3epbtyoufn/T/#m9b94a1c60452551496738e4e15235329f860d1f9 Thanks Barry ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation 2025-10-14 20:28 ` Barry Song @ 2025-10-15 18:13 ` Shakeel Butt 0 siblings, 0 replies; 36+ messages in thread From: Shakeel Butt @ 2025-10-15 18:13 UTC (permalink / raw) To: Barry Song Cc: Eric Dumazet, Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou On Wed, Oct 15, 2025 at 04:28:17AM +0800, Barry Song wrote: > On Tue, Oct 14, 2025 at 10:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > It might be worth exploring these settings further, but I can’t quite see > > > their connection to high-order allocations, > > > > I don't think there is a connection between them. Is there a reason you > > are expecting a connection/relation between them? > > Eric replied to my email about frequent high-order allocation requests, > suggesting that I might be missing some proper configurations for these > settings[1]. So I’m trying to understand whether these configurations affect > the frequency of high-order allocations. If I understand Eric correctly, those configurations do indirectly affect the number of memory allocations and their lifetime (irrespective of order). In one scenario, setting tcp_wmem[0] higher, allow the kernel to allocate more memory even when the system is under memory pressure. See tcp_wmem_schedule(). In your case it would be up to 0.5MiB per socket. Have you tested the configuration values suggested by Eric? ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2025-10-15 18:26 UTC | newest] Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-10-13 10:16 [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation Barry Song 2025-10-13 18:30 ` Vlastimil Babka 2025-10-13 21:35 ` Shakeel Butt 2025-10-13 21:53 ` Alexei Starovoitov 2025-10-13 22:25 ` Shakeel Butt 2025-10-13 22:46 ` Roman Gushchin 2025-10-14 4:31 ` Barry Song 2025-10-14 7:24 ` Michal Hocko 2025-10-14 7:26 ` Michal Hocko 2025-10-14 8:08 ` Barry Song 2025-10-14 14:27 ` Shakeel Butt 2025-10-14 15:14 ` Michal Hocko 2025-10-14 17:22 ` Shakeel Butt 2025-10-15 6:21 ` Michal Hocko 2025-10-15 18:26 ` Shakeel Butt 2025-10-13 18:53 ` Eric Dumazet 2025-10-14 3:58 ` Barry Song 2025-10-14 5:07 ` Eric Dumazet 2025-10-14 6:43 ` Barry Song 2025-10-14 7:01 ` Eric Dumazet 2025-10-14 8:17 ` Barry Song 2025-10-14 8:25 ` Eric Dumazet 2025-10-13 21:56 ` Matthew Wilcox 2025-10-14 4:09 ` Barry Song 2025-10-14 5:04 ` Eric Dumazet 2025-10-14 8:58 ` Barry Song 2025-10-14 9:49 ` Eric Dumazet 2025-10-14 10:19 ` Barry Song 2025-10-14 10:39 ` Eric Dumazet 2025-10-14 20:17 ` Barry Song 2025-10-15 6:39 ` Eric Dumazet 2025-10-15 7:35 ` Barry Song 2025-10-15 16:39 ` Suren Baghdasaryan 2025-10-14 14:37 ` Shakeel Butt 2025-10-14 20:28 ` Barry Song 2025-10-15 18:13 ` Shakeel Butt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox