[RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
@ 2025-10-13 10:16 Barry Song
  2025-10-13 18:30 ` Vlastimil Babka
                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Barry Song @ 2025-10-13 10:16 UTC (permalink / raw)
  To: netdev, linux-mm, linux-doc
  Cc: linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou

From: Barry Song <v-songbaohua@oppo.com>

On phones, we have observed significant phone heating when running apps
with high network bandwidth. This is caused by the network stack frequently
waking kswapd for order-3 allocations. As a result, memory reclamation becomes
constantly active, even though plenty of memory is still available for network
allocations which can fall back to order-0.

Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
introduced high_order_alloc_disable for the transmit (TX) path
(skb_page_frag_refill()) to mitigate some memory reclamation issues,
allowing the TX path to fall back to order-0 immediately, while leaving the
receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
generally unaware of the sysctl and cannot easily adjust it for specific use
cases. Enabling high_order_alloc_disable also completely disables the
benefit of order-3 allocations. Additionally, the sysctl does not apply to the
RX path.

An alternative approach is to disable kswapd for these frequent
allocations and provide best-effort order-3 service for both TX and RX paths,
while removing the sysctl entirely.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Cc: Huacai Zhou <zhouhuacai@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/admin-guide/sysctl/net.rst | 12 ------------
 include/net/sock.h                       |  1 -
 mm/page_frag_cache.c                     |  2 +-
 net/core/sock.c                          |  8 ++------
 net/core/sysctl_net_core.c               |  7 -------
 5 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 2ef50828aff1..b903bbae239c 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
 list is then passed to the stack when the number of segments reaches the
 gro_normal_batch limit.
 
-high_order_alloc_disable
-------------------------
-
-By default the allocator for page frags tries to use high order pages (order-3
-on x86). While the default behavior gives good results in most cases, some users
-might have hit a contention in page allocations/freeing. This was especially
-true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
-lists. This allows to opt-in for order-0 allocation instead but is now mostly of
-historical importance.
-
-Default: 0
-
 2. /proc/sys/net/unix - Parameters for Unix domain sockets
 ----------------------------------------------------------
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 60bcb13f045c..62306c1095d5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
 extern __u32 sysctl_rmem_default;
 
 #define SKB_FRAG_PAGE_ORDER	get_order(32768)
-DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
 
 static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
 {
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e..dd36114dd16f 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 	gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
+	gfp_mask = (gfp_mask & ~__GFP_RECLAIM) |  __GFP_COMP |
 		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
 	page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
 			     numa_mem_id(), NULL);
diff --git a/net/core/sock.c b/net/core/sock.c
index dc03d4b5909a..1fa1e9177d86 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3085,8 +3085,6 @@ static void sk_leave_memory_pressure(struct sock *sk)
 	}
 }
 
-DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
-
 /**
  * skb_page_frag_refill - check that a page_frag contains enough room
  * @sz: minimum size of the fragment we want to get
@@ -3110,10 +3108,8 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
 	}
 
 	pfrag->offset = 0;
-	if (SKB_FRAG_PAGE_ORDER &&
-	    !static_branch_unlikely(&net_high_order_alloc_disable_key)) {
-		/* Avoid direct reclaim but allow kswapd to wake */
-		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+	if (SKB_FRAG_PAGE_ORDER) {
+		pfrag->page = alloc_pages((gfp & ~__GFP_RECLAIM) |
 					  __GFP_COMP | __GFP_NOWARN |
 					  __GFP_NORETRY,
 					  SKB_FRAG_PAGE_ORDER);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8cf04b57ade1..181f6532beb8 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -599,13 +599,6 @@ static struct ctl_table net_core_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_THREE,
 	},
-	{
-		.procname	= "high_order_alloc_disable",
-		.data		= &net_high_order_alloc_disable_key.key,
-		.maxlen         = sizeof(net_high_order_alloc_disable_key),
-		.mode		= 0644,
-		.proc_handler	= proc_do_static_key,
-	},
 	{
 		.procname	= "gro_normal_batch",
 		.data		= &net_hotdata.gro_normal_batch,
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 10:16 [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation Barry Song
@ 2025-10-13 18:30 ` Vlastimil Babka
  2025-10-13 21:35   ` Shakeel Butt
                     ` (2 more replies)
  2025-10-13 18:53 ` Eric Dumazet
  2025-10-13 21:56 ` Matthew Wilcox
  2 siblings, 3 replies; 36+ messages in thread
From: Vlastimil Babka @ 2025-10-13 18:30 UTC (permalink / raw)
  To: Barry Song, netdev, linux-mm, linux-doc
  Cc: linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou,
	Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox,
	Roman Gushchin

On 10/13/25 12:16, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> On phones, we have observed significant phone heating when running apps
> with high network bandwidth. This is caused by the network stack frequently
> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> constantly active, even though plenty of memory is still available for network
> allocations which can fall back to order-0.
> 
> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> introduced high_order_alloc_disable for the transmit (TX) path
> (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> allowing the TX path to fall back to order-0 immediately, while leaving the
> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> generally unaware of the sysctl and cannot easily adjust it for specific use
> cases. Enabling high_order_alloc_disable also completely disables the
> benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> RX path.
> 
> An alternative approach is to disable kswapd for these frequent
> allocations and provide best-effort order-3 service for both TX and RX paths,
> while removing the sysctl entirely.
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Kuniyuki Iwashima <kuniyu@google.com>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Simon Horman <horms@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Brendan Jackman <jackmanb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yunsheng Lin <linyunsheng@huawei.com>
> Cc: Huacai Zhou <zhouhuacai@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  Documentation/admin-guide/sysctl/net.rst | 12 ------------
>  include/net/sock.h                       |  1 -
>  mm/page_frag_cache.c                     |  2 +-
>  net/core/sock.c                          |  8 ++------
>  net/core/sysctl_net_core.c               |  7 -------
>  5 files changed, 3 insertions(+), 27 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> index 2ef50828aff1..b903bbae239c 100644
> --- a/Documentation/admin-guide/sysctl/net.rst
> +++ b/Documentation/admin-guide/sysctl/net.rst
> @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
>  list is then passed to the stack when the number of segments reaches the
>  gro_normal_batch limit.
>  
> -high_order_alloc_disable
> -------------------------
> -
> -By default the allocator for page frags tries to use high order pages (order-3
> -on x86). While the default behavior gives good results in most cases, some users
> -might have hit a contention in page allocations/freeing. This was especially
> -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> -historical importance.
> -
> -Default: 0
> -
>  2. /proc/sys/net/unix - Parameters for Unix domain sockets
>  ----------------------------------------------------------
>  
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 60bcb13f045c..62306c1095d5 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
>  extern __u32 sysctl_rmem_default;
>  
>  #define SKB_FRAG_PAGE_ORDER	get_order(32768)
> -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
>  
>  static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
>  {
> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> index d2423f30577e..dd36114dd16f 100644
> --- a/mm/page_frag_cache.c
> +++ b/mm/page_frag_cache.c
> @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>  	gfp_t gfp = gfp_mask;
>  
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +	gfp_mask = (gfp_mask & ~__GFP_RECLAIM) |  __GFP_COMP |
>  		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;

I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
fine for the page allocator itself where we have a different entry point
that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully

I wonder if we should either:

1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
determine it precisely.

2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
not being disturbing (like proposed here), but that can in fact allow
spinning. Instead, decide to not wake up kswapd by those when other
information indicates it's an opportunistic allocation
(~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
order > 0...)

3) something better?

Vlastimil

>  	page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
>  			     numa_mem_id(), NULL);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index dc03d4b5909a..1fa1e9177d86 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3085,8 +3085,6 @@ static void sk_leave_memory_pressure(struct sock *sk)
>  	}
>  }
>  
> -DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> -
>  /**
>   * skb_page_frag_refill - check that a page_frag contains enough room
>   * @sz: minimum size of the fragment we want to get
> @@ -3110,10 +3108,8 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>  	}
>  
>  	pfrag->offset = 0;
> -	if (SKB_FRAG_PAGE_ORDER &&
> -	    !static_branch_unlikely(&net_high_order_alloc_disable_key)) {
> -		/* Avoid direct reclaim but allow kswapd to wake */
> -		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> +	if (SKB_FRAG_PAGE_ORDER) {
> +		pfrag->page = alloc_pages((gfp & ~__GFP_RECLAIM) |
>  					  __GFP_COMP | __GFP_NOWARN |
>  					  __GFP_NORETRY,
>  					  SKB_FRAG_PAGE_ORDER);
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 8cf04b57ade1..181f6532beb8 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -599,13 +599,6 @@ static struct ctl_table net_core_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_THREE,
>  	},
> -	{
> -		.procname	= "high_order_alloc_disable",
> -		.data		= &net_high_order_alloc_disable_key.key,
> -		.maxlen         = sizeof(net_high_order_alloc_disable_key),
> -		.mode		= 0644,
> -		.proc_handler	= proc_do_static_key,
> -	},
>  	{
>  		.procname	= "gro_normal_batch",
>  		.data		= &net_hotdata.gro_normal_batch,



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 10:16 [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation Barry Song
  2025-10-13 18:30 ` Vlastimil Babka
@ 2025-10-13 18:53 ` Eric Dumazet
  2025-10-14  3:58   ` Barry Song
  2025-10-13 21:56 ` Matthew Wilcox
  2 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2025-10-13 18:53 UTC (permalink / raw)
  To: Barry Song
  Cc: netdev, linux-mm, linux-doc, linux-kernel, Barry Song,
	Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Mon, Oct 13, 2025 at 3:16 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> On phones, we have observed significant phone heating when running apps
> with high network bandwidth. This is caused by the network stack frequently
> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> constantly active, even though plenty of memory is still available for network
> allocations which can fall back to order-0.
>
> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> introduced high_order_alloc_disable for the transmit (TX) path
> (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> allowing the TX path to fall back to order-0 immediately, while leaving the
> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> generally unaware of the sysctl and cannot easily adjust it for specific use
> cases. Enabling high_order_alloc_disable also completely disables the
> benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> RX path.
>
> An alternative approach is to disable kswapd for these frequent
> allocations and provide best-effort order-3 service for both TX and RX paths,
> while removing the sysctl entirely.
>
>
...

> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  Documentation/admin-guide/sysctl/net.rst | 12 ------------
>  include/net/sock.h                       |  1 -
>  mm/page_frag_cache.c                     |  2 +-
>  net/core/sock.c                          |  8 ++------
>  net/core/sysctl_net_core.c               |  7 -------
>  5 files changed, 3 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> index 2ef50828aff1..b903bbae239c 100644
> --- a/Documentation/admin-guide/sysctl/net.rst
> +++ b/Documentation/admin-guide/sysctl/net.rst
> @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
>  list is then passed to the stack when the number of segments reaches the
>  gro_normal_batch limit.
>
> -high_order_alloc_disable
> -------------------------
> -
> -By default the allocator for page frags tries to use high order pages (order-3
> -on x86). While the default behavior gives good results in most cases, some users
> -might have hit a contention in page allocations/freeing. This was especially
> -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> -historical importance.
> -

The sysctl is quite useful for testing purposes, say on a freshly
booted host, with plenty of free memory.

Also, having order-3 pages if possible is quite important for IOMM use cases.

Perhaps kswapd should have some kind of heuristic to not start if a
recent run has already happened.

I am guessing phones do not need to send 1.6 Tbit per second on
network devices (yet),
an option  could be to disable it in your boot scripts.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 18:30 ` Vlastimil Babka
@ 2025-10-13 21:35   ` Shakeel Butt
  2025-10-13 21:53     ` Alexei Starovoitov
  2025-10-13 22:46   ` Roman Gushchin
  2025-10-14  7:26   ` Michal Hocko
  2 siblings, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2025-10-13 21:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Barry Song, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou,
	Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox,
	Roman Gushchin

On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote:
> On 10/13/25 12:16, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> > 
> > On phones, we have observed significant phone heating when running apps
> > with high network bandwidth. This is caused by the network stack frequently
> > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > constantly active, even though plenty of memory is still available for network
> > allocations which can fall back to order-0.
> > 
> > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> > introduced high_order_alloc_disable for the transmit (TX) path
> > (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> > allowing the TX path to fall back to order-0 immediately, while leaving the
> > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> > generally unaware of the sysctl and cannot easily adjust it for specific use
> > cases. Enabling high_order_alloc_disable also completely disables the
> > benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> > RX path.
> > 
> > An alternative approach is to disable kswapd for these frequent
> > allocations and provide best-effort order-3 service for both TX and RX paths,
> > while removing the sysctl entirely.
> > 
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Kuniyuki Iwashima <kuniyu@google.com>
> > Cc: Paolo Abeni <pabeni@redhat.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Simon Horman <horms@kernel.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Brendan Jackman <jackmanb@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Yunsheng Lin <linyunsheng@huawei.com>
> > Cc: Huacai Zhou <zhouhuacai@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  Documentation/admin-guide/sysctl/net.rst | 12 ------------
> >  include/net/sock.h                       |  1 -
> >  mm/page_frag_cache.c                     |  2 +-
> >  net/core/sock.c                          |  8 ++------
> >  net/core/sysctl_net_core.c               |  7 -------
> >  5 files changed, 3 insertions(+), 27 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > index 2ef50828aff1..b903bbae239c 100644
> > --- a/Documentation/admin-guide/sysctl/net.rst
> > +++ b/Documentation/admin-guide/sysctl/net.rst
> > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> >  list is then passed to the stack when the number of segments reaches the
> >  gro_normal_batch limit.
> >  
> > -high_order_alloc_disable
> > -------------------------
> > -
> > -By default the allocator for page frags tries to use high order pages (order-3
> > -on x86). While the default behavior gives good results in most cases, some users
> > -might have hit a contention in page allocations/freeing. This was especially
> > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > -historical importance.
> > -
> > -Default: 0
> > -
> >  2. /proc/sys/net/unix - Parameters for Unix domain sockets
> >  ----------------------------------------------------------
> >  
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 60bcb13f045c..62306c1095d5 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
> >  extern __u32 sysctl_rmem_default;
> >  
> >  #define SKB_FRAG_PAGE_ORDER	get_order(32768)
> > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> >  
> >  static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
> >  {
> > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> > index d2423f30577e..dd36114dd16f 100644
> > --- a/mm/page_frag_cache.c
> > +++ b/mm/page_frag_cache.c
> > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> >  	gfp_t gfp = gfp_mask;
> >  
> >  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > -	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> > +	gfp_mask = (gfp_mask & ~__GFP_RECLAIM) |  __GFP_COMP |
> >  		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> 
> I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
> we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
> interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
> fine for the page allocator itself where we have a different entry point
> that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
> of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
> objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully
> 
> I wonder if we should either:
> 
> 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> determine it precisely.
> 
> 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
> not being disturbing (like proposed here), but that can in fact allow
> spinning. Instead, decide to not wake up kswapd by those when other
> information indicates it's an opportunistic allocation
> (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
> order > 0...)
> 
> 3) something better?
> 

For the !allow_spin allocations, I think we should just add a new __GFP
flag instead of adding more complexity to other allocators which may or
may not want kswapd wakeup for many different reasons.





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 21:35   ` Shakeel Butt
@ 2025-10-13 21:53     ` Alexei Starovoitov
  2025-10-13 22:25       ` Shakeel Butt
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2025-10-13 21:53 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, Barry Song, Network Development, linux-mm,
	open list:DOCUMENTATION, LKML, Barry Song, Jonathan Corbet,
	Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Harry Yoo,
	David Hildenbrand, Matthew Wilcox, Roman Gushchin

On Mon, Oct 13, 2025 at 2:35 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote:
> > On 10/13/25 12:16, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > On phones, we have observed significant phone heating when running apps
> > > with high network bandwidth. This is caused by the network stack frequently
> > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > > constantly active, even though plenty of memory is still available for network
> > > allocations which can fall back to order-0.
> > >
> > > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> > > introduced high_order_alloc_disable for the transmit (TX) path
> > > (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> > > allowing the TX path to fall back to order-0 immediately, while leaving the
> > > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> > > generally unaware of the sysctl and cannot easily adjust it for specific use
> > > cases. Enabling high_order_alloc_disable also completely disables the
> > > benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> > > RX path.
> > >
> > > An alternative approach is to disable kswapd for these frequent
> > > allocations and provide best-effort order-3 service for both TX and RX paths,
> > > while removing the sysctl entirely.
> > >
> > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > Cc: Eric Dumazet <edumazet@google.com>
> > > Cc: Kuniyuki Iwashima <kuniyu@google.com>
> > > Cc: Paolo Abeni <pabeni@redhat.com>
> > > Cc: Willem de Bruijn <willemb@google.com>
> > > Cc: "David S. Miller" <davem@davemloft.net>
> > > Cc: Jakub Kicinski <kuba@kernel.org>
> > > Cc: Simon Horman <horms@kernel.org>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Brendan Jackman <jackmanb@google.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Zi Yan <ziy@nvidia.com>
> > > Cc: Yunsheng Lin <linyunsheng@huawei.com>
> > > Cc: Huacai Zhou <zhouhuacai@oppo.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  Documentation/admin-guide/sysctl/net.rst | 12 ------------
> > >  include/net/sock.h                       |  1 -
> > >  mm/page_frag_cache.c                     |  2 +-
> > >  net/core/sock.c                          |  8 ++------
> > >  net/core/sysctl_net_core.c               |  7 -------
> > >  5 files changed, 3 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > > index 2ef50828aff1..b903bbae239c 100644
> > > --- a/Documentation/admin-guide/sysctl/net.rst
> > > +++ b/Documentation/admin-guide/sysctl/net.rst
> > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > >  list is then passed to the stack when the number of segments reaches the
> > >  gro_normal_batch limit.
> > >
> > > -high_order_alloc_disable
> > > -------------------------
> > > -
> > > -By default the allocator for page frags tries to use high order pages (order-3
> > > -on x86). While the default behavior gives good results in most cases, some users
> > > -might have hit a contention in page allocations/freeing. This was especially
> > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > > -historical importance.
> > > -
> > > -Default: 0
> > > -
> > >  2. /proc/sys/net/unix - Parameters for Unix domain sockets
> > >  ----------------------------------------------------------
> > >
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 60bcb13f045c..62306c1095d5 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
> > >  extern __u32 sysctl_rmem_default;
> > >
> > >  #define SKB_FRAG_PAGE_ORDER        get_order(32768)
> > > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> > >
> > >  static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
> > >  {
> > > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> > > index d2423f30577e..dd36114dd16f 100644
> > > --- a/mm/page_frag_cache.c
> > > +++ b/mm/page_frag_cache.c
> > > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> > >     gfp_t gfp = gfp_mask;
> > >
> > >  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > > -   gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> > > +   gfp_mask = (gfp_mask & ~__GFP_RECLAIM) |  __GFP_COMP |
> > >                __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> >
> > I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
> > we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
> > interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
> > fine for the page allocator itself where we have a different entry point
> > that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
> > of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
> > objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully
> >
> > I wonder if we should either:
> >
> > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > determine it precisely.
> >
> > 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
> > not being disturbing (like proposed here), but that can in fact allow
> > spinning. Instead, decide to not wake up kswapd by those when other
> > information indicates it's an opportunistic allocation
> > (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
> > order > 0...)
> >
> > 3) something better?
> >
>
> For the !allow_spin allocations, I think we should just add a new __GFP
> flag instead of adding more complexity to other allocators which may or
> may not want kswapd wakeup for many different reasons.

That's what I proposed long ago, but was convinced that the new flag
adds more complexity. Looks like we walked this road far enough and
the new flag will actually make things simpler.
Back then I proposed __GFP_TRYLOCK which is not a good name.
How about __GFP_NOLOCK ? or __GFP_NOSPIN ?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 10:16 [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation Barry Song
  2025-10-13 18:30 ` Vlastimil Babka
  2025-10-13 18:53 ` Eric Dumazet
@ 2025-10-13 21:56 ` Matthew Wilcox
  2025-10-14  4:09   ` Barry Song
  2 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2025-10-13 21:56 UTC (permalink / raw)
  To: Barry Song
  Cc: netdev, linux-mm, linux-doc, linux-kernel, Barry Song,
	Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote:
> On phones, we have observed significant phone heating when running apps
> with high network bandwidth. This is caused by the network stack frequently
> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> constantly active, even though plenty of memory is still available for network
> allocations which can fall back to order-0.

I think we need to understand what's going on here a whole lot more than
this!

So, we try to do an order-3 allocation.  kswapd runs and ... succeeds in
creating order-3 pages?  Or fails to?

If it fails, that's something we need to sort out.

If it succeeds, now we have several order-3 pages, great.  But where do
they all go that we need to run kswapd again?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 21:53     ` Alexei Starovoitov
@ 2025-10-13 22:25       ` Shakeel Butt
  0 siblings, 0 replies; 36+ messages in thread
From: Shakeel Butt @ 2025-10-13 22:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Vlastimil Babka, Barry Song, Network Development, linux-mm,
	open list:DOCUMENTATION, LKML, Barry Song, Jonathan Corbet,
	Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou, Harry Yoo,
	David Hildenbrand, Matthew Wilcox, Roman Gushchin

On Mon, Oct 13, 2025 at 02:53:17PM -0700, Alexei Starovoitov wrote:
> On Mon, Oct 13, 2025 at 2:35 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote:
[...]
> > >
> > > I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
> > > we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
> > > interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
> > > fine for the page allocator itself where we have a different entry point
> > > that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
> > > of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
> > > objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully
> > >
> > > I wonder if we should either:
> > >
> > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > > determine it precisely.
> > >
> > > 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
> > > not being disturbing (like proposed here), but that can in fact allow
> > > spinning. Instead, decide to not wake up kswapd by those when other
> > > information indicates it's an opportunistic allocation
> > > (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
> > > order > 0...)
> > >
> > > 3) something better?
> > >
> >
> > For the !allow_spin allocations, I think we should just add a new __GFP
> > flag instead of adding more complexity to other allocators which may or
> > may not want kswapd wakeup for many different reasons.
> 
> That's what I proposed long ago, but was convinced that the new flag
> adds more complexity. 

Oh somehow I thought we took that route because we are low on available
bits.

> Looks like we walked this road far enough and
> the new flag will actually make things simpler.
> Back then I proposed __GFP_TRYLOCK which is not a good name.
> How about __GFP_NOLOCK ? or __GFP_NOSPIN ?

Let's go with __GFP_NOLOCK as we already have nolock variants of the
allocation APIs. 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 18:30 ` Vlastimil Babka
  2025-10-13 21:35   ` Shakeel Butt
@ 2025-10-13 22:46   ` Roman Gushchin
  2025-10-14  4:31     ` Barry Song
  2025-10-14  7:24     ` Michal Hocko
  2025-10-14  7:26   ` Michal Hocko
  2 siblings, 2 replies; 36+ messages in thread
From: Roman Gushchin @ 2025-10-13 22:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Barry Song, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou,
	Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox

Vlastimil Babka <vbabka@suse.cz> writes:

> On 10/13/25 12:16, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>> 
>> On phones, we have observed significant phone heating when running apps
>> with high network bandwidth. This is caused by the network stack frequently
>> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
>> constantly active, even though plenty of memory is still available for network
>> allocations which can fall back to order-0.
>> 
>> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
>> introduced high_order_alloc_disable for the transmit (TX) path
>> (skb_page_frag_refill()) to mitigate some memory reclamation issues,
>> allowing the TX path to fall back to order-0 immediately, while leaving the
>> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
>> generally unaware of the sysctl and cannot easily adjust it for specific use
>> cases. Enabling high_order_alloc_disable also completely disables the
>> benefit of order-3 allocations. Additionally, the sysctl does not apply to the
>> RX path.
>> 
>> An alternative approach is to disable kswapd for these frequent
>> allocations and provide best-effort order-3 service for both TX and RX paths,
>> while removing the sysctl entirely.

I'm not sure this is the right path long-term. There are significant
benefits associated with using larger pages, so making the kernel fall
back to order-0 pages easier and sooner feels wrong, tbh. Without kswapd
trying to defragment memory, the only other option is to force tasks
into the direct compaction and it's known to be problematic.

I wonder if instead we should look into optimizing kswapd to be less
power-hungry?

And if you still prefer to disable kswapd for this purpose, at least it
should be conditional to vm.laptop_mode. But again, I don't think it's
the right long-term approach.

Thanks!


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 18:53 ` Eric Dumazet
@ 2025-10-14  3:58   ` Barry Song
  2025-10-14  5:07     ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Barry Song @ 2025-10-14  3:58 UTC (permalink / raw)
  To: edumazet
  Cc: 21cnbao, corbet, davem, hannes, horms, jackmanb, kuba, kuniyu,
	linux-doc, linux-kernel, linux-mm, linyunsheng, mhocko, netdev,
	pabeni, surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy

> >
> > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > index 2ef50828aff1..b903bbae239c 100644
> > --- a/Documentation/admin-guide/sysctl/net.rst
> > +++ b/Documentation/admin-guide/sysctl/net.rst
> > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> >  list is then passed to the stack when the number of segments reaches the
> >  gro_normal_batch limit.
> >
> > -high_order_alloc_disable
> > -------------------------
> > -
> > -By default the allocator for page frags tries to use high order pages (order-3
> > -on x86). While the default behavior gives good results in most cases, some users
> > -might have hit a contention in page allocations/freeing. This was especially
> > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > -historical importance.
> > -
>
> The sysctl is quite useful for testing purposes, say on a freshly
> booted host, with plenty of free memory.
>
> Also, having order-3 pages if possible is quite important for IOMM use cases.
>
> Perhaps kswapd should have some kind of heuristic to not start if a
> recent run has already happened.

I don’t understand why it shouldn’t start when users continuously request
order-3 allocations and ask kswapd to prepare order-3 memory — it doesn’t
make sense logically to skip it just because earlier requests were already
satisfied.

>
> I am guessing phones do not need to send 1.6 Tbit per second on
> network devices (yet),
> an option  could be to disable it in your boot scripts.

A problem with the existing sysctl is that it only covers the TX path;
for the RX path, we also observe that kswapd consumes significant power.
I could add the patch below to make it support the RX path, but it feels
like a bit of a layer violation, since the RX path code resides in mm
and is intended to serve generic users rather than networking, even
though the current callers are primarily network-related.

diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e..8ad18ec49f39 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -18,6 +18,7 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/page_frag_cache.h>
+#include <net/sock.h>
 #include "internal.h"
 
 static unsigned long encoded_page_create(struct page *page, unsigned int order,
@@ -54,10 +55,12 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
        gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-       gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
-                  __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
-       page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
-                            numa_mem_id(), NULL);
+       if (!static_branch_unlikely(&net_high_order_alloc_disable_key)) {
+               gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
+                       __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
+               page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
+                               numa_mem_id(), NULL);
+       }
 #endif
        if (unlikely(!page)) {


Do you have a better idea on how to make the sysctl also cover the RX path?

Thanks
Barry



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 21:56 ` Matthew Wilcox
@ 2025-10-14  4:09   ` Barry Song
  2025-10-14  5:04     ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Barry Song @ 2025-10-14  4:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: netdev, linux-mm, linux-doc, linux-kernel, Barry Song,
	Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote:
> > On phones, we have observed significant phone heating when running apps
> > with high network bandwidth. This is caused by the network stack frequently
> > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > constantly active, even though plenty of memory is still available for network
> > allocations which can fall back to order-0.
>
> I think we need to understand what's going on here a whole lot more than
> this!
>
> So, we try to do an order-3 allocation.  kswapd runs and ... succeeds in
> creating order-3 pages?  Or fails to?
>

Our team observed that most of the time we successfully obtain order-3
memory, but the cost is excessive memory reclamation, since we end up
over-reclaiming order-0 pages that could have remained in memory.

> If it fails, that's something we need to sort out.
>
> If it succeeds, now we have several order-3 pages, great.  But where do
> they all go that we need to run kswapd again?

The network app keeps running and continues to issue new order-3 allocation
requests, so those few order-3 pages won’t be enough to satisfy the
continuous demand.

Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 22:46   ` Roman Gushchin
@ 2025-10-14  4:31     ` Barry Song
  2025-10-14  7:24     ` Michal Hocko
  1 sibling, 0 replies; 36+ messages in thread
From: Barry Song @ 2025-10-14  4:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou,
	Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox

On Tue, Oct 14, 2025 at 6:47 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Vlastimil Babka <vbabka@suse.cz> writes:
>
> > On 10/13/25 12:16, Barry Song wrote:
> >> From: Barry Song <v-songbaohua@oppo.com>
> >>
> >> On phones, we have observed significant phone heating when running apps
> >> with high network bandwidth. This is caused by the network stack frequently
> >> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> >> constantly active, even though plenty of memory is still available for network
> >> allocations which can fall back to order-0.
> >>
> >> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> >> introduced high_order_alloc_disable for the transmit (TX) path
> >> (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> >> allowing the TX path to fall back to order-0 immediately, while leaving the
> >> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> >> generally unaware of the sysctl and cannot easily adjust it for specific use
> >> cases. Enabling high_order_alloc_disable also completely disables the
> >> benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> >> RX path.
> >>
> >> An alternative approach is to disable kswapd for these frequent
> >> allocations and provide best-effort order-3 service for both TX and RX paths,
> >> while removing the sysctl entirely.
>
> I'm not sure this is the right path long-term. There are significant
> benefits associated with using larger pages, so making the kernel fall
> back to order-0 pages easier and sooner feels wrong, tbh. Without kswapd
> trying to defragment memory, the only other option is to force tasks
> into the direct compaction and it's known to be problematic.

I guess the benefits depend on the hardware: for loopback, they might be
significant, while for slower network devices, order-3 memory may provide
much smaller gains?

On the other hand, I wonder if we could make kcompactd more active when
kswapd is woken for order-3 allocations, instead of reclaiming
order-0 pages to form order-3.

>
> I wonder if instead we should look into optimizing kswapd to be less
> power-hungry?

People have been working on this for years, yet reclaiming a folio still
requires a lot of effort, including folio_referenced, try_to_unmap_one,
and compressing folios to swap out to zRAM.

>
> And if you still prefer to disable kswapd for this purpose, at least it
> should be conditional to vm.laptop_mode. But again, I don't think it's
> the right long-term approach.

My point is that phones generally have much slower network hardware
compared to PCs, and far slower hardware compared to servers, so they
are likely not very sensitive to whether memory is order-3 or order-0. On
the other hand, phones are highly sensitive to power consumption. As a
result, the power cost of creating order-3 pages is likely to outweigh any
benefit that order-3 memory might offer for network performance.

It might be worth extending the existing net_high_order_alloc_disable_key
to the RX path, as I mentioned in my reply to Eric[1], allowing users to
decide whether network or power consumption is more important?

[1]  https://lore.kernel.org/linux-mm/20251014035846.1519-1-21cnbao@gmail.com/

Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  4:09   ` Barry Song
@ 2025-10-14  5:04     ` Eric Dumazet
  2025-10-14  8:58       ` Barry Song
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2025-10-14  5:04 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote:
> > > On phones, we have observed significant phone heating when running apps
> > > with high network bandwidth. This is caused by the network stack frequently
> > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > > constantly active, even though plenty of memory is still available for network
> > > allocations which can fall back to order-0.
> >
> > I think we need to understand what's going on here a whole lot more than
> > this!
> >
> > So, we try to do an order-3 allocation.  kswapd runs and ... succeeds in
> > creating order-3 pages?  Or fails to?
> >
>
> Our team observed that most of the time we successfully obtain order-3
> memory, but the cost is excessive memory reclamation, since we end up
> over-reclaiming order-0 pages that could have remained in memory.
>
> > If it fails, that's something we need to sort out.
> >
> > If it succeeds, now we have several order-3 pages, great.  But where do
> > they all go that we need to run kswapd again?
>
> The network app keeps running and continues to issue new order-3 allocation
> requests, so those few order-3 pages won’t be enough to satisfy the
> continuous demand.

These pages are freed as order-3 pages, and should replenish the buddy
as if nothing happened.

I think you are missing something to control how much memory  can be
pushed on each TCP socket ?

What is tcp_wmem on your phones ? What about tcp_mem ?

Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  3:58   ` Barry Song
@ 2025-10-14  5:07     ` Eric Dumazet
  2025-10-14  6:43       ` Barry Song
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2025-10-14  5:07 UTC (permalink / raw)
  To: Barry Song
  Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc,
	linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni,
	surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy

On Mon, Oct 13, 2025 at 8:58 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > > index 2ef50828aff1..b903bbae239c 100644
> > > --- a/Documentation/admin-guide/sysctl/net.rst
> > > +++ b/Documentation/admin-guide/sysctl/net.rst
> > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > >  list is then passed to the stack when the number of segments reaches the
> > >  gro_normal_batch limit.
> > >
> > > -high_order_alloc_disable
> > > -------------------------
> > > -
> > > -By default the allocator for page frags tries to use high order pages (order-3
> > > -on x86). While the default behavior gives good results in most cases, some users
> > > -might have hit a contention in page allocations/freeing. This was especially
> > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > > -historical importance.
> > > -
> >
> > The sysctl is quite useful for testing purposes, say on a freshly
> > booted host, with plenty of free memory.
> >
> > Also, having order-3 pages if possible is quite important for IOMM use cases.
> >
> > Perhaps kswapd should have some kind of heuristic to not start if a
> > recent run has already happened.
>
> I don’t understand why it shouldn’t start when users continuously request
> order-3 allocations and ask kswapd to prepare order-3 memory — it doesn’t
> make sense logically to skip it just because earlier requests were already
> satisfied.
>
> >
> > I am guessing phones do not need to send 1.6 Tbit per second on
> > network devices (yet),
> > an option  could be to disable it in your boot scripts.
>
> A problem with the existing sysctl is that it only covers the TX path;
> for the RX path, we also observe that kswapd consumes significant power.
> I could add the patch below to make it support the RX path, but it feels
> like a bit of a layer violation, since the RX path code resides in mm
> and is intended to serve generic users rather than networking, even
> though the current callers are primarily network-related.

You might have a buggy driver.

High performance drivers use order-0 allocations only.



>
> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> index d2423f30577e..8ad18ec49f39 100644
> --- a/mm/page_frag_cache.c
> +++ b/mm/page_frag_cache.c
> @@ -18,6 +18,7 @@
>  #include <linux/init.h>
>  #include <linux/mm.h>
>  #include <linux/page_frag_cache.h>
> +#include <net/sock.h>
>  #include "internal.h"
>
>  static unsigned long encoded_page_create(struct page *page, unsigned int order,
> @@ -54,10 +55,12 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>         gfp_t gfp = gfp_mask;
>
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -       gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> -                  __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> -       page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> -                            numa_mem_id(), NULL);
> +       if (!static_branch_unlikely(&net_high_order_alloc_disable_key)) {
> +               gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +                       __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> +               page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> +                               numa_mem_id(), NULL);
> +       }
>  #endif
>         if (unlikely(!page)) {
>
>
> Do you have a better idea on how to make the sysctl also cover the RX path?
>
> Thanks
> Barry
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  5:07     ` Eric Dumazet
@ 2025-10-14  6:43       ` Barry Song
  2025-10-14  7:01         ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Barry Song @ 2025-10-14  6:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc,
	linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni,
	surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy

> >
> > A problem with the existing sysctl is that it only covers the TX path;
> > for the RX path, we also observe that kswapd consumes significant power.
> > I could add the patch below to make it support the RX path, but it feels
> > like a bit of a layer violation, since the RX path code resides in mm
> > and is intended to serve generic users rather than networking, even
> > though the current callers are primarily network-related.
>
> You might have a buggy driver.

We are observing the RX path as follows:

do_softirq
    taskset_hi_action
       kalPacketAlloc
           __netdev_alloc_skb
               page_frag_alloc_align
                   __page_frag_cache_refill

This appears to be a fairly common stack.

So it is a buggy driver?

>
> High performance drivers use order-0 allocations only.
>

Do you have an example of high-performance drivers that use only order-0 memory?

Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  6:43       ` Barry Song
@ 2025-10-14  7:01         ` Eric Dumazet
  2025-10-14  8:17           ` Barry Song
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2025-10-14  7:01 UTC (permalink / raw)
  To: Barry Song
  Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc,
	linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni,
	surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy

On Mon, Oct 13, 2025 at 11:43 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > >
> > > A problem with the existing sysctl is that it only covers the TX path;
> > > for the RX path, we also observe that kswapd consumes significant power.
> > > I could add the patch below to make it support the RX path, but it feels
> > > like a bit of a layer violation, since the RX path code resides in mm
> > > and is intended to serve generic users rather than networking, even
> > > though the current callers are primarily network-related.
> >
> > You might have a buggy driver.
>
> We are observing the RX path as follows:
>
> do_softirq
>     taskset_hi_action
>        kalPacketAlloc
>            __netdev_alloc_skb
>                page_frag_alloc_align
>                    __page_frag_cache_refill
>
> This appears to be a fairly common stack.
>
> So it is a buggy driver?

No idea, kalPacketAlloc is not in upstream trees.

It apparently needs high order allocations. It will fail at some point.

>
> >
> > High performance drivers use order-0 allocations only.
> >
>
> Do you have an example of high-performance drivers that use only order-0 memory?

About all drivers using XDP, and/or using napi_get_frags()

XDP has been using order-0 pages from the very beginning.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 22:46   ` Roman Gushchin
  2025-10-14  4:31     ` Barry Song
@ 2025-10-14  7:24     ` Michal Hocko
  1 sibling, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2025-10-14  7:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan,
	Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo,
	David Hildenbrand, Matthew Wilcox

On Mon 13-10-25 15:46:54, Roman Gushchin wrote:
> Vlastimil Babka <vbabka@suse.cz> writes:
> 
> > On 10/13/25 12:16, Barry Song wrote:
[...]
> >> An alternative approach is to disable kswapd for these frequent
> >> allocations and provide best-effort order-3 service for both TX and RX paths,
> >> while removing the sysctl entirely.
> 
> I'm not sure this is the right path long-term. There are significant
> benefits associated with using larger pages, so making the kernel fall
> back to order-0 pages easier and sooner feels wrong, tbh. Without kswapd
> trying to defragment memory, the only other option is to force tasks
> into the direct compaction and it's known to be problematic.
> 
> I wonder if instead we should look into optimizing kswapd to be less
> power-hungry?

Exactly. If your specific needs prefer low power consumption to higher
order pages availability then we should have more flixible way to say
that than a hardcoded allocation mode. We should be able to tell
kswapd/kcompactd how much to try for those allocations.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-13 18:30 ` Vlastimil Babka
  2025-10-13 21:35   ` Shakeel Butt
  2025-10-13 22:46   ` Roman Gushchin
@ 2025-10-14  7:26   ` Michal Hocko
  2025-10-14  8:08     ` Barry Song
  2025-10-14 14:27     ` Shakeel Butt
  2 siblings, 2 replies; 36+ messages in thread
From: Michal Hocko @ 2025-10-14  7:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Barry Song, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Suren Baghdasaryan, Brendan Jackman,
	Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou,
	Alexei Starovoitov, Harry Yoo, David Hildenbrand, Matthew Wilcox,
	Roman Gushchin

On Mon 13-10-25 20:30:13, Vlastimil Babka wrote:
> On 10/13/25 12:16, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
[...]
> I wonder if we should either:
> 
> 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> determine it precisely.

As said in other reply I do not think this is a good fit for this
specific case as it is all or nothing approach. Soon enough we discover
that "no effort to reclaim/compact" hurts other usecases. So I do not
think we need a dedicated flag for this specific case. We need a way to
tell kswapd/kcompactd how much to try instead.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  7:26   ` Michal Hocko
@ 2025-10-14  8:08     ` Barry Song
  2025-10-14 14:27     ` Shakeel Butt
  1 sibling, 0 replies; 36+ messages in thread
From: Barry Song @ 2025-10-14  8:08 UTC (permalink / raw)
  To: mhocko
  Cc: 21cnbao, alexei.starovoitov, corbet, davem, david, edumazet,
	hannes, harry.yoo, horms, jackmanb, kuba, kuniyu, linux-doc,
	linux-kernel, linux-mm, linyunsheng, netdev, pabeni,
	roman.gushchin, surenb, v-songbaohua, vbabka, willemb, willy,
	zhouhuacai, ziy, baolin.wang

On Tue, Oct 14, 2025 at 3:26 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 13-10-25 20:30:13, Vlastimil Babka wrote:
> > On 10/13/25 12:16, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> [...]
> > I wonder if we should either:
> >
> > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > determine it precisely.
>
> As said in other reply I do not think this is a good fit for this
> specific case as it is all or nothing approach. Soon enough we discover
> that "no effort to reclaim/compact" hurts other usecases. So I do not
> think we need a dedicated flag for this specific case. We need a way to
> tell kswapd/kcompactd how much to try instead.

+Baolin, who may have observed the same issue.

An issue with vmscan is that kcompactd is woken up very late, only after
reclaiming a large number of order-0 pages to satisfy an order-3
application.

static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
{

...
                balanced = pgdat_balanced(pgdat, sc.order, highest_zoneidx);
                if (!balanced && nr_boost_reclaim) {
                        nr_boost_reclaim = 0;
                        goto restart;
                }

                /*
                 * If boosting is not active then only reclaim if there are no
                 * eligible zones. Note that sc.reclaim_idx is not used as
                 * buffer_heads_over_limit may have adjusted it.
                 */
                if (!nr_boost_reclaim && balanced)
                        goto out;
...
                if (kswapd_shrink_node(pgdat, &sc))
                        raise_priority = false;
...

out:

                ...
                /*
                 * As there is now likely space, wakeup kcompact to defragment
                 * pageblocks.
                 */
                wakeup_kcompactd(pgdat, pageblock_order, highest_zoneidx);
}

As pgdat_balanced() needs at least one 3-order pages to return true:

bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
                         int highest_zoneidx, unsigned int alloc_flags,
                         long free_pages)
{
        ...  
        if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
                return false;

        /* If this is an order-0 request then the watermark is fine */
        if (!order)
                return true;

        /* For a high-order request, check at least one suitable page is free */
        for (o = order; o < NR_PAGE_ORDERS; o++) {
                struct free_area *area = &z->free_area[o];
                int mt;

                if (!area->nr_free)
                        continue;

                for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
                        if (!free_area_empty(area, mt)) 
                                return true;
                }    

#ifdef CONFIG_CMA
                if ((alloc_flags & ALLOC_CMA) &&
                    !free_area_empty(area, MIGRATE_CMA)) {
                        return true;
                }    
#endif
                if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) &&
                    !free_area_empty(area, MIGRATE_HIGHATOMIC)) {
                        return true;
                }

}

This appears to be incorrect and will always lead to over-reclamation in order0
to satisfy high-order applications.

I wonder if we should "goto out" earlier to wake up kcompactd when there
is plenty of memory available, even if no order-3 pages exist.

Conceptually, what I mean is:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c80fcae7f2a1..d0e03066bbaa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7057,9 +7057,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
                 * eligible zones. Note that sc.reclaim_idx is not used as
                 * buffer_heads_over_limit may have adjusted it.
                 */
-               if (!nr_boost_reclaim && balanced)
+               if (!nr_boost_reclaim && (balanced || we_have_plenty_memory_to_compact()))
                        goto out;

                /* Limit the priority of boosting to avoid reclaim writeback */
                if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
                        raise_priority = false;


Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  7:01         ` Eric Dumazet
@ 2025-10-14  8:17           ` Barry Song
  2025-10-14  8:25             ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Barry Song @ 2025-10-14  8:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc,
	linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni,
	surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy

On Tue, Oct 14, 2025 at 3:01 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Oct 13, 2025 at 11:43 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > >
> > > > A problem with the existing sysctl is that it only covers the TX path;
> > > > for the RX path, we also observe that kswapd consumes significant power.
> > > > I could add the patch below to make it support the RX path, but it feels
> > > > like a bit of a layer violation, since the RX path code resides in mm
> > > > and is intended to serve generic users rather than networking, even
> > > > though the current callers are primarily network-related.
> > >
> > > You might have a buggy driver.
> >
> > We are observing the RX path as follows:
> >
> > do_softirq
> >     taskset_hi_action
> >        kalPacketAlloc
> >            __netdev_alloc_skb
> >                page_frag_alloc_align
> >                    __page_frag_cache_refill
> >
> > This appears to be a fairly common stack.
> >
> > So it is a buggy driver?
>
> No idea, kalPacketAlloc is not in upstream trees.
>
> It apparently needs high order allocations. It will fail at some point.
>
> >
> > >
> > > High performance drivers use order-0 allocations only.
> > >
> >
> > Do you have an example of high-performance drivers that use only order-0 memory?
>
> About all drivers using XDP, and/or using napi_get_frags()
>
> XDP has been using order-0 pages from the very beginning.

Thanks! But there are still many drivers using netdev_alloc_skb()—we
shouldn’t overlook them, right?

net % git grep netdev_alloc_skb | wc -l
     359

Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  8:17           ` Barry Song
@ 2025-10-14  8:25             ` Eric Dumazet
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2025-10-14  8:25 UTC (permalink / raw)
  To: Barry Song
  Cc: corbet, davem, hannes, horms, jackmanb, kuba, kuniyu, linux-doc,
	linux-kernel, linux-mm, linyunsheng, mhocko, netdev, pabeni,
	surenb, v-songbaohua, vbabka, willemb, zhouhuacai, ziy

On Tue, Oct 14, 2025 at 1:17 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Oct 14, 2025 at 3:01 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Oct 13, 2025 at 11:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > > >
> > > > > A problem with the existing sysctl is that it only covers the TX path;
> > > > > for the RX path, we also observe that kswapd consumes significant power.
> > > > > I could add the patch below to make it support the RX path, but it feels
> > > > > like a bit of a layer violation, since the RX path code resides in mm
> > > > > and is intended to serve generic users rather than networking, even
> > > > > though the current callers are primarily network-related.
> > > >
> > > > You might have a buggy driver.
> > >
> > > We are observing the RX path as follows:
> > >
> > > do_softirq
> > >     taskset_hi_action
> > >        kalPacketAlloc
> > >            __netdev_alloc_skb
> > >                page_frag_alloc_align
> > >                    __page_frag_cache_refill
> > >
> > > This appears to be a fairly common stack.
> > >
> > > So it is a buggy driver?
> >
> > No idea, kalPacketAlloc is not in upstream trees.
> >
> > It apparently needs high order allocations. It will fail at some point.
> >
> > >
> > > >
> > > > High performance drivers use order-0 allocations only.
> > > >
> > >
> > > Do you have an example of high-performance drivers that use only order-0 memory?
> >
> > About all drivers using XDP, and/or using napi_get_frags()
> >
> > XDP has been using order-0 pages from the very beginning.
>
> Thanks! But there are still many drivers using netdev_alloc_skb()—we
> shouldn’t overlook them, right?
>
> net % git grep netdev_alloc_skb | wc -l
>      359

Only the ones that are using 16KB allocations like some WAN drivers :)

Some networks use MTU=9000

If a hardware does not provide SG support on receive, a kmalloc()
based will use 16KB of memory.

By using a frag allocator, we can pack 3 allocations per 32KB instead of 2.

TCP can go 50% faster.

If memory is short, it will fail no matter what.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  5:04     ` Eric Dumazet
@ 2025-10-14  8:58       ` Barry Song
  2025-10-14  9:49         ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Barry Song @ 2025-10-14  8:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Tue, Oct 14, 2025 at 1:04 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote:
> > > > On phones, we have observed significant phone heating when running apps
> > > > with high network bandwidth. This is caused by the network stack frequently
> > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > > > constantly active, even though plenty of memory is still available for network
> > > > allocations which can fall back to order-0.
> > >
> > > I think we need to understand what's going on here a whole lot more than
> > > this!
> > >
> > > So, we try to do an order-3 allocation.  kswapd runs and ... succeeds in
> > > creating order-3 pages?  Or fails to?
> > >
> >
> > Our team observed that most of the time we successfully obtain order-3
> > memory, but the cost is excessive memory reclamation, since we end up
> > over-reclaiming order-0 pages that could have remained in memory.
> >
> > > If it fails, that's something we need to sort out.
> > >
> > > If it succeeds, now we have several order-3 pages, great.  But where do
> > > they all go that we need to run kswapd again?
> >
> > The network app keeps running and continues to issue new order-3 allocation
> > requests, so those few order-3 pages won’t be enough to satisfy the
> > continuous demand.
>
> These pages are freed as order-3 pages, and should replenish the buddy
> as if nothing happened.

Ideally, that would be the case if the workload were simple. However, the
system may have many other processes and kernel drivers running
simultaneously, also consuming memory from the buddy allocator and possibly
taking the replenished pages. As a result, we can still observe multiple
kswapd wakeups and instances of over-reclamation caused by the network
stack’s high-order allocations.

>
> I think you are missing something to control how much memory  can be
> pushed on each TCP socket ?
>
> What is tcp_wmem on your phones ? What about tcp_mem ?
>
> Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat

# cat /proc/sys/net/ipv4/tcp_wmem
524288  1048576 6710886

# cat /proc/sys/net/ipv4/tcp_mem
131220  174961  262440

# cat /proc/sys/net/ipv4/tcp_notsent_lowat
4294967295

Any thoughts on these settings?

Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  8:58       ` Barry Song
@ 2025-10-14  9:49         ` Eric Dumazet
  2025-10-14 10:19           ` Barry Song
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2025-10-14  9:49 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Tue, Oct 14, 2025 at 1:58 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Oct 14, 2025 at 1:04 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote:
> > > > > On phones, we have observed significant phone heating when running apps
> > > > > with high network bandwidth. This is caused by the network stack frequently
> > > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > > > > constantly active, even though plenty of memory is still available for network
> > > > > allocations which can fall back to order-0.
> > > >
> > > > I think we need to understand what's going on here a whole lot more than
> > > > this!
> > > >
> > > > So, we try to do an order-3 allocation.  kswapd runs and ... succeeds in
> > > > creating order-3 pages?  Or fails to?
> > > >
> > >
> > > Our team observed that most of the time we successfully obtain order-3
> > > memory, but the cost is excessive memory reclamation, since we end up
> > > over-reclaiming order-0 pages that could have remained in memory.
> > >
> > > > If it fails, that's something we need to sort out.
> > > >
> > > > If it succeeds, now we have several order-3 pages, great.  But where do
> > > > they all go that we need to run kswapd again?
> > >
> > > The network app keeps running and continues to issue new order-3 allocation
> > > requests, so those few order-3 pages won’t be enough to satisfy the
> > > continuous demand.
> >
> > These pages are freed as order-3 pages, and should replenish the buddy
> > as if nothing happened.
>
> Ideally, that would be the case if the workload were simple. However, the
> system may have many other processes and kernel drivers running
> simultaneously, also consuming memory from the buddy allocator and possibly
> taking the replenished pages. As a result, we can still observe multiple
> kswapd wakeups and instances of over-reclamation caused by the network
> stack’s high-order allocations.
>
> >
> > I think you are missing something to control how much memory  can be
> > pushed on each TCP socket ?
> >
> > What is tcp_wmem on your phones ? What about tcp_mem ?
> >
> > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat
>
> # cat /proc/sys/net/ipv4/tcp_wmem
> 524288  1048576 6710886

Ouch. That is insane tcp_wmem[0] .

Please stick to 4096, or risk OOM of various sorts.

>
> # cat /proc/sys/net/ipv4/tcp_notsent_lowat
> 4294967295
>
> Any thoughts on these settings?

Please look at
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_notsent_lowat - UNSIGNED INTEGER
A TCP socket can control the amount of unsent bytes in its write queue,
thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
reports POLLOUT events if the amount of unsent bytes is below a per
socket value, and if the write queue is not full. sendmsg() will
also not add new buffers if the limit is hit.

This global variable controls the amount of unsent data for
sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
to the global variable has immediate effect.


Setting this sysctl to 2MB can effectively reduce the amount of memory
in TCP write queues by 66 %,
or allow you to increase tcp_wmem[2] so that only flows needing big
BDP can get it.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  9:49         ` Eric Dumazet
@ 2025-10-14 10:19           ` Barry Song
  2025-10-14 10:39             ` Eric Dumazet
  2025-10-14 14:37             ` Shakeel Butt
  0 siblings, 2 replies; 36+ messages in thread
From: Barry Song @ 2025-10-14 10:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

> >
> > >
> > > I think you are missing something to control how much memory  can be
> > > pushed on each TCP socket ?
> > >
> > > What is tcp_wmem on your phones ? What about tcp_mem ?
> > >
> > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat
> >
> > # cat /proc/sys/net/ipv4/tcp_wmem
> > 524288  1048576 6710886
>
> Ouch. That is insane tcp_wmem[0] .
>
> Please stick to 4096, or risk OOM of various sorts.
>
> >
> > # cat /proc/sys/net/ipv4/tcp_notsent_lowat
> > 4294967295
> >
> > Any thoughts on these settings?
>
> Please look at
> https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
>
> tcp_notsent_lowat - UNSIGNED INTEGER
> A TCP socket can control the amount of unsent bytes in its write queue,
> thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
> reports POLLOUT events if the amount of unsent bytes is below a per
> socket value, and if the write queue is not full. sendmsg() will
> also not add new buffers if the limit is hit.
>
> This global variable controls the amount of unsent data for
> sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
> to the global variable has immediate effect.
>
>
> Setting this sysctl to 2MB can effectively reduce the amount of memory
> in TCP write queues by 66 %,
> or allow you to increase tcp_wmem[2] so that only flows needing big
> BDP can get it.

We obtained these settings from our hardware vendors.

It might be worth exploring these settings further, but I can’t quite see
their connection to high-order allocations, since high-order allocations are
kernel macros.

#define SKB_FRAG_PAGE_ORDER     get_order(32768)
#define PAGE_FRAG_CACHE_MAX_SIZE        __ALIGN_MASK(32768, ~PAGE_MASK)
#define PAGE_FRAG_CACHE_MAX_ORDER       get_order(PAGE_FRAG_CACHE_MAX_SIZE)

Is there anything I’m missing?

Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14 10:19           ` Barry Song
@ 2025-10-14 10:39             ` Eric Dumazet
  2025-10-14 20:17               ` Barry Song
  2025-10-14 14:37             ` Shakeel Butt
  1 sibling, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2025-10-14 10:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Tue, Oct 14, 2025 at 3:19 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > >
> > > >
> > > > I think you are missing something to control how much memory  can be
> > > > pushed on each TCP socket ?
> > > >
> > > > What is tcp_wmem on your phones ? What about tcp_mem ?
> > > >
> > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat
> > >
> > > # cat /proc/sys/net/ipv4/tcp_wmem
> > > 524288  1048576 6710886
> >
> > Ouch. That is insane tcp_wmem[0] .
> >
> > Please stick to 4096, or risk OOM of various sorts.
> >
> > >
> > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat
> > > 4294967295
> > >
> > > Any thoughts on these settings?
> >
> > Please look at
> > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> >
> > tcp_notsent_lowat - UNSIGNED INTEGER
> > A TCP socket can control the amount of unsent bytes in its write queue,
> > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
> > reports POLLOUT events if the amount of unsent bytes is below a per
> > socket value, and if the write queue is not full. sendmsg() will
> > also not add new buffers if the limit is hit.
> >
> > This global variable controls the amount of unsent data for
> > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
> > to the global variable has immediate effect.
> >
> >
> > Setting this sysctl to 2MB can effectively reduce the amount of memory
> > in TCP write queues by 66 %,
> > or allow you to increase tcp_wmem[2] so that only flows needing big
> > BDP can get it.
>
> We obtained these settings from our hardware vendors.

Tell them they are wrong.

>
> It might be worth exploring these settings further, but I can’t quite see
> their connection to high-order allocations, since high-order allocations are
> kernel macros.
>
> #define SKB_FRAG_PAGE_ORDER     get_order(32768)
> #define PAGE_FRAG_CACHE_MAX_SIZE        __ALIGN_MASK(32768, ~PAGE_MASK)
> #define PAGE_FRAG_CACHE_MAX_ORDER       get_order(PAGE_FRAG_CACHE_MAX_SIZE)
>
> Is there anything I’m missing?

What is your question exactly ? You read these macros just fine. What
is your point ?

We had in the past something dynamic that we removed

commit d9b2938aabf757da2d40153489b251d4fc3fdd18
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Aug 27 20:49:34 2014 -0700

    net: attempt a single high order allocation


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14  7:26   ` Michal Hocko
  2025-10-14  8:08     ` Barry Song
@ 2025-10-14 14:27     ` Shakeel Butt
  2025-10-14 15:14       ` Michal Hocko
  1 sibling, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2025-10-14 14:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan,
	Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo,
	David Hildenbrand, Matthew Wilcox, Roman Gushchin

On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote:
> On Mon 13-10-25 20:30:13, Vlastimil Babka wrote:
> > On 10/13/25 12:16, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> [...]
> > I wonder if we should either:
> > 
> > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > determine it precisely.
> 
> As said in other reply I do not think this is a good fit for this
> specific case as it is all or nothing approach. Soon enough we discover
> that "no effort to reclaim/compact" hurts other usecases. So I do not
> think we need a dedicated flag for this specific case. We need a way to
> tell kswapd/kcompactd how much to try instead.

To me this new floag is to decouple two orthogonal requests i.e. no lock
semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp
flag convey the semantics of no lock. This can lead to unintended usage
of no lock semantics by users which for whatever reason don't want to
wakeup kswapd.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14 10:19           ` Barry Song
  2025-10-14 10:39             ` Eric Dumazet
@ 2025-10-14 14:37             ` Shakeel Butt
  2025-10-14 20:28               ` Barry Song
  1 sibling, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2025-10-14 14:37 UTC (permalink / raw)
  To: Barry Song
  Cc: Eric Dumazet, Matthew Wilcox, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Tue, Oct 14, 2025 at 06:19:05PM +0800, Barry Song wrote:
> > >
> > > >
> > > > I think you are missing something to control how much memory  can be
> > > > pushed on each TCP socket ?
> > > >
> > > > What is tcp_wmem on your phones ? What about tcp_mem ?
> > > >
> > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat
> > >
> > > # cat /proc/sys/net/ipv4/tcp_wmem
> > > 524288  1048576 6710886
> >
> > Ouch. That is insane tcp_wmem[0] .
> >
> > Please stick to 4096, or risk OOM of various sorts.
> >
> > >
> > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat
> > > 4294967295
> > >
> > > Any thoughts on these settings?
> >
> > Please look at
> > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> >
> > tcp_notsent_lowat - UNSIGNED INTEGER
> > A TCP socket can control the amount of unsent bytes in its write queue,
> > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
> > reports POLLOUT events if the amount of unsent bytes is below a per
> > socket value, and if the write queue is not full. sendmsg() will
> > also not add new buffers if the limit is hit.
> >
> > This global variable controls the amount of unsent data for
> > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
> > to the global variable has immediate effect.
> >
> >
> > Setting this sysctl to 2MB can effectively reduce the amount of memory
> > in TCP write queues by 66 %,
> > or allow you to increase tcp_wmem[2] so that only flows needing big
> > BDP can get it.
> 
> We obtained these settings from our hardware vendors.
> 
> It might be worth exploring these settings further, but I can’t quite see
> their connection to high-order allocations,

I don't think there is a connection between them. Is there a reason you
are expecting a connection/relation between them?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14 14:27     ` Shakeel Butt
@ 2025-10-14 15:14       ` Michal Hocko
  2025-10-14 17:22         ` Shakeel Butt
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2025-10-14 15:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan,
	Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo,
	David Hildenbrand, Matthew Wilcox, Roman Gushchin

On Tue 14-10-25 07:27:06, Shakeel Butt wrote:
> On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote:
> > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote:
> > > On 10/13/25 12:16, Barry Song wrote:
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > [...]
> > > I wonder if we should either:
> > > 
> > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > > determine it precisely.
> > 
> > As said in other reply I do not think this is a good fit for this
> > specific case as it is all or nothing approach. Soon enough we discover
> > that "no effort to reclaim/compact" hurts other usecases. So I do not
> > think we need a dedicated flag for this specific case. We need a way to
> > tell kswapd/kcompactd how much to try instead.
> 
> To me this new floag is to decouple two orthogonal requests i.e. no lock
> semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp
> flag convey the semantics of no lock. This can lead to unintended usage
> of no lock semantics by users which for whatever reason don't want to
> wakeup kswapd.

I would argue that callers should have no business into saying whether
the MM should wake up kswapd or not. The flag name currently suggests
that but that is mostly for historic reasons. A random page allocator
user shouldn't really care about this low level detail, really.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14 15:14       ` Michal Hocko
@ 2025-10-14 17:22         ` Shakeel Butt
  2025-10-15  6:21           ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2025-10-14 17:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan,
	Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo,
	David Hildenbrand, Matthew Wilcox, Roman Gushchin

On Tue, Oct 14, 2025 at 05:14:47PM +0200, Michal Hocko wrote:
> On Tue 14-10-25 07:27:06, Shakeel Butt wrote:
> > On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote:
> > > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote:
> > > > On 10/13/25 12:16, Barry Song wrote:
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > [...]
> > > > I wonder if we should either:
> > > > 
> > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > > > determine it precisely.
> > > 
> > > As said in other reply I do not think this is a good fit for this
> > > specific case as it is all or nothing approach. Soon enough we discover
> > > that "no effort to reclaim/compact" hurts other usecases. So I do not
> > > think we need a dedicated flag for this specific case. We need a way to
> > > tell kswapd/kcompactd how much to try instead.
> > 
> > To me this new floag is to decouple two orthogonal requests i.e. no lock
> > semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp
> > flag convey the semantics of no lock. This can lead to unintended usage
> > of no lock semantics by users which for whatever reason don't want to
> > wakeup kswapd.
> 
> I would argue that callers should have no business into saying whether
> the MM should wake up kswapd or not. The flag name currently suggests
> that but that is mostly for historic reasons. A random page allocator
> user shouldn't really care about this low level detail, really.

I agree but unless we somehow enforce/warn for such cases, there will be
users doing this. A simple grep shows kmsan is doing this. I worry there
might be users who are manually setting up gfp flags for their
allocations and not providing kswapd flag explicitly. Finding such cases
with grep is not easy.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14 10:39             ` Eric Dumazet
@ 2025-10-14 20:17               ` Barry Song
  2025-10-15  6:39                 ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Barry Song @ 2025-10-14 20:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Tue, Oct 14, 2025 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 14, 2025 at 3:19 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > >
> > > > >
> > > > > I think you are missing something to control how much memory  can be
> > > > > pushed on each TCP socket ?
> > > > >
> > > > > What is tcp_wmem on your phones ? What about tcp_mem ?
> > > > >
> > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat
> > > >
> > > > # cat /proc/sys/net/ipv4/tcp_wmem
> > > > 524288  1048576 6710886
> > >
> > > Ouch. That is insane tcp_wmem[0] .
> > >
> > > Please stick to 4096, or risk OOM of various sorts.
> > >
> > > >
> > > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat
> > > > 4294967295
> > > >
> > > > Any thoughts on these settings?
> > >
> > > Please look at
> > > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> > >
> > > tcp_notsent_lowat - UNSIGNED INTEGER
> > > A TCP socket can control the amount of unsent bytes in its write queue,
> > > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
> > > reports POLLOUT events if the amount of unsent bytes is below a per
> > > socket value, and if the write queue is not full. sendmsg() will
> > > also not add new buffers if the limit is hit.
> > >
> > > This global variable controls the amount of unsent data for
> > > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
> > > to the global variable has immediate effect.
> > >
> > >
> > > Setting this sysctl to 2MB can effectively reduce the amount of memory
> > > in TCP write queues by 66 %,
> > > or allow you to increase tcp_wmem[2] so that only flows needing big
> > > BDP can get it.
> >
> > We obtained these settings from our hardware vendors.
>
> Tell them they are wrong.

Well, we checked Qualcomm and MTK, and it seems both set these values
relatively high. In other words, all the AOSP products we examined also
use high values for these settings. Nobody is using tcp_wmem[0]=4096.

We’ll need some time to understand why these are configured this way in
AOSP hardware.

>
> >
> > It might be worth exploring these settings further, but I can’t quite see
> > their connection to high-order allocations, since high-order allocations are
> > kernel macros.
> >
> > #define SKB_FRAG_PAGE_ORDER     get_order(32768)
> > #define PAGE_FRAG_CACHE_MAX_SIZE        __ALIGN_MASK(32768, ~PAGE_MASK)
> > #define PAGE_FRAG_CACHE_MAX_ORDER       get_order(PAGE_FRAG_CACHE_MAX_SIZE)
> >
> > Is there anything I’m missing?
>
> What is your question exactly ? You read these macros just fine. What
> is your point ?

My question is whether these settings influence how often high-order
allocations occur. In other words, would lowering these values make
high-order allocations less frequent? If so, why?
I’m not a network expert, apologies if the question sounds naive.

>
> We had in the past something dynamic that we removed
>
> commit d9b2938aabf757da2d40153489b251d4fc3fdd18
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Wed Aug 27 20:49:34 2014 -0700
>
>     net: attempt a single high order allocation

Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14 14:37             ` Shakeel Butt
@ 2025-10-14 20:28               ` Barry Song
  2025-10-15 18:13                 ` Shakeel Butt
  0 siblings, 1 reply; 36+ messages in thread
From: Barry Song @ 2025-10-14 20:28 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Eric Dumazet, Matthew Wilcox, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Tue, Oct 14, 2025 at 10:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:

> >
> > It might be worth exploring these settings further, but I can’t quite see
> > their connection to high-order allocations,
>
> I don't think there is a connection between them. Is there a reason you
> are expecting a connection/relation between them?

Eric replied to my email about frequent high-order allocation requests,
suggesting that I might be missing some proper configurations for these
settings[1]. So I’m trying to understand whether these configurations affect
the frequency of high-order allocations.

[1] https://lore.kernel.org/linux-mm/pow5zt7dmo2wiydophoap6ntaycyjt2yrszo3ue7mg2hgnzcmv@oi3epbtyoufn/T/#m9b94a1c60452551496738e4e15235329f860d1f9

Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14 17:22         ` Shakeel Butt
@ 2025-10-15  6:21           ` Michal Hocko
  2025-10-15 18:26             ` Shakeel Butt
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2025-10-15  6:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan,
	Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo,
	David Hildenbrand, Matthew Wilcox, Roman Gushchin

On Tue 14-10-25 10:22:03, Shakeel Butt wrote:
> On Tue, Oct 14, 2025 at 05:14:47PM +0200, Michal Hocko wrote:
> > On Tue 14-10-25 07:27:06, Shakeel Butt wrote:
> > > On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote:
> > > > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote:
> > > > > On 10/13/25 12:16, Barry Song wrote:
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > [...]
> > > > > I wonder if we should either:
> > > > > 
> > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > > > > determine it precisely.
> > > > 
> > > > As said in other reply I do not think this is a good fit for this
> > > > specific case as it is all or nothing approach. Soon enough we discover
> > > > that "no effort to reclaim/compact" hurts other usecases. So I do not
> > > > think we need a dedicated flag for this specific case. We need a way to
> > > > tell kswapd/kcompactd how much to try instead.
> > > 
> > > To me this new floag is to decouple two orthogonal requests i.e. no lock
> > > semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp
> > > flag convey the semantics of no lock. This can lead to unintended usage
> > > of no lock semantics by users which for whatever reason don't want to
> > > wakeup kswapd.
> > 
> > I would argue that callers should have no business into saying whether
> > the MM should wake up kswapd or not. The flag name currently suggests
> > that but that is mostly for historic reasons. A random page allocator
> > user shouldn't really care about this low level detail, really.
> 
> I agree but unless we somehow enforce/warn for such cases, there will be
> users doing this. A simple grep shows kmsan is doing this. I worry there
> might be users who are manually setting up gfp flags for their
> allocations and not providing kswapd flag explicitly. Finding such cases
> with grep is not easy.

You are right but this is inherent problem of our gfp interface. It is
too late to have a defensive interface I am afraid.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14 20:17               ` Barry Song
@ 2025-10-15  6:39                 ` Eric Dumazet
  2025-10-15  7:35                   ` Barry Song
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2025-10-15  6:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Tue, Oct 14, 2025 at 1:17 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Oct 14, 2025 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Oct 14, 2025 at 3:19 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > > >
> > > > > >
> > > > > > I think you are missing something to control how much memory  can be
> > > > > > pushed on each TCP socket ?
> > > > > >
> > > > > > What is tcp_wmem on your phones ? What about tcp_mem ?
> > > > > >
> > > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat
> > > > >
> > > > > # cat /proc/sys/net/ipv4/tcp_wmem
> > > > > 524288  1048576 6710886
> > > >
> > > > Ouch. That is insane tcp_wmem[0] .
> > > >
> > > > Please stick to 4096, or risk OOM of various sorts.
> > > >
> > > > >
> > > > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat
> > > > > 4294967295
> > > > >
> > > > > Any thoughts on these settings?
> > > >
> > > > Please look at
> > > > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> > > >
> > > > tcp_notsent_lowat - UNSIGNED INTEGER
> > > > A TCP socket can control the amount of unsent bytes in its write queue,
> > > > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
> > > > reports POLLOUT events if the amount of unsent bytes is below a per
> > > > socket value, and if the write queue is not full. sendmsg() will
> > > > also not add new buffers if the limit is hit.
> > > >
> > > > This global variable controls the amount of unsent data for
> > > > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
> > > > to the global variable has immediate effect.
> > > >
> > > >
> > > > Setting this sysctl to 2MB can effectively reduce the amount of memory
> > > > in TCP write queues by 66 %,
> > > > or allow you to increase tcp_wmem[2] so that only flows needing big
> > > > BDP can get it.
> > >
> > > We obtained these settings from our hardware vendors.
> >
> > Tell them they are wrong.
>
> Well, we checked Qualcomm and MTK, and it seems both set these values
> relatively high. In other words, all the AOSP products we examined also
> use high values for these settings. Nobody is using tcp_wmem[0]=4096.
>

The (fine and safe) default should be PAGE_SIZE.

Perhaps they are dealing with systems with PAGE_SIZE=65536, but then
the skb_page_frag_refill() would be a non issue there, because it would
only allocate order-0 pages.

> We’ll need some time to understand why these are configured this way in
> AOSP hardware.
>
> >
> > >
> > > It might be worth exploring these settings further, but I can’t quite see
> > > their connection to high-order allocations, since high-order allocations are
> > > kernel macros.
> > >
> > > #define SKB_FRAG_PAGE_ORDER     get_order(32768)
> > > #define PAGE_FRAG_CACHE_MAX_SIZE        __ALIGN_MASK(32768, ~PAGE_MASK)
> > > #define PAGE_FRAG_CACHE_MAX_ORDER       get_order(PAGE_FRAG_CACHE_MAX_SIZE)
> > >
> > > Is there anything I’m missing?
> >
> > What is your question exactly ? You read these macros just fine. What
> > is your point ?
>
> My question is whether these settings influence how often high-order
> allocations occur. In other words, would lowering these values make
> high-order allocations less frequent? If so, why?

Because almost all of the buffers stored in TCP write queues are using
order-3 pages
on arches with 4K pages.

I am a bit confused because you posted a patch changing skb_page_frag_refill()
without realizing its first user is TCP.

Look for sk_page_frag_refill() in tcp_sendmsg_locked()

> I’m not a network expert, apologies if the question sounds naive.
>
> >
> > We had in the past something dynamic that we removed
> >
> > commit d9b2938aabf757da2d40153489b251d4fc3fdd18
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Wed Aug 27 20:49:34 2014 -0700
> >
> >     net: attempt a single high order allocation
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-15  6:39                 ` Eric Dumazet
@ 2025-10-15  7:35                   ` Barry Song
  2025-10-15 16:39                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 36+ messages in thread
From: Barry Song @ 2025-10-15  7:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Matthew Wilcox, netdev, linux-mm, linux-doc, linux-kernel,
	Barry Song, Jonathan Corbet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Wed, Oct 15, 2025 at 2:39 PM Eric Dumazet <edumazet@google.com> wrote:

> > >
> > > Tell them they are wrong.
> >
> > Well, we checked Qualcomm and MTK, and it seems both set these values
> > relatively high. In other words, all the AOSP products we examined also
> > use high values for these settings. Nobody is using tcp_wmem[0]=4096.
> >
>
> The (fine and safe) default should be PAGE_SIZE.
>
> Perhaps they are dealing with systems with PAGE_SIZE=65536, but then
> the skb_page_frag_refill() would be a non issue there, because it would
> only allocate order-0 pages.

I am 100% sure that all of them handle PAGE_SIZE=4096. Google is working on
16KB page size for Android, but it is not ready yet(Please correct me
if 16KB has been
ready, Suren).

>
> > We’ll need some time to understand why these are configured this way in
> > AOSP hardware.
> >
> > >
> > > >
> > > > It might be worth exploring these settings further, but I can’t quite see
> > > > their connection to high-order allocations, since high-order allocations are
> > > > kernel macros.
> > > >
> > > > #define SKB_FRAG_PAGE_ORDER     get_order(32768)
> > > > #define PAGE_FRAG_CACHE_MAX_SIZE        __ALIGN_MASK(32768, ~PAGE_MASK)
> > > > #define PAGE_FRAG_CACHE_MAX_ORDER       get_order(PAGE_FRAG_CACHE_MAX_SIZE)
> > > >
> > > > Is there anything I’m missing?
> > >
> > > What is your question exactly ? You read these macros just fine. What
> > > is your point ?
> >
> > My question is whether these settings influence how often high-order
> > allocations occur. In other words, would lowering these values make
> > high-order allocations less frequent? If so, why?
>
> Because almost all of the buffers stored in TCP write queues are using
> order-3 pages
> on arches with 4K pages.
>
> I am a bit confused because you posted a patch changing skb_page_frag_refill()
> without realizing its first user is TCP.
>
> Look for sk_page_frag_refill() in tcp_sendmsg_locked()

Sure. Let me review the code further. The problem was observed on the MM
side, causing over-reclamation and phone heating, while the source of the
allocations lies in network activity. I am not a network expert and may be
missing many network details, so I am raising this RFC to both lists to see
if the network and MM folks can discuss together to find a solution.

As you can see, the discussion has absolutely forked into two branches. :-)

Thanks
Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-15  7:35                   ` Barry Song
@ 2025-10-15 16:39                     ` Suren Baghdasaryan
  0 siblings, 0 replies; 36+ messages in thread
From: Suren Baghdasaryan @ 2025-10-15 16:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Eric Dumazet, Matthew Wilcox, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Vlastimil Babka, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Yunsheng Lin, Huacai Zhou

On Wed, Oct 15, 2025 at 12:35 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 2:39 PM Eric Dumazet <edumazet@google.com> wrote:
>
> > > >
> > > > Tell them they are wrong.
> > >
> > > Well, we checked Qualcomm and MTK, and it seems both set these values
> > > relatively high. In other words, all the AOSP products we examined also
> > > use high values for these settings. Nobody is using tcp_wmem[0]=4096.
> > >
> >
> > The (fine and safe) default should be PAGE_SIZE.
> >
> > Perhaps they are dealing with systems with PAGE_SIZE=65536, but then
> > the skb_page_frag_refill() would be a non issue there, because it would
> > only allocate order-0 pages.
>
> I am 100% sure that all of them handle PAGE_SIZE=4096. Google is working on
> 16KB page size for Android, but it is not ready yet(Please correct me
> if 16KB has been
> ready, Suren).

It is ready but it is new, so it will take some time before we see it
in production devices.

>
> >
> > > We’ll need some time to understand why these are configured this way in
> > > AOSP hardware.
> > >
> > > >
> > > > >
> > > > > It might be worth exploring these settings further, but I can’t quite see
> > > > > their connection to high-order allocations, since high-order allocations are
> > > > > kernel macros.
> > > > >
> > > > > #define SKB_FRAG_PAGE_ORDER     get_order(32768)
> > > > > #define PAGE_FRAG_CACHE_MAX_SIZE        __ALIGN_MASK(32768, ~PAGE_MASK)
> > > > > #define PAGE_FRAG_CACHE_MAX_ORDER       get_order(PAGE_FRAG_CACHE_MAX_SIZE)
> > > > >
> > > > > Is there anything I’m missing?
> > > >
> > > > What is your question exactly ? You read these macros just fine. What
> > > > is your point ?
> > >
> > > My question is whether these settings influence how often high-order
> > > allocations occur. In other words, would lowering these values make
> > > high-order allocations less frequent? If so, why?
> >
> > Because almost all of the buffers stored in TCP write queues are using
> > order-3 pages
> > on arches with 4K pages.
> >
> > I am a bit confused because you posted a patch changing skb_page_frag_refill()
> > without realizing its first user is TCP.
> >
> > Look for sk_page_frag_refill() in tcp_sendmsg_locked()
>
> Sure. Let me review the code further. The problem was observed on the MM
> side, causing over-reclamation and phone heating, while the source of the
> allocations lies in network activity. I am not a network expert and may be
> missing many network details, so I am raising this RFC to both lists to see
> if the network and MM folks can discuss together to find a solution.
>
> As you can see, the discussion has absolutely forked into two branches. :-)
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-14 20:28               ` Barry Song
@ 2025-10-15 18:13                 ` Shakeel Butt
  0 siblings, 0 replies; 36+ messages in thread
From: Shakeel Butt @ 2025-10-15 18:13 UTC (permalink / raw)
  To: Barry Song
  Cc: Eric Dumazet, Matthew Wilcox, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Yunsheng Lin,
	Huacai Zhou

On Wed, Oct 15, 2025 at 04:28:17AM +0800, Barry Song wrote:
> On Tue, Oct 14, 2025 at 10:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > >
> > > It might be worth exploring these settings further, but I can’t quite see
> > > their connection to high-order allocations,
> >
> > I don't think there is a connection between them. Is there a reason you
> > are expecting a connection/relation between them?
> 
> Eric replied to my email about frequent high-order allocation requests,
> suggesting that I might be missing some proper configurations for these
> settings[1]. So I’m trying to understand whether these configurations affect
> the frequency of high-order allocations.

If I understand Eric correctly, those configurations do indirectly
affect the number of memory allocations and their lifetime (irrespective
of order). In one scenario, setting tcp_wmem[0] higher, allow the kernel
to allocate more memory even when the system is under memory pressure.
See tcp_wmem_schedule(). In your case it would be up to 0.5MiB per
socket.

Have you tested the configuration values suggested by Eric?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
  2025-10-15  6:21           ` Michal Hocko
@ 2025-10-15 18:26             ` Shakeel Butt
  0 siblings, 0 replies; 36+ messages in thread
From: Shakeel Butt @ 2025-10-15 18:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Barry Song, netdev, linux-mm, linux-doc,
	linux-kernel, Barry Song, Jonathan Corbet, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner, Zi Yan,
	Yunsheng Lin, Huacai Zhou, Alexei Starovoitov, Harry Yoo,
	David Hildenbrand, Matthew Wilcox, Roman Gushchin

On Wed, Oct 15, 2025 at 08:21:21AM +0200, Michal Hocko wrote:
> On Tue 14-10-25 10:22:03, Shakeel Butt wrote:
> > On Tue, Oct 14, 2025 at 05:14:47PM +0200, Michal Hocko wrote:
> > > On Tue 14-10-25 07:27:06, Shakeel Butt wrote:
> > > > On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote:
> > > > > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote:
> > > > > > On 10/13/25 12:16, Barry Song wrote:
> > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > [...]
> > > > > > I wonder if we should either:
> > > > > > 
> > > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > > > > > determine it precisely.
> > > > > 
> > > > > As said in other reply I do not think this is a good fit for this
> > > > > specific case as it is all or nothing approach. Soon enough we discover
> > > > > that "no effort to reclaim/compact" hurts other usecases. So I do not
> > > > > think we need a dedicated flag for this specific case. We need a way to
> > > > > tell kswapd/kcompactd how much to try instead.
> > > > 
> > > > To me this new floag is to decouple two orthogonal requests i.e. no lock
> > > > semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp
> > > > flag convey the semantics of no lock. This can lead to unintended usage
> > > > of no lock semantics by users which for whatever reason don't want to
> > > > wakeup kswapd.
> > > 
> > > I would argue that callers should have no business into saying whether
> > > the MM should wake up kswapd or not. The flag name currently suggests
> > > that but that is mostly for historic reasons. A random page allocator
> > > user shouldn't really care about this low level detail, really.
> > 
> > I agree but unless we somehow enforce/warn for such cases, there will be
> > users doing this. A simple grep shows kmsan is doing this. I worry there
> > might be users who are manually setting up gfp flags for their
> > allocations and not providing kswapd flag explicitly. Finding such cases
> > with grep is not easy.
> 
> You are right but this is inherent problem of our gfp interface. It is
> too late to have a defensive interface I am afraid.

I am not really asking to overhaul the whole gfp interface but rather
not introduce one more case which can easily be misused. Anyways, this
conversation is orthogonal to the original email and I am fine with wait
and see approach here for now. 



^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-10-15 18:26 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-13 10:16 [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation Barry Song
2025-10-13 18:30 ` Vlastimil Babka
2025-10-13 21:35   ` Shakeel Butt
2025-10-13 21:53     ` Alexei Starovoitov
2025-10-13 22:25       ` Shakeel Butt
2025-10-13 22:46   ` Roman Gushchin
2025-10-14  4:31     ` Barry Song
2025-10-14  7:24     ` Michal Hocko
2025-10-14  7:26   ` Michal Hocko
2025-10-14  8:08     ` Barry Song
2025-10-14 14:27     ` Shakeel Butt
2025-10-14 15:14       ` Michal Hocko
2025-10-14 17:22         ` Shakeel Butt
2025-10-15  6:21           ` Michal Hocko
2025-10-15 18:26             ` Shakeel Butt
2025-10-13 18:53 ` Eric Dumazet
2025-10-14  3:58   ` Barry Song
2025-10-14  5:07     ` Eric Dumazet
2025-10-14  6:43       ` Barry Song
2025-10-14  7:01         ` Eric Dumazet
2025-10-14  8:17           ` Barry Song
2025-10-14  8:25             ` Eric Dumazet
2025-10-13 21:56 ` Matthew Wilcox
2025-10-14  4:09   ` Barry Song
2025-10-14  5:04     ` Eric Dumazet
2025-10-14  8:58       ` Barry Song
2025-10-14  9:49         ` Eric Dumazet
2025-10-14 10:19           ` Barry Song
2025-10-14 10:39             ` Eric Dumazet
2025-10-14 20:17               ` Barry Song
2025-10-15  6:39                 ` Eric Dumazet
2025-10-15  7:35                   ` Barry Song
2025-10-15 16:39                     ` Suren Baghdasaryan
2025-10-14 14:37             ` Shakeel Butt
2025-10-14 20:28               ` Barry Song
2025-10-15 18:13                 ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox