linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Use kmem_cache for memcg alloc
@ 2025-04-25  3:19 Huan Yang
  2025-04-25  3:19 ` [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init() Huan Yang
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Huan Yang @ 2025-04-25  3:19 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Petr Mladek, Vlastimil Babka,
	Rasmus Villemoes, Francesco Valla, Raul E Rangel,
	Paul E. McKenney, Huang Shijie, Guo Weikang,
	Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Huan Yang
  Cc: opensource.kernel

The mem_cgroup_alloc function creates mem_cgroup struct and it's associated
structures including mem_cgroup_per_node.
Through detailed analysis on our test machine (Arm64, 16GB RAM, 6.6 kernel,
1 NUMA node, memcgv2 with nokmem,nosocket,cgroup_disable=pressure),
we can observe the memory allocation for these structures using the
following shell commands:
  # Enable tracing
  echo 1 > /sys/kernel/tracing/events/kmem/kmalloc/enable
  echo 1 > /sys/kernel/tracing/tracing_on
  cat /sys/kernel/tracing/trace_pipe | grep kmalloc | grep mem_cgroup

  # Trigger allocation if cgroup subtree do not enable memcg
  echo +memory > /sys/fs/cgroup/cgroup.subtree_control

Ftrace Output:
  # mem_cgroup struct allocation
  sh-6312    [000] ..... 58015.698365: kmalloc:
    call_site=mem_cgroup_css_alloc+0xd8/0x5b4
    ptr=000000003e4c3799 bytes_req=2312 bytes_alloc=4096
    gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false

  # mem_cgroup_per_node allocation
  sh-6312    [000] ..... 58015.698389: kmalloc:
    call_site=mem_cgroup_css_alloc+0x1d8/0x5b4
    ptr=00000000d798700c bytes_req=2896 bytes_alloc=4096
    gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false

Key Observations:
  1. Both structures use kmalloc with requested sizes between 2KB-4KB
  2. Allocation alignment forces 4KB slab usage due to pre-defined sizes
     (64B, 128B,..., 2KB, 4KB, 8KB)
  3. Memory waste per memcg instance:
      Base struct: 4096 - 2312 = 1784 bytes
      Per-node struct: 4096 - 2896 = 1200 bytes
      Total waste: 2984 bytes (1-node system)
      NUMA scaling: (1200 + 8) * nr_node_ids bytes
So, it's a little waste.

This patchset introduces dedicated kmem_cache:
  Patch2 - mem_cgroup kmem_cache - memcg_cachep
  Patch3 - mem_cgroup_per_node kmem_cache - memcg_pn_cachep

The benefits of this change can be observed with the following tracing
commands:
  # Enable tracing
  echo 1 > /sys/kernel/tracing/events/kmem/kmem_cache_alloc/enable
  echo 1 > /sys/kernel/tracing/tracing_on
  cat /sys/kernel/tracing/trace_pipe | grep kmem_cache_alloc | grep mem_cgroup
  # In another terminal:
  echo +memory > /sys/fs/cgroup/cgroup.subtree_control


The output might now look like this:

  # mem_cgroup struct allocation
  sh-9827     [000] .....   289.513598: kmem_cache_alloc:
    call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=00000000695c1806
    bytes_req=2312 bytes_alloc=2368 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
    accounted=false
  # mem_cgroup_per_node allocation
  sh-9827     [000] .....   289.513602: kmem_cache_alloc:
    call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=000000002989e63a
    bytes_req=2896 bytes_alloc=2944 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
    accounted=false

This indicates that the `mem_cgroup` struct now requests 2312 bytes
and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes
and is allocated 2944 bytes.
The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the
`kmem_cache`.

Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as:

  # mem_cgroup struct allocation
  sh-9269     [003] .....    80.396366: kmem_cache_alloc:
    call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475
    bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
    accounted=false

  # mem_cgroup_per_node allocation
  sh-9269     [003] .....    80.396411: kmem_cache_alloc:
    call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6
    bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
    accounted=false

While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults
to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial
for performance. Please let me know if there are any issues or if I've
misunderstood anything.

This patchset also move mem_cgroup_init ahead of cgroup_init() due to
cgroup_init() will allocate root_mem_cgroup, but each initcall invoke after
cgroup_init, so if each kmem_cache do not prepare, we need testing NULL before
use it.

ChangeLog:
 v2 -> v3:
   Move v2 patch3 ahead, reuse and move mem_cgroup_init ahead of cgroup_init.
 v1 -> v2:
   Patch1-2 simple change commit message.
   Patch3: Add mem_cgroup_init_early to help "memcg" prepare resources
           before cgroup_init().

v2: https://lore.kernel.org/all/20250424120937.96164-1-link@vivo.com/
v1: https://lore.kernel.org/all/20250423084306.65706-1-link@vivo.com/


Huan Yang (3):
  mm/memcg: move mem_cgroup_init() ahead of cgroup_init()
  mm/memcg: use kmem_cache when alloc memcg
  mm/memcg: use kmem_cache when alloc memcg pernode info

 include/linux/memcontrol.h |  3 +++
 init/main.c                |  2 ++
 mm/memcontrol.c            | 21 ++++++++++++++++-----
 3 files changed, 21 insertions(+), 5 deletions(-)


base-commit: 2c9c612abeb38aab0e87d48496de6fd6daafb00b
--
2.48.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init()
  2025-04-25  3:19 [PATCH v3 0/3] Use kmem_cache for memcg alloc Huan Yang
@ 2025-04-25  3:19 ` Huan Yang
  2025-04-25  4:11   ` Shakeel Butt
  2025-04-27 11:46   ` Johannes Weiner
  2025-04-25  3:19 ` [PATCH v3 2/3] mm/memcg: use kmem_cache when alloc memcg Huan Yang
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Huan Yang @ 2025-04-25  3:19 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Petr Mladek, Vlastimil Babka,
	Rasmus Villemoes, Francesco Valla, Raul E Rangel,
	Paul E. McKenney, Huang Shijie, Guo Weikang,
	Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Huan Yang,
	Alexander Gordeev, Boqun Feng
  Cc: opensource.kernel

When cgroup_init() creates root_mem_cgroup through css_alloc callback,
some critical resources might not be fully initialized, forcing later
operations to perform conditional checks for resource availability.

This patch move mem_cgroup_init() to address the init order, it invoke
before cgroup_init, so, compare to subsys_initcall, it can use to prepare
some key resources before root_mem_cgroup alloc.

Signed-off-by: Huan Yang <link@vivo.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 3 +++
 init/main.c                | 2 ++
 mm/memcontrol.c            | 5 ++---
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5264d148bdd9..a6bbdd9cab5b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1057,6 +1057,7 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 	return id;
 }
 
+extern int mem_cgroup_init(void);
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1472,6 +1473,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 {
 	return 0;
 }
+
+static inline int mem_cgroup_init(void) { return 0; }
 #endif /* CONFIG_MEMCG */
 
 /*
diff --git a/init/main.c b/init/main.c
index 6b14e6116a1f..d9c646960fdd 100644
--- a/init/main.c
+++ b/init/main.c
@@ -50,6 +50,7 @@
 #include <linux/writeback.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
+#include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/efi.h>
 #include <linux/tick.h>
@@ -1087,6 +1088,7 @@ void start_kernel(void)
 	nsfs_init();
 	pidfs_init();
 	cpuset_init();
+	mem_cgroup_init();
 	cgroup_init();
 	taskstats_init_early();
 	delayacct_init();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5e2ea8b8a898..64a4213227c4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5029,14 +5029,14 @@ static int __init cgroup_memory(char *s)
 __setup("cgroup.memory=", cgroup_memory);
 
 /*
- * subsys_initcall() for memory controller.
+ * Memory controller init before cgroup_init() initialize root_mem_cgroup.
  *
  * Some parts like memcg_hotplug_cpu_dead() have to be initialized from this
  * context because of lock dependencies (cgroup_lock -> cpu hotplug) but
  * basically everything that doesn't depend on a specific mem_cgroup structure
  * should be initialized from here.
  */
-static int __init mem_cgroup_init(void)
+int __init mem_cgroup_init(void)
 {
 	int cpu;
 
@@ -5057,7 +5057,6 @@ static int __init mem_cgroup_init(void)
 
 	return 0;
 }
-subsys_initcall(mem_cgroup_init);
 
 #ifdef CONFIG_SWAP
 /**
-- 
2.48.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 2/3] mm/memcg: use kmem_cache when alloc memcg
  2025-04-25  3:19 [PATCH v3 0/3] Use kmem_cache for memcg alloc Huan Yang
  2025-04-25  3:19 ` [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init() Huan Yang
@ 2025-04-25  3:19 ` Huan Yang
  2025-04-25  4:12   ` Shakeel Butt
  2025-04-27 11:56   ` Johannes Weiner
  2025-04-25  3:19 ` [PATCH v3 3/3] mm/memcg: use kmem_cache when alloc memcg pernode info Huan Yang
  2025-04-25  4:35 ` [PATCH v3 0/3] Use kmem_cache for memcg alloc Matthew Wilcox
  3 siblings, 2 replies; 14+ messages in thread
From: Huan Yang @ 2025-04-25  3:19 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Petr Mladek, Vlastimil Babka,
	Rasmus Villemoes, Francesco Valla, Raul E Rangel,
	Paul E. McKenney, Huang Shijie, Guo Weikang,
	Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Huan Yang
  Cc: opensource.kernel

When tracing mem_cgroup_alloc() with kmalloc ftrace, we observe:

kmalloc: call_site=mem_cgroup_css_alloc+0xd8/0x5b4 ptr=000000003e4c3799
    bytes_req=2312 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
    accounted=false

The output indicates that while allocating mem_cgroup struct (2312 bytes),
the slab allocator actually provides 4096-byte chunks. This occurs because:

1. The slab allocator predefines bucket sizes from 64B to 8096B
2. The mem_cgroup allocation size (2312B) falls between the 2KB and 4KB
   slabs
3. The allocator rounds up to the nearest larger slab (4KB), resulting in
   ~1KB wasted memory per allocation

This patch introduces a dedicated kmem_cache for mem_cgroup structs,
achieving precise memory allocation. Post-patch ftrace verification shows:

kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4
    ptr=00000000695c1806 bytes_req=2312 bytes_alloc=2368
    gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false

Each memcg alloc offer 2368bytes(include hw cacheline align), compare to
4096, avoid waste.

Signed-off-by: Huan Yang <link@vivo.com>
---
 mm/memcontrol.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64a4213227c4..e34216e55688 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -95,6 +95,8 @@ static bool cgroup_memory_nokmem __ro_after_init;
 /* BPF memory accounting disabled? */
 static bool cgroup_memory_nobpf __ro_after_init;
 
+static struct kmem_cache *memcg_cachep;
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 #endif
@@ -3652,7 +3654,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	int __maybe_unused i;
 	long error;
 
-	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL);
+	memcg = kmem_cache_zalloc(memcg_cachep, GFP_KERNEL);
 	if (!memcg)
 		return ERR_PTR(-ENOMEM);
 
@@ -5038,6 +5040,7 @@ __setup("cgroup.memory=", cgroup_memory);
  */
 int __init mem_cgroup_init(void)
 {
+	unsigned int memcg_size;
 	int cpu;
 
 	/*
@@ -5055,6 +5058,10 @@ int __init mem_cgroup_init(void)
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
 			  drain_local_stock);
 
+	memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids);
+	memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0,
+					 SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
 	return 0;
 }
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 3/3] mm/memcg: use kmem_cache when alloc memcg pernode info
  2025-04-25  3:19 [PATCH v3 0/3] Use kmem_cache for memcg alloc Huan Yang
  2025-04-25  3:19 ` [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init() Huan Yang
  2025-04-25  3:19 ` [PATCH v3 2/3] mm/memcg: use kmem_cache when alloc memcg Huan Yang
@ 2025-04-25  3:19 ` Huan Yang
  2025-04-25  4:12   ` Shakeel Butt
  2025-04-27 12:00   ` Johannes Weiner
  2025-04-25  4:35 ` [PATCH v3 0/3] Use kmem_cache for memcg alloc Matthew Wilcox
  3 siblings, 2 replies; 14+ messages in thread
From: Huan Yang @ 2025-04-25  3:19 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Petr Mladek, Vlastimil Babka,
	Rasmus Villemoes, Francesco Valla, Raul E Rangel,
	Paul E. McKenney, Huang Shijie, Guo Weikang,
	Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Huan Yang,
	Hyeonggon Yoo, Paul Moore
  Cc: opensource.kernel

When tracing mem_cgroup_per_node allocations with kmalloc ftrace:

kmalloc: call_site=mem_cgroup_css_alloc+0x1d8/0x5b4 ptr=00000000d798700c
    bytes_req=2896 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
    accounted=false

This reveals the slab allocator provides 4096B chunks for 2896B
mem_cgroup_per_node due to:

1. The slab allocator predefines bucket sizes from 64B to 8096B
2. The mem_cgroup allocation size (2312B) falls between the 2KB and 4KB
   slabs
3. The allocator rounds up to the nearest larger slab (4KB), resulting in
   ~1KB wasted memory per memcg alloc - per node.

This patch introduces a dedicated kmem_cache for mem_cgroup structs,
achieving precise memory allocation. Post-patch ftrace verification shows:

kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4
    ptr=000000002989e63a bytes_req=2896 bytes_alloc=2944
    gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false

Each mem_cgroup_per_node alloc 2944bytes(include hw cacheline align),
compare to 4096, it avoid waste.

Signed-off-by: Huan Yang <link@vivo.com>
---
 mm/memcontrol.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e34216e55688..af1cd5adfd6c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,7 @@ static bool cgroup_memory_nokmem __ro_after_init;
 static bool cgroup_memory_nobpf __ro_after_init;
 
 static struct kmem_cache *memcg_cachep;
+static struct kmem_cache *memcg_pn_cachep;
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
@@ -3601,7 +3602,8 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 
-	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node);
+	pn = kmem_cache_alloc_node(memcg_pn_cachep, GFP_KERNEL | __GFP_ZERO,
+				   node);
 	if (!pn)
 		return false;
 
@@ -5062,6 +5064,9 @@ int __init mem_cgroup_init(void)
 	memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0,
 					 SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
 
+	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
+				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
+
 	return 0;
 }
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init()
  2025-04-25  3:19 ` [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init() Huan Yang
@ 2025-04-25  4:11   ` Shakeel Butt
  2025-04-28  2:20     ` Huan Yang
  2025-04-27 11:46   ` Johannes Weiner
  1 sibling, 1 reply; 14+ messages in thread
From: Shakeel Butt @ 2025-04-25  4:11 UTC (permalink / raw)
  To: Huan Yang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Petr Mladek, Vlastimil Babka, Rasmus Villemoes,
	Francesco Valla, Raul E Rangel, Paul E. McKenney, Huang Shijie,
	Guo Weikang, Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Alexander Gordeev,
	Boqun Feng, opensource.kernel

On Thu, Apr 24, 2025 at 8:19 PM Huan Yang <link@vivo.com> wrote:
>
> When cgroup_init() creates root_mem_cgroup through css_alloc callback,
> some critical resources might not be fully initialized, forcing later
> operations to perform conditional checks for resource availability.
>
> This patch move mem_cgroup_init() to address the init order, it invoke
> before cgroup_init, so, compare to subsys_initcall, it can use to prepare
> some key resources before root_mem_cgroup alloc.
>
> Signed-off-by: Huan Yang <link@vivo.com>
> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/memcg: use kmem_cache when alloc memcg
  2025-04-25  3:19 ` [PATCH v3 2/3] mm/memcg: use kmem_cache when alloc memcg Huan Yang
@ 2025-04-25  4:12   ` Shakeel Butt
  2025-04-27 11:56   ` Johannes Weiner
  1 sibling, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2025-04-25  4:12 UTC (permalink / raw)
  To: Huan Yang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Petr Mladek, Vlastimil Babka, Rasmus Villemoes,
	Francesco Valla, Raul E Rangel, Paul E. McKenney, Huang Shijie,
	Guo Weikang, Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, opensource.kernel

On Thu, Apr 24, 2025 at 8:19 PM Huan Yang <link@vivo.com> wrote:
>
> When tracing mem_cgroup_alloc() with kmalloc ftrace, we observe:
>
> kmalloc: call_site=mem_cgroup_css_alloc+0xd8/0x5b4 ptr=000000003e4c3799
>     bytes_req=2312 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
>     accounted=false
>
> The output indicates that while allocating mem_cgroup struct (2312 bytes),
> the slab allocator actually provides 4096-byte chunks. This occurs because:
>
> 1. The slab allocator predefines bucket sizes from 64B to 8096B
> 2. The mem_cgroup allocation size (2312B) falls between the 2KB and 4KB
>    slabs
> 3. The allocator rounds up to the nearest larger slab (4KB), resulting in
>    ~1KB wasted memory per allocation
>
> This patch introduces a dedicated kmem_cache for mem_cgroup structs,
> achieving precise memory allocation. Post-patch ftrace verification shows:
>
> kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4
>     ptr=00000000695c1806 bytes_req=2312 bytes_alloc=2368
>     gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
>
> Each memcg alloc offer 2368bytes(include hw cacheline align), compare to
> 4096, avoid waste.
>
> Signed-off-by: Huan Yang <link@vivo.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] mm/memcg: use kmem_cache when alloc memcg pernode info
  2025-04-25  3:19 ` [PATCH v3 3/3] mm/memcg: use kmem_cache when alloc memcg pernode info Huan Yang
@ 2025-04-25  4:12   ` Shakeel Butt
  2025-04-27 12:00   ` Johannes Weiner
  1 sibling, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2025-04-25  4:12 UTC (permalink / raw)
  To: Huan Yang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Petr Mladek, Vlastimil Babka, Rasmus Villemoes,
	Francesco Valla, Raul E Rangel, Paul E. McKenney, Huang Shijie,
	Guo Weikang, Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Hyeonggon Yoo,
	Paul Moore, opensource.kernel

On Thu, Apr 24, 2025 at 8:19 PM Huan Yang <link@vivo.com> wrote:
>
> When tracing mem_cgroup_per_node allocations with kmalloc ftrace:
>
> kmalloc: call_site=mem_cgroup_css_alloc+0x1d8/0x5b4 ptr=00000000d798700c
>     bytes_req=2896 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
>     accounted=false
>
> This reveals the slab allocator provides 4096B chunks for 2896B
> mem_cgroup_per_node due to:
>
> 1. The slab allocator predefines bucket sizes from 64B to 8096B
> 2. The mem_cgroup allocation size (2312B) falls between the 2KB and 4KB
>    slabs
> 3. The allocator rounds up to the nearest larger slab (4KB), resulting in
>    ~1KB wasted memory per memcg alloc - per node.
>
> This patch introduces a dedicated kmem_cache for mem_cgroup structs,
> achieving precise memory allocation. Post-patch ftrace verification shows:
>
> kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4
>     ptr=000000002989e63a bytes_req=2896 bytes_alloc=2944
>     gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false
>
> Each mem_cgroup_per_node alloc 2944bytes(include hw cacheline align),
> compare to 4096, it avoid waste.
>
> Signed-off-by: Huan Yang <link@vivo.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/3] Use kmem_cache for memcg alloc
  2025-04-25  3:19 [PATCH v3 0/3] Use kmem_cache for memcg alloc Huan Yang
                   ` (2 preceding siblings ...)
  2025-04-25  3:19 ` [PATCH v3 3/3] mm/memcg: use kmem_cache when alloc memcg pernode info Huan Yang
@ 2025-04-25  4:35 ` Matthew Wilcox
  2025-04-28  2:19   ` Huan Yang
  3 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2025-04-25  4:35 UTC (permalink / raw)
  To: Huan Yang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Petr Mladek, Vlastimil Babka,
	Rasmus Villemoes, Francesco Valla, Raul E Rangel,
	Paul E. McKenney, Huang Shijie, Guo Weikang,
	Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, opensource.kernel

On Fri, Apr 25, 2025 at 11:19:22AM +0800, Huan Yang wrote:
> Key Observations:
>   1. Both structures use kmalloc with requested sizes between 2KB-4KB
>   2. Allocation alignment forces 4KB slab usage due to pre-defined sizes
>      (64B, 128B,..., 2KB, 4KB, 8KB)
>   3. Memory waste per memcg instance:
>       Base struct: 4096 - 2312 = 1784 bytes
>       Per-node struct: 4096 - 2896 = 1200 bytes
>       Total waste: 2984 bytes (1-node system)
>       NUMA scaling: (1200 + 8) * nr_node_ids bytes
> So, it's a little waste.

[...]

> This indicates that the `mem_cgroup` struct now requests 2312 bytes
> and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes
> and is allocated 2944 bytes.
> The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the
> `kmem_cache`.
> 
> Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as:
> 
>   # mem_cgroup struct allocation
>   sh-9269     [003] .....    80.396366: kmem_cache_alloc:
>     call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475
>     bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
>     accounted=false
> 
>   # mem_cgroup_per_node allocation
>   sh-9269     [003] .....    80.396411: kmem_cache_alloc:
>     call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6
>     bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
>     accounted=false
> 
> While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults
> to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial
> for performance. Please let me know if there are any issues or if I've
> misunderstood anything.

This isn't really the right way to think about this.  Memory is ultimately
allocated from the page allocator.  So what you want to know is how many
objects you get per page.  Before, it's one per page (since both objects
are between 2k and 4k and rounded up to 4k).  After, slab will create
slabs of a certain order to minimise waste, but also not inflate the
allocation order too high.  Let's assume it goes all the way to order 3
(like kmalloc-4k does), so you want to know how many objects fit in a
32KiB allocation.

With HWCACHE_ALIGN, you get floor(32768/2368) = 13 and
floor(32768/2944) = 11.

Without HWCACHE_ALIGN( you get floor(32768/2312) = 14 and
floor(32768/2896) = 11.

So there is a packing advantage to turning off HWCACHE_ALIGN (for the
first slab; no difference for the second).  BUT!  Now you have cacheline
aliasing between two objects, and that's probably bad.  It's the kind
of performance problem that's really hard to see.

Anyway, you've gone from allocating 8 objects per 32KiB to allocating
13 objects per 32KiB, a 62% improvement in memory consumption.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init()
  2025-04-25  3:19 ` [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init() Huan Yang
  2025-04-25  4:11   ` Shakeel Butt
@ 2025-04-27 11:46   ` Johannes Weiner
  2025-04-28  2:20     ` Huan Yang
  1 sibling, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2025-04-27 11:46 UTC (permalink / raw)
  To: Huan Yang
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Petr Mladek, Vlastimil Babka, Rasmus Villemoes,
	Francesco Valla, Raul E Rangel, Paul E. McKenney, Huang Shijie,
	Guo Weikang, Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Alexander Gordeev,
	Boqun Feng, opensource.kernel

On Fri, Apr 25, 2025 at 11:19:23AM +0800, Huan Yang wrote:
> When cgroup_init() creates root_mem_cgroup through css_alloc callback,
> some critical resources might not be fully initialized, forcing later
> operations to perform conditional checks for resource availability.
> 
> This patch move mem_cgroup_init() to address the init order, it invoke
> before cgroup_init, so, compare to subsys_initcall, it can use to prepare
> some key resources before root_mem_cgroup alloc.
> 
> Signed-off-by: Huan Yang <link@vivo.com>
> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/memcg: use kmem_cache when alloc memcg
  2025-04-25  3:19 ` [PATCH v3 2/3] mm/memcg: use kmem_cache when alloc memcg Huan Yang
  2025-04-25  4:12   ` Shakeel Butt
@ 2025-04-27 11:56   ` Johannes Weiner
  1 sibling, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2025-04-27 11:56 UTC (permalink / raw)
  To: Huan Yang
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Petr Mladek, Vlastimil Babka, Rasmus Villemoes,
	Francesco Valla, Raul E Rangel, Paul E. McKenney, Huang Shijie,
	Guo Weikang, Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, opensource.kernel

On Fri, Apr 25, 2025 at 11:19:24AM +0800, Huan Yang wrote:
> When tracing mem_cgroup_alloc() with kmalloc ftrace, we observe:
> 
> kmalloc: call_site=mem_cgroup_css_alloc+0xd8/0x5b4 ptr=000000003e4c3799
>     bytes_req=2312 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
>     accounted=false
> 
> The output indicates that while allocating mem_cgroup struct (2312 bytes),
> the slab allocator actually provides 4096-byte chunks. This occurs because:
> 
> 1. The slab allocator predefines bucket sizes from 64B to 8096B
> 2. The mem_cgroup allocation size (2312B) falls between the 2KB and 4KB
>    slabs
> 3. The allocator rounds up to the nearest larger slab (4KB), resulting in
>    ~1KB wasted memory per allocation
> 
> This patch introduces a dedicated kmem_cache for mem_cgroup structs,
> achieving precise memory allocation. Post-patch ftrace verification shows:
> 
> kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4
>     ptr=00000000695c1806 bytes_req=2312 bytes_alloc=2368
>     gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
> 
> Each memcg alloc offer 2368bytes(include hw cacheline align), compare to
> 4096, avoid waste.
> 
> Signed-off-by: Huan Yang <link@vivo.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] mm/memcg: use kmem_cache when alloc memcg pernode info
  2025-04-25  3:19 ` [PATCH v3 3/3] mm/memcg: use kmem_cache when alloc memcg pernode info Huan Yang
  2025-04-25  4:12   ` Shakeel Butt
@ 2025-04-27 12:00   ` Johannes Weiner
  1 sibling, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2025-04-27 12:00 UTC (permalink / raw)
  To: Huan Yang
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Petr Mladek, Vlastimil Babka, Rasmus Villemoes,
	Francesco Valla, Raul E Rangel, Paul E. McKenney, Huang Shijie,
	Guo Weikang, Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Hyeonggon Yoo,
	Paul Moore, opensource.kernel

On Fri, Apr 25, 2025 at 11:19:25AM +0800, Huan Yang wrote:
> When tracing mem_cgroup_per_node allocations with kmalloc ftrace:
> 
> kmalloc: call_site=mem_cgroup_css_alloc+0x1d8/0x5b4 ptr=00000000d798700c
>     bytes_req=2896 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
>     accounted=false
> 
> This reveals the slab allocator provides 4096B chunks for 2896B
> mem_cgroup_per_node due to:
> 
> 1. The slab allocator predefines bucket sizes from 64B to 8096B
> 2. The mem_cgroup allocation size (2312B) falls between the 2KB and 4KB
>    slabs
> 3. The allocator rounds up to the nearest larger slab (4KB), resulting in
>    ~1KB wasted memory per memcg alloc - per node.
> 
> This patch introduces a dedicated kmem_cache for mem_cgroup structs,
> achieving precise memory allocation. Post-patch ftrace verification shows:
> 
> kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4
>     ptr=000000002989e63a bytes_req=2896 bytes_alloc=2944
>     gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false
> 
> Each mem_cgroup_per_node alloc 2944bytes(include hw cacheline align),
> compare to 4096, it avoid waste.
> 
> Signed-off-by: Huan Yang <link@vivo.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/3] Use kmem_cache for memcg alloc
  2025-04-25  4:35 ` [PATCH v3 0/3] Use kmem_cache for memcg alloc Matthew Wilcox
@ 2025-04-28  2:19   ` Huan Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Huan Yang @ 2025-04-28  2:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Petr Mladek, Vlastimil Babka,
	Rasmus Villemoes, Francesco Valla, Raul E Rangel,
	Paul E. McKenney, Huang Shijie, Guo Weikang,
	Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, opensource.kernel

Hi Matthew,

在 2025/4/25 12:35, Matthew Wilcox 写道:
> On Fri, Apr 25, 2025 at 11:19:22AM +0800, Huan Yang wrote:
>> Key Observations:
>>    1. Both structures use kmalloc with requested sizes between 2KB-4KB
>>    2. Allocation alignment forces 4KB slab usage due to pre-defined sizes
>>       (64B, 128B,..., 2KB, 4KB, 8KB)
>>    3. Memory waste per memcg instance:
>>        Base struct: 4096 - 2312 = 1784 bytes
>>        Per-node struct: 4096 - 2896 = 1200 bytes
>>        Total waste: 2984 bytes (1-node system)
>>        NUMA scaling: (1200 + 8) * nr_node_ids bytes
>> So, it's a little waste.
> [...]
>
>> This indicates that the `mem_cgroup` struct now requests 2312 bytes
>> and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes
>> and is allocated 2944 bytes.
>> The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the
>> `kmem_cache`.
>>
>> Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as:
>>
>>    # mem_cgroup struct allocation
>>    sh-9269     [003] .....    80.396366: kmem_cache_alloc:
>>      call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475
>>      bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
>>      accounted=false
>>
>>    # mem_cgroup_per_node allocation
>>    sh-9269     [003] .....    80.396411: kmem_cache_alloc:
>>      call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6
>>      bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
>>      accounted=false
>>
>> While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults
>> to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial
>> for performance. Please let me know if there are any issues or if I've
>> misunderstood anything.
> This isn't really the right way to think about this.  Memory is ultimately
> allocated from the page allocator.  So what you want to know is how many
> objects you get per page.  Before, it's one per page (since both objects
> are between 2k and 4k and rounded up to 4k).  After, slab will create
> slabs of a certain order to minimise waste, but also not inflate the
> allocation order too high.  Let's assume it goes all the way to order 3
> (like kmalloc-4k does), so you want to know how many objects fit in a
> 32KiB allocation.
>
> With HWCACHE_ALIGN, you get floor(32768/2368) = 13 and
> floor(32768/2944) = 11.
>
> Without HWCACHE_ALIGN( you get floor(32768/2312) = 14 and
> floor(32768/2896) = 11.

Yes, thanks. And, this can easily observe with the following command:

   # show mem_cgroup slab's order, it's 3.
   cat /sys/kernel/slab/mem_cgroup/order
   # show mem_cgroup slab's objs per slab, it's 13
   cat /sys/kernel/slab/mem_cgroup/objs_per_slab

And we can quickly calculate the Page order obtained by the slab allocation and the number of objs it can store:

# mem_cgroup,2368 size

| ORDER | SIZE | NUM_OBJS | ORIGIN |
| ----------- | ------- | ---------------- | ---------- |
| 3            | 32KB | 13                | 8           |
| 2            | 16KB | 6                  | 4           |
| 1            | 8KB   | 3                  | 2           |
| 0            | 4KB   | 1                  | 1           |

# mem_cgroup_per_node,2944 size

| ORDER | SIZE | NUM_OBJS | ORIGIN |
| ----------- | ------- | ---------------- | ----------  |
| 3            | 32KB | 11                | 8            |
| 2            | 16KB | 5                  | 4            |
| 1            | 8KB   | 2                  | 2            |
| 0            | 4KB   | 1                  | 1            |

So, for mem_cgroup, if page order > 1, then have optimize; while mem_cgroup_per_node needs order 2. :)

>
> So there is a packing advantage to turning off HWCACHE_ALIGN (for the
> first slab; no difference for the second).  BUT!  Now you have cacheline
> aliasing between two objects, and that's probably bad.  It's the kind
> of performance problem that's really hard to see.

Yes, And I would like to learn, in what situations do you think HWCACHE UNALIGN might cause issues?

Could it be direct memory reclaim by multiple processes? Or multiple processes charging memory simultaneously?

>
> Anyway, you've gone from allocating 8 objects per 32KiB to allocating
> 13 objects per 32KiB, a 62% improvement in memory consumption.

Thanks, that's more clearer.

Huan



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init()
  2025-04-25  4:11   ` Shakeel Butt
@ 2025-04-28  2:20     ` Huan Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Huan Yang @ 2025-04-28  2:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Petr Mladek, Vlastimil Babka, Rasmus Villemoes,
	Francesco Valla, Raul E Rangel, Paul E. McKenney, Huang Shijie,
	Guo Weikang, Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Alexander Gordeev,
	Boqun Feng, opensource.kernel


在 2025/4/25 12:11, Shakeel Butt 写道:
> On Thu, Apr 24, 2025 at 8:19 PM Huan Yang <link@vivo.com> wrote:
>> When cgroup_init() creates root_mem_cgroup through css_alloc callback,
>> some critical resources might not be fully initialized, forcing later
>> operations to perform conditional checks for resource availability.
>>
>> This patch move mem_cgroup_init() to address the init order, it invoke
>> before cgroup_init, so, compare to subsys_initcall, it can use to prepare
>> some key resources before root_mem_cgroup alloc.
>>
>> Signed-off-by: Huan Yang <link@vivo.com>
>> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
thx.
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init()
  2025-04-27 11:46   ` Johannes Weiner
@ 2025-04-28  2:20     ` Huan Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Huan Yang @ 2025-04-28  2:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Petr Mladek, Vlastimil Babka, Rasmus Villemoes,
	Francesco Valla, Raul E Rangel, Paul E. McKenney, Huang Shijie,
	Guo Weikang, Uladzislau Rezki (Sony),
	KP Singh, cgroups, linux-mm, linux-kernel, Alexander Gordeev,
	Boqun Feng, opensource.kernel


在 2025/4/27 19:46, Johannes Weiner 写道:
> On Fri, Apr 25, 2025 at 11:19:23AM +0800, Huan Yang wrote:
>> When cgroup_init() creates root_mem_cgroup through css_alloc callback,
>> some critical resources might not be fully initialized, forcing later
>> operations to perform conditional checks for resource availability.
>>
>> This patch move mem_cgroup_init() to address the init order, it invoke
>> before cgroup_init, so, compare to subsys_initcall, it can use to prepare
>> some key resources before root_mem_cgroup alloc.
>>
>> Signed-off-by: Huan Yang <link@vivo.com>
>> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
thx.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-04-28  2:20 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-25  3:19 [PATCH v3 0/3] Use kmem_cache for memcg alloc Huan Yang
2025-04-25  3:19 ` [PATCH v3 1/3] mm/memcg: move mem_cgroup_init() ahead of cgroup_init() Huan Yang
2025-04-25  4:11   ` Shakeel Butt
2025-04-28  2:20     ` Huan Yang
2025-04-27 11:46   ` Johannes Weiner
2025-04-28  2:20     ` Huan Yang
2025-04-25  3:19 ` [PATCH v3 2/3] mm/memcg: use kmem_cache when alloc memcg Huan Yang
2025-04-25  4:12   ` Shakeel Butt
2025-04-27 11:56   ` Johannes Weiner
2025-04-25  3:19 ` [PATCH v3 3/3] mm/memcg: use kmem_cache when alloc memcg pernode info Huan Yang
2025-04-25  4:12   ` Shakeel Butt
2025-04-27 12:00   ` Johannes Weiner
2025-04-25  4:35 ` [PATCH v3 0/3] Use kmem_cache for memcg alloc Matthew Wilcox
2025-04-28  2:19   ` Huan Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox