[RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure
@ 2023-07-23 19:09 Hyeonggon Yoo
  2023-07-23 19:09 ` [RFC 1/2] Revert "mm, slub: change percpu partial accounting from objects to pages" Hyeonggon Yoo
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Hyeonggon Yoo @ 2023-07-23 19:09 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	David Rientjes, Andrew Morton
  Cc: Roman Gushchin, Feng Tang, Sang, Oliver, Jay Patel, Binder Makin,
	aneesh.kumar, tsahu, piyushs, fengwei.yin, ying.huang, lkp,
	oe-lkp, linux-mm, linux-kernel, Hyeonggon Yoo

Hello folks,

This series is motivated by kernel test bot report [1] on Jay's patch
that modifies slab order. While the patch was not merged and not in the
final form, I think it was a good lesson that changing slab order has more
impacts on performance than we expected.

While inspecting the report, I found some potential points to improve
SLUB. [2] It's _potential_ because it shows no improvements on hackbench.
but I believe more realistic workloads would benefit from this. Due to
lack of resources and lack of my understanding of *realistic* workloads,
I am asking you to help evaluating this together.

It only consists of two patches. Patch #1 addresses inaccuracy in
SLUB's heuristic, which can negatively affect workloads' performance
when large folios are not available from buddy.

Patch #2 changes SLUB's behavior when there are no slabs available on the
local node's partial slab list, increasing NUMA locality when there are
available memory (without reclamation) on the local node from buddy.

This is early state, but I think it's a good enough to start discussion.
Any feedbacks and ideas are welcome. Thank you in advance!

Hyeonggon

https://lore.kernel.org/linux-mm/202307172140.3b34825a-oliver.sang@intel.com [1]
https://lore.kernel.org/linux-mm/CAB=+i9S6Ykp90+4N1kCE=hiTJTE4wzJDi8k5pBjjO_3sf0aeqg@mail.gmail.com [2]

Hyeonggon Yoo (2):
  Revert "mm, slub: change percpu partial accounting from objects to
    pages"
  mm/slub: prefer NUMA locality over slight memory saving on NUMA
    machines

 include/linux/slub_def.h |  2 --
 mm/slab.h                |  6 ++++
 mm/slub.c                | 76 ++++++++++++++++++++++++++--------------
 3 files changed, 55 insertions(+), 29 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 1/2] Revert "mm, slub: change percpu partial accounting from objects to pages"
  2023-07-23 19:09 [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure Hyeonggon Yoo
@ 2023-07-23 19:09 ` Hyeonggon Yoo
  2023-07-26 10:34   ` Vlastimil Babka
  2023-07-23 19:09 ` [RFC 2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines Hyeonggon Yoo
  2023-08-10 10:55 ` [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure Jay Patel
  2 siblings, 1 reply; 12+ messages in thread
From: Hyeonggon Yoo @ 2023-07-23 19:09 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	David Rientjes, Andrew Morton
  Cc: Roman Gushchin, Feng Tang, Sang, Oliver, Jay Patel, Binder Makin,
	aneesh.kumar, tsahu, piyushs, fengwei.yin, ying.huang, lkp,
	oe-lkp, linux-mm, linux-kernel, Hyeonggon Yoo

This is partial revert of commit b47291ef02b0 ("mm, slub: change percpu
partial accounting from objects to pages"). and full revert of commit
662188c3a20e ("mm/slub: Simplify struct slab slabs field definition").

While b47291ef02b0 prevents percpu partial slab list becoming too long,
it assumes that the order of slabs are always oo_order(s->oo).

The current approach can surprisingly lower the number of objects cached
per cpu when it fails to allocate high order slabs. Instead of accounting
the number of slabs, change it back to accounting objects, but keep
the assumption that the slab is always half-full.

With this change, the number of cached objects per cpu is not surprisingly
decreased even when it fails to allocate high order slabs. It still
prevents large inaccuracy because it does not account based on the
number of free objects when taking slabs.
---
 include/linux/slub_def.h |  2 --
 mm/slab.h                |  6 ++++++
 mm/slub.c                | 31 ++++++++++++-------------------
 3 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index deb90cf4bffb..589ff6a2a23f 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -109,8 +109,6 @@ struct kmem_cache {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	/* Number of per cpu partial objects to keep around */
 	unsigned int cpu_partial;
-	/* Number of per cpu partial slabs to keep around */
-	unsigned int cpu_partial_slabs;
 #endif
 	struct kmem_cache_order_objects oo;
 
diff --git a/mm/slab.h b/mm/slab.h
index 799a315695c6..be38a264df16 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -65,7 +65,13 @@ struct slab {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 				struct {
 					struct slab *next;
+#ifdef CONFIG_64BIT
 					int slabs;	/* Nr of slabs left */
+					int pobjects;	/* Approximate count */
+#else
+					short int slabs;
+					short int pobjects;
+#endif
 				};
 #endif
 			};
diff --git a/mm/slub.c b/mm/slub.c
index f7940048138c..199d3d03d5b9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -486,18 +486,7 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
 {
-	unsigned int nr_slabs;
-
 	s->cpu_partial = nr_objects;
-
-	/*
-	 * We take the number of objects but actually limit the number of
-	 * slabs on the per cpu partial list, in order to limit excessive
-	 * growth of the list. For simplicity we assume that the slabs will
-	 * be half-full.
-	 */
-	nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
-	s->cpu_partial_slabs = nr_slabs;
 }
 #else
 static inline void
@@ -2275,7 +2264,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 	struct slab *slab, *slab2;
 	void *object = NULL;
 	unsigned long flags;
-	unsigned int partial_slabs = 0;
+	int objects_taken = 0;
 
 	/*
 	 * Racy check. If we mistakenly see no partial slabs then we
@@ -2312,11 +2301,11 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 		} else {
 			put_cpu_partial(s, slab, 0);
 			stat(s, CPU_PARTIAL_NODE);
-			partial_slabs++;
+			objects_taken += slab->objects / 2;
 		}
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 		if (!kmem_cache_has_cpu_partial(s)
-			|| partial_slabs > s->cpu_partial_slabs / 2)
+			|| objects_taken > s->cpu_partial / 2)
 			break;
 #else
 		break;
@@ -2699,13 +2688,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 	struct slab *slab_to_unfreeze = NULL;
 	unsigned long flags;
 	int slabs = 0;
+	int pobjects = 0;
 
 	local_lock_irqsave(&s->cpu_slab->lock, flags);
 
 	oldslab = this_cpu_read(s->cpu_slab->partial);
 
 	if (oldslab) {
-		if (drain && oldslab->slabs >= s->cpu_partial_slabs) {
+		if (drain && oldslab->pobjects >= s->cpu_partial) {
 			/*
 			 * Partial array is full. Move the existing set to the
 			 * per node partial list. Postpone the actual unfreezing
@@ -2714,14 +2704,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 			slab_to_unfreeze = oldslab;
 			oldslab = NULL;
 		} else {
+			pobjects = oldslab->pobjects;
 			slabs = oldslab->slabs;
 		}
 	}
 
 	slabs++;
+	pobjects += slab->objects / 2;
 
 	slab->slabs = slabs;
 	slab->next = oldslab;
+	slab->pobjects = pobjects;
 
 	this_cpu_write(s->cpu_slab->partial, slab);
 
@@ -5653,13 +5646,13 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
 
 		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
 
-		if (slab)
+		if (slab) {
 			slabs += slab->slabs;
+			objects += slab->objects;
+		}
 	}
 #endif
 
-	/* Approximate half-full slabs, see slub_set_cpu_partial() */
-	objects = (slabs * oo_objects(s->oo)) / 2;
 	len += sysfs_emit_at(buf, len, "%d(%d)", objects, slabs);
 
 #ifdef CONFIG_SLUB_CPU_PARTIAL
@@ -5669,7 +5662,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
 		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
 		if (slab) {
 			slabs = READ_ONCE(slab->slabs);
-			objects = (slabs * oo_objects(s->oo)) / 2;
+			objects = READ_ONCE(slab->pobjects);
 			len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
 					     cpu, objects, slabs);
 		}
-- 
2.41.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines
  2023-07-23 19:09 [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure Hyeonggon Yoo
  2023-07-23 19:09 ` [RFC 1/2] Revert "mm, slub: change percpu partial accounting from objects to pages" Hyeonggon Yoo
@ 2023-07-23 19:09 ` Hyeonggon Yoo
  2023-08-03 14:54   ` Vlastimil Babka
  2023-08-10 10:55 ` [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure Jay Patel
  2 siblings, 1 reply; 12+ messages in thread
From: Hyeonggon Yoo @ 2023-07-23 19:09 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	David Rientjes, Andrew Morton
  Cc: Roman Gushchin, Feng Tang, Sang, Oliver, Jay Patel, Binder Makin,
	aneesh.kumar, tsahu, piyushs, fengwei.yin, ying.huang, lkp,
	oe-lkp, linux-mm, linux-kernel, Hyeonggon Yoo

By default, SLUB sets remote_node_defrag_ratio to 1000, which makes it
(in most cases) take slabs from remote nodes first before trying allocating
new folios on the local node from buddy.

Documentation/ABI/testing/sysfs-kernel-slab says:
> The file remote_node_defrag_ratio specifies the percentage of
> times SLUB will attempt to refill the cpu slab with a partial
> slab from a remote node as opposed to allocating a new slab on
> the local node.  This reduces the amount of wasted memory over
> the entire system but can be expensive.

Although this made sense when it was introduced, the portion of
per node partial lists in the overall SLUB memory usage has been decreased
since the introduction of per cpu partial lists. Therefore, it's worth
reevaluating its overhead on performance and memory usage.

[
	XXX: Add performance data. I tried to measure its impact on
	hackbench with a 2 socket NUMA 	machine. but it seems hackbench is
	too synthetic to benefit from this, because the	skbuff_head_cache's
	size fits into the last level cache.

	Probably more realistic workloads like netperf would benefit
	from this?
]

Set remote_node_defrag_ratio to zero by default, and the new behavior is:
	1) try refilling per CPU partial list from the local node
	2) try allocating new slabs from the local node without reclamation
	3) try refilling per CPU partial list from remote nodes
	4) try allocating new slabs from the local node or remote nodes

If user specified remote_node_defrag_ratio, it probabilistically tries
3) first and then try 2) and 4) in order, to avoid unexpected behavioral
change from user's perspective.
---
 mm/slub.c | 45 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 199d3d03d5b9..cfdea3e3e221 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2319,7 +2319,8 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 /*
  * Get a slab from somewhere. Search in increasing NUMA distances.
  */
-static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
+static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc,
+			     bool force_defrag)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
@@ -2347,8 +2348,8 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
 	 * may be expensive if we do it every time we are trying to find a slab
 	 * with available objects.
 	 */
-	if (!s->remote_node_defrag_ratio ||
-			get_cycles() % 1024 > s->remote_node_defrag_ratio)
+	if (!force_defrag && (!s->remote_node_defrag_ratio ||
+			get_cycles() % 1024 > s->remote_node_defrag_ratio))
 		return NULL;
 
 	do {
@@ -2382,7 +2383,8 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
 /*
  * Get a partial slab, lock it and return it.
  */
-static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
+static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc,
+			 bool force_defrag)
 {
 	void *object;
 	int searchnode = node;
@@ -2394,7 +2396,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
 	if (object || node != NUMA_NO_NODE)
 		return object;
 
-	return get_any_partial(s, pc);
+	return get_any_partial(s, pc, force_defrag);
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -3092,6 +3094,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	struct slab *slab;
 	unsigned long flags;
 	struct partial_context pc;
+	gfp_t local_flags;
 
 	stat(s, ALLOC_SLOWPATH);
 
@@ -3208,10 +3211,35 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	pc.flags = gfpflags;
 	pc.slab = &slab;
 	pc.orig_size = orig_size;
-	freelist = get_partial(s, node, &pc);
+
+	freelist = get_partial(s, node, &pc, false);
 	if (freelist)
 		goto check_new_slab;
 
+	/*
+	 * try allocating slab from the local node first before taking slabs
+	 * from remote nodes. If user specified remote_node_defrag_ratio,
+	 * try taking slabs from remote nodes first.
+	 */
+	slub_put_cpu_ptr(s->cpu_slab);
+	local_flags = (gfpflags | __GFP_NOWARN | __GFP_THISNODE);
+	local_flags &= ~(__GFP_NOFAIL | __GFP_RECLAIM);
+	slab = new_slab(s, local_flags, node);
+	c = slub_get_cpu_ptr(s->cpu_slab);
+
+	if (slab)
+		goto alloc_slab;
+
+	/*
+	 * At this point no memory can be allocated lightly.
+	 * Take slabs from remote nodes.
+	 */
+	if (node == NUMA_NO_NODE) {
+		freelist = get_any_partial(s, &pc, true);
+		if (freelist)
+			goto check_new_slab;
+	}
+
 	slub_put_cpu_ptr(s->cpu_slab);
 	slab = new_slab(s, gfpflags, node);
 	c = slub_get_cpu_ptr(s->cpu_slab);
@@ -3221,6 +3249,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		return NULL;
 	}
 
+alloc_slab:
 	stat(s, ALLOC_SLAB);
 
 	if (kmem_cache_debug(s)) {
@@ -3404,7 +3433,7 @@ static void *__slab_alloc_node(struct kmem_cache *s,
 	pc.flags = gfpflags;
 	pc.slab = &slab;
 	pc.orig_size = orig_size;
-	object = get_partial(s, node, &pc);
+	object = get_partial(s, node, &pc, false);
 
 	if (object)
 		return object;
@@ -4538,7 +4567,7 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 	set_cpu_partial(s);
 
 #ifdef CONFIG_NUMA
-	s->remote_node_defrag_ratio = 1000;
+	s->remote_node_defrag_ratio = 0;
 #endif
 
 	/* Initialize the pre-computed randomized freelist if slab is up */
-- 
2.41.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/2] Revert "mm, slub: change percpu partial accounting from objects to pages"
  2023-07-23 19:09 ` [RFC 1/2] Revert "mm, slub: change percpu partial accounting from objects to pages" Hyeonggon Yoo
@ 2023-07-26 10:34   ` Vlastimil Babka
  2023-08-21 15:11     ` Hyeonggon Yoo
  0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2023-07-26 10:34 UTC (permalink / raw)
  To: Hyeonggon Yoo, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	David Rientjes, Andrew Morton
  Cc: Roman Gushchin, Feng Tang, Sang, Oliver, Jay Patel, Binder Makin,
	aneesh.kumar, tsahu, piyushs, fengwei.yin, ying.huang, lkp,
	oe-lkp, linux-mm, linux-kernel

Nit: I would change the subject from "Revert: " as it's not a revert
exactly. If we can come up with a good subject that's not very long :)

On 7/23/23 21:09, Hyeonggon Yoo wrote:
> This is partial revert of commit b47291ef02b0 ("mm, slub: change percpu
> partial accounting from objects to pages"). and full revert of commit
> 662188c3a20e ("mm/slub: Simplify struct slab slabs field definition").
> 
> While b47291ef02b0 prevents percpu partial slab list becoming too long,
> it assumes that the order of slabs are always oo_order(s->oo).

I think I've considered this possibility, but decided it's not important
because if the system becomes memory pressured in a way that it can't
allocate the oo_order() and has to fallback, we no longer care about
accurate percpu caching, as we're unlikely having optimum performance anyway.

> The current approach can surprisingly lower the number of objects cached
> per cpu when it fails to allocate high order slabs. Instead of accounting
> the number of slabs, change it back to accounting objects, but keep
> the assumption that the slab is always half-full.

That's a nice solution as that avoids converting the sysfs variable, so I
wouldn't mind going that way even if I doubt the performance benefits in a
memory pressured system. But maybe there's a concern that if the system is
really memory pressured and has to fallback to smaller orders, before this
patch it would keep fewer percpu partial slabs than after this patch, which
would increase the pressure further and thus be counter-productive?

> With this change, the number of cached objects per cpu is not surprisingly
> decreased even when it fails to allocate high order slabs. It still
> prevents large inaccuracy because it does not account based on the
> number of free objects when taking slabs.
> ---
>  include/linux/slub_def.h |  2 --
>  mm/slab.h                |  6 ++++++
>  mm/slub.c                | 31 ++++++++++++-------------------
>  3 files changed, 18 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index deb90cf4bffb..589ff6a2a23f 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -109,8 +109,6 @@ struct kmem_cache {
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  	/* Number of per cpu partial objects to keep around */
>  	unsigned int cpu_partial;
> -	/* Number of per cpu partial slabs to keep around */
> -	unsigned int cpu_partial_slabs;
>  #endif
>  	struct kmem_cache_order_objects oo;
>  
> diff --git a/mm/slab.h b/mm/slab.h
> index 799a315695c6..be38a264df16 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -65,7 +65,13 @@ struct slab {
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  				struct {
>  					struct slab *next;
> +#ifdef CONFIG_64BIT
>  					int slabs;	/* Nr of slabs left */
> +					int pobjects;	/* Approximate count */
> +#else
> +					short int slabs;
> +					short int pobjects;
> +#endif
>  				};
>  #endif
>  			};
> diff --git a/mm/slub.c b/mm/slub.c
> index f7940048138c..199d3d03d5b9 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -486,18 +486,7 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
>  {
> -	unsigned int nr_slabs;
> -
>  	s->cpu_partial = nr_objects;
> -
> -	/*
> -	 * We take the number of objects but actually limit the number of
> -	 * slabs on the per cpu partial list, in order to limit excessive
> -	 * growth of the list. For simplicity we assume that the slabs will
> -	 * be half-full.
> -	 */
> -	nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
> -	s->cpu_partial_slabs = nr_slabs;
>  }
>  #else
>  static inline void
> @@ -2275,7 +2264,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
>  	struct slab *slab, *slab2;
>  	void *object = NULL;
>  	unsigned long flags;
> -	unsigned int partial_slabs = 0;
> +	int objects_taken = 0;
>  
>  	/*
>  	 * Racy check. If we mistakenly see no partial slabs then we
> @@ -2312,11 +2301,11 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
>  		} else {
>  			put_cpu_partial(s, slab, 0);
>  			stat(s, CPU_PARTIAL_NODE);
> -			partial_slabs++;
> +			objects_taken += slab->objects / 2;
>  		}
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  		if (!kmem_cache_has_cpu_partial(s)
> -			|| partial_slabs > s->cpu_partial_slabs / 2)
> +			|| objects_taken > s->cpu_partial / 2)
>  			break;
>  #else
>  		break;
> @@ -2699,13 +2688,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
>  	struct slab *slab_to_unfreeze = NULL;
>  	unsigned long flags;
>  	int slabs = 0;
> +	int pobjects = 0;
>  
>  	local_lock_irqsave(&s->cpu_slab->lock, flags);
>  
>  	oldslab = this_cpu_read(s->cpu_slab->partial);
>  
>  	if (oldslab) {
> -		if (drain && oldslab->slabs >= s->cpu_partial_slabs) {
> +		if (drain && oldslab->pobjects >= s->cpu_partial) {
>  			/*
>  			 * Partial array is full. Move the existing set to the
>  			 * per node partial list. Postpone the actual unfreezing
> @@ -2714,14 +2704,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
>  			slab_to_unfreeze = oldslab;
>  			oldslab = NULL;
>  		} else {
> +			pobjects = oldslab->pobjects;
>  			slabs = oldslab->slabs;
>  		}
>  	}
>  
>  	slabs++;
> +	pobjects += slab->objects / 2;
>  
>  	slab->slabs = slabs;
>  	slab->next = oldslab;
> +	slab->pobjects = pobjects;
>  
>  	this_cpu_write(s->cpu_slab->partial, slab);
>  
> @@ -5653,13 +5646,13 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
>  
>  		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
>  
> -		if (slab)
> +		if (slab) {
>  			slabs += slab->slabs;
> +			objects += slab->objects;
> +		}
>  	}
>  #endif
>  
> -	/* Approximate half-full slabs, see slub_set_cpu_partial() */
> -	objects = (slabs * oo_objects(s->oo)) / 2;
>  	len += sysfs_emit_at(buf, len, "%d(%d)", objects, slabs);
>  
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
> @@ -5669,7 +5662,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
>  		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
>  		if (slab) {
>  			slabs = READ_ONCE(slab->slabs);
> -			objects = (slabs * oo_objects(s->oo)) / 2;
> +			objects = READ_ONCE(slab->pobjects);
>  			len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
>  					     cpu, objects, slabs);
>  		}



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines
  2023-07-23 19:09 ` [RFC 2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines Hyeonggon Yoo
@ 2023-08-03 14:54   ` Vlastimil Babka
  2023-08-07  8:39     ` Hyeonggon Yoo
  0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2023-08-03 14:54 UTC (permalink / raw)
  To: Hyeonggon Yoo, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	David Rientjes, Andrew Morton
  Cc: Roman Gushchin, Feng Tang, Sang, Oliver, Jay Patel, Binder Makin,
	aneesh.kumar, tsahu, piyushs, fengwei.yin, ying.huang, lkp,
	oe-lkp, linux-mm, linux-kernel

On 7/23/23 21:09, Hyeonggon Yoo wrote:
> By default, SLUB sets remote_node_defrag_ratio to 1000, which makes it
> (in most cases) take slabs from remote nodes first before trying allocating
> new folios on the local node from buddy.
> 
> Documentation/ABI/testing/sysfs-kernel-slab says:
>> The file remote_node_defrag_ratio specifies the percentage of
>> times SLUB will attempt to refill the cpu slab with a partial
>> slab from a remote node as opposed to allocating a new slab on
>> the local node.  This reduces the amount of wasted memory over
>> the entire system but can be expensive.
> 
> Although this made sense when it was introduced, the portion of
> per node partial lists in the overall SLUB memory usage has been decreased
> since the introduction of per cpu partial lists. Therefore, it's worth
> reevaluating its overhead on performance and memory usage.
> 
> [
> 	XXX: Add performance data. I tried to measure its impact on
> 	hackbench with a 2 socket NUMA 	machine. but it seems hackbench is
> 	too synthetic to benefit from this, because the	skbuff_head_cache's
> 	size fits into the last level cache.
> 
> 	Probably more realistic workloads like netperf would benefit
> 	from this?
> ]
> 
> Set remote_node_defrag_ratio to zero by default, and the new behavior is:
> 	1) try refilling per CPU partial list from the local node
> 	2) try allocating new slabs from the local node without reclamation
> 	3) try refilling per CPU partial list from remote nodes
> 	4) try allocating new slabs from the local node or remote nodes
> 
> If user specified remote_node_defrag_ratio, it probabilistically tries
> 3) first and then try 2) and 4) in order, to avoid unexpected behavioral
> change from user's perspective.

It makes sense to me, but as you note it would be great to demonstrate
benefits, because it adds complexity, especially in the already complex
___slab_alloc(). Networking has been indeed historically a workload very
sensitive to slab performance, so seems a good candidate.

We could also postpone this until we have tried the percpu arrays
improvements discussed at LSF/MM.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines
  2023-08-03 14:54   ` Vlastimil Babka
@ 2023-08-07  8:39     ` Hyeonggon Yoo
  2023-08-08  9:59       ` Vlastimil Babka
  0 siblings, 1 reply; 12+ messages in thread
From: Hyeonggon Yoo @ 2023-08-07  8:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, David Rientjes,
	Andrew Morton, Roman Gushchin, Feng Tang, Sang, Oliver,
	Jay Patel, Binder Makin, aneesh.kumar, tsahu, piyushs,
	fengwei.yin, ying.huang, lkp, oe-lkp, linux-mm, linux-kernel

On Thu, Aug 3, 2023 at 11:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/23/23 21:09, Hyeonggon Yoo wrote:
> > By default, SLUB sets remote_node_defrag_ratio to 1000, which makes it
> > (in most cases) take slabs from remote nodes first before trying allocating
> > new folios on the local node from buddy.
> >
> > Documentation/ABI/testing/sysfs-kernel-slab says:
> >> The file remote_node_defrag_ratio specifies the percentage of
> >> times SLUB will attempt to refill the cpu slab with a partial
> >> slab from a remote node as opposed to allocating a new slab on
> >> the local node.  This reduces the amount of wasted memory over
> >> the entire system but can be expensive.
> >
> > Although this made sense when it was introduced, the portion of
> > per node partial lists in the overall SLUB memory usage has been decreased
> > since the introduction of per cpu partial lists. Therefore, it's worth
> > reevaluating its overhead on performance and memory usage.
> >
> > [
> >       XXX: Add performance data. I tried to measure its impact on
> >       hackbench with a 2 socket NUMA  machine. but it seems hackbench is
> >       too synthetic to benefit from this, because the skbuff_head_cache's
> >       size fits into the last level cache.
> >
> >       Probably more realistic workloads like netperf would benefit
> >       from this?
> > ]
> >
> > Set remote_node_defrag_ratio to zero by default, and the new behavior is:
> >       1) try refilling per CPU partial list from the local node
> >       2) try allocating new slabs from the local node without reclamation
> >       3) try refilling per CPU partial list from remote nodes
> >       4) try allocating new slabs from the local node or remote nodes
> >
> > If user specified remote_node_defrag_ratio, it probabilistically tries
> > 3) first and then try 2) and 4) in order, to avoid unexpected behavioral
> > change from user's perspective.
>
> It makes sense to me, but as you note it would be great to demonstrate
> benefits, because it adds complexity, especially in the already complex
> ___slab_alloc(). Networking has been indeed historically a workload very
> sensitive to slab performance, so seems a good candidate.

Thank you for looking at it!

Yeah, it was a PoC for what I thought "oh, it might be useful"
and definitely I will try to measure it.

> We could also postpone this until we have tried the percpu arrays
> improvements discussed at LSF/MM.

Possibly, but can you please share your plans/opinions on it?
I think one possible way is simply to allow the cpu freelist to be
mixed by objects from different slabs
if we want to minimize changes, Or introduce a per cpu array similar
to what SLAB does now.

And one thing I'm having difficulty understanding is - what is the
mind behind/or impact of managing objects
on a slab basis, other than avoiding array queues in 2007?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines
  2023-08-07  8:39     ` Hyeonggon Yoo
@ 2023-08-08  9:59       ` Vlastimil Babka
  0 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2023-08-08  9:59 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, David Rientjes,
	Andrew Morton, Roman Gushchin, Feng Tang, Sang, Oliver,
	Jay Patel, Binder Makin, aneesh.kumar, tsahu, piyushs,
	fengwei.yin, ying.huang, lkp, oe-lkp, linux-mm, linux-kernel

On 8/7/23 10:39, Hyeonggon Yoo wrote:
> On Thu, Aug 3, 2023 at 11:54 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
> 
> Thank you for looking at it!
> 
> Yeah, it was a PoC for what I thought "oh, it might be useful"
> and definitely I will try to measure it.
> 
>> We could also postpone this until we have tried the percpu arrays
>> improvements discussed at LSF/MM.
> 
> Possibly, but can you please share your plans/opinions on it?

Here's the very first attempt :)
https://lore.kernel.org/linux-mm/20230808095342.12637-7-vbabka@suse.cz/

> I think one possible way is simply to allow the cpu freelist to be
> mixed by objects from different slabs

I didn't try that way, might be much trickier than it looks.

> if we want to minimize changes, Or introduce a per cpu array similar
> to what SLAB does now.

Yes.

> And one thing I'm having difficulty understanding is - what is the
> mind behind/or impact of managing objects
> on a slab basis, other than avoiding array queues in 2007?

"The mind" is Christoph's so I'll leave that question to him :)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure
  2023-07-23 19:09 [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure Hyeonggon Yoo
  2023-07-23 19:09 ` [RFC 1/2] Revert "mm, slub: change percpu partial accounting from objects to pages" Hyeonggon Yoo
  2023-07-23 19:09 ` [RFC 2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines Hyeonggon Yoo
@ 2023-08-10 10:55 ` Jay Patel
  2023-08-10 18:06   ` Hyeonggon Yoo
  2 siblings, 1 reply; 12+ messages in thread
From: Jay Patel @ 2023-08-10 10:55 UTC (permalink / raw)
  To: Hyeonggon Yoo, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	Joonsoo Kim, David Rientjes, Andrew Morton
  Cc: Roman Gushchin, Feng Tang, Sang, Oliver, Binder Makin,
	aneesh.kumar, tsahu, piyushs, fengwei.yin, ying.huang, lkp,
	oe-lkp, linux-mm, linux-kernel

On Mon, 2023-07-24 at 04:09 +0900, Hyeonggon Yoo wrote:
> Hello folks,
> 
> This series is motivated by kernel test bot report [1] on Jay's patch
> that modifies slab order. While the patch was not merged and not in
> the
> final form, I think it was a good lesson that changing slab order has
> more
> impacts on performance than we expected.
> 
> While inspecting the report, I found some potential points to improve
> SLUB. [2] It's _potential_ because it shows no improvements on
> hackbench.
> but I believe more realistic workloads would benefit from this. Due
> to
> lack of resources and lack of my understanding of *realistic*
> workloads,
> I am asking you to help evaluating this together.

Hi Hyeonggon,
I tried hackbench test on Powerpc machine with 16 cpus but
got ~32% of Regression with patch.

Results as 

+-------+----+---------+------------+------------+
|       |    | Normal  | With Patch |            |
+-------+----+---------+------------+------------+
| Amean | 1  | 1.3700  | 2.0353     | ( -32.69%) |
| Amean | 4  | 5.1663  | 7.6563     | (- 32.52%) |
| Amean | 7  | 8.9180  | 13.3353    | ( -33.13%) |
| Amean | 12 | 15.4290 | 23.0757    | ( -33.14%) |
| Amean | 21 | 27.3333 | 40.7823    | ( -32.98%) |
| Amean | 30 | 38.7677 | 58.5300    | ( -33.76%) |
| Amean | 48 | 62.2987 | 92.9850    | ( -33.00%) |
| Amean | 64 | 82.8993 | 123.4717   | ( -32.86%) |
+-------+----+---------+------------+------------+

Thanks
Jay Patel
> 
> It only consists of two patches. Patch #1 addresses inaccuracy in
> SLUB's heuristic, which can negatively affect workloads' performance
> when large folios are not available from buddy.
> 
> Patch #2 changes SLUB's behavior when there are no slabs available on
> the
> local node's partial slab list, increasing NUMA locality when there
> are
> available memory (without reclamation) on the local node from buddy.
> 
> This is early state, but I think it's a good enough to start
> discussion.
> Any feedbacks and ideas are welcome. Thank you in advance!
> 
> Hyeonggon
> 
> https://lore.kernel.org/linux-mm/202307172140.3b34825a-oliver.sang@intel.com
> [1]
> https://lore.kernel.org/linux-mm/CAB=+i9S6Ykp90+4N1kCE=hiTJTE4wzJDi8k5pBjjO_3sf0aeqg@mail.gmail.com
> [2]
> 
> Hyeonggon Yoo (2):
>   Revert "mm, slub: change percpu partial accounting from objects to
>     pages"
>   mm/slub: prefer NUMA locality over slight memory saving on NUMA
>     machines
> 
>  include/linux/slub_def.h |  2 --
>  mm/slab.h                |  6 ++++
>  mm/slub.c                | 76 ++++++++++++++++++++++++++----------
> ----
>  3 files changed, 55 insertions(+), 29 deletions(-)
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure
  2023-08-10 10:55 ` [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure Jay Patel
@ 2023-08-10 18:06   ` Hyeonggon Yoo
  2023-08-18  6:45     ` Jay Patel
  0 siblings, 1 reply; 12+ messages in thread
From: Hyeonggon Yoo @ 2023-08-10 18:06 UTC (permalink / raw)
  To: jaypatel
  Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	David Rientjes, Andrew Morton, Roman Gushchin, Feng Tang, Sang,
	Oliver, Binder Makin, aneesh.kumar, tsahu, piyushs, fengwei.yin,
	ying.huang, lkp, oe-lkp, linux-mm, linux-kernel

On Thu, Aug 10, 2023 at 7:56 PM Jay Patel <jaypatel@linux.ibm.com> wrote:
>
> On Mon, 2023-07-24 at 04:09 +0900, Hyeonggon Yoo wrote:
> > Hello folks,
> >
> > This series is motivated by kernel test bot report [1] on Jay's patch
> > that modifies slab order. While the patch was not merged and not in
> > the
> > final form, I think it was a good lesson that changing slab order has
> > more
> > impacts on performance than we expected.
> >
> > While inspecting the report, I found some potential points to improve
> > SLUB. [2] It's _potential_ because it shows no improvements on
> > hackbench.
> > but I believe more realistic workloads would benefit from this. Due
> > to
> > lack of resources and lack of my understanding of *realistic*
> > workloads,
> > I am asking you to help evaluating this together.
>
> Hi Hyeonggon,
> I tried hackbench test on Powerpc machine with 16 cpus but
> got ~32% of Regression with patch.

Thank you so much for measuring this! That's very helpful.
It's interesting because on an AMD machine with 2 NUMA nodes there was
not much difference.

Does it have more than one socket?

Could you confirm if the offending patch is patch 1 or 2?
If the offending one is patch 2, can you please check how large is L3
cache miss rate
during hackbench?

> Results as
>
> +-------+----+---------+------------+------------+
> |       |    | Normal  | With Patch |            |
> +-------+----+---------+------------+------------+
> | Amean | 1  | 1.3700  | 2.0353     | ( -32.69%) |
> | Amean | 4  | 5.1663  | 7.6563     | (- 32.52%) |
> | Amean | 7  | 8.9180  | 13.3353    | ( -33.13%) |
> | Amean | 12 | 15.4290 | 23.0757    | ( -33.14%) |
> | Amean | 21 | 27.3333 | 40.7823    | ( -32.98%) |
> | Amean | 30 | 38.7677 | 58.5300    | ( -33.76%) |
> | Amean | 48 | 62.2987 | 92.9850    | ( -33.00%) |
> | Amean | 64 | 82.8993 | 123.4717   | ( -32.86%) |
> +-------+----+---------+------------+------------+
>
> Thanks
> Jay Patel
> >
> > It only consists of two patches. Patch #1 addresses inaccuracy in
> > SLUB's heuristic, which can negatively affect workloads' performance
> > when large folios are not available from buddy.
> >
> > Patch #2 changes SLUB's behavior when there are no slabs available on
> > the
> > local node's partial slab list, increasing NUMA locality when there
> > are
> > available memory (without reclamation) on the local node from buddy.
> >
> > This is early state, but I think it's a good enough to start
> > discussion.
> > Any feedbacks and ideas are welcome. Thank you in advance!
> >
> > Hyeonggon
> >
> > https://lore.kernel.org/linux-mm/202307172140.3b34825a-oliver.sang@intel.com
> > [1]
> > https://lore.kernel.org/linux-mm/CAB=+i9S6Ykp90+4N1kCE=hiTJTE4wzJDi8k5pBjjO_3sf0aeqg@mail.gmail.com
> > [2]
> >
> > Hyeonggon Yoo (2):
> >   Revert "mm, slub: change percpu partial accounting from objects to
> >     pages"
> >   mm/slub: prefer NUMA locality over slight memory saving on NUMA
> >     machines
> >
> >  include/linux/slub_def.h |  2 --
> >  mm/slab.h                |  6 ++++
> >  mm/slub.c                | 76 ++++++++++++++++++++++++++----------
> > ----
> >  3 files changed, 55 insertions(+), 29 deletions(-)
> >
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure
  2023-08-10 18:06   ` Hyeonggon Yoo
@ 2023-08-18  6:45     ` Jay Patel
  2023-08-18 15:18       ` Hyeonggon Yoo
  0 siblings, 1 reply; 12+ messages in thread
From: Jay Patel @ 2023-08-18  6:45 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	David Rientjes, Andrew Morton, Roman Gushchin, Feng Tang, Sang,
	Oliver, Binder Makin, aneesh.kumar, tsahu, piyushs, fengwei.yin,
	ying.huang, lkp, oe-lkp, linux-mm, linux-kernel

On Fri, 2023-08-11 at 03:06 +0900, Hyeonggon Yoo wrote:
> On Thu, Aug 10, 2023 at 7:56 PM Jay Patel <jaypatel@linux.ibm.com>
> wrote:
> > On Mon, 2023-07-24 at 04:09 +0900, Hyeonggon Yoo wrote:
> > > Hello folks,
> > > 
> > > This series is motivated by kernel test bot report [1] on Jay's
> > > patch
> > > that modifies slab order. While the patch was not merged and not
> > > in
> > > the
> > > final form, I think it was a good lesson that changing slab order
> > > has
> > > more
> > > impacts on performance than we expected.
> > > 
> > > While inspecting the report, I found some potential points to
> > > improve
> > > SLUB. [2] It's _potential_ because it shows no improvements on
> > > hackbench.
> > > but I believe more realistic workloads would benefit from this.
> > > Due
> > > to
> > > lack of resources and lack of my understanding of *realistic*
> > > workloads,
> > > I am asking you to help evaluating this together.
> > 
> > Hi Hyeonggon,
> > I tried hackbench test on Powerpc machine with 16 cpus but
> > got ~32% of Regression with patch.
> 
> Thank you so much for measuring this! That's very helpful.
> It's interesting because on an AMD machine with 2 NUMA nodes there
> was
> not much difference.
> 
> Does it have more than one socket?

I have tested on single socket system.
> 
> Could you confirm if the offending patch is patch 1 or 2?
> If the offending one is patch 2, can you please check how large is L3
> cache miss rate
> during hackbench?
> 
Below regression is cause by Patch 1 "Revert mm, slub: change percpu
partial accounting from objects to pages"

Thanks 
Jay Patel

> > Results as
> > 
> > +-------+----+---------+------------+------------+
> > >       |    | Normal  | With Patch |            |
> > +-------+----+---------+------------+------------+
> > > Amean | 1  | 1.3700  | 2.0353     | ( -32.69%) |
> > > Amean | 4  | 5.1663  | 7.6563     | (- 32.52%) |
> > > Amean | 7  | 8.9180  | 13.3353    | ( -33.13%) |
> > > Amean | 12 | 15.4290 | 23.0757    | ( -33.14%) |
> > > Amean | 21 | 27.3333 | 40.7823    | ( -32.98%) |
> > > Amean | 30 | 38.7677 | 58.5300    | ( -33.76%) |
> > > Amean | 48 | 62.2987 | 92.9850    | ( -33.00%) |
> > > Amean | 64 | 82.8993 | 123.4717   | ( -32.86%) |
> > +-------+----+---------+------------+------------+
> > 
> > Thanks
> > Jay Patel
> > > It only consists of two patches. Patch #1 addresses inaccuracy in
> > > SLUB's heuristic, which can negatively affect workloads'
> > > performance
> > > when large folios are not available from buddy.
> > > 
> > > Patch #2 changes SLUB's behavior when there are no slabs
> > > available on
> > > the
> > > local node's partial slab list, increasing NUMA locality when
> > > there
> > > are
> > > available memory (without reclamation) on the local node from
> > > buddy.
> > > 
> > > This is early state, but I think it's a good enough to start
> > > discussion.
> > > Any feedbacks and ideas are welcome. Thank you in advance!
> > > 
> > > Hyeonggon
> > > 
> > > https://lore.kernel.org/linux-mm/202307172140.3b34825a-oliver.sang@intel.com
> > > [1]
> > > https://lore.kernel.org/linux-mm/CAB=+i9S6Ykp90+4N1kCE=hiTJTE4wzJDi8k5pBjjO_3sf0aeqg@mail.gmail.com
> > > [2]
> > > 
> > > Hyeonggon Yoo (2):
> > >   Revert "mm, slub: change percpu partial accounting from objects
> > > to
> > >     pages"
> > >   mm/slub: prefer NUMA locality over slight memory saving on NUMA
> > >     machines
> > > 
> > >  include/linux/slub_def.h |  2 --
> > >  mm/slab.h                |  6 ++++
> > >  mm/slub.c                | 76 ++++++++++++++++++++++++++------
> > > ----
> > > ----
> > >  3 files changed, 55 insertions(+), 29 deletions(-)
> > > 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure
  2023-08-18  6:45     ` Jay Patel
@ 2023-08-18 15:18       ` Hyeonggon Yoo
  0 siblings, 0 replies; 12+ messages in thread
From: Hyeonggon Yoo @ 2023-08-18 15:18 UTC (permalink / raw)
  To: jaypatel
  Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	David Rientjes, Andrew Morton, Roman Gushchin, Feng Tang, Sang,
	Oliver, Binder Makin, aneesh.kumar, tsahu, piyushs, fengwei.yin,
	ying.huang, lkp, oe-lkp, linux-mm, linux-kernel

On Fri, Aug 18, 2023 at 4:11 PM Jay Patel <jaypatel@linux.ibm.com> wrote:
>
> On Fri, 2023-08-11 at 03:06 +0900, Hyeonggon Yoo wrote:
> > On Thu, Aug 10, 2023 at 7:56 PM Jay Patel <jaypatel@linux.ibm.com>
> > wrote:
> > > On Mon, 2023-07-24 at 04:09 +0900, Hyeonggon Yoo wrote:
> > > > Hello folks,
> > > >
> > > > This series is motivated by kernel test bot report [1] on Jay's
> > > > patch
> > > > that modifies slab order. While the patch was not merged and not
> > > > in
> > > > the
> > > > final form, I think it was a good lesson that changing slab order
> > > > has
> > > > more
> > > > impacts on performance than we expected.
> > > >
> > > > While inspecting the report, I found some potential points to
> > > > improve
> > > > SLUB. [2] It's _potential_ because it shows no improvements on
> > > > hackbench.
> > > > but I believe more realistic workloads would benefit from this.
> > > > Due
> > > > to
> > > > lack of resources and lack of my understanding of *realistic*
> > > > workloads,
> > > > I am asking you to help evaluating this together.
> > >
> > > Hi Hyeonggon,
> > > I tried hackbench test on Powerpc machine with 16 cpus but
> > > got ~32% of Regression with patch.
> >
> > Thank you so much for measuring this! That's very helpful.
> > It's interesting because on an AMD machine with 2 NUMA nodes there
> > was
> > not much difference.
> >
> > Does it have more than one socket?
>
> I have tested on single socket system.
> >
> > Could you confirm if the offending patch is patch 1 or 2?
> > If the offending one is patch 2, can you please check how large is L3
> > cache miss rate
> > during hackbench?
> >
> Below regression is cause by Patch 1 "Revert mm, slub: change percpu
> partial accounting from objects to pages"

Fortunately I was able to reproduce the regression (5~10%) on my amd laptop :)
It's interesting and thank you so much for pointing it out!

It only modifies slowpath so the overhead of calculation itself should
be negligible.
And I think it's fair to assume that this is because the freelist is
shortened due to the patch,
because it rounds up the number of slabs:
> nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));

So before the patch more objects were cached than intended.
I'll try to bump up the default value to the point where it does not
use more memory than before.

By the way, what is the optimal default value is very unclear to me.
Obviously 'Good enough value for hackbench' is not a good standard,
because it's quite a synthetic workload.


> Thanks
> Jay Patel
>
> > > Results as
> > >
> > > +-------+----+---------+------------+------------+
> > > >       |    | Normal  | With Patch |            |
> > > +-------+----+---------+------------+------------+
> > > > Amean | 1  | 1.3700  | 2.0353     | ( -32.69%) |
> > > > Amean | 4  | 5.1663  | 7.6563     | (- 32.52%) |
> > > > Amean | 7  | 8.9180  | 13.3353    | ( -33.13%) |
> > > > Amean | 12 | 15.4290 | 23.0757    | ( -33.14%) |
> > > > Amean | 21 | 27.3333 | 40.7823    | ( -32.98%) |
> > > > Amean | 30 | 38.7677 | 58.5300    | ( -33.76%) |
> > > > Amean | 48 | 62.2987 | 92.9850    | ( -33.00%) |
> > > > Amean | 64 | 82.8993 | 123.4717   | ( -32.86%) |
> > > +-------+----+---------+------------+------------+
> > >
> > > Thanks
> > > Jay Patel


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/2] Revert "mm, slub: change percpu partial accounting from objects to pages"
  2023-07-26 10:34   ` Vlastimil Babka
@ 2023-08-21 15:11     ` Hyeonggon Yoo
  0 siblings, 0 replies; 12+ messages in thread
From: Hyeonggon Yoo @ 2023-08-21 15:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, David Rientjes,
	Andrew Morton, Roman Gushchin, Feng Tang, Sang, Oliver,
	Jay Patel, Binder Makin, aneesh.kumar, tsahu, piyushs,
	fengwei.yin, ying.huang, lkp, oe-lkp, linux-mm, linux-kernel,
	Jesper Dangaard Brouer

[ +Cc Jesper - he might have an opinion on this. ]

On Wed, Jul 26, 2023 at 7:34 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Nit: I would change the subject from "Revert: " as it's not a revert
> exactly. If we can come up with a good subject that's not very long :)

Will do :)

> On 7/23/23 21:09, Hyeonggon Yoo wrote:
> > This is partial revert of commit b47291ef02b0 ("mm, slub: change percpu
> > partial accounting from objects to pages"). and full revert of commit
> > 662188c3a20e ("mm/slub: Simplify struct slab slabs field definition").
> >
> > While b47291ef02b0 prevents percpu partial slab list becoming too long,
> > it assumes that the order of slabs are always oo_order(s->oo).

> I think I've considered this possibility, but decided it's not important
> because if the system becomes memory pressured in a way that it can't
> allocate the oo_order() and has to fallback, we no longer care about
> accurate percpu caching, as we're unlikely having optimum performance anyway.

But it does not perform any direct reclamation/compaction to allocate
high order slabs,
so isn't it an easier condition to happen than that?

> > The current approach can surprisingly lower the number of objects cached
> > per cpu when it fails to allocate high order slabs. Instead of accounting
> > the number of slabs, change it back to accounting objects, but keep
> > the assumption that the slab is always half-full.
>
> That's a nice solution as that avoids converting the sysfs variable, so I
> wouldn't mind going that way even if I doubt the performance benefits in a
> memory pressured system.

> But maybe there's a concern that if the system is
> really memory pressured and has to fallback to smaller orders, before this
> patch it would keep fewer percpu partial slabs than after this patch, which
> would increase the pressure further and thus be counter-productive?

You mean SLUB needs to stop per-cpu caching when direct/or indirect
reclamation is desired?

> > With this change, the number of cached objects per cpu is not surprisingly
> > decreased even when it fails to allocate high order slabs. It still
> > prevents large inaccuracy because it does not account based on the
> > number of free objects when taking slabs.
> > ---
> >  include/linux/slub_def.h |  2 --
> >  mm/slab.h                |  6 ++++++
> >  mm/slub.c                | 31 ++++++++++++-------------------
> >  3 files changed, 18 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> > index deb90cf4bffb..589ff6a2a23f 100644
> > --- a/include/linux/slub_def.h
> > +++ b/include/linux/slub_def.h
> > @@ -109,8 +109,6 @@ struct kmem_cache {
> >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> >       /* Number of per cpu partial objects to keep around */
> >       unsigned int cpu_partial;
> > -     /* Number of per cpu partial slabs to keep around */
> > -     unsigned int cpu_partial_slabs;
> >  #endif
> >       struct kmem_cache_order_objects oo;
> >
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 799a315695c6..be38a264df16 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -65,7 +65,13 @@ struct slab {
> >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> >                               struct {
> >                                       struct slab *next;
> > +#ifdef CONFIG_64BIT
> >                                       int slabs;      /* Nr of slabs left */
> > +                                     int pobjects;   /* Approximate count */
> > +#else
> > +                                     short int slabs;
> > +                                     short int pobjects;
> > +#endif
> >                               };
> >  #endif
> >                       };
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f7940048138c..199d3d03d5b9 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -486,18 +486,7 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
> >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> >  static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
> >  {
> > -     unsigned int nr_slabs;
> > -
> >       s->cpu_partial = nr_objects;
> > -
> > -     /*
> > -      * We take the number of objects but actually limit the number of
> > -      * slabs on the per cpu partial list, in order to limit excessive
> > -      * growth of the list. For simplicity we assume that the slabs will
> > -      * be half-full.
> > -      */
> > -     nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
> > -     s->cpu_partial_slabs = nr_slabs;
> >  }
> >  #else
> >  static inline void
> > @@ -2275,7 +2264,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> >       struct slab *slab, *slab2;
> >       void *object = NULL;
> >       unsigned long flags;
> > -     unsigned int partial_slabs = 0;
> > +     int objects_taken = 0;
> >
> >       /*
> >        * Racy check. If we mistakenly see no partial slabs then we
> > @@ -2312,11 +2301,11 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> >               } else {
> >                       put_cpu_partial(s, slab, 0);
> >                       stat(s, CPU_PARTIAL_NODE);
> > -                     partial_slabs++;
> > +                     objects_taken += slab->objects / 2;
> >               }
> >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> >               if (!kmem_cache_has_cpu_partial(s)
> > -                     || partial_slabs > s->cpu_partial_slabs / 2)
> > +                     || objects_taken > s->cpu_partial / 2)
> >                       break;
> >  #else
> >               break;
> > @@ -2699,13 +2688,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
> >       struct slab *slab_to_unfreeze = NULL;
> >       unsigned long flags;
> >       int slabs = 0;
> > +     int pobjects = 0;
> >
> >       local_lock_irqsave(&s->cpu_slab->lock, flags);
> >
> >       oldslab = this_cpu_read(s->cpu_slab->partial);
> >
> >       if (oldslab) {
> > -             if (drain && oldslab->slabs >= s->cpu_partial_slabs) {
> > +             if (drain && oldslab->pobjects >= s->cpu_partial) {
> >                       /*
> >                        * Partial array is full. Move the existing set to the
> >                        * per node partial list. Postpone the actual unfreezing
> > @@ -2714,14 +2704,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
> >                       slab_to_unfreeze = oldslab;
> >                       oldslab = NULL;
> >               } else {
> > +                     pobjects = oldslab->pobjects;
> >                       slabs = oldslab->slabs;
> >               }
> >       }
> >
> >       slabs++;
> > +     pobjects += slab->objects / 2;
> >
> >       slab->slabs = slabs;
> >       slab->next = oldslab;
> > +     slab->pobjects = pobjects;
> >
> >       this_cpu_write(s->cpu_slab->partial, slab);
> >
> > @@ -5653,13 +5646,13 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
> >
> >               slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
> >
> > -             if (slab)
> > +             if (slab) {
> >                       slabs += slab->slabs;
> > +                     objects += slab->objects;
> > +             }
> >       }
> >  #endif
> >
> > -     /* Approximate half-full slabs, see slub_set_cpu_partial() */
> > -     objects = (slabs * oo_objects(s->oo)) / 2;
> >       len += sysfs_emit_at(buf, len, "%d(%d)", objects, slabs);
> >
> >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> > @@ -5669,7 +5662,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
> >               slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
> >               if (slab) {
> >                       slabs = READ_ONCE(slab->slabs);
> > -                     objects = (slabs * oo_objects(s->oo)) / 2;
> > +                     objects = READ_ONCE(slab->pobjects);
> >                       len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
> >                                            cpu, objects, slabs);
> >               }
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-08-21 15:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-23 19:09 [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure Hyeonggon Yoo
2023-07-23 19:09 ` [RFC 1/2] Revert "mm, slub: change percpu partial accounting from objects to pages" Hyeonggon Yoo
2023-07-26 10:34   ` Vlastimil Babka
2023-08-21 15:11     ` Hyeonggon Yoo
2023-07-23 19:09 ` [RFC 2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines Hyeonggon Yoo
2023-08-03 14:54   ` Vlastimil Babka
2023-08-07  8:39     ` Hyeonggon Yoo
2023-08-08  9:59       ` Vlastimil Babka
2023-08-10 10:55 ` [RFC 0/2] An attempt to improve SLUB on NUMA / under memory pressure Jay Patel
2023-08-10 18:06   ` Hyeonggon Yoo
2023-08-18  6:45     ` Jay Patel
2023-08-18 15:18       ` Hyeonggon Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox