[PATCH] slub: limit number of slabs to scan in count

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] slub: limit number of slabs to scan in count_partial()
@ 2024-04-11 16:40 Jianfeng Wang
  2024-04-11 17:02 ` Christoph Lameter (Ampere)
  2024-04-12  7:41 ` Vlastimil Babka
  0 siblings, 2 replies; 15+ messages in thread
From: Jianfeng Wang @ 2024-04-11 16:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka, junxiao.bi

When reading "/proc/slabinfo", the kernel needs to report the number
of free objects for each kmem_cache. The current implementation relies
on count_partial() that counts the number of free objects by scanning
each kmem_cache_node's list of partial slabs and summing free objects
from every partial slab in the list. This process must hold per
kmem_cache_node spinlock and disable IRQ and may take a long time.
Consequently, it can block slab allocation requests on other CPU cores
and cause timeouts for network devices etc., when the partial slab list
is long. In production, even NMI watchdog can be triggered due to this
matter: e.g., for "buffer_head", the number of partial slabs was
observed to be ~1M in one kmem_cache_node. This problem was also
confirmed by several others [1-3] in the past.

Iterating a partial list to get the exact count of objects can cause
soft lockups for a long list with or without the lock (e.g., if
preemption is disabled), and is not very useful too: the object
count can change right after the lock is released. The approach of
maintaining free-object counters requires atomic operations on the
fast path [3].

So, the fix is to limit the number of slabs to scan in
count_partial(), and output an approximated result if the list is too
long. Default to 10000 which should be enough for most sane cases.

[1] https://lore.kernel.org/linux-mm/
alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/
[2] https://lore.kernel.org/lkml/
alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/
[3] https://lore.kernel.org/lkml/
1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/

Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
---
 mm/slub.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1bb2a93cf7b6..5ed998ec7d6d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3213,16 +3213,25 @@ static inline bool free_debug_processing(struct kmem_cache *s,
 #endif /* CONFIG_SLUB_DEBUG */

 #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
+#define MAX_PARTIAL_TO_SCAN 10000
+
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct slab *))
 {
 	unsigned long flags;
 	unsigned long x = 0;
+	unsigned long scanned = 0;
 	struct slab *slab;

 	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(slab, &n->partial, slab_list)
+	list_for_each_entry(slab, &n->partial, slab_list) {
 		x += get_count(slab);
+		if (++scanned > MAX_PARTIAL_TO_SCAN) {
+			/* Approximate total count of objects */
+			x = mult_frac(x, n->nr_partial, scanned);
+			break;
+		}
+	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
 	return x;
 }
-- 
2.42.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-11 16:40 [PATCH] slub: limit number of slabs to scan in count_partial() Jianfeng Wang
@ 2024-04-11 17:02 ` Christoph Lameter (Ampere)
  2024-04-12  7:48   ` Vlastimil Babka
  2024-04-12  7:41 ` Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-04-11 17:02 UTC (permalink / raw)
  To: Jianfeng Wang
  Cc: linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	vbabka, junxiao.bi

On Thu, 11 Apr 2024, Jianfeng Wang wrote:

> So, the fix is to limit the number of slabs to scan in
> count_partial(), and output an approximated result if the list is too
> long. Default to 10000 which should be enough for most sane cases.

That is a creative approach. The problem though is that objects on the 
partial lists are kind of sorted. The partial slabs with only a few 
objects available are at the start of the list so that allocations cause 
them to be removed from the partial list fast. Full slabs do not need to 
be tracked on any list.

The partial slabs with few objects are put at the end of the partial list 
in the hope that the few objects remaining will also be freed which would 
allow the freeing of the slab folio.

So the object density may be higher at the beginning of the list.

kmem_cache_shrink() will explicitly sort the partial lists to put the 
partial pages in that order.

Can you run some tests showing the difference between the estimation and 
the real count?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-11 16:40 [PATCH] slub: limit number of slabs to scan in count_partial() Jianfeng Wang
  2024-04-11 17:02 ` Christoph Lameter (Ampere)
@ 2024-04-12  7:41 ` Vlastimil Babka
  1 sibling, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2024-04-12  7:41 UTC (permalink / raw)
  To: Jianfeng Wang, linux-mm, linux-kernel
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, junxiao.bi

On 4/11/24 6:40 PM, Jianfeng Wang wrote:
> When reading "/proc/slabinfo", the kernel needs to report the number
> of free objects for each kmem_cache. The current implementation relies
> on count_partial() that counts the number of free objects by scanning
> each kmem_cache_node's list of partial slabs and summing free objects
> from every partial slab in the list. This process must hold per
> kmem_cache_node spinlock and disable IRQ and may take a long time.
> Consequently, it can block slab allocation requests on other CPU cores
> and cause timeouts for network devices etc., when the partial slab list
> is long. In production, even NMI watchdog can be triggered due to this
> matter: e.g., for "buffer_head", the number of partial slabs was
> observed to be ~1M in one kmem_cache_node. This problem was also

Wonder what led to such a situation, whether it could have been the NUMA
related problem this patch is fixing:
https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@huawei.com/

But guess we don't do that with buffer heads, and it's just possible that a
surge of allocations led to many slabs being allocated, and then the a
slower trickle freeing didn't happen in a matching order, leading to many
slabs becoming partially free...

> confirmed by several others [1-3] in the past.
> 
> Iterating a partial list to get the exact count of objects can cause
> soft lockups for a long list with or without the lock (e.g., if
> preemption is disabled), and is not very useful too: the object
> count can change right after the lock is released. The approach of
> maintaining free-object counters requires atomic operations on the
> fast path [3].
> 
> So, the fix is to limit the number of slabs to scan in
> count_partial(), and output an approximated result if the list is too
> long. Default to 10000 which should be enough for most sane cases.
> 
> [1] https://lore.kernel.org/linux-mm/
> alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/
> [2] https://lore.kernel.org/lkml/
> alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/
> [3] https://lore.kernel.org/lkml/
> 1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/
> 
> Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
> ---
>  mm/slub.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 1bb2a93cf7b6..5ed998ec7d6d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3213,16 +3213,25 @@ static inline bool free_debug_processing(struct kmem_cache *s,
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
> +#define MAX_PARTIAL_TO_SCAN 10000
> +
>  static unsigned long count_partial(struct kmem_cache_node *n,
>  					int (*get_count)(struct slab *))
>  {
>  	unsigned long flags;
>  	unsigned long x = 0;
> +	unsigned long scanned = 0;
>  	struct slab *slab;
>  
>  	spin_lock_irqsave(&n->list_lock, flags);
> -	list_for_each_entry(slab, &n->partial, slab_list)
> +	list_for_each_entry(slab, &n->partial, slab_list) {
>  		x += get_count(slab);
> +		if (++scanned > MAX_PARTIAL_TO_SCAN) {
> +			/* Approximate total count of objects */
> +			x = mult_frac(x, n->nr_partial, scanned);
> +			break;
> +		}
> +	}
>  	spin_unlock_irqrestore(&n->list_lock, flags);
>  	return x;
>  }



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-11 17:02 ` Christoph Lameter (Ampere)
@ 2024-04-12  7:48   ` Vlastimil Babka
  2024-04-12 17:29     ` [External] : " Jianfeng Wang
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2024-04-12  7:48 UTC (permalink / raw)
  To: Christoph Lameter (Ampere), Jianfeng Wang
  Cc: linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	junxiao.bi

On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
> 
>> So, the fix is to limit the number of slabs to scan in
>> count_partial(), and output an approximated result if the list is too
>> long. Default to 10000 which should be enough for most sane cases.
> 
> 
> That is a creative approach. The problem though is that objects on the 
> partial lists are kind of sorted. The partial slabs with only a few 
> objects available are at the start of the list so that allocations cause 
> them to be removed from the partial list fast. Full slabs do not need to 
> be tracked on any list.
> 
> The partial slabs with few objects are put at the end of the partial list 
> in the hope that the few objects remaining will also be freed which would 
> allow the freeing of the slab folio.
> 
> So the object density may be higher at the beginning of the list.
> 
> kmem_cache_shrink() will explicitly sort the partial lists to put the 
> partial pages in that order.
> 
> Can you run some tests showing the difference between the estimation and 
> the real count?

Maybe we could also get a more accurate picture by counting N slabs from the
head and N from the tail and approximating from both. Also not perfect, but
could be able to answer the question if the kmem_cache is significantly
fragmented. Which is probably the only information we can get from the
slabinfo <active_objs> vs <num_objs>. IIRC the latter is always accurate,
the former never because of cpu slabs, so we never know how many objects are
exactly in use. By comparing both we can get an idea of the fragmentation,
and if this change won't make that estimate significantly worse, it should
be acceptable.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] : Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-12  7:48   ` Vlastimil Babka
@ 2024-04-12 17:29     ` Jianfeng Wang
  2024-04-12 18:16       ` Christoph Lameter (Ampere)
  2024-04-12 20:20       ` [External] : " Vlastimil Babka
  0 siblings, 2 replies; 15+ messages in thread
From: Jianfeng Wang @ 2024-04-12 17:29 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter (Ampere)
  Cc: linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	junxiao.bi



On 4/12/24 12:48 AM, Vlastimil Babka wrote:
> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>
>>> So, the fix is to limit the number of slabs to scan in
>>> count_partial(), and output an approximated result if the list is too
>>> long. Default to 10000 which should be enough for most sane cases.
>>
>>
>> That is a creative approach. The problem though is that objects on the 
>> partial lists are kind of sorted. The partial slabs with only a few 
>> objects available are at the start of the list so that allocations cause 
>> them to be removed from the partial list fast. Full slabs do not need to 
>> be tracked on any list.
>>
>> The partial slabs with few objects are put at the end of the partial list 
>> in the hope that the few objects remaining will also be freed which would 
>> allow the freeing of the slab folio.
>>
>> So the object density may be higher at the beginning of the list.
>>
>> kmem_cache_shrink() will explicitly sort the partial lists to put the 
>> partial pages in that order.
>>
>> Can you run some tests showing the difference between the estimation and 
>> the real count?

Yes.
On a server with one NUMA node, I create a case that uses many dentry objects.
For "dentry", the length of partial slabs is slightly above 250000. Then, I
compare my approach of scanning N slabs from the list's head v.s. the original
approach of scanning the full list. I do it by getting both results using
the new and the original count_partial() and printing them in /proc/slabinfo.

N = 10000
my_result = 4741651
org_result = 4744966
diff = (org_result - my_result) / org_result = 0.00069 = 0.069 %

Increasing N further to 25000 will only slight improve the accuracy:
N = 15000 -> diff =  0.02 %
N = 20000 -> diff =  0.01 %
N = 25000 -> diff = -0.017 %

Based on the measurement, I think the difference between the estimation and
the real count is very limited (i.e. less than 0.1% for N = 10000). The
benefit is significant: shorter execution time for get_slabinfo(); no more
soft lockups or crashes caused by count_partial().

> 
> Maybe we could also get a more accurate picture by counting N slabs from the
> head and N from the tail and approximating from both. Also not perfect, but
> could be able to answer the question if the kmem_cache is significantly
> fragmented. Which is probably the only information we can get from the
> slabinfo <active_objs> vs <num_objs>. IIRC the latter is always accurate,
> the former never because of cpu slabs, so we never know how many objects are
> exactly in use. By comparing both we can get an idea of the fragmentation,
> and if this change won't make that estimate significantly worse, it should
> be acceptable.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-12 17:29     ` [External] : " Jianfeng Wang
@ 2024-04-12 18:16       ` Christoph Lameter (Ampere)
  2024-04-12 18:32         ` Jianfeng Wang
  2024-04-12 20:20       ` [External] : " Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-04-12 18:16 UTC (permalink / raw)
  To: Jianfeng Wang
  Cc: Vlastimil Babka, linux-mm, linux-kernel, penberg, rientjes,
	iamjoonsoo.kim, akpm, junxiao.bi

On Fri, 12 Apr 2024, Jianfeng Wang wrote:

>>> Can you run some tests showing the difference between the estimation and
>>> the real count?
>
> Yes.
> On a server with one NUMA node, I create a case that uses many dentry objects.
> For "dentry", the length of partial slabs is slightly above 250000. Then, I
> compare my approach of scanning N slabs from the list's head v.s. the original
> approach of scanning the full list. I do it by getting both results using
> the new and the original count_partial() and printing them in /proc/slabinfo.
>
> N = 10000
> my_result = 4741651
> org_result = 4744966
> diff = (org_result - my_result) / org_result = 0.00069 = 0.069 %
>
> Increasing N further to 25000 will only slight improve the accuracy:
> N = 15000 -> diff =  0.02 %
> N = 20000 -> diff =  0.01 %
> N = 25000 -> diff = -0.017 %
>
> Based on the measurement, I think the difference between the estimation and
> the real count is very limited (i.e. less than 0.1% for N = 10000). The
> benefit is significant: shorter execution time for get_slabinfo(); no more
> soft lockups or crashes caused by count_partial().

Wow. That is good. Maybe decrease N to 1000 instead?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-12 18:16       ` Christoph Lameter (Ampere)
@ 2024-04-12 18:32         ` Jianfeng Wang
  0 siblings, 0 replies; 15+ messages in thread
From: Jianfeng Wang @ 2024-04-12 18:32 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Vlastimil Babka, linux-mm, linux-kernel, penberg, rientjes,
	iamjoonsoo.kim, akpm, junxiao.bi



On 4/12/24 11:16 AM, Christoph Lameter (Ampere) wrote:
> On Fri, 12 Apr 2024, Jianfeng Wang wrote:
> 
>>>> Can you run some tests showing the difference between the estimation and
>>>> the real count?
>>
>> Yes.
>> On a server with one NUMA node, I create a case that uses many dentry objects.
>> For "dentry", the length of partial slabs is slightly above 250000. Then, I
>> compare my approach of scanning N slabs from the list's head v.s. the original
>> approach of scanning the full list. I do it by getting both results using
>> the new and the original count_partial() and printing them in /proc/slabinfo.
>>
>> N = 10000
>> my_result = 4741651
>> org_result = 4744966
>> diff = (org_result - my_result) / org_result = 0.00069 = 0.069 %
>>
>> Increasing N further to 25000 will only slight improve the accuracy:
>> N = 15000 -> diff =  0.02 %
>> N = 20000 -> diff =  0.01 %
>> N = 25000 -> diff = -0.017 %
>>
>> Based on the measurement, I think the difference between the estimation and
>> the real count is very limited (i.e. less than 0.1% for N = 10000). The
>> benefit is significant: shorter execution time for get_slabinfo(); no more
>> soft lockups or crashes caused by count_partial().
> 
> Wow. That is good. Maybe decrease N to 1000 instead?
> 

Yes, the diff is still limited. Here are some numbers:
N = 5000 -> diff = 0.0019 = 0.19 %
N = 3000 -> diff = 0.0023 = 0.23 %
N = 1000 -> diff = 0.0040 = 0.40 %

So, the estimation is quite accurate.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] : Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-12 17:29     ` [External] : " Jianfeng Wang
  2024-04-12 18:16       ` Christoph Lameter (Ampere)
@ 2024-04-12 20:20       ` Vlastimil Babka
  2024-04-12 20:44         ` Jianfeng Wang
  2024-04-13  4:43         ` [External] : " Matthew Wilcox
  1 sibling, 2 replies; 15+ messages in thread
From: Vlastimil Babka @ 2024-04-12 20:20 UTC (permalink / raw)
  To: Jianfeng Wang, Christoph Lameter (Ampere)
  Cc: linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	junxiao.bi

On 4/12/24 7:29 PM, Jianfeng Wang wrote:
> 
> 
> On 4/12/24 12:48 AM, Vlastimil Babka wrote:
>> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>>
>>>> So, the fix is to limit the number of slabs to scan in
>>>> count_partial(), and output an approximated result if the list is too
>>>> long. Default to 10000 which should be enough for most sane cases.
>>>
>>>
>>> That is a creative approach. The problem though is that objects on the 
>>> partial lists are kind of sorted. The partial slabs with only a few 
>>> objects available are at the start of the list so that allocations cause 
>>> them to be removed from the partial list fast. Full slabs do not need to 
>>> be tracked on any list.
>>>
>>> The partial slabs with few objects are put at the end of the partial list 
>>> in the hope that the few objects remaining will also be freed which would 
>>> allow the freeing of the slab folio.
>>>
>>> So the object density may be higher at the beginning of the list.
>>>
>>> kmem_cache_shrink() will explicitly sort the partial lists to put the 
>>> partial pages in that order.
>>>
>>> Can you run some tests showing the difference between the estimation and 
>>> the real count?
> 
> Yes.
> On a server with one NUMA node, I create a case that uses many dentry objects.

Could you describe in more detail how do you make dentry cache to grow such
a large partial slabs list? Thanks.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-12 20:20       ` [External] : " Vlastimil Babka
@ 2024-04-12 20:44         ` Jianfeng Wang
  2024-04-13  1:17           ` Jianfeng Wang
  2024-04-13  4:43         ` [External] : " Matthew Wilcox
  1 sibling, 1 reply; 15+ messages in thread
From: Jianfeng Wang @ 2024-04-12 20:44 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter (Ampere)
  Cc: linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	junxiao.bi



On 4/12/24 1:20 PM, Vlastimil Babka wrote:
> On 4/12/24 7:29 PM, Jianfeng Wang wrote:
>>
>>
>> On 4/12/24 12:48 AM, Vlastimil Babka wrote:
>>> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>>>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>>>
>>>>> So, the fix is to limit the number of slabs to scan in
>>>>> count_partial(), and output an approximated result if the list is too
>>>>> long. Default to 10000 which should be enough for most sane cases.
>>>>
>>>>
>>>> That is a creative approach. The problem though is that objects on the 
>>>> partial lists are kind of sorted. The partial slabs with only a few 
>>>> objects available are at the start of the list so that allocations cause 
>>>> them to be removed from the partial list fast. Full slabs do not need to 
>>>> be tracked on any list.
>>>>
>>>> The partial slabs with few objects are put at the end of the partial list 
>>>> in the hope that the few objects remaining will also be freed which would 
>>>> allow the freeing of the slab folio.
>>>>
>>>> So the object density may be higher at the beginning of the list.
>>>>
>>>> kmem_cache_shrink() will explicitly sort the partial lists to put the 
>>>> partial pages in that order.
>>>>
>>>> Can you run some tests showing the difference between the estimation and 
>>>> the real count?
>>
>> Yes.
>> On a server with one NUMA node, I create a case that uses many dentry objects.
> 
> Could you describe in more detail how do you make dentry cache to grow such
> a large partial slabs list? Thanks.
> 

I utilized the fact that creating a folder will create a new dentry object;
deleting a folder will delete all its sub-folder's dentry objects.

Then, I started to create N folders, while each folder has M empty sub-folders.
Assuming that these operations would consume a large number of dentry
objects in the sequential order. Their slabs were very likely to be full slabs.
After all folders were created, I deleted a subset of the N folders (i.e.,
one out of every two folders). This would create many holes, which turned a
subset of full slabs into partial slabs.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-12 20:44         ` Jianfeng Wang
@ 2024-04-13  1:17           ` Jianfeng Wang
  2024-04-15  7:35             ` Vlastimil Babka
  2024-04-15 16:20             ` Christoph Lameter (Ampere)
  0 siblings, 2 replies; 15+ messages in thread
From: Jianfeng Wang @ 2024-04-13  1:17 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter (Ampere)
  Cc: linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	Junxiao Bi


> On Apr 12, 2024, at 1:44 PM, Jianfeng Wang <jianfeng.w.wang@oracle.com> wrote:
> 
> On 4/12/24 1:20 PM, Vlastimil Babka wrote:
>> On 4/12/24 7:29 PM, Jianfeng Wang wrote:
>>> 
>>> 
>>> On 4/12/24 12:48 AM, Vlastimil Babka wrote:
>>>> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>>>>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>>>> 
>>>>>> So, the fix is to limit the number of slabs to scan in
>>>>>> count_partial(), and output an approximated result if the list is too
>>>>>> long. Default to 10000 which should be enough for most sane cases.
>>>>> 
>>>>> 
>>>>> That is a creative approach. The problem though is that objects on the 
>>>>> partial lists are kind of sorted. The partial slabs with only a few 
>>>>> objects available are at the start of the list so that allocations cause 
>>>>> them to be removed from the partial list fast. Full slabs do not need to 
>>>>> be tracked on any list.
>>>>> 
>>>>> The partial slabs with few objects are put at the end of the partial list 
>>>>> in the hope that the few objects remaining will also be freed which would 
>>>>> allow the freeing of the slab folio.
>>>>> 
>>>>> So the object density may be higher at the beginning of the list.
>>>>> 
>>>>> kmem_cache_shrink() will explicitly sort the partial lists to put the 
>>>>> partial pages in that order.
>>>>> 

Realized that I’d do "echo 1 > /sys/kernel/slab/dentry/shrink” to sort the list explicitly.
After that, the numbers become:
N = 10000 -> diff = 7.1 %
N = 20000 -> diff = 5.7 %
N = 25000 -> diff = 5.4 %
So, expecting ~5-7% difference after shrinking.

>>>>> Can you run some tests showing the difference between the estimation and 
>>>>> the real count?
>>> 
>>> Yes.
>>> On a server with one NUMA node, I create a case that uses many dentry objects.
>> 
>> Could you describe in more detail how do you make dentry cache to grow such
>> a large partial slabs list? Thanks.
>> 
> 
> I utilized the fact that creating a folder will create a new dentry object;
> deleting a folder will delete all its sub-folder's dentry objects.
> 
> Then, I started to create N folders, while each folder has M empty sub-folders.
> Assuming that these operations would consume a large number of dentry
> objects in the sequential order. Their slabs were very likely to be full slabs.
> After all folders were created, I deleted a subset of the N folders (i.e.,
> one out of every two folders). This would create many holes, which turned a
> subset of full slabs into partial slabs.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] : Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-12 20:20       ` [External] : " Vlastimil Babka
  2024-04-12 20:44         ` Jianfeng Wang
@ 2024-04-13  4:43         ` Matthew Wilcox
  1 sibling, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2024-04-13  4:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jianfeng Wang, Christoph Lameter (Ampere),
	linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	junxiao.bi

On Fri, Apr 12, 2024 at 10:20:48PM +0200, Vlastimil Babka wrote:
> On 4/12/24 7:29 PM, Jianfeng Wang wrote:
> > Yes.
> > On a server with one NUMA node, I create a case that uses many dentry objects.
> 
> Could you describe in more detail how do you make dentry cache to grow such
> a large partial slabs list? Thanks.

I don't know precisely which (customer) problem Jianfeng is looking
at, but here's one we got to publish a couple of years ago:

https://lore.kernel.org/linux-fsdevel/3a7abaca-e20f-ad59-f6f0-caedd8450f9f@oracle.com/

(is LSFMM coming up?  Must be time to have that discussion again!)
https://lwn.net/Articles/894098/



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-13  1:17           ` Jianfeng Wang
@ 2024-04-15  7:35             ` Vlastimil Babka
  2024-04-16 18:58               ` Jianfeng Wang
  2024-04-15 16:20             ` Christoph Lameter (Ampere)
  1 sibling, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2024-04-15  7:35 UTC (permalink / raw)
  To: Jianfeng Wang, Christoph Lameter (Ampere)
  Cc: linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	Junxiao Bi

On 4/13/24 3:17 AM, Jianfeng Wang wrote:
> 
>> On Apr 12, 2024, at 1:44 PM, Jianfeng Wang <jianfeng.w.wang@oracle.com> wrote:
>> 
>> On 4/12/24 1:20 PM, Vlastimil Babka wrote:
>>> On 4/12/24 7:29 PM, Jianfeng Wang wrote:
>>>> 
>>>> On 4/12/24 12:48 AM, Vlastimil Babka wrote:
>>>>> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>>>>>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>>>>> 
>>>>>>> So, the fix is to limit the number of slabs to scan in
>>>>>>> count_partial(), and output an approximated result if the list is too
>>>>>>> long. Default to 10000 which should be enough for most sane cases.
>>>>>> 
>>>>>> 
>>>>>> That is a creative approach. The problem though is that objects on the 
>>>>>> partial lists are kind of sorted. The partial slabs with only a few 
>>>>>> objects available are at the start of the list so that allocations cause 
>>>>>> them to be removed from the partial list fast. Full slabs do not need to 
>>>>>> be tracked on any list.
>>>>>> 
>>>>>> The partial slabs with few objects are put at the end of the partial list 
>>>>>> in the hope that the few objects remaining will also be freed which would 
>>>>>> allow the freeing of the slab folio.
>>>>>> 
>>>>>> So the object density may be higher at the beginning of the list.
>>>>>> 
>>>>>> kmem_cache_shrink() will explicitly sort the partial lists to put the 
>>>>>> partial pages in that order.
>>>>>> 
> 
> Realized that I’d do "echo 1 > /sys/kernel/slab/dentry/shrink” to sort the list explicitly.
> After that, the numbers become:
> N = 10000 -> diff = 7.1 %
> N = 20000 -> diff = 5.7 %
> N = 25000 -> diff = 5.4 %
> So, expecting ~5-7% difference after shrinking.
> 
>>>>>> Can you run some tests showing the difference between the estimation and 
>>>>>> the real count?
>>>> 
>>>> Yes.
>>>> On a server with one NUMA node, I create a case that uses many dentry objects.
>>> 
>>> Could you describe in more detail how do you make dentry cache to grow such
>>> a large partial slabs list? Thanks.
>>> 
>> 
>> I utilized the fact that creating a folder will create a new dentry object;
>> deleting a folder will delete all its sub-folder's dentry objects.
>> 
>> Then, I started to create N folders, while each folder has M empty sub-folders.
>> Assuming that these operations would consume a large number of dentry
>> objects in the sequential order. Their slabs were very likely to be full slabs.
>> After all folders were created, I deleted a subset of the N folders (i.e.,
>> one out of every two folders). This would create many holes, which turned a
>> subset of full slabs into partial slabs.

Thanks, right, so that's quite a deterministic way to achieve the long
partial lists with very close to uniform ratio of free/used, so no wonder
the resulting accuracy is good and the diff is very small. But in practice
the workloads that may lead to long lists will not be so uniform. The result
after shrinking shows what happens if there's bias in which slabs we inspect
due to the sorting. But still most of the slabs will have the near-uniform
free/used ratio so the sorting will not do so much difference. But another
workload might do that.

So what happens if you inspect X slabs from the head and X from the tail as
I suggested? That should help your test case even after you sort, and also
should in theory be more accurate even for less uniform workloads.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-13  1:17           ` Jianfeng Wang
  2024-04-15  7:35             ` Vlastimil Babka
@ 2024-04-15 16:20             ` Christoph Lameter (Ampere)
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-04-15 16:20 UTC (permalink / raw)
  To: Jianfeng Wang
  Cc: Vlastimil Babka, linux-mm, linux-kernel, penberg, rientjes,
	iamjoonsoo.kim, akpm, Junxiao Bi

[-- Attachment #1: Type: text/plain, Size: 459 bytes --]

On Sat, 13 Apr 2024, Jianfeng Wang wrote:

>>>>>> kmem_cache_shrink() will explicitly sort the partial lists to put the
>>>>>> partial pages in that order.
>>>>>>
>
> Realized that I’d do "echo 1 > /sys/kernel/slab/dentry/shrink” to sort the list explicitly.
> After that, the numbers become:
> N = 10000 -> diff = 7.1 %
> N = 20000 -> diff = 5.7 %
> N = 25000 -> diff = 5.4 %
> So, expecting ~5-7% difference after shrinking.

That still looks ok to me.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-15  7:35             ` Vlastimil Babka
@ 2024-04-16 18:58               ` Jianfeng Wang
  2024-04-16 20:14                 ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Jianfeng Wang @ 2024-04-16 18:58 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter (Ampere)
  Cc: linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	Junxiao Bi



On 4/15/24 12:35 AM, Vlastimil Babka wrote:
> On 4/13/24 3:17 AM, Jianfeng Wang wrote:
>>
>>> On Apr 12, 2024, at 1:44 PM, Jianfeng Wang <jianfeng.w.wang@oracle.com> wrote:
>>>
>>> On 4/12/24 1:20 PM, Vlastimil Babka wrote:
>>>> On 4/12/24 7:29 PM, Jianfeng Wang wrote:
>>>>>
>>>>> On 4/12/24 12:48 AM, Vlastimil Babka wrote:
>>>>>> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>>>>>>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>>>>>>
>>>>>>>> So, the fix is to limit the number of slabs to scan in
>>>>>>>> count_partial(), and output an approximated result if the list is too
>>>>>>>> long. Default to 10000 which should be enough for most sane cases.
>>>>>>>
>>>>>>>
>>>>>>> That is a creative approach. The problem though is that objects on the 
>>>>>>> partial lists are kind of sorted. The partial slabs with only a few 
>>>>>>> objects available are at the start of the list so that allocations cause 
>>>>>>> them to be removed from the partial list fast. Full slabs do not need to 
>>>>>>> be tracked on any list.
>>>>>>>
>>>>>>> The partial slabs with few objects are put at the end of the partial list 
>>>>>>> in the hope that the few objects remaining will also be freed which would 
>>>>>>> allow the freeing of the slab folio.
>>>>>>>
>>>>>>> So the object density may be higher at the beginning of the list.
>>>>>>>
>>>>>>> kmem_cache_shrink() will explicitly sort the partial lists to put the 
>>>>>>> partial pages in that order.
>>>>>>>
>>
>> Realized that I’d do "echo 1 > /sys/kernel/slab/dentry/shrink” to sort the list explicitly.
>> After that, the numbers become:
>> N = 10000 -> diff = 7.1 %
>> N = 20000 -> diff = 5.7 %
>> N = 25000 -> diff = 5.4 %
>> So, expecting ~5-7% difference after shrinking.
>>
>>>>>>> Can you run some tests showing the difference between the estimation and 
>>>>>>> the real count?
>>>>>
>>>>> Yes.
>>>>> On a server with one NUMA node, I create a case that uses many dentry objects.
>>>>
>>>> Could you describe in more detail how do you make dentry cache to grow such
>>>> a large partial slabs list? Thanks.
>>>>
>>>
>>> I utilized the fact that creating a folder will create a new dentry object;
>>> deleting a folder will delete all its sub-folder's dentry objects.
>>>
>>> Then, I started to create N folders, while each folder has M empty sub-folders.
>>> Assuming that these operations would consume a large number of dentry
>>> objects in the sequential order. Their slabs were very likely to be full slabs.
>>> After all folders were created, I deleted a subset of the N folders (i.e.,
>>> one out of every two folders). This would create many holes, which turned a
>>> subset of full slabs into partial slabs.
> 
> Thanks, right, so that's quite a deterministic way to achieve the long
> partial lists with very close to uniform ratio of free/used, so no wonder
> the resulting accuracy is good and the diff is very small. But in practice
> the workloads that may lead to long lists will not be so uniform. The result
> after shrinking shows what happens if there's bias in which slabs we inspect
> due to the sorting. But still most of the slabs will have the near-uniform
> free/used ratio so the sorting will not do so much difference. But another
> workload might do that.
> 
> So what happens if you inspect X slabs from the head and X from the tail as
> I suggested? That should help your test case even after you sort, and also
> should in theory be more accurate even for less uniform workloads.

Yes, the approach of counting from both directions and then approximating
works better after sorting the partial list.

Here is what I did.
---
+static unsigned long count_partial(struct kmem_cache_node *n,
+                                       int (*get_count)(struct slab *))
+{
+       unsigned long flags;
+       unsigned long x = 0;
+       unsigned long scanned = 0;
+       struct slab *slab;
+
+       spin_lock_irqsave(&n->list_lock, flags);
+       list_for_each_entry(slab, &n->partial, slab_list) {
+               x += get_count(slab);
+               if (++scanned > MAX_PARTIAL_TO_SCAN)
+                       break;
+       }
+
+       if (scanned > max_partial_to_scan) {
+               scanned = 0;
+               list_for_each_entry_reverse(slab, &n->partial, slab_list) {
+                       x += get_count(slab);
+                       if (++scanned > MAX_PARTIAL_TO_SCAN) {
+                               /* Approximate total count of objects */
+                               x = mult_frac(x, n->nr_partial, scanned * 2);
+                               x = min(x, node_nr_objs(n));
+                               break;
+                       }
+               }
+       }
+       spin_unlock_irqrestore(&n->list_lock, flags);
+       return x;
+}
---

Results:
---
* Pre-shrink:
MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) | 
1000             | 0.43  %                 | 0.80   %                |
5000             | 0.06  %                 | 0.16   %                |
10000            | 0.02  %                 | -0.003 %                |
20000            | 0.009 %                 | 0.03   %                |

* After-shrink:
MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) | 
1000             | 12.46 %                 | 3.60   %                |
5000             | 5.38  %                 | 0.22   %                |
10000            | 4.99  %                 | -0.06  %                |
20000            | 4.86  %                 | -0.17  %                |
---

For |MAX_SLAB_TO_SCAN| >= 5000, count_partial() returns the exact
object count if the length of the partial list is not larger than
|MAX_SLAB_TO_SCAN|. Otherwise, it counts |MAX_SLAB_TO_SCAN| from both
the head and from the tail, and outputs an approximation that shows
<1% difference.

With a slightly larger limit (like 10000), count_partial() should
produce the exact number for most cases (that won't lead to a
lockup) and avoid lockups with a good estimate.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] slub: limit number of slabs to scan in count_partial()
  2024-04-16 18:58               ` Jianfeng Wang
@ 2024-04-16 20:14                 ` Vlastimil Babka
  0 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2024-04-16 20:14 UTC (permalink / raw)
  To: Jianfeng Wang, Christoph Lameter (Ampere)
  Cc: linux-mm, linux-kernel, penberg, rientjes, iamjoonsoo.kim, akpm,
	Junxiao Bi

On 4/16/24 8:58 PM, Jianfeng Wang wrote:
> 
> 
> On 4/15/24 12:35 AM, Vlastimil Babka wrote:
>> On 4/13/24 3:17 AM, Jianfeng Wang wrote:
>>>
>>>> On Apr 12, 2024, at 1:44 PM, Jianfeng Wang <jianfeng.w.wang@oracle.com> wrote:
>>>>
>>>> On 4/12/24 1:20 PM, Vlastimil Babka wrote:
>>>>> On 4/12/24 7:29 PM, Jianfeng Wang wrote:
>>>>>>
>>>>>> On 4/12/24 12:48 AM, Vlastimil Babka wrote:
>>>>>>> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>>>>>>>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>>>>>>>
>>>>>>>>> So, the fix is to limit the number of slabs to scan in
>>>>>>>>> count_partial(), and output an approximated result if the list is too
>>>>>>>>> long. Default to 10000 which should be enough for most sane cases.
>>>>>>>>
>>>>>>>>
>>>>>>>> That is a creative approach. The problem though is that objects on the 
>>>>>>>> partial lists are kind of sorted. The partial slabs with only a few 
>>>>>>>> objects available are at the start of the list so that allocations cause 
>>>>>>>> them to be removed from the partial list fast. Full slabs do not need to 
>>>>>>>> be tracked on any list.
>>>>>>>>
>>>>>>>> The partial slabs with few objects are put at the end of the partial list 
>>>>>>>> in the hope that the few objects remaining will also be freed which would 
>>>>>>>> allow the freeing of the slab folio.
>>>>>>>>
>>>>>>>> So the object density may be higher at the beginning of the list.
>>>>>>>>
>>>>>>>> kmem_cache_shrink() will explicitly sort the partial lists to put the 
>>>>>>>> partial pages in that order.
>>>>>>>>
>>>
>>> Realized that I’d do "echo 1 > /sys/kernel/slab/dentry/shrink” to sort the list explicitly.
>>> After that, the numbers become:
>>> N = 10000 -> diff = 7.1 %
>>> N = 20000 -> diff = 5.7 %
>>> N = 25000 -> diff = 5.4 %
>>> So, expecting ~5-7% difference after shrinking.
>>>
>>>>>>>> Can you run some tests showing the difference between the estimation and 
>>>>>>>> the real count?
>>>>>>
>>>>>> Yes.
>>>>>> On a server with one NUMA node, I create a case that uses many dentry objects.
>>>>>
>>>>> Could you describe in more detail how do you make dentry cache to grow such
>>>>> a large partial slabs list? Thanks.
>>>>>
>>>>
>>>> I utilized the fact that creating a folder will create a new dentry object;
>>>> deleting a folder will delete all its sub-folder's dentry objects.
>>>>
>>>> Then, I started to create N folders, while each folder has M empty sub-folders.
>>>> Assuming that these operations would consume a large number of dentry
>>>> objects in the sequential order. Their slabs were very likely to be full slabs.
>>>> After all folders were created, I deleted a subset of the N folders (i.e.,
>>>> one out of every two folders). This would create many holes, which turned a
>>>> subset of full slabs into partial slabs.
>> 
>> Thanks, right, so that's quite a deterministic way to achieve the long
>> partial lists with very close to uniform ratio of free/used, so no wonder
>> the resulting accuracy is good and the diff is very small. But in practice
>> the workloads that may lead to long lists will not be so uniform. The result
>> after shrinking shows what happens if there's bias in which slabs we inspect
>> due to the sorting. But still most of the slabs will have the near-uniform
>> free/used ratio so the sorting will not do so much difference. But another
>> workload might do that.
>> 
>> So what happens if you inspect X slabs from the head and X from the tail as
>> I suggested? That should help your test case even after you sort, and also
>> should in theory be more accurate even for less uniform workloads.
> 
> Yes, the approach of counting from both directions and then approximating
> works better after sorting the partial list.

Yeah I think we could go with that approach then. Let's do 5000 from each
side. You can check whether n->nr_partial < 10000 and then just scan the
whole list in single direction with no approximation, and otherwise 5000
from each side with approximation. I think the code as you show below will
scan some slabs in the middle of the list twice if there's between 5000 and
10000 on the list, so checking n->nr_partial would avoid that. Thanks!

> Here is what I did.
> ---
> +static unsigned long count_partial(struct kmem_cache_node *n,
> +                                       int (*get_count)(struct slab *))
> +{
> +       unsigned long flags;
> +       unsigned long x = 0;
> +       unsigned long scanned = 0;
> +       struct slab *slab;
> +
> +       spin_lock_irqsave(&n->list_lock, flags);
> +       list_for_each_entry(slab, &n->partial, slab_list) {
> +               x += get_count(slab);
> +               if (++scanned > MAX_PARTIAL_TO_SCAN)
> +                       break;
> +       }
> +
> +       if (scanned > max_partial_to_scan) {
> +               scanned = 0;
> +               list_for_each_entry_reverse(slab, &n->partial, slab_list) {
> +                       x += get_count(slab);
> +                       if (++scanned > MAX_PARTIAL_TO_SCAN) {
> +                               /* Approximate total count of objects */
> +                               x = mult_frac(x, n->nr_partial, scanned * 2);
> +                               x = min(x, node_nr_objs(n));
> +                               break;
> +                       }
> +               }
> +       }
> +       spin_unlock_irqrestore(&n->list_lock, flags);
> +       return x;
> +}
> ---
> 
> Results:
> ---
> * Pre-shrink:
> MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) | 
> 1000             | 0.43  %                 | 0.80   %                |
> 5000             | 0.06  %                 | 0.16   %                |
> 10000            | 0.02  %                 | -0.003 %                |
> 20000            | 0.009 %                 | 0.03   %                |
> 
> * After-shrink:
> MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) | 
> 1000             | 12.46 %                 | 3.60   %                |
> 5000             | 5.38  %                 | 0.22   %                |
> 10000            | 4.99  %                 | -0.06  %                |
> 20000            | 4.86  %                 | -0.17  %                |
> ---
> 
> For |MAX_SLAB_TO_SCAN| >= 5000, count_partial() returns the exact
> object count if the length of the partial list is not larger than
> |MAX_SLAB_TO_SCAN|. Otherwise, it counts |MAX_SLAB_TO_SCAN| from both
> the head and from the tail, and outputs an approximation that shows
> <1% difference.
> 
> With a slightly larger limit (like 10000), count_partial() should
> produce the exact number for most cases (that won't lead to a
> lockup) and avoid lockups with a good estimate.



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-04-16 20:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-11 16:40 [PATCH] slub: limit number of slabs to scan in count_partial() Jianfeng Wang
2024-04-11 17:02 ` Christoph Lameter (Ampere)
2024-04-12  7:48   ` Vlastimil Babka
2024-04-12 17:29     ` [External] : " Jianfeng Wang
2024-04-12 18:16       ` Christoph Lameter (Ampere)
2024-04-12 18:32         ` Jianfeng Wang
2024-04-12 20:20       ` [External] : " Vlastimil Babka
2024-04-12 20:44         ` Jianfeng Wang
2024-04-13  1:17           ` Jianfeng Wang
2024-04-15  7:35             ` Vlastimil Babka
2024-04-16 18:58               ` Jianfeng Wang
2024-04-16 20:14                 ` Vlastimil Babka
2024-04-15 16:20             ` Christoph Lameter (Ampere)
2024-04-13  4:43         ` [External] : " Matthew Wilcox
2024-04-12  7:41 ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox