[PATCH v3 0/2] slub: introduce count_partial_free

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] slub: introduce count_partial_free_approx()
@ 2024-04-19 17:56 Jianfeng Wang
  2024-04-19 17:56 ` [PATCH v3 1/2] " Jianfeng Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jianfeng Wang @ 2024-04-19 17:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: vbabka, cl, akpm, penberg, rientjes

This patch fixes a known issue in get_slabinfo() which relies on
count_partial() to get the exact count of free objects in a
kmem_cache_node's partial list. For some slubs, their partial lists
can be extremely long. Currently, count_partial() traverses a partial
list to get the exact count of objects. This process may take a long
time, during which slab allocations are blocked and IRQs are disabled.
In production, even NMI watchdog can be triggered due to this matter.

The proposed fix is to limit the number of slabs to scan and output an
approximated count for a long partial list. The v1 patch counts N slabs
from the list's head and then uses it to estimate the total object
count in the list. As suggested by Vlastimil, the v2 patch uses an
alternative, i.e., counting N/2 from the list's head and tail, produces
a more accurate approximation after the partial list is sorted by
kmem_cache_shrink(). In this version, the implementation is moved to a
new function count_partial_free_approx(). count_partial() is still used
in sysfs for users who still want the exact object count.

---
Changes since v2 [2]
 - Introduce count_partial_free_approx() and keep count_partial()
 - Use count_partial_free_approx() in get_slabinfo() and slab_out_of_memory()

Changes since v1 [1]
 - Update the approximation method by counting from the list's head and tail
 - Cap the approximation by the total object count
 - Update the commit message to add benchmark results and explain the choice

[1] https://lore.kernel.org/linux-mm/20240411164023.99368-1-jianfeng.w.wang@oracle.com/
[2] https://lore.kernel.org/linux-mm/20240417185938.5237-2-jianfeng.w.wang@oracle.com/

Thanks,
--Jianfeng

Jianfeng Wang (2):
  slub: introduce count_partial_free_approx()
  slub: use count_partial_free_approx() in slab_out_of_memory()

 mm/slub.c | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

-- 
2.42.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 1/2] slub: introduce count_partial_free_approx()
  2024-04-19 17:56 [PATCH v3 0/2] slub: introduce count_partial_free_approx() Jianfeng Wang
@ 2024-04-19 17:56 ` Jianfeng Wang
  2024-04-20  0:18   ` David Rientjes
  2024-04-19 17:56 ` [PATCH v3 2/2] slub: use count_partial_free_approx() in slab_out_of_memory() Jianfeng Wang
  2024-04-22  7:56 ` [PATCH v3 0/2] slub: introduce count_partial_free_approx() Vlastimil Babka
  2 siblings, 1 reply; 7+ messages in thread
From: Jianfeng Wang @ 2024-04-19 17:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: vbabka, cl, akpm, penberg, rientjes

When reading "/proc/slabinfo", the kernel needs to report the number
of free objects for each kmem_cache. The current implementation uses
count_partial() to get it by scanning each kmem_cache_node's partial
slab list and summing free objects from every partial slab. This
process must hold per-kmem_cache_node spinlock and disable IRQ, and
may take a long time. Consequently, it can block slab allocations on
other CPUs and cause timeouts for network devices, when the partial
list is long. In production, even NMI watchdog can be triggered due
to this matter: e.g., for "buffer_head", the number of partial slabs
was observed to be ~1M in one kmem_cache_node. This problem was also
confirmed by others [1-3].

Iterating a partial list to get the exact count of objects can cause
soft lockups for a long list with or without the lock (e.g., if
preemption is disabled), and may not be very useful: the object count
can change after the lock is released. The approach of maintaining
free-object counters requires atomic operations on the fast path [3].

So, the fix is to introduce count_partial_free_approx(). This function
can be used for getting the free object count in a kmem_cache_node's
partial list. It limits the number of slabs to scan and avoids scanning
the whole list by giving an approximation for a long list. Suppose the
limit is N. If the list's length is not greater than N, output the exact
count by traversing the list; if its length is greater than N, output an
approximated count by traversing a subset of the list. The proposed
method is to scan N/2 slabs from the list's head and N/2 slabs from
the tail. For a partial list with ~280K slabs, benchmarks show that
it performs better than just counting from the list's head, after slabs
get sorted by kmem_cache_shrink(). Default the limit to 10000, as it
produces an approximation within 1% of the exact count for both
scenarios. Then, use count_partial_free_approx() in get_slabinfo().

Benchmarks: Diff = (exact - approximated) / exact
* Normal case (w/o kmem_cache_shrink()):
| MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
| 1000        |  0.43  %              |  1.09  %              |
| 5000        |  0.06  %              |  0.37  %              |
| 10000       |  0.02  %              |  0.16  %              |
| 20000       |  0.009 %              | -0.003 %              |

* Skewed case (w/ kmem_cache_shrink()):
| MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
| 1000        |  12.46 %              |  6.75  %              |
| 5000        |  5.38  %              |  1.27  %              |
| 10000       |  4.99  %              |  0.22  %              |
| 20000       |  4.86  %              | -0.06  %              |

[1] https://lore.kernel.org/linux-mm/
alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/
[2] https://lore.kernel.org/lkml/
alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/
[3] https://lore.kernel.org/lkml/
1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/

Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
---
 mm/slub.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1bb2a93cf7b6..993cbbdd2b6c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3213,6 +3213,43 @@ static inline bool free_debug_processing(struct kmem_cache *s,
 #endif /* CONFIG_SLUB_DEBUG */

 #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
+#define MAX_PARTIAL_TO_SCAN 10000
+
+static unsigned long count_partial_free_approx(struct kmem_cache_node *n)
+{
+	unsigned long flags;
+	unsigned long x = 0;
+	struct slab *slab;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) {
+		list_for_each_entry(slab, &n->partial, slab_list)
+			x += slab->objects - slab->inuse;
+	} else {
+		/*
+		 * For a long list, approximate the total count of objects in
+		 * it to meet the limit on the number of slabs to scan.
+		 * Scan from both the list's head and tail for better accuracy.
+		 */
+		unsigned long scanned = 0;
+
+		list_for_each_entry(slab, &n->partial, slab_list) {
+			x += slab->objects - slab->inuse;
+			if (++scanned == MAX_PARTIAL_TO_SCAN / 2)
+				break;
+		}
+		list_for_each_entry_reverse(slab, &n->partial, slab_list) {
+			x += slab->objects - slab->inuse;
+			if (++scanned == MAX_PARTIAL_TO_SCAN)
+				break;
+		}
+		x = mult_frac(x, n->nr_partial, scanned);
+		x = min(x, node_nr_objs(n));
+	}
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return x;
+}
+
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct slab *))
 {
@@ -7089,7 +7126,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
 	for_each_kmem_cache_node(s, node, n) {
 		nr_slabs += node_nr_slabs(n);
 		nr_objs += node_nr_objs(n);
-		nr_free += count_partial(n, count_free);
+		nr_free += count_partial_free_approx(n);
 	}

 	sinfo->active_objs = nr_objs - nr_free;
-- 
2.42.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/2] slub: introduce count_partial_free_approx()
  2024-04-19 17:56 ` [PATCH v3 1/2] " Jianfeng Wang
@ 2024-04-20  0:18   ` David Rientjes
  2024-04-22  7:49     ` Vlastimil Babka
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2024-04-20  0:18 UTC (permalink / raw)
  To: Jianfeng Wang; +Cc: linux-mm, linux-kernel, vbabka, cl, akpm, penberg

On Fri, 19 Apr 2024, Jianfeng Wang wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index 1bb2a93cf7b6..993cbbdd2b6c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3213,6 +3213,43 @@ static inline bool free_debug_processing(struct kmem_cache *s,
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
> +#define MAX_PARTIAL_TO_SCAN 10000
> +
> +static unsigned long count_partial_free_approx(struct kmem_cache_node *n)
> +{
> +	unsigned long flags;
> +	unsigned long x = 0;
> +	struct slab *slab;
> +
> +	spin_lock_irqsave(&n->list_lock, flags);
> +	if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) {
> +		list_for_each_entry(slab, &n->partial, slab_list)
> +			x += slab->objects - slab->inuse;
> +	} else {
> +		/*
> +		 * For a long list, approximate the total count of objects in
> +		 * it to meet the limit on the number of slabs to scan.
> +		 * Scan from both the list's head and tail for better accuracy.
> +		 */
> +		unsigned long scanned = 0;
> +
> +		list_for_each_entry(slab, &n->partial, slab_list) {
> +			x += slab->objects - slab->inuse;
> +			if (++scanned == MAX_PARTIAL_TO_SCAN / 2)
> +				break;
> +		}
> +		list_for_each_entry_reverse(slab, &n->partial, slab_list) {
> +			x += slab->objects - slab->inuse;
> +			if (++scanned == MAX_PARTIAL_TO_SCAN)
> +				break;
> +		}
> +		x = mult_frac(x, n->nr_partial, scanned);
> +		x = min(x, node_nr_objs(n));
> +	}
> +	spin_unlock_irqrestore(&n->list_lock, flags);
> +	return x;
> +}

Creative :)

The default value of MAX_PARTIAL_TO_SCAN seems to work well in practice 
while being large enough to bias for actual values?

I can't think of a better way to avoid the disruption that very long 
partial lists cause.  If the actual value is needed, it will need to be 
read from the sysfs file for that slab cache.

It does beg the question of whether we want to extend slabinfo to indicate 
that some fields are approximations, however.  Adding a suffix such as 
" : approx" to a slab cache line may be helpful if the disparity in the 
estimates would actually make a difference in practice.

I have a hard time believing that this approximation will not be "close 
enough" for all practical purposes, given that the value could very well 
substantially change the instant after the iteration is done anyway.

So for that reason, this sounds good to me!

Acked-by: David Rientjes <rientjes@google.com>

> +
>  static unsigned long count_partial(struct kmem_cache_node *n,
>  					int (*get_count)(struct slab *))
>  {
> @@ -7089,7 +7126,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
>  	for_each_kmem_cache_node(s, node, n) {
>  		nr_slabs += node_nr_slabs(n);
>  		nr_objs += node_nr_objs(n);
> -		nr_free += count_partial(n, count_free);
> +		nr_free += count_partial_free_approx(n);
>  	}
>  
>  	sinfo->active_objs = nr_objs - nr_free;


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/2] slub: introduce count_partial_free_approx()
  2024-04-20  0:18   ` David Rientjes
@ 2024-04-22  7:49     ` Vlastimil Babka
  0 siblings, 0 replies; 7+ messages in thread
From: Vlastimil Babka @ 2024-04-22  7:49 UTC (permalink / raw)
  To: David Rientjes, Jianfeng Wang; +Cc: linux-mm, linux-kernel, cl, akpm, penberg

On 4/20/24 2:18 AM, David Rientjes wrote:
> On Fri, 19 Apr 2024, Jianfeng Wang wrote:
> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 1bb2a93cf7b6..993cbbdd2b6c 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3213,6 +3213,43 @@ static inline bool free_debug_processing(struct kmem_cache *s,
>>  #endif /* CONFIG_SLUB_DEBUG */
>>  
>>  #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
>> +#define MAX_PARTIAL_TO_SCAN 10000
>> +
>> +static unsigned long count_partial_free_approx(struct kmem_cache_node *n)
>> +{
>> +	unsigned long flags;
>> +	unsigned long x = 0;
>> +	struct slab *slab;
>> +
>> +	spin_lock_irqsave(&n->list_lock, flags);
>> +	if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) {
>> +		list_for_each_entry(slab, &n->partial, slab_list)
>> +			x += slab->objects - slab->inuse;
>> +	} else {
>> +		/*
>> +		 * For a long list, approximate the total count of objects in
>> +		 * it to meet the limit on the number of slabs to scan.
>> +		 * Scan from both the list's head and tail for better accuracy.
>> +		 */
>> +		unsigned long scanned = 0;
>> +
>> +		list_for_each_entry(slab, &n->partial, slab_list) {
>> +			x += slab->objects - slab->inuse;
>> +			if (++scanned == MAX_PARTIAL_TO_SCAN / 2)
>> +				break;
>> +		}
>> +		list_for_each_entry_reverse(slab, &n->partial, slab_list) {
>> +			x += slab->objects - slab->inuse;
>> +			if (++scanned == MAX_PARTIAL_TO_SCAN)
>> +				break;
>> +		}
>> +		x = mult_frac(x, n->nr_partial, scanned);
>> +		x = min(x, node_nr_objs(n));
>> +	}
>> +	spin_unlock_irqrestore(&n->list_lock, flags);
>> +	return x;
>> +}
> 
> Creative :)
> 
> The default value of MAX_PARTIAL_TO_SCAN seems to work well in practice 
> while being large enough to bias for actual values?
> 
> I can't think of a better way to avoid the disruption that very long 
> partial lists cause.  If the actual value is needed, it will need to be 
> read from the sysfs file for that slab cache.
> 
> It does beg the question of whether we want to extend slabinfo to indicate 
> that some fields are approximations, however.  Adding a suffix such as 
> " : approx" to a slab cache line may be helpful if the disparity in the 
> estimates would actually make a difference in practice.

I'm afraid that changing the layout of /proc/slabinfo has a much higher
chance of breaking some consumer, than the imprecision due to approximation
has. So I would rather not risk it.

> I have a hard time believing that this approximation will not be "close 
> enough" for all practical purposes, given that the value could very well 
> substantially change the instant after the iteration is done anyway.
> 
> So for that reason, this sounds good to me!
> 
> Acked-by: David Rientjes <rientjes@google.com>
> 
>> +
>>  static unsigned long count_partial(struct kmem_cache_node *n,
>>  					int (*get_count)(struct slab *))
>>  {
>> @@ -7089,7 +7126,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
>>  	for_each_kmem_cache_node(s, node, n) {
>>  		nr_slabs += node_nr_slabs(n);
>>  		nr_objs += node_nr_objs(n);
>> -		nr_free += count_partial(n, count_free);
>> +		nr_free += count_partial_free_approx(n);
>>  	}
>>  
>>  	sinfo->active_objs = nr_objs - nr_free;



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 2/2] slub: use count_partial_free_approx() in slab_out_of_memory()
  2024-04-19 17:56 [PATCH v3 0/2] slub: introduce count_partial_free_approx() Jianfeng Wang
  2024-04-19 17:56 ` [PATCH v3 1/2] " Jianfeng Wang
@ 2024-04-19 17:56 ` Jianfeng Wang
  2024-04-20  0:18   ` David Rientjes
  2024-04-22  7:56 ` [PATCH v3 0/2] slub: introduce count_partial_free_approx() Vlastimil Babka
  2 siblings, 1 reply; 7+ messages in thread
From: Jianfeng Wang @ 2024-04-19 17:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: vbabka, cl, akpm, penberg, rientjes

slab_out_of_memory() uses count_partial() to get the exact count
of free objects for each node. As it may get called in the slab
allocation path, count_partial_free_approx() can be used to avoid
the risk and overhead of traversing a long partial slab list.

At the same time, show_slab_objects() still uses count_partial().
Thus, slub users can still have the option to access the exact
count of objects via sysfs if the overhead is acceptable to them.

Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 993cbbdd2b6c..fa55699aa21a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3292,7 +3292,7 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
 		unsigned long nr_objs;
 		unsigned long nr_free;
 
-		nr_free  = count_partial(n, count_free);
+		nr_free  = count_partial_free_approx(n);
 		nr_slabs = node_nr_slabs(n);
 		nr_objs  = node_nr_objs(n);
 
-- 
2.42.1



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 2/2] slub: use count_partial_free_approx() in slab_out_of_memory()
  2024-04-19 17:56 ` [PATCH v3 2/2] slub: use count_partial_free_approx() in slab_out_of_memory() Jianfeng Wang
@ 2024-04-20  0:18   ` David Rientjes
  0 siblings, 0 replies; 7+ messages in thread
From: David Rientjes @ 2024-04-20  0:18 UTC (permalink / raw)
  To: Jianfeng Wang; +Cc: linux-mm, linux-kernel, vbabka, cl, akpm, penberg

On Fri, 19 Apr 2024, Jianfeng Wang wrote:

> slab_out_of_memory() uses count_partial() to get the exact count
> of free objects for each node. As it may get called in the slab
> allocation path, count_partial_free_approx() can be used to avoid
> the risk and overhead of traversing a long partial slab list.
> 
> At the same time, show_slab_objects() still uses count_partial().
> Thus, slub users can still have the option to access the exact
> count of objects via sysfs if the overhead is acceptable to them.
> 
> Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 0/2] slub: introduce count_partial_free_approx()
  2024-04-19 17:56 [PATCH v3 0/2] slub: introduce count_partial_free_approx() Jianfeng Wang
  2024-04-19 17:56 ` [PATCH v3 1/2] " Jianfeng Wang
  2024-04-19 17:56 ` [PATCH v3 2/2] slub: use count_partial_free_approx() in slab_out_of_memory() Jianfeng Wang
@ 2024-04-22  7:56 ` Vlastimil Babka
  2 siblings, 0 replies; 7+ messages in thread
From: Vlastimil Babka @ 2024-04-22  7:56 UTC (permalink / raw)
  To: Jianfeng Wang, linux-mm, linux-kernel; +Cc: cl, akpm, penberg, rientjes

On 4/19/24 7:56 PM, Jianfeng Wang wrote:
> This patch fixes a known issue in get_slabinfo() which relies on
> count_partial() to get the exact count of free objects in a
> kmem_cache_node's partial list. For some slubs, their partial lists
> can be extremely long. Currently, count_partial() traverses a partial
> list to get the exact count of objects. This process may take a long
> time, during which slab allocations are blocked and IRQs are disabled.
> In production, even NMI watchdog can be triggered due to this matter.
> 
> The proposed fix is to limit the number of slabs to scan and output an
> approximated count for a long partial list. The v1 patch counts N slabs
> from the list's head and then uses it to estimate the total object
> count in the list. As suggested by Vlastimil, the v2 patch uses an
> alternative, i.e., counting N/2 from the list's head and tail, produces
> a more accurate approximation after the partial list is sorted by
> kmem_cache_shrink(). In this version, the implementation is moved to a
> new function count_partial_free_approx(). count_partial() is still used
> in sysfs for users who still want the exact object count.

Added to slab/for-next, thanks!

> 
> ---
> Changes since v2 [2]
>  - Introduce count_partial_free_approx() and keep count_partial()
>  - Use count_partial_free_approx() in get_slabinfo() and slab_out_of_memory()
> 
> Changes since v1 [1]
>  - Update the approximation method by counting from the list's head and tail
>  - Cap the approximation by the total object count
>  - Update the commit message to add benchmark results and explain the choice
> 
> [1] https://lore.kernel.org/linux-mm/20240411164023.99368-1-jianfeng.w.wang@oracle.com/
> [2] https://lore.kernel.org/linux-mm/20240417185938.5237-2-jianfeng.w.wang@oracle.com/
> 
> Thanks,
> --Jianfeng
> 
> Jianfeng Wang (2):
>   slub: introduce count_partial_free_approx()
>   slub: use count_partial_free_approx() in slab_out_of_memory()
> 
>  mm/slub.c | 41 +++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 39 insertions(+), 2 deletions(-)
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-04-22  7:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-19 17:56 [PATCH v3 0/2] slub: introduce count_partial_free_approx() Jianfeng Wang
2024-04-19 17:56 ` [PATCH v3 1/2] " Jianfeng Wang
2024-04-20  0:18   ` David Rientjes
2024-04-22  7:49     ` Vlastimil Babka
2024-04-19 17:56 ` [PATCH v3 2/2] slub: use count_partial_free_approx() in slab_out_of_memory() Jianfeng Wang
2024-04-20  0:18   ` David Rientjes
2024-04-22  7:56 ` [PATCH v3 0/2] slub: introduce count_partial_free_approx() Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox