Re: [PATCH] slub: limit number of slabs to scan in count_partial()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Vlastimil Babka <vbabka@suse.cz>
To: Jianfeng Wang <jianfeng.w.wang@oracle.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: cl@linux.com, penberg@kernel.org, rientjes@google.com,
	iamjoonsoo.kim@lge.com, akpm@linux-foundation.org,
	junxiao.bi@oracle.com
Subject: Re: [PATCH] slub: limit number of slabs to scan in count_partial()
Date: Fri, 12 Apr 2024 09:41:24 +0200	[thread overview]
Message-ID: <d9794e5f-a57c-4d3b-bca3-6b1e34f6f163@suse.cz> (raw)
In-Reply-To: <20240411164023.99368-1-jianfeng.w.wang@oracle.com>

On 4/11/24 6:40 PM, Jianfeng Wang wrote:
> When reading "/proc/slabinfo", the kernel needs to report the number
> of free objects for each kmem_cache. The current implementation relies
> on count_partial() that counts the number of free objects by scanning
> each kmem_cache_node's list of partial slabs and summing free objects
> from every partial slab in the list. This process must hold per
> kmem_cache_node spinlock and disable IRQ and may take a long time.
> Consequently, it can block slab allocation requests on other CPU cores
> and cause timeouts for network devices etc., when the partial slab list
> is long. In production, even NMI watchdog can be triggered due to this
> matter: e.g., for "buffer_head", the number of partial slabs was
> observed to be ~1M in one kmem_cache_node. This problem was also

Wonder what led to such a situation, whether it could have been the NUMA
related problem this patch is fixing:
https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@huawei.com/

But guess we don't do that with buffer heads, and it's just possible that a
surge of allocations led to many slabs being allocated, and then the a
slower trickle freeing didn't happen in a matching order, leading to many
slabs becoming partially free...

> confirmed by several others [1-3] in the past.
> 
> Iterating a partial list to get the exact count of objects can cause
> soft lockups for a long list with or without the lock (e.g., if
> preemption is disabled), and is not very useful too: the object
> count can change right after the lock is released. The approach of
> maintaining free-object counters requires atomic operations on the
> fast path [3].
> 
> So, the fix is to limit the number of slabs to scan in
> count_partial(), and output an approximated result if the list is too
> long. Default to 10000 which should be enough for most sane cases.
> 
> [1] https://lore.kernel.org/linux-mm/
> alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/
> [2] https://lore.kernel.org/lkml/
> alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/
> [3] https://lore.kernel.org/lkml/
> 1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/
> 
> Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
> ---
>  mm/slub.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 1bb2a93cf7b6..5ed998ec7d6d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3213,16 +3213,25 @@ static inline bool free_debug_processing(struct kmem_cache *s,
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
> +#define MAX_PARTIAL_TO_SCAN 10000
> +
>  static unsigned long count_partial(struct kmem_cache_node *n,
>  					int (*get_count)(struct slab *))
>  {
>  	unsigned long flags;
>  	unsigned long x = 0;
> +	unsigned long scanned = 0;
>  	struct slab *slab;
>  
>  	spin_lock_irqsave(&n->list_lock, flags);
> -	list_for_each_entry(slab, &n->partial, slab_list)
> +	list_for_each_entry(slab, &n->partial, slab_list) {
>  		x += get_count(slab);
> +		if (++scanned > MAX_PARTIAL_TO_SCAN) {
> +			/* Approximate total count of objects */
> +			x = mult_frac(x, n->nr_partial, scanned);
> +			break;
> +		}
> +	}
>  	spin_unlock_irqrestore(&n->list_lock, flags);
>  	return x;
>  }

     prev parent reply	other threads:[~2024-04-12  7:41 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-11 16:40 Jianfeng Wang
2024-04-11 17:02 ` Christoph Lameter (Ampere)
2024-04-12  7:48   ` Vlastimil Babka
2024-04-12 17:29     ` [External] : " Jianfeng Wang
2024-04-12 18:16       ` Christoph Lameter (Ampere)
2024-04-12 18:32         ` Jianfeng Wang
2024-04-12 20:20       ` [External] : " Vlastimil Babka
2024-04-12 20:44         ` Jianfeng Wang
2024-04-13  1:17           ` Jianfeng Wang
2024-04-15  7:35             ` Vlastimil Babka
2024-04-16 18:58               ` Jianfeng Wang
2024-04-16 20:14                 ` Vlastimil Babka
2024-04-15 16:20             ` Christoph Lameter (Ampere)
2024-04-13  4:43         ` [External] : " Matthew Wilcox
2024-04-12  7:41 ` Vlastimil Babka [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d9794e5f-a57c-4d3b-bca3-6b1e34f6f163@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=jianfeng.w.wang@oracle.com \
    --cc=junxiao.bi@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox