From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D0098C369D7
	for <linux-mm@archiver.kernel.org>; Tue, 22 Apr 2025 15:02:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2064B6B000A; Tue, 22 Apr 2025 11:02:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1B6056B000D; Tue, 22 Apr 2025 11:02:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 056B16B000E; Tue, 22 Apr 2025 11:02:51 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id DB1F66B000A
	for <linux-mm@kvack.org>; Tue, 22 Apr 2025 11:02:51 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 2F4B31A1F8C
	for <linux-mm@kvack.org>; Tue, 22 Apr 2025 15:02:52 +0000 (UTC)
X-FDA: 83361996984.13.3870115
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
	by imf05.hostedemail.com (Postfix) with ESMTP id AD8AC100002
	for <linux-mm@kvack.org>; Tue, 22 Apr 2025 15:02:49 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=hpHdW0Gk;
	dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=+1kgr3uV;
	dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=hpHdW0Gk;
	dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=+1kgr3uV;
	spf=pass (imf05.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1745334170;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=;
	b=gbwUHsJjbso+CjxEJauhJGMVFX1K4yW9z33Fuv6Hckx0lFqLQ1FG8scBikCDRcfbbU1DfW
	J2jYQT+oMces45HNOGwQbHb7W+weuiibCTFw3G8n379fVXDUIGd82Ezio3wgg97QO/mWfy
	rkkOwKdQIg09TXhsa+b8gFdKHaI7Ucs=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=hpHdW0Gk;
	dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=+1kgr3uV;
	dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=hpHdW0Gk;
	dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=+1kgr3uV;
	spf=pass (imf05.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745334170; a=rsa-sha256;
	cv=none;
	b=jHMVo9y7jYwoL2TvlM92dlPR1zccol05rWEumLDLl3RE5f+/0HITXVbSbyWYJpAihOtJtX
	C56/PfxFD5ZcEDOgKpGiM4nolsaruTZ86ussyt/DsMO175Vi3HFfSYQsTvkjRGHJwXYh88
	VwTdGWpza6+PWETn9bJu+byuEy0XcaI=
Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id 99E551F444;
	Tue, 22 Apr 2025 15:02:47 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa;
	t=1745334167; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=;
	b=hpHdW0GkV/2UTDtGzEzvbvO3JLr6a44MxbCqemu97UEoDJTETLcwW2fE4Lxp58h6sXPKjy
	fBIyS+R6LVElgrnTym4mcdKuuEWgMwoX+xrKNwUzgNWb3QoQUiGH/hUIfvQ/qa+I1icxsq
	oSPqQhgofRgLLDsSjqg2P0Pzx02Nui4=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz;
	s=susede2_ed25519; t=1745334167;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=;
	b=+1kgr3uVRJHJh1hBN70rJBL8XMLl1uMojeZ8WlPh2QYWy6Okd9G5cYp9i8tgiuCekrIqeE
	YZSFmPFzIBtTNnAA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa;
	t=1745334167; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=;
	b=hpHdW0GkV/2UTDtGzEzvbvO3JLr6a44MxbCqemu97UEoDJTETLcwW2fE4Lxp58h6sXPKjy
	fBIyS+R6LVElgrnTym4mcdKuuEWgMwoX+xrKNwUzgNWb3QoQUiGH/hUIfvQ/qa+I1icxsq
	oSPqQhgofRgLLDsSjqg2P0Pzx02Nui4=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz;
	s=susede2_ed25519; t=1745334167;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=kY+OgALyRhny2CxnYe6DTXTokYtMM0xdO8gEaNUZMJI=;
	b=+1kgr3uVRJHJh1hBN70rJBL8XMLl1uMojeZ8WlPh2QYWy6Okd9G5cYp9i8tgiuCekrIqeE
	YZSFmPFzIBtTNnAA==
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 8275E137CF;
	Tue, 22 Apr 2025 15:02:47 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
	by imap1.dmz-prg2.suse.org with ESMTPSA
	id 41+JH5evB2jWVgAAD6G6ig
	(envelope-from <vbabka@suse.cz>); Tue, 22 Apr 2025 15:02:47 +0000
Message-ID: <208e2b4a-cff5-432e-a330-74a8fc3a22cc@suse.cz>
Date: Tue, 22 Apr 2025 17:02:47 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu
 sheaves
Content-Language: en-US
To: Suren Baghdasaryan <surenb@google.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
 Christoph Lameter <cl@linux.com>, David Rientjes <rientjes@google.com>,
 Roman Gushchin <roman.gushchin@linux.dev>, Harry Yoo <harry.yoo@oracle.com>,
 Uladzislau Rezki <urezki@gmail.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, rcu@vger.kernel.org,
 maple-tree@lists.infradead.org
References: <20250317-slub-percpu-caches-v3-0-9d9884d8b643@suse.cz>
 <20250317-slub-percpu-caches-v3-2-9d9884d8b643@suse.cz>
 <CAJuCfpGNcb=TNQx5YB8EruG_Q61crLHCBJ9daC68r19b+S1fJA@mail.gmail.com>
From: Vlastimil Babka <vbabka@suse.cz>
In-Reply-To: <CAJuCfpGNcb=TNQx5YB8EruG_Q61crLHCBJ9daC68r19b+S1fJA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Rspamd-Action: no action
X-Stat-Signature: p3mp7k7t34mawkm4zoobwwbz48wi6xnj
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: AD8AC100002
X-Rspam-User: 
X-HE-Tag: 1745334169-769464
X-HE-Meta: U2FsdGVkX19sNUyNJMfdFVupr7+7NA7Hid2YTgmlpV1LqIBTH7CJgacgL+7s6OaL3ASRNyFYyOco2cBYJrxQC/UJwnDC72EhKKizlePVmg29tAZdTfahufWwYrPGVwqF/b3Fe099f4XKps/SlWDPH3jN1xqrvCXoonawUWWj3UVHG18IR+P+bP/phaOSJpGdFAc9I4kiYA3rhimiiczqO5PblAZNnY2x8NeGVaslsZmqWngoYUIcKDSFZ2mXyofO2NaZ5WEKs3nmDVpV0RVIPAuVQw/bDfC/+IyH8785aJ1JYv4qgqeX9l9ZrLW1QKZjrMpzGRgvtuzawLv4wFbtlAMa8fzXYkDvYDXJEPhM2q8hWFaloyYN3WUBnFesaZmFmG95Cp/dG/sxW53xuCg4VXf0Z36qQCjJ5X6FSkoL6vMsbtsMA8MlFJd/2rpFOwyHoOEUYJcjCds8qzG5rIoW4IvSEmuS/7th7DwIk2OE2hEL/oiQUk36OkhvlNPxWdvA9klkGcZoCmBjRsU4BtqOidFdAgzr9mZki/xkPBaQdckWNy7cvcgWQ8aGYaX1SlbgizCCaoyhju53rJieVoo3DyOI0MY3OoqfwWPggkjeICZFMdWXhLPt+5v7dbHt4aFJSFnblRqCcNjCz22e0WT1UDvZDb6Ll9zda6anHEJNEMIMjKQ3xSX6J6oMQjrXJjO1CRxT7EwUTMjdXu+aXFCg0s0pyBcO4XcF8fwb6jVXcWcwMt6k8gWPRanMWaM/WD0jpZ356hp9OVyuENMJiMv7M3D4kVaAlCnDf+PtAwWIBV/6DB0zXBhBIseo/m9wDSW1LYJ9DaMDKhDeYQDESmx/qFYr4vsVWoxv+5RzzpPqrOJ1xMgQqbFSwQFOXIlaVV+oDzuDh0i5Ytv0VcYfR4ER1fWlwIz2A8IEqDyhpV0riw5jpYXmclAyIs86t5S8ykb4ZhlKMrM9St5/ftb5wxr
 2oRrreAz
 9zsaCS0bZQixixqUfLn7r+3LtvKc4rXygd/Wesi3B8TC1oSuJkLaDQ8sR4yI3tpAfk3y9AIUzw+K5CrwBFLVnN4fIbVHZkBRziks3/d8WtSB960oih6gIp0T5GtqXfOTp6txtk1oFTSi2QPBSDjQGrfZJWUniZuNOpzfKHsq8WKbNbt8U0CGIcYA2pLlneU/03T9A9SQVnLAmR4OTwhGOYoNFPR4ylz8Yai27A88Xjkpjc4ofmWP6lgABOSsoO8VL3yEMir44AZJN3rzAqNjrh0Nw70ZbU0B+Koz8
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 4/10/25 21:51, Suren Baghdasaryan wrote:
>> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +
>> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>> +
>> +       /* The cpu is not executing anymore so we don't need pcs->lock */
>> +       sheaf_flush_unused(s, pcs->main);
> 
> You are flushing pcs->main but sheaf_flush_unused() will account that
> in SHEAF_FLUSH_OTHER instead of SHEAF_FLUSH_MAIN. Perhaps
> sheaf_flush_unused() needs a parameter to indicate which counter to be
> incremented.

Hm it's technically true, but it's a cpu hotremove operation so rather rare
and not performance sensitive and one could argue it's not a main sheaf (of
an active cpu) anymore. So I rather went with a comment than complicating
the code.

>> +       if (pcs->spare) {
>> +               sheaf_flush_unused(s, pcs->spare);
>> +               free_empty_sheaf(s, pcs->spare);
>> +               pcs->spare = NULL;
>> +       }
>> +}
>> +
>> +static void pcs_destroy(struct kmem_cache *s)
>> +{
>> +       int cpu;
>> +
>> +       for_each_possible_cpu(cpu) {
>> +               struct slub_percpu_sheaves *pcs;
>> +
>> +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>> +
>> +               /* can happen when unwinding failed create */
>> +               if (!pcs->main)
>> +                       continue;
>> +
>> +               WARN_ON(pcs->spare);
>> +
>> +               if (!WARN_ON(pcs->main->size)) {
> 
> If pcs->main->size > 0, shouldn't we flush pcs->main? I understand
> it's not a normal situation but I think something like this would be
> more correct:
> 
>               if (WARN_ON(pcs->main->size))
>                             sheaf_flush_unused(s, pcs->main);
>               free_empty_sheaf(s, pcs->main);
>               pcs->main = NULL;

Initially I agreed but then I realized we only reach this code when we
already found all slabs only containing free objects. So if we then find
some in sheaves it means something must have gone terribly wrong so trying
to flush can only make things worse. I've added a comment instead.

>> +                       free_empty_sheaf(s, pcs->main);
>> +                       pcs->main = NULL;
>> +               }
>> +       }
>> +
>> +       free_percpu(s->cpu_sheaves);
>> +       s->cpu_sheaves = NULL;
>> +}
>> +
>> +static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
>> +{
>> +       struct slab_sheaf *empty = NULL;
>> +       unsigned long flags;
>> +
>> +       spin_lock_irqsave(&barn->lock, flags);
>> +
>> +       if (barn->nr_empty) {
>> +               empty = list_first_entry(&barn->sheaves_empty,
>> +                                        struct slab_sheaf, barn_list);
>> +               list_del(&empty->barn_list);
>> +               barn->nr_empty--;
>> +       }
>> +
>> +       spin_unlock_irqrestore(&barn->lock, flags);
>> +
>> +       return empty;
>> +}
>> +
>> +static int barn_put_empty_sheaf(struct node_barn *barn,
>> +                               struct slab_sheaf *sheaf, bool ignore_limit)
> 
> This ignore_limit in barn_put_empty_sheaf()/barn_put_full_sheaf() is
> sticking out and really used only inside rcu_free_sheaf() in the next
> patch. Every time I saw the return value of these functions ignored I
> had to remind myself that they pass ignore_limit=true, so the function
> can't really fail. Maybe we could get rid of this flag and do trimming
> of barn lists inside rcu_free_sheaf() separately in one go? IIUC

Good suggestion, didn't realize the barn_put_() are only used in the
fallback-due-to-a-race-or-cpu-migration paths, so did as you suggested.

> because we ignore the limits in all other places, at the time of
> rcu_free_sheaf() we might end up with a barn having many more sheaves
> than the limit allows for, so trimming in bulk might be even more
> productive.

There are many caches with no rcu freeing happening so making
rcu_free_sheaf() a bulk trimming point could be insufficient. I'd just leave
it for now and hope that ignoring limits only due to races won't cause barns
to grow more than it could self-correct for just by further allocation/freeing.

>> +{
>> +       unsigned long flags;
>> +       int ret = 0;
>> +
>> +       spin_lock_irqsave(&barn->lock, flags);
>> +
>> +       if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) {
>> +               ret = -E2BIG;
>> +       } else {
>> +               list_add(&sheaf->barn_list, &barn->sheaves_empty);
>> +               barn->nr_empty++;
>> +       }
>> +
>> +       spin_unlock_irqrestore(&barn->lock, flags);
>> +       return ret;
>> +}
>> +
>> +static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
>> +                              bool ignore_limit)
> 
> Can this function be called for partially populated sheaves or only
> full ones? I think rcu_free_sheaf() in the next patch might end up
> calling it for a partially populated sheaf.

It can, but the code handles not-really-full sheaves fine and it's easier to
allow them than try to be strict, there should be just a bit lower
efficiency in debug scenarios and that's fine. Adding a comment.

>> +{
>> +       unsigned long flags;
>> +       int ret = 0;
>> +
>> +       spin_lock_irqsave(&barn->lock, flags);
>> +
>> +       if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) {
>> +               ret = -E2BIG;
>> +       } else {
>> +               list_add(&sheaf->barn_list, &barn->sheaves_full);
>> +               barn->nr_full++;
>> +       }
>> +
>> +       spin_unlock_irqrestore(&barn->lock, flags);
>> +       return ret;
>> +}
>> +
>> +/*
>> + * If a full sheaf is available, return it and put the supplied empty one to
>> + * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
>> + * change.
> 
> I'm unclear why we ignore the limit here but not in
> barn_replace_full_sheaf(). Maybe because full sheaves consume more
> memory?

Yes.

> But then why do we mostly pass ignore_limit=true when invoking
> barn_put_full_sheaf()?

Not anymore with your suggestion.

> Explanation of this limit logic needs to be
> clarified.
> 
>> +static __fastpath_inline
>> +unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +       struct slab_sheaf *main;
>> +       unsigned int allocated = 0;
>> +       unsigned int batch;
>> +
>> +next_batch:
>> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
>> +               return allocated;
>> +
>> +       pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +       if (unlikely(pcs->main->size == 0)) {
> 
> The above condition is unlikely only for the first batch. I think it's
> actually guaranteed on all consecutive batches once you do "goto
> next_batch", right?

Hm yes but perhaps bulk allocs larger than batch size are rare?

>> +static __fastpath_inline
>> +bool free_to_pcs(struct kmem_cache *s, void *object)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +
>> +restart:
>> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
>> +               return false;
>> +
>> +       pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
>> +
>> +               struct slab_sheaf *empty;
>> +
>> +               if (!pcs->spare) {
>> +                       empty = barn_get_empty_sheaf(pcs->barn);
>> +                       if (empty) {
>> +                               pcs->spare = pcs->main;
>> +                               pcs->main = empty;
>> +                               goto do_free;
>> +                       }
>> +                       goto alloc_empty;
>> +               }
>> +
>> +               if (pcs->spare->size < s->sheaf_capacity) {
>> +                       stat(s, SHEAF_SWAP);
>> +                       swap(pcs->main, pcs->spare);
>> +                       goto do_free;
>> +               }
>> +
>> +               empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> 
> This function reads easier now but if barn_replace_full_sheaf() could
> ignore the MAX_FULL_SHEAVES and barn list trimming could be done
> later, it would simplify this function even further.

It could increase barn lock contention to put full sheaves there over limit
and then flush them afterwards so I think it's better to not do that.
Especially with "slab: determine barn status racily outside of lock".

> 
> nit: I think the section below would be more readable if structured
> with fail handling blocks at the end. Something like this:
> 
> next_batch:
>         if (!localtry_trylock(&s->cpu_sheaves->lock))
>                 goto fallback;
> 
>         pcs = this_cpu_ptr(s->cpu_sheaves);
> 
>         if (likely(pcs->main->size < s->sheaf_capacity))
>                 goto do_free;

Nice suggestion, did that. Thanks.