From: Vlastimil Babka <vbabka@suse.cz>
To: Suren Baghdasaryan <surenb@google.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
Christoph Lameter <cl@linux.com>,
David Rientjes <rientjes@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Hyeonggon Yoo <42.hyeyoo@gmail.com>,
Uladzislau Rezki <urezki@gmail.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
rcu@vger.kernel.org, maple-tree@lists.infradead.org
Subject: Re: [PATCH RFC v2 01/10] slab: add opt-in caching layer of percpu sheaves
Date: Wed, 12 Mar 2025 15:57:59 +0100 [thread overview]
Message-ID: <befd17b0-160e-4933-96d9-8d5c4a774162@suse.cz> (raw)
In-Reply-To: <CAJuCfpG4BYNWM24_Jha-SapfeaGdO0GKuteHwNE1hDdWXRS+1Q@mail.gmail.com>
On 2/22/25 23:46, Suren Baghdasaryan wrote:
> On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Specifying a non-zero value for a new struct kmem_cache_args field
>> sheaf_capacity will setup a caching layer of percpu arrays called
>> sheaves of given capacity for the created cache.
>>
>> Allocations from the cache will allocate via the percpu sheaves (main or
>> spare) as long as they have no NUMA node preference. Frees will also
>> refill one of the sheaves.
>>
>> When both percpu sheaves are found empty during an allocation, an empty
>> sheaf may be replaced with a full one from the per-node barn. If none
>> are available and the allocation is allowed to block, an empty sheaf is
>> refilled from slab(s) by an internal bulk alloc operation. When both
>> percpu sheaves are full during freeing, the barn can replace a full one
>> with an empty one, unless over a full sheaves limit. In that case a
>> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
>> sheaves and barns is also wired to the existing cpu flushing and cache
>> shrinking operations.
>>
>> The sheaves do not distinguish NUMA locality of the cached objects. If
>> an allocation is requested with kmem_cache_alloc_node() with a specific
>> node (not NUMA_NO_NODE), sheaves are bypassed.
>>
>> The bulk operations exposed to slab users also try to utilize the
>> sheaves as long as the necessary (full or empty) sheaves are available
>> on the cpu or in the barn. Once depleted, they will fallback to bulk
>> alloc/free to slabs directly to avoid double copying.
>>
>> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
>> allocated or freed using the sheaves. Counters sheaf_refill,
>> sheaf_flush_main and sheaf_flush_other count objects filled or flushed
>> from or to slab pages, and can be used to assess how effective the
>> caching is. The refill and flush operations will also count towards the
>> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
>> counters.
>>
>> Access to the percpu sheaves is protected by local_lock_irqsave()
>> operations, each per-NUMA-node barn has a spin_lock.
>>
>> A current limitation is that when slub_debug is enabled for a cache with
>> percpu sheaves, the objects in the array are considered as allocated from
>> the slub_debug perspective, and the alloc/free debugging hooks occur
>> when moving the objects between the array and slab pages. This means
>> that e.g. an use-after-free that occurs for an object cached in the
>> array is undetected. Collected alloc/free stacktraces might also be less
>> useful. This limitation could be changed in the future.
>>
>> On the other hand, KASAN, kmemcg and other hooks are executed on actual
>> allocations and frees by kmem_cache users even if those use the array,
>> so their debugging or accounting accuracy should be unaffected.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>
> Only one possible issue in __pcs_flush_all_cpu(), all other comments
> are nits and suggestions.
Thanks.
>> + * Limitations: when slub_debug is enabled for the cache, all relevant
>> + * actions (i.e. poisoning, obtaining stacktraces) and checks happen
>> + * when objects move between sheaves and slab pages, which may result in
>> + * e.g. not detecting a use-after-free while the object is in the array
>> + * cache, and the stacktraces may be less useful.
>
> I would also love to see a short comparison of sheaves (when objects
> are freed using kfree_rcu()) vs SLAB_TYPESAFE_BY_RCU. I think both
> mechanisms rcu-free objects in bulk but sheaves would not reuse an
> object before RCU grace period is passed. Is that right?
I don't think that's right. SLAB_TYPESAFE_BY_RCU doesn't rcu-free objects in
bulk, the objects are freed immediately. It only rcu-delays freeing the slab
folio once all objects are freed.
>> +struct slub_percpu_sheaves {
>> + local_lock_t lock;
>> + struct slab_sheaf *main; /* never NULL when unlocked */
>> + struct slab_sheaf *spare; /* empty or full, may be NULL */
>> + struct slab_sheaf *rcu_free;
>
> Would be nice to have a short comment for rcu_free as well. I could
> guess what main and spare are but for rcu_free had to look further.
Added.
>> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>> + size_t size, void **p);
>> +
>> +
>> +static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>> + gfp_t gfp)
>> +{
>> + int to_fill = s->sheaf_capacity - sheaf->size;
>> + int filled;
>> +
>> + if (!to_fill)
>> + return 0;
>> +
>> + filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
>> + &sheaf->objects[sheaf->size]);
>> +
>> + if (!filled)
>> + return -ENOMEM;
>> +
>> + sheaf->size = s->sheaf_capacity;
>
> nit: __kmem_cache_alloc_bulk() either allocates requested number of
> objects or returns 0, so the current code is fine but if at some point
> the implementation changes so that it can return smaller number of
> objects than requested (filled < to_fill) then the above assignment
> will become invalid. I think a safer thing here would be to just:
>
> sheaf->size += filled;
>
> which also makes logical sense. Alternatively you could add
> VM_BUG_ON(filled != to_fill) but the increment I think would be
> better.
It's useful to indicate the refill was not successful, for patch 6. So I'm
changing this to:
sheaf->size += filled;
stat_add(s, SHEAF_REFILL, filled);
if (filled < to_fill)
return -ENOMEM;
return 0;
>> +
>> + stat_add(s, SHEAF_REFILL, filled);
>> +
>> + return 0;
>> +}
>> +
>> +
>> +static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
>> +{
>> + struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
>> +
>> + if (!sheaf)
>> + return NULL;
>> +
>> + if (refill_sheaf(s, sheaf, gfp)) {
>> + free_empty_sheaf(s, sheaf);
>> + return NULL;
>> + }
>> +
>> + return sheaf;
>> +}
>> +
>> +/*
>> + * Maximum number of objects freed during a single flush of main pcs sheaf.
>> + * Translates directly to an on-stack array size.
>> + */
>> +#define PCS_BATCH_MAX 32U
>> +
> .> +static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t
> size, void **p);
>> +
>
> A comment clarifying why you are freeing in PCS_BATCH_MAX batches here
> would be helpful. My understanding is that you do that to free objects
> outside of the cpu_sheaves->lock, so you isolate a batch, release the
> lock and then free the batch.
OK.
>> +static void sheaf_flush_main(struct kmem_cache *s)
>> +{
>> + struct slub_percpu_sheaves *pcs;
>> + unsigned int batch, remaining;
>> + void *objects[PCS_BATCH_MAX];
>> + struct slab_sheaf *sheaf;
>> + unsigned long flags;
>> +
>> +next_batch:
>> + local_lock_irqsave(&s->cpu_sheaves->lock, flags);
>> + pcs = this_cpu_ptr(s->cpu_sheaves);
>> + sheaf = pcs->main;
>> +
>> + batch = min(PCS_BATCH_MAX, sheaf->size);
>> +
>> + sheaf->size -= batch;
>> + memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
>> +
>> + remaining = sheaf->size;
>> +
>> + local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
>> +
>> + __kmem_cache_free_bulk(s, batch, &objects[0]);
>> +
>> + stat_add(s, SHEAF_FLUSH_MAIN, batch);
>> +
>> + if (remaining)
>> + goto next_batch;
>> +}
>> +
>
> This function seems to be used against either isolated sheaves or in
> slub_cpu_dead() --> __pcs_flush_all_cpu() path where we hold
> slab_mutex and I think that guarantees that the sheaf is unused. Maybe
> a short comment clarifying this requirement or rename the function to
> reflect that? Something like flush_unused_sheaf()?
It's not slab_mutex, but the fact slub_cpu_dead() is executed in a hotplug
phase when the given cpu is already not executing anymore and thus cannot be
manipulating its percpu sheaves, so we are the only one that does.
So I will clarify and rename to sheaf_flush_unused().
>> +
>> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
>> +{
>> + struct slub_percpu_sheaves *pcs;
>> +
>> + pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>> +
>> + if (pcs->spare) {
>> + sheaf_flush(s, pcs->spare);
>> + free_empty_sheaf(s, pcs->spare);
>> + pcs->spare = NULL;
>> + }
>> +
>> + // TODO: handle rcu_free
>> + BUG_ON(pcs->rcu_free);
>> +
>> + sheaf_flush_main(s);
>
> Hmm. sheaf_flush_main() always flushes for this_cpu only, so IIUC this
> call will not necessarily flush the main sheaf for the cpu passed to
> __pcs_flush_all_cpu().
Thanks, yes I need to call sheaf_flush_unused(pcs->main). It's ok to do
given my reply above.
>> +/*
>> + * Free an object to the percpu sheaves.
>> + * The object is expected to have passed slab_free_hook() already.
>> + */
>> +static __fastpath_inline
>> +void free_to_pcs(struct kmem_cache *s, void *object)
>> +{
>> + struct slub_percpu_sheaves *pcs;
>> + unsigned long flags;
>> +
>> +restart:
>> + local_lock_irqsave(&s->cpu_sheaves->lock, flags);
>> + pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> + if (unlikely(pcs->main->size == s->sheaf_capacity)) {
>> +
>> + struct slab_sheaf *empty;
>> +
>> + if (!pcs->spare) {
>> + empty = barn_get_empty_sheaf(pcs->barn);
>> + if (empty) {
>> + pcs->spare = pcs->main;
>> + pcs->main = empty;
>> + goto do_free;
>> + }
>> + goto alloc_empty;
>> + }
>> +
>> + if (pcs->spare->size < s->sheaf_capacity) {
>> + stat(s, SHEAF_SWAP);
>> + swap(pcs->main, pcs->spare);
>> + goto do_free;
>> + }
>> +
>> + empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
>> +
>> + if (!IS_ERR(empty)) {
>> + pcs->main = empty;
>> + goto do_free;
>> + }
>> +
>> + if (PTR_ERR(empty) == -E2BIG) {
>> + /* Since we got here, spare exists and is full */
>> + struct slab_sheaf *to_flush = pcs->spare;
>> +
>> + pcs->spare = NULL;
>> + local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
>> +
>> + sheaf_flush(s, to_flush);
>> + empty = to_flush;
>> + goto got_empty;
>> + }
>> +
>> +alloc_empty:
>> + local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
>> +
>> + empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>> +
>> + if (!empty) {
>> + sheaf_flush_main(s);
>> + goto restart;
>> + }
>> +
>> +got_empty:
>> + local_lock_irqsave(&s->cpu_sheaves->lock, flags);
>> + pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> + /*
>> + * if we put any sheaf to barn here, it's because we raced or
>> + * have been migrated to a different cpu, which should be rare
>> + * enough so just ignore the barn's limits to simplify
>> + */
>> + if (unlikely(pcs->main->size < s->sheaf_capacity)) {
>> + if (!pcs->spare)
>> + pcs->spare = empty;
>> + else
>> + barn_put_empty_sheaf(pcs->barn, empty, true);
>> + goto do_free;
>> + }
>> +
>> + if (!pcs->spare) {
>> + pcs->spare = pcs->main;
>> + pcs->main = empty;
>> + goto do_free;
>> + }
>> +
>> + barn_put_full_sheaf(pcs->barn, pcs->main, true);
>> + pcs->main = empty;
>
> I find the program flow in this function quite complex and hard to
> follow. I think refactoring the above block starting from "pcs =
> this_cpu_ptr(s->cpu_sheaves)" would somewhat simplify it. That
> eliminates the need for the "got_empty" label and makes the
> locking/unlocking sequence of s->cpu_sheaves->lock a bit more clear.
I'm a bit lost, refactoring how exactly?
>> + }
>> +
>> +do_free:
>> + pcs->main->objects[pcs->main->size++] = object;
>> +
>> + local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
>> +
>> + stat(s, FREE_PCS);
>> +}
next prev parent reply other threads:[~2025-03-12 14:58 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB " Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 01/10] slab: add opt-in caching layer of " Vlastimil Babka
2025-02-22 22:46 ` Suren Baghdasaryan
2025-02-22 22:56 ` Suren Baghdasaryan
2025-03-12 14:57 ` Vlastimil Babka [this message]
2025-03-12 15:14 ` Suren Baghdasaryan
2025-03-17 10:09 ` Vlastimil Babka
2025-02-24 8:04 ` Harry Yoo
2025-03-12 14:59 ` Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
2025-02-22 23:08 ` Suren Baghdasaryan
2025-03-12 16:19 ` Vlastimil Babka
2025-02-24 8:40 ` Harry Yoo
2025-03-12 16:16 ` Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t Vlastimil Babka
2025-02-17 14:19 ` Sebastian Andrzej Siewior
2025-02-17 14:35 ` Vlastimil Babka
2025-02-17 15:07 ` Sebastian Andrzej Siewior
2025-02-18 18:41 ` Alexei Starovoitov
2025-02-26 17:00 ` Davidlohr Bueso
2025-02-26 17:15 ` Alexei Starovoitov
2025-02-26 19:28 ` Davidlohr Bueso
2025-02-14 16:27 ` [PATCH RFC v2 04/10] locking/local_lock: add localtry_trylock() Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 05/10] slab: switch percpu sheaves locking to localtry_lock Vlastimil Babka
2025-02-23 2:33 ` Suren Baghdasaryan
2025-02-24 13:08 ` Harry Yoo
2025-02-14 16:27 ` [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
2025-02-23 3:54 ` Suren Baghdasaryan
2025-02-25 7:30 ` Harry Yoo
2025-03-12 17:09 ` Vlastimil Babka
2025-02-25 8:00 ` Harry Yoo
2025-03-12 18:16 ` Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock Vlastimil Babka
2025-02-23 4:00 ` Suren Baghdasaryan
2025-02-25 8:54 ` Harry Yoo
2025-03-12 18:23 ` Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 08/10] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
2025-02-23 4:24 ` Suren Baghdasaryan
2025-02-14 16:27 ` [PATCH RFC v2 09/10] tools: Add sheafs support to testing infrastructure Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 10/10] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
2025-02-23 4:27 ` Suren Baghdasaryan
2025-02-14 18:28 ` [PATCH RFC v2 00/10] SLUB percpu sheaves Christoph Lameter (Ampere)
2025-02-23 0:19 ` Kent Overstreet
2025-02-23 4:44 ` Suren Baghdasaryan
2025-02-24 1:36 ` Suren Baghdasaryan
2025-02-24 1:43 ` Suren Baghdasaryan
2025-02-24 20:53 ` Vlastimil Babka
2025-02-24 21:12 ` Suren Baghdasaryan
2025-02-25 20:26 ` Suren Baghdasaryan
2025-03-04 10:54 ` Vlastimil Babka
2025-03-04 18:35 ` Suren Baghdasaryan
2025-03-04 19:08 ` Liam R. Howlett
2025-03-14 17:10 ` Suren Baghdasaryan
2025-03-17 11:08 ` Vlastimil Babka
2025-03-17 18:56 ` Suren Baghdasaryan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=befd17b0-160e-4933-96d9-8d5c4a774162@suse.cz \
--to=vbabka@suse.cz \
--cc=42.hyeyoo@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=cl@linux.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=maple-tree@lists.infradead.org \
--cc=rcu@vger.kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=surenb@google.com \
--cc=urezki@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox