[LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
@ 2025-02-24 16:13 Vlastimil Babka
  2025-02-24 18:02 ` Shakeel Butt
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Vlastimil Babka @ 2025-02-24 16:13 UTC (permalink / raw)
  To: lsf-pc, linux-mm, bpf
  Cc: Christoph Lameter, David Rientjes, Hyeonggon Yoo,
	Uladzislau Rezki (Sony),
	Alexei Starovoitov

Hi,

I'd like to propose a session about the SLUB allocator.

Mainly I would like to discuss the addition of the sheaves caching layer,
the latest RFC posted at [1].

The goals of that work is to:

- Reduce fastpath overhead. The current freeing fastpath only can be used if
the same target slab is still the cpu slab, which can be only expected for a
very short term allocations. Further improvements should come from the new
local_trylock_t primitive.

- Improve efficiency of users such as like maple tree, thanks to more
efficient preallocations, and kfree_rcu batching/reusal

- Hopefully also facilitate further changes needed for bpf allocations, also
via local_trylock_t, that could possibly extend to the other parts of the
implementation as needed.

The controversial discussion points I expect about this approach are:

- Either sheaves will not support NUMA restrictions (as in current RFC), or
bring back the alien cache flushing issues of SLAB (or there's a better idea?)

- Will it be possible to eventually have sheaves enabled for every cache and
replace the current slub's fastpaths with it? Arguably these are also not
very efficient when NUMA-restricted allocations are requested for varying
NUMA nodes (cpu slab is flushed if it's from a wrong node, to load a slab
from the requested node).

Besides sheaves, I'd like to summarize recent kfree_rcu() changes and we
could discuss further improvements to that.

Also we can discuss what's needed to support bpf allocations. I've talked
about it last year, but then focused on other things, so Alexei has been
driving that recently (so far in the page allocator).

[1]
https://lore.kernel.org/all/20250214-slub-percpu-caches-v2-0-88592ee0966a@suse.cz/

Thanks,
Vlastimil


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
  2025-02-24 16:13 [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer Vlastimil Babka
@ 2025-02-24 18:02 ` Shakeel Butt
  2025-02-24 18:15   ` Vlastimil Babka
  2025-02-24 18:46   ` Mateusz Guzik
  2025-02-26  0:17 ` Christoph Lameter (Ampere)
  2025-03-25 17:43 ` Vlastimil Babka
  2 siblings, 2 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-02-24 18:02 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: lsf-pc, linux-mm, bpf, Christoph Lameter, David Rientjes,
	Hyeonggon Yoo, Uladzislau Rezki (Sony),
	Alexei Starovoitov

On Mon, Feb 24, 2025 at 05:13:25PM +0100, Vlastimil Babka wrote:
> Hi,
> 
> I'd like to propose a session about the SLUB allocator.
> 
> Mainly I would like to discuss the addition of the sheaves caching layer,
> the latest RFC posted at [1].
> 
> The goals of that work is to:
> 
> - Reduce fastpath overhead. The current freeing fastpath only can be used if
> the same target slab is still the cpu slab, which can be only expected for a
> very short term allocations. Further improvements should come from the new
> local_trylock_t primitive.
> 
> - Improve efficiency of users such as like maple tree, thanks to more
> efficient preallocations, and kfree_rcu batching/reusal
> 
> - Hopefully also facilitate further changes needed for bpf allocations, also
> via local_trylock_t, that could possibly extend to the other parts of the
> implementation as needed.
> 
> The controversial discussion points I expect about this approach are:
> 
> - Either sheaves will not support NUMA restrictions (as in current RFC), or
> bring back the alien cache flushing issues of SLAB (or there's a better idea?)
> 
> - Will it be possible to eventually have sheaves enabled for every cache and
> replace the current slub's fastpaths with it? Arguably these are also not
> very efficient when NUMA-restricted allocations are requested for varying
> NUMA nodes (cpu slab is flushed if it's from a wrong node, to load a slab
> from the requested node).
> 
> Besides sheaves, I'd like to summarize recent kfree_rcu() changes and we
> could discuss further improvements to that.
> 
> Also we can discuss what's needed to support bpf allocations. I've talked
> about it last year, but then focused on other things, so Alexei has been
> driving that recently (so far in the page allocator).

What about pre-memcg-charged sheaves? We had to disable memcg charging
of some kernel allocations and I think sheaves can help in reenabling
it.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
  2025-02-24 18:02 ` Shakeel Butt
@ 2025-02-24 18:15   ` Vlastimil Babka
  2025-02-24 20:52     ` Shakeel Butt
  2025-02-24 18:46   ` Mateusz Guzik
  1 sibling, 1 reply; 10+ messages in thread
From: Vlastimil Babka @ 2025-02-24 18:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, linux-mm, bpf, Christoph Lameter, David Rientjes,
	Hyeonggon Yoo, Uladzislau Rezki (Sony),
	Alexei Starovoitov

On 2/24/25 19:02, Shakeel Butt wrote:
> On Mon, Feb 24, 2025 at 05:13:25PM +0100, Vlastimil Babka wrote:
>> Hi,
>> 
>> I'd like to propose a session about the SLUB allocator.
>> 
>> Mainly I would like to discuss the addition of the sheaves caching layer,
>> the latest RFC posted at [1].
>> 
>> The goals of that work is to:
>> 
>> - Reduce fastpath overhead. The current freeing fastpath only can be used if
>> the same target slab is still the cpu slab, which can be only expected for a
>> very short term allocations. Further improvements should come from the new
>> local_trylock_t primitive.
>> 
>> - Improve efficiency of users such as like maple tree, thanks to more
>> efficient preallocations, and kfree_rcu batching/reusal
>> 
>> - Hopefully also facilitate further changes needed for bpf allocations, also
>> via local_trylock_t, that could possibly extend to the other parts of the
>> implementation as needed.
>> 
>> The controversial discussion points I expect about this approach are:
>> 
>> - Either sheaves will not support NUMA restrictions (as in current RFC), or
>> bring back the alien cache flushing issues of SLAB (or there's a better idea?)
>> 
>> - Will it be possible to eventually have sheaves enabled for every cache and
>> replace the current slub's fastpaths with it? Arguably these are also not
>> very efficient when NUMA-restricted allocations are requested for varying
>> NUMA nodes (cpu slab is flushed if it's from a wrong node, to load a slab
>> from the requested node).
>> 
>> Besides sheaves, I'd like to summarize recent kfree_rcu() changes and we
>> could discuss further improvements to that.
>> 
>> Also we can discuss what's needed to support bpf allocations. I've talked
>> about it last year, but then focused on other things, so Alexei has been
>> driving that recently (so far in the page allocator).
> 
> What about pre-memcg-charged sheaves? We had to disable memcg charging
> of some kernel allocations

You mean due to bad performance? Which ones for example? Was the overhead
due to accounting of how much is charged, or due to the associating memcgs
with objects?

> and I think sheaves can help in reenabling
> it.

You mean by mean having separate sheaves per memcg? Wouldn't that mean
risking that too many objects could be cached in them, we'd have to flush
eventually e.g. the least recently used ones, etc? Or do you mean some other
scheme?

Thanks, Vlastimil


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
  2025-02-24 18:02 ` Shakeel Butt
  2025-02-24 18:15   ` Vlastimil Babka
@ 2025-02-24 18:46   ` Mateusz Guzik
  2025-02-24 21:12     ` Shakeel Butt
  1 sibling, 1 reply; 10+ messages in thread
From: Mateusz Guzik @ 2025-02-24 18:46 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, lsf-pc, linux-mm, bpf, Christoph Lameter,
	David Rientjes, Hyeonggon Yoo, Uladzislau Rezki (Sony),
	Alexei Starovoitov

On Mon, Feb 24, 2025 at 10:02:09AM -0800, Shakeel Butt wrote:
> What about pre-memcg-charged sheaves? We had to disable memcg charging
> of some kernel allocations and I think sheaves can help in reenabling
> it.

It has been several months since last I looked at memcg, so details are
fuzzy and I don't have time to refresh everything.

However, if memory serves right the primary problem was the irq on/off
trip associated with them (sometimes happening twice, second time with
refill_obj_stock()).

I think the real fix(tm) would recognize only some allocations need
interrupt safety -- as in some slabs should not be allowed to be used
outside of the process context. This is somewhat what sheaves is doing,
but can be applied without fronting the current kmem caching mechanism.
This may be a tough sell and even then it plays whackamole with patching
up all consumers.

Suppose it is not an option.

Then there are 2 ways that I considered.

The easiest splits memcg accounting for irq and process level -- similar
to what localtry thing is doing. this would only cost preemption off/on
trip in the common case and a branch on the current state. But suppose
this is a no-go as well.

My primary idea was using hand-rolled sequence counters and local 8-byte
cmpxchg (*without* the lock prefix, also not to be confused with 16-byte
used by the current slub fast path). Should this work, it would be
significantly faster than irq trips. 

The irq thing is there only to facilitate several fields being updated
or memcg itself getting replaced in an atomic manner for process vs
interrupt context.

The observation is that all values which are getting updated are 4
bytes. Then perhaps an additional counter can be added next to each one
so that an 8-byte cmpxchg is going to fail should an irq swoop in and
change stuff from under us.

The percpu state would have a sequence counter associated with the
assigned memcg_stock_pcp. The memcg_stock_pcp object would have the same
value replicated inside for every var which can be updated in the fast
path.

Then the fast path would only succeed if the value read off from per-cpu
did not change vs what's in the stock thing.

Any change to memcg_stock_pcp (e.g., rolling up bytes after passing the
page size threshold) would disable interrupts and modify all these
counters.

There is some more work needed to make sure the stock obj can be safely
swapped out for a new one and not accidentally have a value which lines
up with the prevoius one, I don't remember what I had for that (and yes,
I recognize a 4 byte value will invariably roll over and *in principle*
a conflict will be possible).

This is a rough outline since Vlasta keeps prodding me about it.

That said, maybe someone will have a better idea. The above is up for
grabs if someone wants to do it, I can't commit to looking at it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
  2025-02-24 18:15   ` Vlastimil Babka
@ 2025-02-24 20:52     ` Shakeel Butt
  0 siblings, 0 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-02-24 20:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: lsf-pc, linux-mm, bpf, Christoph Lameter, David Rientjes,
	Hyeonggon Yoo, Uladzislau Rezki (Sony),
	Alexei Starovoitov

On Mon, Feb 24, 2025 at 07:15:16PM +0100, Vlastimil Babka wrote:
> On 2/24/25 19:02, Shakeel Butt wrote:
> > On Mon, Feb 24, 2025 at 05:13:25PM +0100, Vlastimil Babka wrote:
> >> Hi,
> >> 
> >> I'd like to propose a session about the SLUB allocator.
> >> 
> >> Mainly I would like to discuss the addition of the sheaves caching layer,
> >> the latest RFC posted at [1].
> >> 
> >> The goals of that work is to:
> >> 
> >> - Reduce fastpath overhead. The current freeing fastpath only can be used if
> >> the same target slab is still the cpu slab, which can be only expected for a
> >> very short term allocations. Further improvements should come from the new
> >> local_trylock_t primitive.
> >> 
> >> - Improve efficiency of users such as like maple tree, thanks to more
> >> efficient preallocations, and kfree_rcu batching/reusal
> >> 
> >> - Hopefully also facilitate further changes needed for bpf allocations, also
> >> via local_trylock_t, that could possibly extend to the other parts of the
> >> implementation as needed.
> >> 
> >> The controversial discussion points I expect about this approach are:
> >> 
> >> - Either sheaves will not support NUMA restrictions (as in current RFC), or
> >> bring back the alien cache flushing issues of SLAB (or there's a better idea?)
> >> 
> >> - Will it be possible to eventually have sheaves enabled for every cache and
> >> replace the current slub's fastpaths with it? Arguably these are also not
> >> very efficient when NUMA-restricted allocations are requested for varying
> >> NUMA nodes (cpu slab is flushed if it's from a wrong node, to load a slab
> >> from the requested node).
> >> 
> >> Besides sheaves, I'd like to summarize recent kfree_rcu() changes and we
> >> could discuss further improvements to that.
> >> 
> >> Also we can discuss what's needed to support bpf allocations. I've talked
> >> about it last year, but then focused on other things, so Alexei has been
> >> driving that recently (so far in the page allocator).
> > 
> > What about pre-memcg-charged sheaves? We had to disable memcg charging
> > of some kernel allocations
> 
> You mean due to bad performance? Which ones for example? Was the overhead
> due to accounting of how much is charged, or due to the associating memcgs
> with objects?
> 

I know of the following two cases but we do hear frequently that kmemcg
accounting is not cheap.

3754707bcc3e ("Revert "memcg: enable accounting for file lock caches"")
0bcfe68b8767 ("Revert "memcg: enable accounting for pollfd and select
bits arrays"")

> > and I think sheaves can help in reenabling
> > it.
> 
> You mean by mean having separate sheaves per memcg? Wouldn't that mean
> risking that too many objects could be cached in them, we'd have to flush
> eventually e.g. the least recently used ones, etc? Or do you mean some other
> scheme?
> 

As you pointed out a simple scheme of separate sheaves per memcg might
not work. Maybe targeting specific kmem caches or allocation sites would
be a first step. I will need to think more on this.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
  2025-02-24 18:46   ` Mateusz Guzik
@ 2025-02-24 21:12     ` Shakeel Butt
  2025-02-24 22:21       ` Mateusz Guzik
  0 siblings, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2025-02-24 21:12 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Vlastimil Babka, lsf-pc, linux-mm, bpf, Christoph Lameter,
	David Rientjes, Hyeonggon Yoo, Uladzislau Rezki (Sony),
	Alexei Starovoitov

On Mon, Feb 24, 2025 at 07:46:52PM +0100, Mateusz Guzik wrote:
> On Mon, Feb 24, 2025 at 10:02:09AM -0800, Shakeel Butt wrote:
> > What about pre-memcg-charged sheaves? We had to disable memcg charging
> > of some kernel allocations and I think sheaves can help in reenabling
> > it.
> 
> It has been several months since last I looked at memcg, so details are
> fuzzy and I don't have time to refresh everything.
> 
> However, if memory serves right the primary problem was the irq on/off
> trip associated with them (sometimes happening twice, second time with
> refill_obj_stock()).
> 
> I think the real fix(tm) would recognize only some allocations need
> interrupt safety -- as in some slabs should not be allowed to be used
> outside of the process context. This is somewhat what sheaves is doing,
> but can be applied without fronting the current kmem caching mechanism.
> This may be a tough sell and even then it plays whackamole with patching
> up all consumers.
> 
> Suppose it is not an option.
> 
> Then there are 2 ways that I considered.
> 
> The easiest splits memcg accounting for irq and process level -- similar
> to what localtry thing is doing. this would only cost preemption off/on
> trip in the common case and a branch on the current state. But suppose
> this is a no-go as well.

Have you seen 559271146efc ("mm/memcg: optimize user context object
stock access"). It got reverted for RT (or something). Maybe we can look
at it again.

> 
> My primary idea was using hand-rolled sequence counters and local 8-byte
> cmpxchg (*without* the lock prefix, also not to be confused with 16-byte
> used by the current slub fast path). Should this work, it would be
> significantly faster than irq trips. 
> 
> The irq thing is there only to facilitate several fields being updated
> or memcg itself getting replaced in an atomic manner for process vs
> interrupt context.
> 
> The observation is that all values which are getting updated are 4
> bytes. Then perhaps an additional counter can be added next to each one
> so that an 8-byte cmpxchg is going to fail should an irq swoop in and
> change stuff from under us.
> 
> The percpu state would have a sequence counter associated with the
> assigned memcg_stock_pcp. The memcg_stock_pcp object would have the same
> value replicated inside for every var which can be updated in the fast
> path.
> 
> Then the fast path would only succeed if the value read off from per-cpu
> did not change vs what's in the stock thing.
> 
> Any change to memcg_stock_pcp (e.g., rolling up bytes after passing the
> page size threshold) would disable interrupts and modify all these
> counters.
> 
> There is some more work needed to make sure the stock obj can be safely
> swapped out for a new one and not accidentally have a value which lines
> up with the prevoius one, I don't remember what I had for that (and yes,
> I recognize a 4 byte value will invariably roll over and *in principle*
> a conflict will be possible).
> 
> This is a rough outline since Vlasta keeps prodding me about it.

By chance do you have this code lying around somewhere? Not saying this
is the way to go but wanted to take a look.

> 
> That said, maybe someone will have a better idea. The above is up for
> grabs if someone wants to do it, I can't commit to looking at it.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
  2025-02-24 21:12     ` Shakeel Butt
@ 2025-02-24 22:21       ` Mateusz Guzik
  0 siblings, 0 replies; 10+ messages in thread
From: Mateusz Guzik @ 2025-02-24 22:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, lsf-pc, linux-mm, bpf, Christoph Lameter,
	David Rientjes, Hyeonggon Yoo, Uladzislau Rezki (Sony),
	Alexei Starovoitov

On Mon, Feb 24, 2025 at 10:12 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Feb 24, 2025 at 07:46:52PM +0100, Mateusz Guzik wrote:
> > On Mon, Feb 24, 2025 at 10:02:09AM -0800, Shakeel Butt wrote:
> > > What about pre-memcg-charged sheaves? We had to disable memcg charging
> > > of some kernel allocations and I think sheaves can help in reenabling
> > > it.
> >
> > It has been several months since last I looked at memcg, so details are
> > fuzzy and I don't have time to refresh everything.
> >
> > However, if memory serves right the primary problem was the irq on/off
> > trip associated with them (sometimes happening twice, second time with
> > refill_obj_stock()).
> >
> > I think the real fix(tm) would recognize only some allocations need
> > interrupt safety -- as in some slabs should not be allowed to be used
> > outside of the process context. This is somewhat what sheaves is doing,
> > but can be applied without fronting the current kmem caching mechanism.
> > This may be a tough sell and even then it plays whackamole with patching
> > up all consumers.
> >
> > Suppose it is not an option.
> >
> > Then there are 2 ways that I considered.
> >
> > The easiest splits memcg accounting for irq and process level -- similar
> > to what localtry thing is doing. this would only cost preemption off/on
> > trip in the common case and a branch on the current state. But suppose
> > this is a no-go as well.
>
> Have you seen 559271146efc ("mm/memcg: optimize user context object
> stock access"). It got reverted for RT (or something). Maybe we can look
> at it again.
>

Huh. I have not it, it does look like the same core idea.

Even if RT itself is the problem, perhaps this could be made build
time conditional on it?

> >
> > My primary idea was using hand-rolled sequence counters and local 8-byte
> > cmpxchg (*without* the lock prefix, also not to be confused with 16-byte
> > used by the current slub fast path). Should this work, it would be
> > significantly faster than irq trips.
> >
> > The irq thing is there only to facilitate several fields being updated
> > or memcg itself getting replaced in an atomic manner for process vs
> > interrupt context.
> >
> > The observation is that all values which are getting updated are 4
> > bytes. Then perhaps an additional counter can be added next to each one
> > so that an 8-byte cmpxchg is going to fail should an irq swoop in and
> > change stuff from under us.
> >
> > The percpu state would have a sequence counter associated with the
> > assigned memcg_stock_pcp. The memcg_stock_pcp object would have the same
> > value replicated inside for every var which can be updated in the fast
> > path.
> >
> > Then the fast path would only succeed if the value read off from per-cpu
> > did not change vs what's in the stock thing.
> >
> > Any change to memcg_stock_pcp (e.g., rolling up bytes after passing the
> > page size threshold) would disable interrupts and modify all these
> > counters.
> >
> > There is some more work needed to make sure the stock obj can be safely
> > swapped out for a new one and not accidentally have a value which lines
> > up with the prevoius one, I don't remember what I had for that (and yes,
> > I recognize a 4 byte value will invariably roll over and *in principle*
> > a conflict will be possible).
> >
> > This is a rough outline since Vlasta keeps prodding me about it.
>
> By chance do you have this code lying around somewhere? Not saying this
> is the way to go but wanted to take a look.

Sorry mate, there was a lot of handwaving produced around this and
kmem fast paths, but no code. :)

Conceptually though I think this is pretty straightforward.

Anyhow, I forgot to mention another angle: perhaps a kernel-equivalent
of rseq could be somehow employed here?

As in you prep the op. Should an interrupt come in, it can detect you
were going to execute it and redirect your IP to a fallback or just
restart. I have no idea how feasible this is here, food for thought.
-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
  2025-02-24 16:13 [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer Vlastimil Babka
  2025-02-24 18:02 ` Shakeel Butt
@ 2025-02-26  0:17 ` Christoph Lameter (Ampere)
  2025-03-05 10:26   ` Vlastimil Babka
  2025-03-25 17:43 ` Vlastimil Babka
  2 siblings, 1 reply; 10+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-02-26  0:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: lsf-pc, linux-mm, bpf, David Rientjes, Hyeonggon Yoo,
	Uladzislau Rezki (Sony),
	Alexei Starovoitov

Let me just express my general concern. SLUB was written because SLAB
became a Byzantine mess with layer upon layer of debugging and queues
here and there and with "maintenance" for these queues going on every 2
seconds staggered on all processors. This caused a degree of OS noise that
caused HPC jobs (and today we see similar issues with AI jobs) to not be
able to accomplish a deterministic rendezvous. On some large machines
we had ~10% of the whole memory vanish into one of the other queue on boot
up with  the customers being a bit upset were all the expensive memory
went.

It seems that were have nearly recreated the old nightmare again.

I would suggest rewriting the whole allocator once again trying to
simplify things as much as possible and isolating specialized allocator
functionality needed for some subsystems into different APIs.

The main allocation / free path needs to be as simple and as efficient as
possible. It may not be possible to accomplish something like that given
all the special casing that we have been pushing into it. Also consider the
runtime security measures and verification stuff that is on by default at
runtime as well.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
  2025-02-26  0:17 ` Christoph Lameter (Ampere)
@ 2025-03-05 10:26   ` Vlastimil Babka
  0 siblings, 0 replies; 10+ messages in thread
From: Vlastimil Babka @ 2025-03-05 10:26 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: lsf-pc, linux-mm, bpf, David Rientjes, Hyeonggon Yoo,
	Uladzislau Rezki (Sony),
	Alexei Starovoitov, Kees Cook

On 2/26/25 01:17, Christoph Lameter (Ampere) wrote:
> 
> Let me just express my general concern. SLUB was written because SLAB
> became a Byzantine mess with layer upon layer of debugging and queues

I don't recall it having much debugging. IIRC it was behind some config that
nobody enabled. SLUB's debugging that can be dynamically enabled on boot is
so much better.

> here and there and with "maintenance" for these queues going on every 2
> seconds staggered on all processors. This caused a degree of OS noise that
> caused HPC jobs (and today we see similar issues with AI jobs) to not be
> able to accomplish a deterministic rendezvous. On some large machines

Yeah, I don't want to reintroduce this, hence sheaves intentionally don't
support NUMA restricted allocations so none of the flushed alien arrays are
necessary.

> we had ~10% of the whole memory vanish into one of the other queue on boot
> up with  the customers being a bit upset were all the expensive memory
> went.
> 
> It seems that were have nearly recreated the old nightmare again.

I don't see it that bleak.

> I would suggest rewriting the whole allocator once again trying to
> simplify things as much as possible and isolating specialized allocator
> functionality needed for some subsystems into different APIs.

Any specific suggestions? Some things are hard to isolate i.e. make them
work on top of the core allocator because not interacting with the internals
would not allow some useful functionality, or efficiency.

> The main allocation / free path needs to be as simple and as efficient as
> possible. It may not be possible to accomplish something like that given
> all the special casing that we have been pushing into it. Also consider the

I see some possibilities for simplification in not trying to support KASAN
together with slab_debug anymore. KASAN should be superior for that purpose
(of course you pay the extra cost) and it's tricky to not have it step on
each other's toes with slab_debug.

> runtime security measures and verification stuff that is on by default at
> runtime as well.

Yeah more and more hardening seems to be the current trend. But also not
realistically possible to isolate away from the core. I at least tried to
always compile all of it away completely when the respective CONFIG is not
enabled. OTOH I'd like to see some of that to support boot parameters (via
static keys etc) so it can be compiled in but not enabled. That would not
completely eliminate the overhead of passing e.g. the bucket parameter or
performing kmalloc random index evaluation, but would not allocate the
separate caches if not enabled, so the memory overhead of that would not be
imposed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer
  2025-02-24 16:13 [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer Vlastimil Babka
  2025-02-24 18:02 ` Shakeel Butt
  2025-02-26  0:17 ` Christoph Lameter (Ampere)
@ 2025-03-25 17:43 ` Vlastimil Babka
  2 siblings, 0 replies; 10+ messages in thread
From: Vlastimil Babka @ 2025-03-25 17:43 UTC (permalink / raw)
  To: lsf-pc, linux-mm, bpf
  Cc: Christoph Lameter, David Rientjes, Hyeonggon Yoo,
	Uladzislau Rezki (Sony),
	Alexei Starovoitov

[-- Attachment #1: Type: text/plain, Size: 147 bytes --]

On 2/24/25 5:13 PM, Vlastimil Babka wrote:
> Hi,
> 
> I'd like to propose a session about the SLUB allocator.

Here are my slides from the session.

[-- Attachment #2: sheaves.pdf --]
[-- Type: application/pdf, Size: 207012 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-03-25 17:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-24 16:13 [LSF/MM/BPF TOPIC] SLUB allocator, mainly the sheaves caching layer Vlastimil Babka
2025-02-24 18:02 ` Shakeel Butt
2025-02-24 18:15   ` Vlastimil Babka
2025-02-24 20:52     ` Shakeel Butt
2025-02-24 18:46   ` Mateusz Guzik
2025-02-24 21:12     ` Shakeel Butt
2025-02-24 22:21       ` Mateusz Guzik
2025-02-26  0:17 ` Christoph Lameter (Ampere)
2025-03-05 10:26   ` Vlastimil Babka
2025-03-25 17:43 ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox