a case for a destructor for slub: mm

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* a case for a destructor for slub: mm_struct
@ 2025-03-12 10:33 Mateusz Guzik
  2025-03-13  8:59 ` Harry Yoo
  0 siblings, 1 reply; 11+ messages in thread
From: Mateusz Guzik @ 2025-03-12 10:33 UTC (permalink / raw)
  To: Vlastimil Babka, Hyeonggon Yoo; +Cc: linux-mm, Dennis Zhou

I'm looking for someone(tm) willing to implement a destructor for slub.

Currently SLUB only supports a constructor, a callback to use when
first creating an object, but there is no matching callback for
getting rid of it.

The pair would come in handy when a frequently allocated and freed
object performs the same expensive work each time.

The specific usage I have in mind is mm_struct -- it gets allocated on
both each fork and exec and suffers global serialization several
times.

The primary thing I'm looking to handle this way is cid and percpu
counter allocation, both going to down to the percpu allocator which
only has a global lock. The problem is exacerbated as it happens
back-to-back, so that's 4 acquires per lifetime cycle (alloc and
free).

There is other expensive work which can also be modified this way.

I recognize something like this would pose a tradeoff in terms of
memory usage, but I don't believe it's a big deal. If you have a
mm_struct hanging out, you are going to need to have the percpu memory
up for grabs to make any use of it anyway. Granted, there may be
spurious mm_struct's hanging out and eating pcpu resources. Something
can be added to reclaim those by the pcpu allocator.

So that's it for making the case, as for the APIs, I think it would be
best if both dtor and ctor accepted a batch of objects to operate on,
but that's a lot of extra churn due to pre-existing ctor users.

ACHTUNG: I think this particular usage would still want some buy in
from the mm folk and at least Dennis (the percpu allocator
maintainer), but one has to start somewhere. There were 2 different
patchsets posted to move rss counters away from the current pcpu
scheme, but both had different tradeoffs and ultimately died off.

Should someone(tm) commit to sorting this out, I'll handle the percpu
thing. There are some other tweaks warranted here (e.g., depessimizing
the rss counter validation loop at exit).

So what do you think?

In order to bench yourself,  you can grab code from here:
http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)

I check spinlock problems with: bpftrace -e
'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }'
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-12 10:33 a case for a destructor for slub: mm_struct Mateusz Guzik
@ 2025-03-13  8:59 ` Harry Yoo
  2025-03-13 11:23   ` Mateusz Guzik
  0 siblings, 1 reply; 11+ messages in thread
From: Harry Yoo @ 2025-03-13  8:59 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Vlastimil Babka, Hyeonggon Yoo, linux-mm, Dennis Zhou

Hi Mateusz,

Thank you for providing detailed explanation on how destructor in slab
allocator can be useful. A few questions and comments inlined below.

On Wed, Mar 12, 2025 at 11:33:09AM +0100, Mateusz Guzik wrote:
> I'm looking for someone(tm) willing to implement a destructor for slub.
> 
> Currently SLUB only supports a constructor, a callback to use when
> first creating an object, but there is no matching callback for
> getting rid of it.
> 
> The pair would come in handy when a frequently allocated and freed
> object performs the same expensive work each time.

Actually the destructor feature previously existed but removed by the commit
c59def9f222d ("Slab allocators: Drop support for destructors").

But it was removed because there were not many users and uncertainty
about its usefulness.

> The specific usage I have in mind is mm_struct -- it gets allocated on
> both each fork and exec and suffers global serialization several
> times.
> 
> The primary thing I'm looking to handle this way is cid and percpu
> counter allocation, both going to down to the percpu allocator which
> only has a global lock. The problem is exacerbated as it happens
> back-to-back, so that's 4 acquires per lifetime cycle (alloc and
> free).

That could be beneficial :-)

> There is other expensive work which can also be modified this way.

Not sure what you're referring to?

> I recognize something like this would pose a tradeoff in terms of
> memory usage, but I don't believe it's a big deal. If you have a
> mm_struct hanging out, you are going to need to have the percpu memory
> up for grabs to make any use of it anyway. Granted,

Yes. Some memory overhead is expected, but I don't think it'd be excessive.

> there may be spurious mm_struct's hanging out and eating pcpu resources.
> Something can be added to reclaim those by the pcpu allocator.

Not sure if I follow. What do you mean by spurious mm_struct, and how
does the pcpu allocator reclaim that?

> So that's it for making the case, as for the APIs, I think it would be
> best if both dtor and ctor accepted a batch of objects to operate on,
> but that's a lot of extra churn due to pre-existing ctor users.

Why do you want to pass batch of objects, instead of calling one by one
for each object when a slab folio is allocated/freed?

Is it solely to reduce the overhead of extra function calls when
allocating or freeing a slab folio?

> ACHTUNG: I think this particular usage would still want some buy in
> from the mm folk and at least Dennis (the percpu allocator
> maintainer), but one has to start somewhere. There were 2 different
> patchsets posted to move rss counters away from the current pcpu
> scheme, but both had different tradeoffs and ultimately died off.
>
> Should someone(tm) commit to sorting this out, I'll handle the percpu
> thing. There are some other tweaks warranted here (e.g., depessimizing
> the rss counter validation loop at exit).
> 
> So what do you think?

I'd love to take the project and work on it, it makes sense to revive
the destructor feature if that turns out to be beneficial.

I'll do that the slab part.

> In order to bench yourself,  you can grab code from here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
> 
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
> 
> I check spinlock problems with: bpftrace -e
> 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }'

Yay! I was looking for something like this to evaluate the performance.
Thank you for providing it!

-- 
Cheers,
Harry (formerly known as Hyeonggon)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-13  8:59 ` Harry Yoo
@ 2025-03-13 11:23   ` Mateusz Guzik
  2025-03-13 11:25     ` Mateusz Guzik
  2025-03-14  8:27     ` Harry Yoo
  0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Guzik @ 2025-03-13 11:23 UTC (permalink / raw)
  To: Harry Yoo; +Cc: Vlastimil Babka, Hyeonggon Yoo, linux-mm, Dennis Zhou

On Thu, Mar 13, 2025 at 9:59 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > There is other expensive work which can also be modified this way.
>
> Not sure what you're referring to?
>

There is pgd_alloc/free calls which end up taking the global pgd_list
lock. Instead, this could be patched to let mm hang out on the get
skipped by the pgd list iterating machinery as needed.

> > there may be spurious mm_struct's hanging out and eating pcpu resources.
> > Something can be added to reclaim those by the pcpu allocator.
>
> Not sure if I follow. What do you mean by spurious mm_struct, and how
> does the pcpu allocator reclaim that?
>

Suppose a workload was ran which created tons of mm_struct. The
workload is done and they can be reclaimed, but hang out just in case.

Another workload showed up, but one which wants to do many percpu
allocs and is not mm_struct-heavy.

In case of resource shortage it would be good if the percpu allocator
knew how to reclaim the known cached-but-not-used memory instead of
grabbing new patches.

As for how to get there, so happens the primary consumer (percpu
counters) already has a global list of all allocated objects. The
allocator could walk it and reclaim as needed.

> > So that's it for making the case, as for the APIs, I think it would be
> > best if both dtor and ctor accepted a batch of objects to operate on,
> > but that's a lot of extra churn due to pre-existing ctor users.
>
> Why do you want to pass batch of objects, instead of calling one by one
> for each object when a slab folio is allocated/freed?
>
> Is it solely to reduce the overhead of extra function calls when
> allocating or freeing a slab folio?
>

The single-threaded overhead is one thing, some contention is another.
back-to-back acquires are a degenerate case from scalability
standpoint.

While these codepaths should be executing rarely, there is still no
need to get spikes when they do.

Even so, that's a minor thing which can be patched up later -- even a
ctor/dtor pair which operates one obj at a time is almost entirety of
the improvement.

> > ACHTUNG: I think this particular usage would still want some buy in
> > from the mm folk and at least Dennis (the percpu allocator
> > maintainer), but one has to start somewhere. There were 2 different
> > patchsets posted to move rss counters away from the current pcpu
> > scheme, but both had different tradeoffs and ultimately died off.
> >
> > Should someone(tm) commit to sorting this out, I'll handle the percpu
> > thing. There are some other tweaks warranted here (e.g., depessimizing
> > the rss counter validation loop at exit).
> >
> > So what do you think?
>
> I'd love to take the project and work on it, it makes sense to revive
> the destructor feature if that turns out to be beneficial.
>
> I'll do that the slab part.

Cool, thanks.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-13 11:23   ` Mateusz Guzik
@ 2025-03-13 11:25     ` Mateusz Guzik
  2025-03-14  8:27     ` Harry Yoo
  1 sibling, 0 replies; 11+ messages in thread
From: Mateusz Guzik @ 2025-03-13 11:25 UTC (permalink / raw)
  To: Harry Yoo; +Cc: Vlastimil Babka, Hyeonggon Yoo, linux-mm, Dennis Zhou

On Thu, Mar 13, 2025 at 12:23 PM Mateusz Guzik <mjguzik@gmail.com> wrote:

geez, my spelling today is really off even by my own "standard"

>
> On Thu, Mar 13, 2025 at 9:59 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > There is other expensive work which can also be modified this way.
> >
> > Not sure what you're referring to?
> >
>
> There is pgd_alloc/free calls which end up taking the global pgd_list
> lock. Instead, this could be patched to let mm hang out on the get
> skipped by the pgd list iterating machinery as needed.

Instead, this could be patched to let mm hang out on the list and have
iterations of said list skip it as needed.

>
> > > there may be spurious mm_struct's hanging out and eating pcpu resources.
> > > Something can be added to reclaim those by the pcpu allocator.
> >
> > Not sure if I follow. What do you mean by spurious mm_struct, and how
> > does the pcpu allocator reclaim that?
> >
>
> Suppose a workload was ran which created tons of mm_struct. The
> workload is done and they can be reclaimed, but hang out just in case.
>
> Another workload showed up, but one which wants to do many percpu
> allocs and is not mm_struct-heavy.
>
> In case of resource shortage it would be good if the percpu allocator
> knew how to reclaim the known cached-but-not-used memory instead of
> grabbing new patches.

new pages

>
> As for how to get there, so happens the primary consumer (percpu
> counters) already has a global list of all allocated objects. The
> allocator could walk it and reclaim as needed.
>
> > > So that's it for making the case, as for the APIs, I think it would be
> > > best if both dtor and ctor accepted a batch of objects to operate on,
> > > but that's a lot of extra churn due to pre-existing ctor users.
> >
> > Why do you want to pass batch of objects, instead of calling one by one
> > for each object when a slab folio is allocated/freed?
> >
> > Is it solely to reduce the overhead of extra function calls when
> > allocating or freeing a slab folio?
> >
>
> The single-threaded overhead is one thing, some contention is another.
> back-to-back acquires are a degenerate case from scalability
> standpoint.
>
> While these codepaths should be executing rarely, there is still no
> need to get spikes when they do.
>
> Even so, that's a minor thing which can be patched up later -- even a
> ctor/dtor pair which operates one obj at a time is almost entirety of
> the improvement.
>
> > > ACHTUNG: I think this particular usage would still want some buy in
> > > from the mm folk and at least Dennis (the percpu allocator
> > > maintainer), but one has to start somewhere. There were 2 different
> > > patchsets posted to move rss counters away from the current pcpu
> > > scheme, but both had different tradeoffs and ultimately died off.
> > >
> > > Should someone(tm) commit to sorting this out, I'll handle the percpu
> > > thing. There are some other tweaks warranted here (e.g., depessimizing
> > > the rss counter validation loop at exit).
> > >
> > > So what do you think?
> >
> > I'd love to take the project and work on it, it makes sense to revive
> > the destructor feature if that turns out to be beneficial.
> >
> > I'll do that the slab part.
>
> Cool, thanks.
>
> --
> Mateusz Guzik <mjguzik gmail.com>



-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-13 11:23   ` Mateusz Guzik
  2025-03-13 11:25     ` Mateusz Guzik
@ 2025-03-14  8:27     ` Harry Yoo
  2025-03-14  8:34       ` Harry Yoo
  2025-03-14 12:32       ` Mateusz Guzik
  1 sibling, 2 replies; 11+ messages in thread
From: Harry Yoo @ 2025-03-14  8:27 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Vlastimil Babka, Hyeonggon Yoo, linux-mm, Dennis Zhou

On Thu, Mar 13, 2025 at 12:23:55PM +0100, Mateusz Guzik wrote:
> On Thu, Mar 13, 2025 at 9:59 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > There is other expensive work which can also be modified this way.
> >
> > Not sure what you're referring to?
> >
> 
> There is pgd_alloc/free calls which end up taking the global pgd_list
> lock. Instead, this could be patched to let mm hang out on the get
> skipped by the pgd list iterating machinery as needed.

Okay, IIUC, you want to remove it from the list in the destructor
and not remove it from the list when freeing it, and the code iterating
over the list needs to verify whether it is still allocated.

Would this be with SLAB_TYPESAFE_BY_RCU, or by taking a lock within
the destructor?

The RFC patch [1] that removes the destructor states that "taking a spinlock
in a destructor is a bit risky since the slab allocators may run the
destructors anytime they decide a slab is no longer needed",
but I need to think a bit more about how risky that actually is.

[1] https://lore.kernel.org/linux-kernel/Pine.LNX.4.64.0705101156190.10663@schroedinger.engr.sgi.com/#t

> > > there may be spurious mm_struct's hanging out and eating pcpu resources.
> > > Something can be added to reclaim those by the pcpu allocator.
> >
> > Not sure if I follow. What do you mean by spurious mm_struct, and how
> > does the pcpu allocator reclaim that?
> >
> 
> Suppose a workload was ran which created tons of mm_struct. The
> workload is done and they can be reclaimed, but hang out just in case.
>
> Another workload showed up, but one which wants to do many percpu
> allocs and is not mm_struct-heavy.
> 
> In case of resource shortage it would be good if the percpu allocator
> knew how to reclaim the known cached-but-not-used memory instead of
> grabbing new patches.
> 
> As for how to get there, so happens the primary consumer (percpu
> counters) already has a global list of all allocated objects. The
> allocator could walk it and reclaim as needed.

You mean reclaiming per-cpu objects along withthe slab objects that uses them?
That sounds like a new slab shrinker for mm_struct?

> > > So that's it for making the case, as for the APIs, I think it would be
> > > best if both dtor and ctor accepted a batch of objects to operate on,
> > > but that's a lot of extra churn due to pre-existing ctor users.
> >
> > Why do you want to pass batch of objects, instead of calling one by one
> > for each object when a slab folio is allocated/freed?
> >
> > Is it solely to reduce the overhead of extra function calls when
> > allocating or freeing a slab folio?
> >
> 
> The single-threaded overhead is one thing, some contention is another.
> back-to-back acquires are a degenerate case from scalability
> standpoint.
> 
> While these codepaths should be executing rarely, there is still no
> need to get spikes when they do.
> 
> Even so, that's a minor thing which can be patched up later -- even a
> ctor/dtor pair which operates one obj at a time is almost entirety of
> the improvement.

Hmm, but I think even a ctor/dtor pair that operations on one object at
a time requires changing the semantic of how the ctor works :'(

For now the constructor is not expected to fail, but looking at
mm_init()... it can fail. There is no way to let slab allocator know
that it failed.

Looks like some churn is needed anyway?

> > > ACHTUNG: I think this particular usage would still want some buy in
> > > from the mm folk and at least Dennis (the percpu allocator
> > > maintainer), but one has to start somewhere. There were 2 different
> > > patchsets posted to move rss counters away from the current pcpu
> > > scheme, but both had different tradeoffs and ultimately died off.
> > >
> > > Should someone(tm) commit to sorting this out, I'll handle the percpu
> > > thing. There are some other tweaks warranted here (e.g., depessimizing
> > > the rss counter validation loop at exit).
> > >
> > > So what do you think?
> >
> > I'd love to take the project and work on it, it makes sense to revive
> > the destructor feature if that turns out to be beneficial.
> >
> > I'll do that the slab part.
> 
> Cool, thanks.

-- 
Cheers,
Harry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-14  8:27     ` Harry Yoo
@ 2025-03-14  8:34       ` Harry Yoo
  2025-03-14 12:32       ` Mateusz Guzik
  1 sibling, 0 replies; 11+ messages in thread
From: Harry Yoo @ 2025-03-14  8:34 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Vlastimil Babka, Hyeonggon Yoo, linux-mm, Dennis Zhou

On Fri, Mar 14, 2025 at 05:27:33PM +0900, Harry Yoo wrote:
> On Thu, Mar 13, 2025 at 12:23:55PM +0100, Mateusz Guzik wrote:
> > On Thu, Mar 13, 2025 at 9:59 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > There is other expensive work which can also be modified this way.
> > >
> > > Not sure what you're referring to?
> > >
> > 
> > There is pgd_alloc/free calls which end up taking the global pgd_list
> > lock. Instead, this could be patched to let mm hang out on the get
> > skipped by the pgd list iterating machinery as needed.
> 
> Okay, IIUC, you want to remove it from the list in the destructor
> and not remove it from the list when freeing it, and the code iterating
> over the list needs to verify whether it is still allocated.
> 
> Would this be with SLAB_TYPESAFE_BY_RCU, or by taking a lock within
> the destructor?

Nah, slipped my mind. a lock needs to be acquired
even with SLAB_TYPESAFE_BY_RCU.

> The RFC patch [1] that removes the destructor states that "taking a spinlock
> in a destructor is a bit risky since the slab allocators may run the
> destructors anytime they decide a slab is no longer needed",
> but I need to think a bit more about how risky that actually is.
> 
> [1] https://lore.kernel.org/linux-kernel/Pine.LNX.4.64.0705101156190.10663@schroedinger.engr.sgi.com/#t
> 
> > > > there may be spurious mm_struct's hanging out and eating pcpu resources.
> > > > Something can be added to reclaim those by the pcpu allocator.
> > >
> > > Not sure if I follow. What do you mean by spurious mm_struct, and how
> > > does the pcpu allocator reclaim that?
> > >
> > 
> > Suppose a workload was ran which created tons of mm_struct. The
> > workload is done and they can be reclaimed, but hang out just in case.
> >
> > Another workload showed up, but one which wants to do many percpu
> > allocs and is not mm_struct-heavy.
> > 
> > In case of resource shortage it would be good if the percpu allocator
> > knew how to reclaim the known cached-but-not-used memory instead of
> > grabbing new patches.
> > 
> > As for how to get there, so happens the primary consumer (percpu
> > counters) already has a global list of all allocated objects. The
> > allocator could walk it and reclaim as needed.
> 
> You mean reclaiming per-cpu objects along withthe slab objects that uses them?
> That sounds like a new slab shrinker for mm_struct?
> 
> > > > So that's it for making the case, as for the APIs, I think it would be
> > > > best if both dtor and ctor accepted a batch of objects to operate on,
> > > > but that's a lot of extra churn due to pre-existing ctor users.
> > >
> > > Why do you want to pass batch of objects, instead of calling one by one
> > > for each object when a slab folio is allocated/freed?
> > >
> > > Is it solely to reduce the overhead of extra function calls when
> > > allocating or freeing a slab folio?
> > >
> > 
> > The single-threaded overhead is one thing, some contention is another.
> > back-to-back acquires are a degenerate case from scalability
> > standpoint.
> > 
> > While these codepaths should be executing rarely, there is still no
> > need to get spikes when they do.
> > 
> > Even so, that's a minor thing which can be patched up later -- even a
> > ctor/dtor pair which operates one obj at a time is almost entirety of
> > the improvement.
> 
> Hmm, but I think even a ctor/dtor pair that operations on one object at
> a time requires changing the semantic of how the ctor works :'(
> 
> For now the constructor is not expected to fail, but looking at
> mm_init()... it can fail. There is no way to let slab allocator know
> that it failed.
> 
> Looks like some churn is needed anyway?
> 
> > > > ACHTUNG: I think this particular usage would still want some buy in
> > > > from the mm folk and at least Dennis (the percpu allocator
> > > > maintainer), but one has to start somewhere. There were 2 different
> > > > patchsets posted to move rss counters away from the current pcpu
> > > > scheme, but both had different tradeoffs and ultimately died off.
> > > >
> > > > Should someone(tm) commit to sorting this out, I'll handle the percpu
> > > > thing. There are some other tweaks warranted here (e.g., depessimizing
> > > > the rss counter validation loop at exit).
> > > >
> > > > So what do you think?
> > >
> > > I'd love to take the project and work on it, it makes sense to revive
> > > the destructor feature if that turns out to be beneficial.
> > >
> > > I'll do that the slab part.
> > 
> > Cool, thanks.
> 
> -- 
> Cheers,
> Harry

-- 
Cheers,
Harry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-14  8:27     ` Harry Yoo
  2025-03-14  8:34       ` Harry Yoo
@ 2025-03-14 12:32       ` Mateusz Guzik
  2025-03-17  5:42         ` Harry Yoo
  1 sibling, 1 reply; 11+ messages in thread
From: Mateusz Guzik @ 2025-03-14 12:32 UTC (permalink / raw)
  To: Harry Yoo; +Cc: Vlastimil Babka, Hyeonggon Yoo, linux-mm, Dennis Zhou

On Fri, Mar 14, 2025 at 9:27 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Thu, Mar 13, 2025 at 12:23:55PM +0100, Mateusz Guzik wrote:
> > On Thu, Mar 13, 2025 at 9:59 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > There is other expensive work which can also be modified this way.
> > >
> > > Not sure what you're referring to?
> > >
> >
> > There is pgd_alloc/free calls which end up taking the global pgd_list
> > lock. Instead, this could be patched to let mm hang out on the get
> > skipped by the pgd list iterating machinery as needed.
>
> Okay, IIUC, you want to remove it from the list in the destructor
> and not remove it from the list when freeing it, and the code iterating
> over the list needs to verify whether it is still allocated.
>
> Would this be with SLAB_TYPESAFE_BY_RCU, or by taking a lock within
> the destructor?
>
> The RFC patch [1] that removes the destructor states that "taking a spinlock
> in a destructor is a bit risky since the slab allocators may run the
> destructors anytime they decide a slab is no longer needed",
> but I need to think a bit more about how risky that actually is.
>
> [1] https://lore.kernel.org/linux-kernel/Pine.LNX.4.64.0705101156190.10663@schroedinger.engr.sgi.com/#t
>

It's a spinlock which disables interrupts around itself, so it should
not be a problem.

> > > > there may be spurious mm_struct's hanging out and eating pcpu resources.
> > > > Something can be added to reclaim those by the pcpu allocator.
> > >
> > > Not sure if I follow. What do you mean by spurious mm_struct, and how
> > > does the pcpu allocator reclaim that?
> > >
> >
> > Suppose a workload was ran which created tons of mm_struct. The
> > workload is done and they can be reclaimed, but hang out just in case.
> >
> > Another workload showed up, but one which wants to do many percpu
> > allocs and is not mm_struct-heavy.
> >
> > In case of resource shortage it would be good if the percpu allocator
> > knew how to reclaim the known cached-but-not-used memory instead of
> > grabbing new patches.
> >
> > As for how to get there, so happens the primary consumer (percpu
> > counters) already has a global list of all allocated objects. The
> > allocator could walk it and reclaim as needed.
>
> You mean reclaiming per-cpu objects along withthe slab objects that uses them?
> That sounds like a new slab shrinker for mm_struct?
>

at least the per-cpu thing, mm_struct itself optionally

> > > > So that's it for making the case, as for the APIs, I think it would be
> > > > best if both dtor and ctor accepted a batch of objects to operate on,
> > > > but that's a lot of extra churn due to pre-existing ctor users.
> > >
> > > Why do you want to pass batch of objects, instead of calling one by one
> > > for each object when a slab folio is allocated/freed?
> > >
> > > Is it solely to reduce the overhead of extra function calls when
> > > allocating or freeing a slab folio?
> > >
> >
> > The single-threaded overhead is one thing, some contention is another.
> > back-to-back acquires are a degenerate case from scalability
> > standpoint.
> >
> > While these codepaths should be executing rarely, there is still no
> > need to get spikes when they do.
> >
> > Even so, that's a minor thing which can be patched up later -- even a
> > ctor/dtor pair which operates one obj at a time is almost entirety of
> > the improvement.
>
> Hmm, but I think even a ctor/dtor pair that operations on one object at
> a time requires changing the semantic of how the ctor works :'(
>
> For now the constructor is not expected to fail, but looking at
> mm_init()... it can fail. There is no way to let slab allocator know
> that it failed.
>
> Looks like some churn is needed anyway?
>

indeed, i did not realize it at the time, sorry :)

I'm happy to split the churn and add some 'return 0;' myself.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-14 12:32       ` Mateusz Guzik
@ 2025-03-17  5:42         ` Harry Yoo
  2025-03-17  9:02           ` Vlastimil Babka
  2025-03-17  9:23           ` Mateusz Guzik
  0 siblings, 2 replies; 11+ messages in thread
From: Harry Yoo @ 2025-03-17  5:42 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Vlastimil Babka, Hyeonggon Yoo, linux-mm, Dennis Zhou

On Fri, Mar 14, 2025 at 01:32:16PM +0100, Mateusz Guzik wrote:
> On Fri, Mar 14, 2025 at 9:27 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Thu, Mar 13, 2025 at 12:23:55PM +0100, Mateusz Guzik wrote:
> > > On Thu, Mar 13, 2025 at 9:59 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > > There is other expensive work which can also be modified this way.
> > > >
> > > > Not sure what you're referring to?
> > > >
> > >
> > > There is pgd_alloc/free calls which end up taking the global pgd_list
> > > lock. Instead, this could be patched to let mm hang out on the get
> > > skipped by the pgd list iterating machinery as needed.
> >
> > Okay, IIUC, you want to remove it from the list in the destructor
> > and not remove it from the list when freeing it, and the code iterating
> > over the list needs to verify whether it is still allocated.
> >
> > Would this be with SLAB_TYPESAFE_BY_RCU, or by taking a lock within
> > the destructor?
> >
> > The RFC patch [1] that removes the destructor states that "taking a spinlock
> > in a destructor is a bit risky since the slab allocators may run the
> > destructors anytime they decide a slab is no longer needed",
> > but I need to think a bit more about how risky that actually is.
> >
> > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-kernel/Pine.LNX.4.64.0705101156190.10663@schroedinger.engr.sgi.com/*t__;Iw!!ACWV5N9M2RV99hQ!LWzNjGjW2lHHrdKWHilJ8BXUlkkEPrOlbI6Py5pptw_gPpPMH03guZnDYgYrAtMuyMEmMwgWrg6gKuQ$ 
> >
> 
> It's a spinlock which disables interrupts around itself, so it should
> not be a problem.
> 
> > > > > there may be spurious mm_struct's hanging out and eating pcpu resources.
> > > > > Something can be added to reclaim those by the pcpu allocator.
> > > >
> > > > Not sure if I follow. What do you mean by spurious mm_struct, and how
> > > > does the pcpu allocator reclaim that?
> > > >
> > >
> > > Suppose a workload was ran which created tons of mm_struct. The
> > > workload is done and they can be reclaimed, but hang out just in case.
> > >
> > > Another workload showed up, but one which wants to do many percpu
> > > allocs and is not mm_struct-heavy.
> > >
> > > In case of resource shortage it would be good if the percpu allocator
> > > knew how to reclaim the known cached-but-not-used memory instead of
> > > grabbing new patches.
> > >
> > > As for how to get there, so happens the primary consumer (percpu
> > > counters) already has a global list of all allocated objects. The
> > > allocator could walk it and reclaim as needed.
> >
> > You mean reclaiming per-cpu objects along withthe slab objects that uses them?
> > That sounds like a new slab shrinker for mm_struct?
> >
> 
> at least the per-cpu thing, mm_struct itself optionally

If we allow reclaiming per-cpu stuff only biut do not reclaim
the slab object that contains it...

Does that mean the users of the cache need to check if the percpu
memory has been reclaimed and if so, should call init routines (e.g.,
mm_init())?

> > > > > So that's it for making the case, as for the APIs, I think it would be
> > > > > best if both dtor and ctor accepted a batch of objects to operate on,
> > > > > but that's a lot of extra churn due to pre-existing ctor users.
> > > >
> > > > Why do you want to pass batch of objects, instead of calling one by one
> > > > for each object when a slab folio is allocated/freed?
> > > >
> > > > Is it solely to reduce the overhead of extra function calls when
> > > > allocating or freeing a slab folio?
> > > >
> > >
> > > The single-threaded overhead is one thing, some contention is another.
> > > back-to-back acquires are a degenerate case from scalability
> > > standpoint.
> > >
> > > While these codepaths should be executing rarely, there is still no
> > > need to get spikes when they do.
> > >
> > > Even so, that's a minor thing which can be patched up later -- even a
> > > ctor/dtor pair which operates one obj at a time is almost entirety of
> > > the improvement.
> >
> > Hmm, but I think even a ctor/dtor pair that operations on one object at
> > a time requires changing the semantic of how the ctor works :'(
> >
> > For now the constructor is not expected to fail, but looking at
> > mm_init()... it can fail. There is no way to let slab allocator know
> > that it failed.
> >
> > Looks like some churn is needed anyway?
> >
> 
> indeed, i did not realize it at the time, sorry :)
> 
> I'm happy to split the churn and add some 'return 0;' myself.

Fortunately, there were only a handful users of of constructor ;-)
Never mind, I converted all of them in an hour (splitting the work + ping
pong-ing seemed like unnecessary overhead).

Will send the patch series before too late.
But please be patient—since LSFMM is next week,
I'm expecting to be busier than usual :)

-- 
Cheers,
Harry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-17  5:42         ` Harry Yoo
@ 2025-03-17  9:02           ` Vlastimil Babka
  2025-03-17  9:17             ` Mateusz Guzik
  2025-03-17  9:23           ` Mateusz Guzik
  1 sibling, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2025-03-17  9:02 UTC (permalink / raw)
  To: Harry Yoo, Mateusz Guzik; +Cc: Hyeonggon Yoo, linux-mm, Dennis Zhou

On 3/17/25 06:42, Harry Yoo wrote:
> On Fri, Mar 14, 2025 at 01:32:16PM +0100, Mateusz Guzik wrote:
>> 
>> It's a spinlock which disables interrupts around itself, so it should
>> not be a problem.
>> 
>> > > > > there may be spurious mm_struct's hanging out and eating pcpu resources.
>> > > > > Something can be added to reclaim those by the pcpu allocator.
>> > > >
>> > > > Not sure if I follow. What do you mean by spurious mm_struct, and how
>> > > > does the pcpu allocator reclaim that?
>> > > >
>> > >
>> > > Suppose a workload was ran which created tons of mm_struct. The
>> > > workload is done and they can be reclaimed, but hang out just in case.
>> > >
>> > > Another workload showed up, but one which wants to do many percpu
>> > > allocs and is not mm_struct-heavy.
>> > >
>> > > In case of resource shortage it would be good if the percpu allocator
>> > > knew how to reclaim the known cached-but-not-used memory instead of
>> > > grabbing new patches.
>> > >
>> > > As for how to get there, so happens the primary consumer (percpu
>> > > counters) already has a global list of all allocated objects. The
>> > > allocator could walk it and reclaim as needed.
>> >
>> > You mean reclaiming per-cpu objects along withthe slab objects that uses them?
>> > That sounds like a new slab shrinker for mm_struct?
>> >
>> 
>> at least the per-cpu thing, mm_struct itself optionally
> 
> If we allow reclaiming per-cpu stuff only biut do not reclaim
> the slab object that contains it...
> 
> Does that mean the users of the cache need to check if the percpu
> memory has been reclaimed and if so, should call init routines (e.g.,
> mm_init())?

That sounds like something we'd better avoid? Think it would need to imply
some locking between the shrinker and slab allocator so it doesn't hand out
a mm_struct where its percpu memory is reclaimed.

I hope it's enough if we're able to shrink what slab allocator has cached in
per-cpu (partial) slabs, there's already flushing of that from e.g. sysfs
but can't recall if there's a shrinker. Of course there will always be free
mm_struct objects in partially full slabs due to fragmentation, but I doubt
we'd need to worry specifically about the percpu memory those "own".


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-17  9:02           ` Vlastimil Babka
@ 2025-03-17  9:17             ` Mateusz Guzik
  0 siblings, 0 replies; 11+ messages in thread
From: Mateusz Guzik @ 2025-03-17  9:17 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Harry Yoo, Hyeonggon Yoo, linux-mm, Dennis Zhou

On Mon, Mar 17, 2025 at 10:02 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 3/17/25 06:42, Harry Yoo wrote:
> > On Fri, Mar 14, 2025 at 01:32:16PM +0100, Mateusz Guzik wrote:
> >> > You mean reclaiming per-cpu objects along withthe slab objects that uses them?
> >> > That sounds like a new slab shrinker for mm_struct?
> >> >
> >>
> >> at least the per-cpu thing, mm_struct itself optionally
> >
> > If we allow reclaiming per-cpu stuff only biut do not reclaim
> > the slab object that contains it...
> >
> > Does that mean the users of the cache need to check if the percpu
> > memory has been reclaimed and if so, should call init routines (e.g.,
> > mm_init())?
>
> That sounds like something we'd better avoid? Think it would need to imply
> some locking between the shrinker and slab allocator so it doesn't hand out
> a mm_struct where its percpu memory is reclaimed.
>
> I hope it's enough if we're able to shrink what slab allocator has cached in
> per-cpu (partial) slabs, there's already flushing of that from e.g. sysfs
> but can't recall if there's a shrinker. Of course there will always be free
> mm_struct objects in partially full slabs due to fragmentation, but I doubt
> we'd need to worry specifically about the percpu memory those "own".

My preference here is that a shrinker will just whack some mm_struct
objects like it would normally happen during memory reclaim.

However, should that be difficult to achieve, whacking just the
per-cpu areas is perfectly doable at an added cost at allocation time
-- you "claim" the obj with an atomic op. After that it is illegal to
whack percpu areas from under you and if they are already missing, you
just allocate them. Getting here is not ideal and preferably avoided
in favor of whacking the entire struct, but it can be done with minor
hacks.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: a case for a destructor for slub: mm_struct
  2025-03-17  5:42         ` Harry Yoo
  2025-03-17  9:02           ` Vlastimil Babka
@ 2025-03-17  9:23           ` Mateusz Guzik
  1 sibling, 0 replies; 11+ messages in thread
From: Mateusz Guzik @ 2025-03-17  9:23 UTC (permalink / raw)
  To: Harry Yoo; +Cc: Vlastimil Babka, Hyeonggon Yoo, linux-mm, Dennis Zhou

On Mon, Mar 17, 2025 at 6:42 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> If we allow reclaiming per-cpu stuff only biut do not reclaim
> the slab object that contains it...
>
> Does that mean the users of the cache need to check if the percpu
> memory has been reclaimed and if so, should call init routines (e.g.,
> mm_init())?
>

see the other mail (about to be written) e-mail.

> > I'm happy to split the churn and add some 'return 0;' myself.
>
> Fortunately, there were only a handful users of of constructor ;-)
> Never mind, I converted all of them in an hour (splitting the work + ping
> pong-ing seemed like unnecessary overhead).
>

cool and apologies again :)

> Will send the patch series before too late.
> But please be patient—since LSFMM is next week,
> I'm expecting to be busier than usual :)
>

there is zero time pressure here mate

I appreciate you looking into this.

--
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-03-17  9:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-12 10:33 a case for a destructor for slub: mm_struct Mateusz Guzik
2025-03-13  8:59 ` Harry Yoo
2025-03-13 11:23   ` Mateusz Guzik
2025-03-13 11:25     ` Mateusz Guzik
2025-03-14  8:27     ` Harry Yoo
2025-03-14  8:34       ` Harry Yoo
2025-03-14 12:32       ` Mateusz Guzik
2025-03-17  5:42         ` Harry Yoo
2025-03-17  9:02           ` Vlastimil Babka
2025-03-17  9:17             ` Mateusz Guzik
2025-03-17  9:23           ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox