Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Harry Yoo <harry.yoo@oracle.com>
To: Pedro Falcato <pfalcato@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Mateusz Guzik <mjguzik@gmail.com>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Jiri Pirko <jiri@resnulli.us>, Vlad Buslov <vladbu@nvidia.com>,
	Yevgeny Kliteynik <kliteyn@nvidia.com>, Jan Kara <jack@suse.cz>,
	Byungchul Park <byungchul@sk.com>,
	linux-mm@kvack.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
Date: Mon, 28 Apr 2025 10:18:34 +0900	[thread overview]
Message-ID: <aA7XagXweCLdATTH@harry> (raw)
In-Reply-To: <vd3k2bljkzow6ozzan2hkeiyytcqe2g6gavroej23457erucza@fknlr6cmzvo7>

On Fri, Apr 25, 2025 at 11:42:27AM +0100, Pedro Falcato wrote:
> On Fri, Apr 25, 2025 at 07:12:02PM +0900, Harry Yoo wrote:
> > On Thu, Apr 24, 2025 at 12:28:37PM +0100, Pedro Falcato wrote:
> > > On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> > > > Overview
> > > > ========
> > > > 
> > > > The slab destructor feature existed in early days of slab allocator(s).
> > > > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > > > for destructors") in 2007 due to lack of serious use cases at that time.
> > > > 
> > > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > > > constructor/destructor pair to mitigate the global serialization point
> > > > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > > > percpu memory during its lifetime.
> > > > 
> > > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > > > so each allocate–free cycle requires two expensive acquire/release on
> > > > that mutex.
> > > > 
> > > > We can mitigate this contention by retaining the percpu regions after
> > > > the object is freed and releasing them only when the backing slab pages
> > > > are freed.
> > > > 
> > > > How to do this with slab constructors and destructors: the constructor
> > > > allocates percpu memory, and the destructor frees it when the slab pages
> > > > are reclaimed; this slightly alters the constructor’s semantics,
> > > > as it can now fail.
> > > > 
> > > 
> > > I really really really really don't like this. We're opening a pandora's box
> > > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > > solution there would be, what, failing dtors? which says a lot about the whole
> > > situation...
> > > 
> > > Case in point:
> > 
> > <...snip...>
> > 
> > > Then there are obviously other problems like: whatever you're calling must
> > > not ever require the slab allocator (directly or indirectly) and must not
> > > do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> > > is a no-go (AIUI!) already because of such issues.
> > 
> > Could you please elaborate more on this?
>
> Well, as discussed multiple-times both on-and-off-list, the pcpu allocator is
> not a problem here because the freeing path takes a spinlock, not a mutex.

Yes, and it seems to be a leaf spinlock (no code path in the kernel takes
any other locks while holding the lock).

> But obviously you can see the fun locking horror dependency chains
> we're creating with this patchset.
>
> ->ctor() needs to be super careful calling things, avoiding
> any sort of loop.

You mean recursion to avoid exhausting the kernel stack?

It'd be fine as long as ->ctor does not allocate objects from
the same cache (and of course it should not).

> ->dtor() needs to be super careful calling things, avoiding
> _any_ sort of direct reclaim possibilities.

Why would a dtor _ever_ need to perform direct reclamation?

>You also now need to pass a gfp_t  to both ->ctor and ->dtor.

Passing gfp_t to ->ctor agreed, but why to ->dtor?

Destructors should not allocate any memory and thus no need for
'Get Free Page' flags.

> So basically:
> - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
>   lock holding, etc, can be hidden with a normal kmem_cache_alloc;

Speaking of deadlocks involving ->ctor, of course, you shouldn't allocate
mm_struct while holding pcpu_lock, or allocate a pgd while holding pgd_lock.
I don't think avoiding deadlocks caused by ->ctor is that difficult, and
lockdep can detect them even if someone makes a mistake.

> - ->dtor *will* do fancy things like recursing back onto the slab allocator and
>   grabbing locks

AIUI it can't recurse back onto the slab allocator.
'Taking only leaf spinlocks' is a very restrictive rule.

For example, slab takes list_lock, and it is not a leaf spinlock because
in some path slab can take other locks while holding it. And thus you can't
call kmem_cache_free() directly in ->dtor.

'Doing fancy things and grabbing locks' in dtor is safe as long as
you 1) disable interrupts and 2) take only leaf spinlocks.
 
> - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
>   to dispose of slabs. It can also uncontrollably do $whatever.

'Can uncontrollably do $whatever' is not true.
It's safe as long as ->dtor only takes leaf spinlocks.

> - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
>   ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
>   cache_free/slab disposal issues
> 
> - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
>   issues by purely starting up shrinkers on direct reclaim as well.

I don't see that is a problem (as long as ->dtor only takes leaf spinlocks).

-- 
Cheers,
Harry / Hyeonggon

next prev parent reply	other threads:[~2025-04-28  1:25 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24  8:07 Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 1/7] mm/slab: refactor freelist shuffle Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 2/7] treewide, slab: allow slab constructor to return an error Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 3/7] mm/slab: revive the destructor feature in slab allocator Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 4/7] net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu alloc Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 5/7] mm/percpu: allow (un)charging objects without alloc/free Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 6/7] lib/percpu_counter: allow (un)charging percpu counters " Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 7/7] kernel/fork: improve exec() throughput with slab ctor/dtor pair Harry Yoo
2025-04-24  9:29 ` [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Mateusz Guzik
2025-04-24  9:58   ` Harry Yoo
2025-04-24 15:00     ` Mateusz Guzik
2025-04-24 11:28 ` Pedro Falcato
2025-04-24 15:20   ` Mateusz Guzik
2025-04-24 16:11     ` Mateusz Guzik
2025-04-25  7:40     ` Harry Yoo
2025-04-25 10:12   ` Harry Yoo
2025-04-25 10:42     ` Pedro Falcato
2025-04-28  1:18       ` Harry Yoo [this message]
2025-04-30 19:49       ` Mateusz Guzik
2025-05-12 11:00         ` Harry Yoo
2025-04-24 15:50 ` Christoph Lameter (Ampere)
2025-04-24 16:03   ` Mateusz Guzik
2025-04-24 16:39     ` Christoph Lameter (Ampere)
2025-04-24 17:26       ` Mateusz Guzik
2025-04-24 18:47 ` Tejun Heo
2025-04-25 10:10   ` Harry Yoo
2025-04-25 19:03     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aA7XagXweCLdATTH@harry \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=byungchul@sk.com \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=jack@suse.cz \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=kliteyn@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mjguzik@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pfalcato@suse.de \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vladbu@nvidia.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox