From: Pedro Falcato <pfalcato@suse.de>
To: Jan Kara <jack@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>,
linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org,
Mateusz Guzik <mjguzik@gmail.com>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Gabriel Krisman Bertazi <krisman@suse.de>,
Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@gentwo.org>,
Dennis Zhou <dennis@kernel.org>,
Vlastimil Babka <vbabka@suse.cz>, Hao Li <hao.li@linux.dev>
Subject: Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator
Date: Fri, 6 Mar 2026 15:35:36 +0000 [thread overview]
Message-ID: <qz3f4p2ra6nq5cx3vlacmwif2ih5ojbf7s3ydzw6d7tgqn24lj@pnynq4l6oovc> (raw)
In-Reply-To: <z7bjxfk7jah6zgyikhiz6eqxd3xwywxp745bykcr3sm3p525yi@4diwkjsoyckl>
On Thu, Mar 05, 2026 at 12:48:21PM +0100, Jan Kara wrote:
> On Thu 05-03-26 11:33:21, Pedro Falcato wrote:
> > On Fri, Feb 27, 2026 at 03:41:50PM +0900, Harry Yoo wrote:
> > > Hi folks, I'd like to discuss ways to mitigate limitations of
> > > percpu memory allocator.
> > >
> > > While the percpu memory allocator has served its role well,
> > > it has a few problems: 1) its global lock contention, and
> > > 2) lack of features to avoid high initialization cost of percpu memory.
> > >
> > > Global lock contention
> > > =======================
> > >
> > > Percpu allocator has a global lock when allocating or freeing memory.
> > > Of course, caching percpu memory is not always worth it, because
> > > it would meaningfully increase memory usage.
> > >
> > > However, some users (e.g., fork+exec, tc filter) suffer from
> > > the lock contention when many CPUs allocate / free percpu memory
> > > concurrently.
> > >
> > > That said, we need a way to cache percpu memory per cpu, in a selective
> > > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu
> > > memory in slab objects and letting slab cache them per cpu,
> > > with slab ctor+dtor pair: allocate percpu memory and
> > > associate it with slab object in constructor, and free it when
> > > deallocating slabs (with resurrecting slab destructor feature).
> > >
> > > This only works when percpu memory is associated with slab objects.
> > > I would like to hear if anybody thinks it's still worth redesigning
> > > percpu memory allocator for better scalability.
> >
> > I think this (make alloc_percpu actually scale) is the obvious suggestion.
> > Everything else is just papering over the cracks.
>
> I disagree. There are two separate (although related) issues that need
> solving. One issue is certainly scalability of the percpu allocator.
> Another issue (which is also visible in singlethreaded workloads) is that
> a percpu counter creation has a rather large cost even if the allocator is
> totally uncontended - this is because of the initialization (and final
> summarization) cost. And this is very visible e.g. in the fork() intensive
> loads such as shell scripts where we currently allocate several percpu
> arrays for each fork() and significant part of the fork() cost is currently
> the initialization of percpu arrays on larger machines. Reducing this
> overhead is a separate goal.
I agree that it's a separate issue. But it's as much of an issue for
single-threaded processes as much as multi-threaded. Say you have a 64 core
CPU. Why should you pay for 64 separate cores when you only spawned 2 threads?
(and, yes, this is a not-so-rare situation, like lld which spawns up to 16
threads (https://reviews.llvm.org/D147493), even if you have hundreds of CPUs)
So perhaps the best way to go about this problem would be to go back to
per-task RSS accounting. This one had problems with many-task RSS accuracy,
but the current one has problems for many-cpu RSS accuracy. A single-threaded
optimization could patch over the problem for the vast majority of programs,
but exceptions exist.
Or another possible idea: lazily initialize these cpu counters somehow,
on task switch.
I'm afraid that while the solution presented by Mathieu fixes a problem with
the current scheme (insane inaccuracy on large-cpu-count), it might also add
to the percpu allocation + init problem (this might not be true, I have not
paid too much attention).
--
Pedro
next prev parent reply other threads:[~2026-03-06 15:35 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-27 6:41 Harry Yoo
2026-03-04 17:50 ` Gabriel Krisman Bertazi
2026-03-05 4:24 ` Mathieu Desnoyers
2026-03-05 10:05 ` Jan Kara
2026-03-05 11:33 ` Pedro Falcato
2026-03-05 11:48 ` Jan Kara
2026-03-06 15:35 ` Pedro Falcato [this message]
2026-03-06 16:26 ` Gabriel Krisman Bertazi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=qz3f4p2ra6nq5cx3vlacmwif2ih5ojbf7s3ydzw6d7tgqn24lj@pnynq4l6oovc \
--to=pfalcato@suse.de \
--cc=cl@gentwo.org \
--cc=dennis@kernel.org \
--cc=hao.li@linux.dev \
--cc=harry.yoo@oracle.com \
--cc=jack@suse.cz \
--cc=krisman@suse.de \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mjguzik@gmail.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox