From: Gabriel Krisman Bertazi <krisman@suse.de>
To: Pedro Falcato <pfalcato@suse.de>
Cc: Jan Kara <jack@suse.cz>, Harry Yoo <harry.yoo@oracle.com>,
linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org,
Mateusz Guzik <mjguzik@gmail.com>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@gentwo.org>,
Dennis Zhou <dennis@kernel.org>,
Vlastimil Babka <vbabka@suse.cz>, Hao Li <hao.li@linux.dev>
Subject: Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator
Date: Fri, 06 Mar 2026 11:26:22 -0500 [thread overview]
Message-ID: <87seaczzyp.fsf@mailhost.krisman.be> (raw)
In-Reply-To: <qz3f4p2ra6nq5cx3vlacmwif2ih5ojbf7s3ydzw6d7tgqn24lj@pnynq4l6oovc> (Pedro Falcato's message of "Fri, 6 Mar 2026 15:35:36 +0000")
Pedro Falcato <pfalcato@suse.de> writes:
> On Thu, Mar 05, 2026 at 12:48:21PM +0100, Jan Kara wrote:
>
>> On Thu 05-03-26 11:33:21, Pedro Falcato wrote:
>> > On Fri, Feb 27, 2026 at 03:41:50PM +0900, Harry Yoo wrote:
>> > > Hi folks, I'd like to discuss ways to mitigate limitations of
>> > > percpu memory allocator.
>> > >
>> > > While the percpu memory allocator has served its role well,
>> > > it has a few problems: 1) its global lock contention, and
>> > > 2) lack of features to avoid high initialization cost of percpu memory.
>> > >
>> > > Global lock contention
>> > > =======================
>> > >
>> > > Percpu allocator has a global lock when allocating or freeing memory.
>> > > Of course, caching percpu memory is not always worth it, because
>> > > it would meaningfully increase memory usage.
>> > >
>> > > However, some users (e.g., fork+exec, tc filter) suffer from
>> > > the lock contention when many CPUs allocate / free percpu memory
>> > > concurrently.
>> > >
>> > > That said, we need a way to cache percpu memory per cpu, in a selective
>> > > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu
>> > > memory in slab objects and letting slab cache them per cpu,
>> > > with slab ctor+dtor pair: allocate percpu memory and
>> > > associate it with slab object in constructor, and free it when
>> > > deallocating slabs (with resurrecting slab destructor feature).
>> > >
>> > > This only works when percpu memory is associated with slab objects.
>> > > I would like to hear if anybody thinks it's still worth redesigning
>> > > percpu memory allocator for better scalability.
>> >
>> > I think this (make alloc_percpu actually scale) is the obvious suggestion.
>> > Everything else is just papering over the cracks.
>>
>> I disagree. There are two separate (although related) issues that need
>> solving. One issue is certainly scalability of the percpu allocator.
>> Another issue (which is also visible in singlethreaded workloads) is that
>> a percpu counter creation has a rather large cost even if the allocator is
>> totally uncontended - this is because of the initialization (and final
>> summarization) cost. And this is very visible e.g. in the fork() intensive
>> loads such as shell scripts where we currently allocate several percpu
>> arrays for each fork() and significant part of the fork() cost is currently
>> the initialization of percpu arrays on larger machines. Reducing this
>> overhead is a separate goal.
>
> I agree that it's a separate issue. But it's as much of an issue for
> single-threaded processes as much as multi-threaded. Say you have a 64 core
> CPU. Why should you pay for 64 separate cores when you only spawned 2 threads?
> (and, yes, this is a not-so-rare situation, like lld which spawns up to 16
> threads (https://reviews.llvm.org/D147493), even if you have hundreds
> of CPUs)
True. Still, being an up-front initialization cost, it is the most
relevant the shortest the task lives. I'd imagine that even for
something as lld doing 16 clone syscalls, the overhead of a single
percpu counter initialization is a very small blip in the profile, not
worth special-casing for. The single-threaded case is the obvious
optimizable-case in this sense.
> So perhaps the best way to go about this problem would be to go back to
> per-task RSS accounting. This one had problems with many-task RSS accuracy,
> but the current one has problems for many-cpu RSS accuracy.
The current pcpu one has a much smaller accuracy error than per-task,
which justified its inclusion in the first place, no? IIRC, there was a
real use case where the worse accuracy mattered for process selection
during OOM.
> A single-threaded
> optimization could patch over the problem for the vast majority of programs,
> but exceptions exist.
>
> Or another possible idea: lazily initialize these cpu counters somehow,
> on task switch.
>
> I'm afraid that while the solution presented by Mathieu fixes a problem with
> the current scheme (insane inaccuracy on large-cpu-count), it might also add
> to the percpu allocation + init problem (this might not be true, I have not
> paid too much attention).
--
Gabriel Krisman Bertazi
prev parent reply other threads:[~2026-03-06 16:26 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-27 6:41 Harry Yoo
2026-03-04 17:50 ` Gabriel Krisman Bertazi
2026-03-05 4:24 ` Mathieu Desnoyers
2026-03-05 10:05 ` Jan Kara
2026-03-05 11:33 ` Pedro Falcato
2026-03-05 11:48 ` Jan Kara
2026-03-06 15:35 ` Pedro Falcato
2026-03-06 16:26 ` Gabriel Krisman Bertazi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87seaczzyp.fsf@mailhost.krisman.be \
--to=krisman@suse.de \
--cc=cl@gentwo.org \
--cc=dennis@kernel.org \
--cc=hao.li@linux.dev \
--cc=harry.yoo@oracle.com \
--cc=jack@suse.cz \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mjguzik@gmail.com \
--cc=pfalcato@suse.de \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox