Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Gabriel Krisman Bertazi <krisman@suse.de>
To: Pedro Falcato <pfalcato@suse.de>
Cc: Jan Kara <jack@suse.cz>,  Harry Yoo <harry.yoo@oracle.com>,
	linux-mm@kvack.org,  lsf-pc@lists.linux-foundation.org,
	 Mateusz Guzik <mjguzik@gmail.com>,
	 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Tejun Heo <tj@kernel.org>,  Christoph Lameter <cl@gentwo.org>,
	 Dennis Zhou <dennis@kernel.org>,
	 Vlastimil Babka <vbabka@suse.cz>,  Hao Li <hao.li@linux.dev>
Subject: Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator
Date: Fri, 06 Mar 2026 11:26:22 -0500	[thread overview]
Message-ID: <87seaczzyp.fsf@mailhost.krisman.be> (raw)
In-Reply-To: <qz3f4p2ra6nq5cx3vlacmwif2ih5ojbf7s3ydzw6d7tgqn24lj@pnynq4l6oovc> (Pedro Falcato's message of "Fri, 6 Mar 2026 15:35:36 +0000")

Pedro Falcato <pfalcato@suse.de> writes:

> On Thu, Mar 05, 2026 at 12:48:21PM +0100, Jan Kara wrote:
>
>> On Thu 05-03-26 11:33:21, Pedro Falcato wrote:
>> > On Fri, Feb 27, 2026 at 03:41:50PM +0900, Harry Yoo wrote:
>> > > Hi folks, I'd like to discuss ways to mitigate limitations of
>> > > percpu memory allocator. 
>> > > 
>> > > While the percpu memory allocator has served its role well,
>> > > it has a few problems: 1) its global lock contention, and
>> > > 2) lack of features to avoid high initialization cost of percpu memory.
>> > > 
>> > > Global lock contention
>> > > =======================
>> > > 
>> > > Percpu allocator has a global lock when allocating or freeing memory.
>> > > Of course, caching percpu memory is not always worth it, because
>> > > it would meaningfully increase memory usage.
>> > > 
>> > > However, some users (e.g., fork+exec, tc filter) suffer from
>> > > the lock contention when many CPUs allocate / free percpu memory
>> > > concurrently.
>> > > 
>> > > That said, we need a way to cache percpu memory per cpu, in a selective
>> > > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu
>> > > memory in slab objects and letting slab cache them per cpu,
>> > > with slab ctor+dtor pair: allocate percpu memory and
>> > > associate it with slab object in constructor, and free it when
>> > > deallocating slabs (with resurrecting slab destructor feature).
>> > > 
>> > > This only works when percpu memory is associated with slab objects.
>> > > I would like to hear if anybody thinks it's still worth redesigning
>> > > percpu memory allocator for better scalability.
>> > 
>> > I think this (make alloc_percpu actually scale) is the obvious suggestion.
>> > Everything else is just papering over the cracks.
>> 
>> I disagree. There are two separate (although related) issues that need
>> solving. One issue is certainly scalability of the percpu allocator.
>> Another issue (which is also visible in singlethreaded workloads) is that
>> a percpu counter creation has a rather large cost even if the allocator is
>> totally uncontended - this is because of the initialization (and final
>> summarization) cost. And this is very visible e.g. in the fork() intensive
>> loads such as shell scripts where we currently allocate several percpu
>> arrays for each fork() and significant part of the fork() cost is currently
>> the initialization of percpu arrays on larger machines. Reducing this
>> overhead is a separate goal.
>
> I agree that it's a separate issue. But it's as much of an issue for
> single-threaded processes as much as multi-threaded. Say you have a 64 core
> CPU. Why should you pay for 64 separate cores when you only spawned 2 threads?
> (and, yes, this is a not-so-rare situation, like lld which spawns up to 16
> threads (https://reviews.llvm.org/D147493), even if you have hundreds
> of CPUs)

True.  Still, being an up-front initialization cost, it is the most
relevant the shortest the task lives.  I'd imagine that even for
something as lld doing 16 clone syscalls, the overhead of a single
percpu counter initialization is a very small blip in the profile, not
worth special-casing for.  The single-threaded case is the obvious
optimizable-case in this sense.

> So perhaps the best way to go about this problem would be to go back to
> per-task RSS accounting. This one had problems with many-task RSS accuracy,
> but the current one has problems for many-cpu RSS accuracy.

The current pcpu one has a much smaller accuracy error than per-task,
which justified its inclusion in the first place, no?  IIRC, there was a
real use case where the worse accuracy mattered for process selection
during OOM.

> A single-threaded
> optimization could patch over the problem for the vast majority of programs,
> but exceptions exist.

>
> Or another possible idea: lazily initialize these cpu counters somehow,
> on task switch.

>
> I'm afraid that while the solution presented by Mathieu fixes a problem with
> the current scheme (insane inaccuracy on large-cpu-count), it might also add
> to the percpu allocation + init problem (this might not be true, I have not
> paid too much attention).

-- 
Gabriel Krisman Bertazi

     prev parent reply	other threads:[~2026-03-06 16:26 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-27  6:41 Harry Yoo
2026-03-04 17:50 ` Gabriel Krisman Bertazi
2026-03-05  4:24   ` Mathieu Desnoyers
2026-03-05 10:05     ` Jan Kara
2026-03-05 11:33 ` Pedro Falcato
2026-03-05 11:48   ` Jan Kara
2026-03-06 15:35     ` Pedro Falcato
2026-03-06 16:26       ` Gabriel Krisman Bertazi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87seaczzyp.fsf@mailhost.krisman.be \
    --to=krisman@suse.de \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=hao.li@linux.dev \
    --cc=harry.yoo@oracle.com \
    --cc=jack@suse.cz \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mjguzik@gmail.com \
    --cc=pfalcato@suse.de \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox