Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Pedro Falcato <pfalcato@suse.de>
To: Jan Kara <jack@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>,
	linux-mm@kvack.org,  lsf-pc@lists.linux-foundation.org,
	Mateusz Guzik <mjguzik@gmail.com>,
	 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Gabriel Krisman Bertazi <krisman@suse.de>,
	 Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@gentwo.org>,
	 Dennis Zhou <dennis@kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>, Hao Li <hao.li@linux.dev>
Subject: Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator
Date: Fri, 6 Mar 2026 15:35:36 +0000	[thread overview]
Message-ID: <qz3f4p2ra6nq5cx3vlacmwif2ih5ojbf7s3ydzw6d7tgqn24lj@pnynq4l6oovc> (raw)
In-Reply-To: <z7bjxfk7jah6zgyikhiz6eqxd3xwywxp745bykcr3sm3p525yi@4diwkjsoyckl>

On Thu, Mar 05, 2026 at 12:48:21PM +0100, Jan Kara wrote:
> On Thu 05-03-26 11:33:21, Pedro Falcato wrote:
> > On Fri, Feb 27, 2026 at 03:41:50PM +0900, Harry Yoo wrote:
> > > Hi folks, I'd like to discuss ways to mitigate limitations of
> > > percpu memory allocator. 
> > > 
> > > While the percpu memory allocator has served its role well,
> > > it has a few problems: 1) its global lock contention, and
> > > 2) lack of features to avoid high initialization cost of percpu memory.
> > > 
> > > Global lock contention
> > > =======================
> > > 
> > > Percpu allocator has a global lock when allocating or freeing memory.
> > > Of course, caching percpu memory is not always worth it, because
> > > it would meaningfully increase memory usage.
> > > 
> > > However, some users (e.g., fork+exec, tc filter) suffer from
> > > the lock contention when many CPUs allocate / free percpu memory
> > > concurrently.
> > > 
> > > That said, we need a way to cache percpu memory per cpu, in a selective
> > > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu
> > > memory in slab objects and letting slab cache them per cpu,
> > > with slab ctor+dtor pair: allocate percpu memory and
> > > associate it with slab object in constructor, and free it when
> > > deallocating slabs (with resurrecting slab destructor feature).
> > > 
> > > This only works when percpu memory is associated with slab objects.
> > > I would like to hear if anybody thinks it's still worth redesigning
> > > percpu memory allocator for better scalability.
> > 
> > I think this (make alloc_percpu actually scale) is the obvious suggestion.
> > Everything else is just papering over the cracks.
> 
> I disagree. There are two separate (although related) issues that need
> solving. One issue is certainly scalability of the percpu allocator.
> Another issue (which is also visible in singlethreaded workloads) is that
> a percpu counter creation has a rather large cost even if the allocator is
> totally uncontended - this is because of the initialization (and final
> summarization) cost. And this is very visible e.g. in the fork() intensive
> loads such as shell scripts where we currently allocate several percpu
> arrays for each fork() and significant part of the fork() cost is currently
> the initialization of percpu arrays on larger machines. Reducing this
> overhead is a separate goal.

I agree that it's a separate issue. But it's as much of an issue for
single-threaded processes as much as multi-threaded. Say you have a 64 core
CPU. Why should you pay for 64 separate cores when you only spawned 2 threads?
(and, yes, this is a not-so-rare situation, like lld which spawns up to 16
threads (https://reviews.llvm.org/D147493), even if you have hundreds of CPUs)

So perhaps the best way to go about this problem would be to go back to
per-task RSS accounting. This one had problems with many-task RSS accuracy,
but the current one has problems for many-cpu RSS accuracy. A single-threaded
optimization could patch over the problem for the vast majority of programs,
but exceptions exist.

Or another possible idea: lazily initialize these cpu counters somehow,
on task switch.

I'm afraid that while the solution presented by Mathieu fixes a problem with
the current scheme (insane inaccuracy on large-cpu-count), it might also add
to the percpu allocation + init problem (this might not be true, I have not
paid too much attention).

-- 
Pedro

next prev parent reply	other threads:[~2026-03-06 15:35 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-27  6:41 Harry Yoo
2026-03-04 17:50 ` Gabriel Krisman Bertazi
2026-03-05  4:24   ` Mathieu Desnoyers
2026-03-05 10:05     ` Jan Kara
2026-03-05 11:33 ` Pedro Falcato
2026-03-05 11:48   ` Jan Kara
2026-03-06 15:35     ` Pedro Falcato [this message]
2026-03-06 16:26       ` Gabriel Krisman Bertazi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=qz3f4p2ra6nq5cx3vlacmwif2ih5ojbf7s3ydzw6d7tgqn24lj@pnynq4l6oovc \
    --to=pfalcato@suse.de \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=hao.li@linux.dev \
    --cc=harry.yoo@oracle.com \
    --cc=jack@suse.cz \
    --cc=krisman@suse.de \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mjguzik@gmail.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox