* [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator
@ 2026-02-27 6:41 Harry Yoo
2026-03-04 17:50 ` Gabriel Krisman Bertazi
0 siblings, 1 reply; 2+ messages in thread
From: Harry Yoo @ 2026-02-27 6:41 UTC (permalink / raw)
To: linux-mm
Cc: lsf-pc, Mateusz Guzik, Mathieu Desnoyers,
Gabriel Krisman Bertazi, Tejun Heo, Christoph Lameter,
Dennis Zhou, Vlastimil Babka, Hao Li, Jan Kara
Hi folks, I'd like to discuss ways to mitigate limitations of
percpu memory allocator.
While the percpu memory allocator has served its role well,
it has a few problems: 1) its global lock contention, and
2) lack of features to avoid high initialization cost of percpu memory.
Global lock contention
=======================
Percpu allocator has a global lock when allocating or freeing memory.
Of course, caching percpu memory is not always worth it, because
it would meaningfully increase memory usage.
However, some users (e.g., fork+exec, tc filter) suffer from
the lock contention when many CPUs allocate / free percpu memory
concurrently.
That said, we need a way to cache percpu memory per cpu, in a selective
way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu
memory in slab objects and letting slab cache them per cpu,
with slab ctor+dtor pair: allocate percpu memory and
associate it with slab object in constructor, and free it when
deallocating slabs (with resurrecting slab destructor feature).
This only works when percpu memory is associated with slab objects.
I would like to hear if anybody thinks it's still worth redesigning
percpu memory allocator for better scalability.
Initialization of percpu data has high overhead
===============================================
Initializing percpu data has non-negligible overhead on systems with
many CPUs. There's been a few approaches proposed to mitigate this.
I'd like to discuss the status of ideas proposed, and potentially
whether there are other approaches worth exploring.
Slab constructor + destructor Pair
----------------------------------
Percpu allocator doesn't distinguish types of objects
unlike slab and it doesn't support constructors that could avoid
re-initializing them on every allocation.
One solution to this is using slab ctor+dtor pair; as long as a certain
state is preserved on free (e.g. sum of percpu counter is zero),
initialization needs to be done only once on construction.
Dual-mode percpu counters
-------------------------
Gabriel Krisman Bertazi proposed [2] introducing dual-mode percpu
counters; single-threaded tasks use a simple counter, which is cheaper
to initialize. Later when a new task is spawned, upgrade it to a more
expensive, full-fledged counter.
On-demand initialization of mm_cid counters
-------------------------------------------
Mathieu Desnoyers proposed [3] initializing mm_cid counters on-demand
on clone instead of initializing for all CPUs on every allocation.
[1] https://lore.kernel.org/linux-mm/20250424080755.272925-1-harry.yoo@oracle.com
[2] https://lore.kernel.org/linux-mm/20251127233635.4170047-1-krisman@suse.de
[3] https://lore.kernel.org/linux-mm/355143c9-78c7-4da1-9033-5ae6fa50efad@efficios.com
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator
2026-02-27 6:41 [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator Harry Yoo
@ 2026-03-04 17:50 ` Gabriel Krisman Bertazi
0 siblings, 0 replies; 2+ messages in thread
From: Gabriel Krisman Bertazi @ 2026-03-04 17:50 UTC (permalink / raw)
To: Harry Yoo
Cc: linux-mm, lsf-pc, Mateusz Guzik, Mathieu Desnoyers, Tejun Heo,
Christoph Lameter, Dennis Zhou, Vlastimil Babka, Hao Li,
Jan Kara
Harry Yoo <harry.yoo@oracle.com> writes:
> Hi folks, I'd like to discuss ways to mitigate limitations of
> percpu memory allocator.
>
> While the percpu memory allocator has served its role well,
> it has a few problems: 1) its global lock contention, and
> 2) lack of features to avoid high initialization cost of percpu
> memory.
I won't be going to LSF this year. But Jan was the proponent of my
dual-mode pcpu initialization work and he'll be around. I'm not sure
this requires a full session either, as it might not grasp broader
interest. Are Mathieu and Mateusz attending?
>
> Global lock contention
> =======================
>
> Percpu allocator has a global lock when allocating or freeing memory.
> Of course, caching percpu memory is not always worth it, because
> it would meaningfully increase memory usage.
>
> However, some users (e.g., fork+exec, tc filter) suffer from
> the lock contention when many CPUs allocate / free percpu memory
> concurrently.
>
> That said, we need a way to cache percpu memory per cpu, in a selective
> way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu
> memory in slab objects and letting slab cache them per cpu,
> with slab ctor+dtor pair: allocate percpu memory and
> associate it with slab object in constructor, and free it when
> deallocating slabs (with resurrecting slab destructor feature).
>
> This only works when percpu memory is associated with slab objects.
> I would like to hear if anybody thinks it's still worth redesigning
> percpu memory allocator for better scalability.
>
> Initialization of percpu data has high overhead
> ===============================================
>
> Initializing percpu data has non-negligible overhead on systems with
> many CPUs. There's been a few approaches proposed to mitigate this.
> I'd like to discuss the status of ideas proposed, and potentially
> whether there are other approaches worth exploring.
>
> Slab constructor + destructor Pair
> ----------------------------------
>
> Percpu allocator doesn't distinguish types of objects
> unlike slab and it doesn't support constructors that could avoid
> re-initializing them on every allocation.
> One solution to this is using slab ctor+dtor pair; as long as a certain
> state is preserved on free (e.g. sum of percpu counter is zero),
> initialization needs to be done only once on construction.
>
> Dual-mode percpu counters
> -------------------------
>
> Gabriel Krisman Bertazi proposed [2] introducing dual-mode percpu
> counters; single-threaded tasks use a simple counter, which is cheaper
> to initialize. Later when a new task is spawned, upgrade it to a more
> expensive, full-fledged counter.
>
> On-demand initialization of mm_cid counters
> -------------------------------------------
>
> Mathieu Desnoyers proposed [3] initializing mm_cid counters on-demand
> on clone instead of initializing for all CPUs on every allocation.
>
> [1] https://lore.kernel.org/linux-mm/20250424080755.272925-1-harry.yoo@oracle.com
> [2] https://lore.kernel.org/linux-mm/20251127233635.4170047-1-krisman@suse.de
> [3] https://lore.kernel.org/linux-mm/355143c9-78c7-4da1-9033-5ae6fa50efad@efficios.com
--
Gabriel Krisman Bertazi
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-03-04 17:50 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-27 6:41 [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator Harry Yoo
2026-03-04 17:50 ` Gabriel Krisman Bertazi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox