* [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator @ 2026-02-27 6:41 Harry Yoo 2026-03-04 17:50 ` Gabriel Krisman Bertazi 2026-03-05 11:33 ` Pedro Falcato 0 siblings, 2 replies; 8+ messages in thread From: Harry Yoo @ 2026-02-27 6:41 UTC (permalink / raw) To: linux-mm Cc: lsf-pc, Mateusz Guzik, Mathieu Desnoyers, Gabriel Krisman Bertazi, Tejun Heo, Christoph Lameter, Dennis Zhou, Vlastimil Babka, Hao Li, Jan Kara Hi folks, I'd like to discuss ways to mitigate limitations of percpu memory allocator. While the percpu memory allocator has served its role well, it has a few problems: 1) its global lock contention, and 2) lack of features to avoid high initialization cost of percpu memory. Global lock contention ======================= Percpu allocator has a global lock when allocating or freeing memory. Of course, caching percpu memory is not always worth it, because it would meaningfully increase memory usage. However, some users (e.g., fork+exec, tc filter) suffer from the lock contention when many CPUs allocate / free percpu memory concurrently. That said, we need a way to cache percpu memory per cpu, in a selective way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu memory in slab objects and letting slab cache them per cpu, with slab ctor+dtor pair: allocate percpu memory and associate it with slab object in constructor, and free it when deallocating slabs (with resurrecting slab destructor feature). This only works when percpu memory is associated with slab objects. I would like to hear if anybody thinks it's still worth redesigning percpu memory allocator for better scalability. Initialization of percpu data has high overhead =============================================== Initializing percpu data has non-negligible overhead on systems with many CPUs. There's been a few approaches proposed to mitigate this. I'd like to discuss the status of ideas proposed, and potentially whether there are other approaches worth exploring. Slab constructor + destructor Pair ---------------------------------- Percpu allocator doesn't distinguish types of objects unlike slab and it doesn't support constructors that could avoid re-initializing them on every allocation. One solution to this is using slab ctor+dtor pair; as long as a certain state is preserved on free (e.g. sum of percpu counter is zero), initialization needs to be done only once on construction. Dual-mode percpu counters ------------------------- Gabriel Krisman Bertazi proposed [2] introducing dual-mode percpu counters; single-threaded tasks use a simple counter, which is cheaper to initialize. Later when a new task is spawned, upgrade it to a more expensive, full-fledged counter. On-demand initialization of mm_cid counters ------------------------------------------- Mathieu Desnoyers proposed [3] initializing mm_cid counters on-demand on clone instead of initializing for all CPUs on every allocation. [1] https://lore.kernel.org/linux-mm/20250424080755.272925-1-harry.yoo@oracle.com [2] https://lore.kernel.org/linux-mm/20251127233635.4170047-1-krisman@suse.de [3] https://lore.kernel.org/linux-mm/355143c9-78c7-4da1-9033-5ae6fa50efad@efficios.com -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator 2026-02-27 6:41 [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator Harry Yoo @ 2026-03-04 17:50 ` Gabriel Krisman Bertazi 2026-03-05 4:24 ` Mathieu Desnoyers 2026-03-05 11:33 ` Pedro Falcato 1 sibling, 1 reply; 8+ messages in thread From: Gabriel Krisman Bertazi @ 2026-03-04 17:50 UTC (permalink / raw) To: Harry Yoo Cc: linux-mm, lsf-pc, Mateusz Guzik, Mathieu Desnoyers, Tejun Heo, Christoph Lameter, Dennis Zhou, Vlastimil Babka, Hao Li, Jan Kara Harry Yoo <harry.yoo@oracle.com> writes: > Hi folks, I'd like to discuss ways to mitigate limitations of > percpu memory allocator. > > While the percpu memory allocator has served its role well, > it has a few problems: 1) its global lock contention, and > 2) lack of features to avoid high initialization cost of percpu > memory. I won't be going to LSF this year. But Jan was the proponent of my dual-mode pcpu initialization work and he'll be around. I'm not sure this requires a full session either, as it might not grasp broader interest. Are Mathieu and Mateusz attending? > > Global lock contention > ======================= > > Percpu allocator has a global lock when allocating or freeing memory. > Of course, caching percpu memory is not always worth it, because > it would meaningfully increase memory usage. > > However, some users (e.g., fork+exec, tc filter) suffer from > the lock contention when many CPUs allocate / free percpu memory > concurrently. > > That said, we need a way to cache percpu memory per cpu, in a selective > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu > memory in slab objects and letting slab cache them per cpu, > with slab ctor+dtor pair: allocate percpu memory and > associate it with slab object in constructor, and free it when > deallocating slabs (with resurrecting slab destructor feature). > > This only works when percpu memory is associated with slab objects. > I would like to hear if anybody thinks it's still worth redesigning > percpu memory allocator for better scalability. > > Initialization of percpu data has high overhead > =============================================== > > Initializing percpu data has non-negligible overhead on systems with > many CPUs. There's been a few approaches proposed to mitigate this. > I'd like to discuss the status of ideas proposed, and potentially > whether there are other approaches worth exploring. > > Slab constructor + destructor Pair > ---------------------------------- > > Percpu allocator doesn't distinguish types of objects > unlike slab and it doesn't support constructors that could avoid > re-initializing them on every allocation. > One solution to this is using slab ctor+dtor pair; as long as a certain > state is preserved on free (e.g. sum of percpu counter is zero), > initialization needs to be done only once on construction. > > Dual-mode percpu counters > ------------------------- > > Gabriel Krisman Bertazi proposed [2] introducing dual-mode percpu > counters; single-threaded tasks use a simple counter, which is cheaper > to initialize. Later when a new task is spawned, upgrade it to a more > expensive, full-fledged counter. > > On-demand initialization of mm_cid counters > ------------------------------------------- > > Mathieu Desnoyers proposed [3] initializing mm_cid counters on-demand > on clone instead of initializing for all CPUs on every allocation. > > [1] https://lore.kernel.org/linux-mm/20250424080755.272925-1-harry.yoo@oracle.com > [2] https://lore.kernel.org/linux-mm/20251127233635.4170047-1-krisman@suse.de > [3] https://lore.kernel.org/linux-mm/355143c9-78c7-4da1-9033-5ae6fa50efad@efficios.com -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator 2026-03-04 17:50 ` Gabriel Krisman Bertazi @ 2026-03-05 4:24 ` Mathieu Desnoyers 2026-03-05 10:05 ` Jan Kara 0 siblings, 1 reply; 8+ messages in thread From: Mathieu Desnoyers @ 2026-03-05 4:24 UTC (permalink / raw) To: Gabriel Krisman Bertazi, Harry Yoo Cc: linux-mm, lsf-pc, Mateusz Guzik, Tejun Heo, Christoph Lameter, Dennis Zhou, Vlastimil Babka, Hao Li, Jan Kara On 2026-03-04 12:50, Gabriel Krisman Bertazi wrote: > Harry Yoo <harry.yoo@oracle.com> writes: > >> Hi folks, I'd like to discuss ways to mitigate limitations of >> percpu memory allocator. >> >> While the percpu memory allocator has served its role well, >> it has a few problems: 1) its global lock contention, and >> 2) lack of features to avoid high initialization cost of percpu >> memory. > > I won't be going to LSF this year. But Jan was the proponent of my > dual-mode pcpu initialization work and he'll be around. I'm not sure > this requires a full session either, as it might not grasp broader > interest. Are Mathieu and Mateusz attending? > No sorry, I was not planning to attend in person this year. Is it possible to attend remotely ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator 2026-03-05 4:24 ` Mathieu Desnoyers @ 2026-03-05 10:05 ` Jan Kara 0 siblings, 0 replies; 8+ messages in thread From: Jan Kara @ 2026-03-05 10:05 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Gabriel Krisman Bertazi, Harry Yoo, linux-mm, lsf-pc, Mateusz Guzik, Tejun Heo, Christoph Lameter, Dennis Zhou, Vlastimil Babka, Hao Li, Jan Kara On Wed 04-03-26 23:24:37, Mathieu Desnoyers wrote: > On 2026-03-04 12:50, Gabriel Krisman Bertazi wrote: > > Harry Yoo <harry.yoo@oracle.com> writes: > > > > > Hi folks, I'd like to discuss ways to mitigate limitations of > > > percpu memory allocator. > > > > > > While the percpu memory allocator has served its role well, > > > it has a few problems: 1) its global lock contention, and > > > 2) lack of features to avoid high initialization cost of percpu > > > memory. > > > > I won't be going to LSF this year. But Jan was the proponent of my > > dual-mode pcpu initialization work and he'll be around. I'm not sure > > this requires a full session either, as it might not grasp broader > > interest. Are Mathieu and Mateusz attending? > > No sorry, I was not planning to attend in person this year. > Is it possible to attend remotely ? We don't offer general remote attendance option but we will have two meeting owls available on site so we can set up something on case-by-case basis. If you're interested in remotely participating in this session, please talk to MM track leaders how / whether they can accommodate this. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator 2026-02-27 6:41 [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator Harry Yoo 2026-03-04 17:50 ` Gabriel Krisman Bertazi @ 2026-03-05 11:33 ` Pedro Falcato 2026-03-05 11:48 ` Jan Kara 1 sibling, 1 reply; 8+ messages in thread From: Pedro Falcato @ 2026-03-05 11:33 UTC (permalink / raw) To: Harry Yoo Cc: linux-mm, lsf-pc, Mateusz Guzik, Mathieu Desnoyers, Gabriel Krisman Bertazi, Tejun Heo, Christoph Lameter, Dennis Zhou, Vlastimil Babka, Hao Li, Jan Kara On Fri, Feb 27, 2026 at 03:41:50PM +0900, Harry Yoo wrote: > Hi folks, I'd like to discuss ways to mitigate limitations of > percpu memory allocator. > > While the percpu memory allocator has served its role well, > it has a few problems: 1) its global lock contention, and > 2) lack of features to avoid high initialization cost of percpu memory. > > Global lock contention > ======================= > > Percpu allocator has a global lock when allocating or freeing memory. > Of course, caching percpu memory is not always worth it, because > it would meaningfully increase memory usage. > > However, some users (e.g., fork+exec, tc filter) suffer from > the lock contention when many CPUs allocate / free percpu memory > concurrently. > > That said, we need a way to cache percpu memory per cpu, in a selective > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu > memory in slab objects and letting slab cache them per cpu, > with slab ctor+dtor pair: allocate percpu memory and > associate it with slab object in constructor, and free it when > deallocating slabs (with resurrecting slab destructor feature). > > This only works when percpu memory is associated with slab objects. > I would like to hear if anybody thinks it's still worth redesigning > percpu memory allocator for better scalability. I think this (make alloc_percpu actually scale) is the obvious suggestion. Everything else is just papering over the cracks. > Slab constructor + destructor Pair > ---------------------------------- > > Percpu allocator doesn't distinguish types of objects > unlike slab and it doesn't support constructors that could avoid > re-initializing them on every allocation. > One solution to this is using slab ctor+dtor pair; as long as a certain > state is preserved on free (e.g. sum of percpu counter is zero), > initialization needs to be done only once on construction. As I said way back when, making an object permanently accessible a-la TYPESAFE_BY_RCU) is screwey and messes with the object lifetime too much. Not to mention the locking problems that we discussed back-and-forth. -- Pedro ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator 2026-03-05 11:33 ` Pedro Falcato @ 2026-03-05 11:48 ` Jan Kara 2026-03-06 15:35 ` Pedro Falcato 0 siblings, 1 reply; 8+ messages in thread From: Jan Kara @ 2026-03-05 11:48 UTC (permalink / raw) To: Pedro Falcato Cc: Harry Yoo, linux-mm, lsf-pc, Mateusz Guzik, Mathieu Desnoyers, Gabriel Krisman Bertazi, Tejun Heo, Christoph Lameter, Dennis Zhou, Vlastimil Babka, Hao Li, Jan Kara On Thu 05-03-26 11:33:21, Pedro Falcato wrote: > On Fri, Feb 27, 2026 at 03:41:50PM +0900, Harry Yoo wrote: > > Hi folks, I'd like to discuss ways to mitigate limitations of > > percpu memory allocator. > > > > While the percpu memory allocator has served its role well, > > it has a few problems: 1) its global lock contention, and > > 2) lack of features to avoid high initialization cost of percpu memory. > > > > Global lock contention > > ======================= > > > > Percpu allocator has a global lock when allocating or freeing memory. > > Of course, caching percpu memory is not always worth it, because > > it would meaningfully increase memory usage. > > > > However, some users (e.g., fork+exec, tc filter) suffer from > > the lock contention when many CPUs allocate / free percpu memory > > concurrently. > > > > That said, we need a way to cache percpu memory per cpu, in a selective > > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu > > memory in slab objects and letting slab cache them per cpu, > > with slab ctor+dtor pair: allocate percpu memory and > > associate it with slab object in constructor, and free it when > > deallocating slabs (with resurrecting slab destructor feature). > > > > This only works when percpu memory is associated with slab objects. > > I would like to hear if anybody thinks it's still worth redesigning > > percpu memory allocator for better scalability. > > I think this (make alloc_percpu actually scale) is the obvious suggestion. > Everything else is just papering over the cracks. I disagree. There are two separate (although related) issues that need solving. One issue is certainly scalability of the percpu allocator. Another issue (which is also visible in singlethreaded workloads) is that a percpu counter creation has a rather large cost even if the allocator is totally uncontended - this is because of the initialization (and final summarization) cost. And this is very visible e.g. in the fork() intensive loads such as shell scripts where we currently allocate several percpu arrays for each fork() and significant part of the fork() cost is currently the initialization of percpu arrays on larger machines. Reducing this overhead is a separate goal. > > Slab constructor + destructor Pair > > ---------------------------------- > > > > Percpu allocator doesn't distinguish types of objects > > unlike slab and it doesn't support constructors that could avoid > > re-initializing them on every allocation. > > One solution to this is using slab ctor+dtor pair; as long as a certain > > state is preserved on free (e.g. sum of percpu counter is zero), > > initialization needs to be done only once on construction. > > As I said way back when, making an object permanently accessible > a-la TYPESAFE_BY_RCU) is screwey and messes with the object lifetime > too much. Not to mention the locking problems that we discussed back-and-forth. Yeah, I was not enthusiastic about this solution either. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator 2026-03-05 11:48 ` Jan Kara @ 2026-03-06 15:35 ` Pedro Falcato 2026-03-06 16:26 ` Gabriel Krisman Bertazi 0 siblings, 1 reply; 8+ messages in thread From: Pedro Falcato @ 2026-03-06 15:35 UTC (permalink / raw) To: Jan Kara Cc: Harry Yoo, linux-mm, lsf-pc, Mateusz Guzik, Mathieu Desnoyers, Gabriel Krisman Bertazi, Tejun Heo, Christoph Lameter, Dennis Zhou, Vlastimil Babka, Hao Li On Thu, Mar 05, 2026 at 12:48:21PM +0100, Jan Kara wrote: > On Thu 05-03-26 11:33:21, Pedro Falcato wrote: > > On Fri, Feb 27, 2026 at 03:41:50PM +0900, Harry Yoo wrote: > > > Hi folks, I'd like to discuss ways to mitigate limitations of > > > percpu memory allocator. > > > > > > While the percpu memory allocator has served its role well, > > > it has a few problems: 1) its global lock contention, and > > > 2) lack of features to avoid high initialization cost of percpu memory. > > > > > > Global lock contention > > > ======================= > > > > > > Percpu allocator has a global lock when allocating or freeing memory. > > > Of course, caching percpu memory is not always worth it, because > > > it would meaningfully increase memory usage. > > > > > > However, some users (e.g., fork+exec, tc filter) suffer from > > > the lock contention when many CPUs allocate / free percpu memory > > > concurrently. > > > > > > That said, we need a way to cache percpu memory per cpu, in a selective > > > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu > > > memory in slab objects and letting slab cache them per cpu, > > > with slab ctor+dtor pair: allocate percpu memory and > > > associate it with slab object in constructor, and free it when > > > deallocating slabs (with resurrecting slab destructor feature). > > > > > > This only works when percpu memory is associated with slab objects. > > > I would like to hear if anybody thinks it's still worth redesigning > > > percpu memory allocator for better scalability. > > > > I think this (make alloc_percpu actually scale) is the obvious suggestion. > > Everything else is just papering over the cracks. > > I disagree. There are two separate (although related) issues that need > solving. One issue is certainly scalability of the percpu allocator. > Another issue (which is also visible in singlethreaded workloads) is that > a percpu counter creation has a rather large cost even if the allocator is > totally uncontended - this is because of the initialization (and final > summarization) cost. And this is very visible e.g. in the fork() intensive > loads such as shell scripts where we currently allocate several percpu > arrays for each fork() and significant part of the fork() cost is currently > the initialization of percpu arrays on larger machines. Reducing this > overhead is a separate goal. I agree that it's a separate issue. But it's as much of an issue for single-threaded processes as much as multi-threaded. Say you have a 64 core CPU. Why should you pay for 64 separate cores when you only spawned 2 threads? (and, yes, this is a not-so-rare situation, like lld which spawns up to 16 threads (https://reviews.llvm.org/D147493), even if you have hundreds of CPUs) So perhaps the best way to go about this problem would be to go back to per-task RSS accounting. This one had problems with many-task RSS accuracy, but the current one has problems for many-cpu RSS accuracy. A single-threaded optimization could patch over the problem for the vast majority of programs, but exceptions exist. Or another possible idea: lazily initialize these cpu counters somehow, on task switch. I'm afraid that while the solution presented by Mathieu fixes a problem with the current scheme (insane inaccuracy on large-cpu-count), it might also add to the percpu allocation + init problem (this might not be true, I have not paid too much attention). -- Pedro ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator 2026-03-06 15:35 ` Pedro Falcato @ 2026-03-06 16:26 ` Gabriel Krisman Bertazi 0 siblings, 0 replies; 8+ messages in thread From: Gabriel Krisman Bertazi @ 2026-03-06 16:26 UTC (permalink / raw) To: Pedro Falcato Cc: Jan Kara, Harry Yoo, linux-mm, lsf-pc, Mateusz Guzik, Mathieu Desnoyers, Tejun Heo, Christoph Lameter, Dennis Zhou, Vlastimil Babka, Hao Li Pedro Falcato <pfalcato@suse.de> writes: > On Thu, Mar 05, 2026 at 12:48:21PM +0100, Jan Kara wrote: > >> On Thu 05-03-26 11:33:21, Pedro Falcato wrote: >> > On Fri, Feb 27, 2026 at 03:41:50PM +0900, Harry Yoo wrote: >> > > Hi folks, I'd like to discuss ways to mitigate limitations of >> > > percpu memory allocator. >> > > >> > > While the percpu memory allocator has served its role well, >> > > it has a few problems: 1) its global lock contention, and >> > > 2) lack of features to avoid high initialization cost of percpu memory. >> > > >> > > Global lock contention >> > > ======================= >> > > >> > > Percpu allocator has a global lock when allocating or freeing memory. >> > > Of course, caching percpu memory is not always worth it, because >> > > it would meaningfully increase memory usage. >> > > >> > > However, some users (e.g., fork+exec, tc filter) suffer from >> > > the lock contention when many CPUs allocate / free percpu memory >> > > concurrently. >> > > >> > > That said, we need a way to cache percpu memory per cpu, in a selective >> > > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu >> > > memory in slab objects and letting slab cache them per cpu, >> > > with slab ctor+dtor pair: allocate percpu memory and >> > > associate it with slab object in constructor, and free it when >> > > deallocating slabs (with resurrecting slab destructor feature). >> > > >> > > This only works when percpu memory is associated with slab objects. >> > > I would like to hear if anybody thinks it's still worth redesigning >> > > percpu memory allocator for better scalability. >> > >> > I think this (make alloc_percpu actually scale) is the obvious suggestion. >> > Everything else is just papering over the cracks. >> >> I disagree. There are two separate (although related) issues that need >> solving. One issue is certainly scalability of the percpu allocator. >> Another issue (which is also visible in singlethreaded workloads) is that >> a percpu counter creation has a rather large cost even if the allocator is >> totally uncontended - this is because of the initialization (and final >> summarization) cost. And this is very visible e.g. in the fork() intensive >> loads such as shell scripts where we currently allocate several percpu >> arrays for each fork() and significant part of the fork() cost is currently >> the initialization of percpu arrays on larger machines. Reducing this >> overhead is a separate goal. > > I agree that it's a separate issue. But it's as much of an issue for > single-threaded processes as much as multi-threaded. Say you have a 64 core > CPU. Why should you pay for 64 separate cores when you only spawned 2 threads? > (and, yes, this is a not-so-rare situation, like lld which spawns up to 16 > threads (https://reviews.llvm.org/D147493), even if you have hundreds > of CPUs) True. Still, being an up-front initialization cost, it is the most relevant the shortest the task lives. I'd imagine that even for something as lld doing 16 clone syscalls, the overhead of a single percpu counter initialization is a very small blip in the profile, not worth special-casing for. The single-threaded case is the obvious optimizable-case in this sense. > So perhaps the best way to go about this problem would be to go back to > per-task RSS accounting. This one had problems with many-task RSS accuracy, > but the current one has problems for many-cpu RSS accuracy. The current pcpu one has a much smaller accuracy error than per-task, which justified its inclusion in the first place, no? IIRC, there was a real use case where the worse accuracy mattered for process selection during OOM. > A single-threaded > optimization could patch over the problem for the vast majority of programs, > but exceptions exist. > > Or another possible idea: lazily initialize these cpu counters somehow, > on task switch. > > I'm afraid that while the solution presented by Mathieu fixes a problem with > the current scheme (insane inaccuracy on large-cpu-count), it might also add > to the percpu allocation + init problem (this might not be true, I have not > paid too much attention). -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-03-06 16:26 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-02-27 6:41 [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator Harry Yoo 2026-03-04 17:50 ` Gabriel Krisman Bertazi 2026-03-05 4:24 ` Mathieu Desnoyers 2026-03-05 10:05 ` Jan Kara 2026-03-05 11:33 ` Pedro Falcato 2026-03-05 11:48 ` Jan Kara 2026-03-06 15:35 ` Pedro Falcato 2026-03-06 16:26 ` Gabriel Krisman Bertazi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox