* [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
@ 2025-11-27 23:36 Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi
` (4 more replies)
0 siblings, 5 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw)
To: linux-mm
Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik,
Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou,
Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan
The cost of the pcpu memory allocation is non-negligible for systems
with many cpus, and it is quite visible when forking a new task, as
reported in a few occasions. In particular, Jan Kara reported the
commit introducing per-cpu counters for rss_stat caused a 10% regression
of system time for gitsource in his system [1]. In that same occasion,
Jan suggested we special-cased the single-threaded case: since we know
there won't be frequent remote updates of rss_stats for single-threaded
applications, we could special case it with a local counter for most
updates, and an atomic counter for the infrequent remote updates. This
patchset implements this idea.
It exposes a dual-mode counter that starts as a simple counter, cheap to
initialize on single-threaded tasks, that can be upgraded inflight to a
fully-fledged per cpu counter later. Patch 3 then modifies the rss_stat
counters to use that structure, forcing the upgrade as soon as a second
task sharing the mm_struct is spawned. By delaying the initialization
cost until the MM is shared, we cover single-threaded applications
fairly cheaply, while not penalizing applications that spawn multiple
threads. On a 256c system, where the pcpu allocation of the rss_stats
is quite noticeable, this has reduced the wall-clock time between 6%
15% (depending on the number of cores) of an artificial fork-intensive
microbenchmark (calling /bin/true in a loop). In a more realistic
benchmark, it showed an improvement of 1.5% on kernbench elapsed time.
More performance data, including profilings is available in the patch
modifying the rss_stat counters.
While this patch exposes a single users of this API, this should be
useful in more cases. This is why I made it into a proper API. In
addition, considering the recent efforts in this area, such as
hierarchical per-cpu counters which are orthogonal to this work because
they improve multi-threaded workloads, abstracting this with a new API
could help the merging of both works.
Finally, this is a RFC because it is an early work. in particular, I'd
be interested in more benchmarks suggestions, and I'd like feedback
whether this new interface should be implemented inside percpu_counters
as lazy counters or as a completely separated interface.
Thanks,
[1] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3
---
Cc: linux-kernel@vger.kernel.org
Cc: jack@suse.cz
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Gabriel Krisman Bertazi (4):
lib/percpu_counter: Split out a helper to insert into hotplug list
lib: Support lazy initialization of per-cpu counters
mm: Avoid percpu MM counters on single-threaded tasks
mm: Split a slow path for updating mm counters
arch/s390/mm/gmap_helpers.c | 4 +-
arch/s390/mm/pgtable.c | 4 +-
fs/exec.c | 2 +-
include/linux/lazy_percpu_counter.h | 145 ++++++++++++++++++++++++++++
include/linux/mm.h | 26 ++---
include/linux/mm_types.h | 4 +-
include/linux/percpu_counter.h | 5 +-
include/trace/events/kmem.h | 4 +-
kernel/events/uprobes.c | 2 +-
kernel/fork.c | 14 ++-
lib/percpu_counter.c | 68 ++++++++++---
mm/filemap.c | 2 +-
mm/huge_memory.c | 22 ++---
mm/khugepaged.c | 6 +-
mm/ksm.c | 2 +-
mm/madvise.c | 2 +-
mm/memory.c | 20 ++--
mm/migrate.c | 2 +-
mm/migrate_device.c | 2 +-
mm/rmap.c | 16 +--
mm/swapfile.c | 6 +-
mm/userfaultfd.c | 2 +-
22 files changed, 276 insertions(+), 84 deletions(-)
create mode 100644 include/linux/lazy_percpu_counter.h
--
2.51.0
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list 2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi @ 2025-11-27 23:36 ` Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters Gabriel Krisman Bertazi ` (3 subsequent siblings) 4 siblings, 0 replies; 19+ messages in thread From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw) To: linux-mm Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik, Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan In preparation to using it with the lazy pcpu counter. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> --- lib/percpu_counter.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index 2891f94a11c6..c2322d53f3b1 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -185,11 +185,26 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc) } EXPORT_SYMBOL(__percpu_counter_sum); +static int cpu_hotplug_add_watchlist(struct percpu_counter *fbc, int nr_counters) +{ +#ifdef CONFIG_HOTPLUG_CPU + unsigned long flags; + int i; + + spin_lock_irqsave(&percpu_counters_lock, flags); + for (i = 0; i < nr_counters; i++) { + INIT_LIST_HEAD(&fbc[i].list); + list_add(&fbc[i].list, &percpu_counters); + } + spin_unlock_irqrestore(&percpu_counters_lock, flags); +#endif + return 0; +} + int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp, u32 nr_counters, struct lock_class_key *key) { - unsigned long flags __maybe_unused; size_t counter_size; s32 __percpu *counters; u32 i; @@ -205,21 +220,12 @@ int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, for (i = 0; i < nr_counters; i++) { raw_spin_lock_init(&fbc[i].lock); lockdep_set_class(&fbc[i].lock, key); -#ifdef CONFIG_HOTPLUG_CPU - INIT_LIST_HEAD(&fbc[i].list); -#endif fbc[i].count = amount; fbc[i].counters = (void __percpu *)counters + i * counter_size; debug_percpu_counter_activate(&fbc[i]); } - -#ifdef CONFIG_HOTPLUG_CPU - spin_lock_irqsave(&percpu_counters_lock, flags); - for (i = 0; i < nr_counters; i++) - list_add(&fbc[i].list, &percpu_counters); - spin_unlock_irqrestore(&percpu_counters_lock, flags); -#endif + cpu_hotplug_add_watchlist(fbc, nr_counters); return 0; } EXPORT_SYMBOL(__percpu_counter_init_many); -- 2.51.0 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters 2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi @ 2025-11-27 23:36 ` Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks Gabriel Krisman Bertazi ` (2 subsequent siblings) 4 siblings, 0 replies; 19+ messages in thread From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw) To: linux-mm Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik, Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan While per-cpu counters are efficient when there is a need for frequent updates from different cpus, they have a non-trivial upfront initialization cost, mainly due to the percpu variable allocation. This cost becomes relevant both for short-lived counters and for cases where we don't know beforehand if there will be frequent updates from remote cpus. On both cases, it could have been better to just use a simple counter. The prime example is rss_stats of single-threaded tasks, where the vast majority of counter updates happen from a single-cpu context at a time, except for slowpath cases, such as OOM, khugepage. For those workloads, a simple counter would have sufficed and likely yielded better overall performance if the tasks were sufficiently short. There is no end of examples of short-lived single-thread workloads, in particular coreutils tools. This patch introduces a new counter flavor that delays the percpu initialization until needed. It is a dual-mode counter. It starts with a two-part counter that can be updated either from a local context through simple arithmetic or from a remote context through an atomic operation. Once remote accesses become more frequent, and the user considers the overhead of atomic updates surpasses the cost of initializing a fully-fledged per-cpu counter, the user can seamlessly upgrade the counter to the per-cpu counter. The first user of this are the rss_stat counters. Benchmarks results are provided on that patch. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> --- include/linux/lazy_percpu_counter.h | 145 ++++++++++++++++++++++++++++ include/linux/percpu_counter.h | 5 +- lib/percpu_counter.c | 40 ++++++++ 3 files changed, 189 insertions(+), 1 deletion(-) create mode 100644 include/linux/lazy_percpu_counter.h diff --git a/include/linux/lazy_percpu_counter.h b/include/linux/lazy_percpu_counter.h new file mode 100644 index 000000000000..7300b8c33507 --- /dev/null +++ b/include/linux/lazy_percpu_counter.h @@ -0,0 +1,145 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include <linux/percpu_counter.h> +#ifndef _LAZY_PERCPU_COUNTER +#define _LAZY_PERCPU_COUNTER + +/* Lazy percpu counter is a bi-modal distributed counter structure that + * starts off as a simple counter and can be upgraded to a full per-cpu + * counter when the user considers more non-local updates are likely to + * happen more frequently in the future. It is useful when non-local + * updates are rare, but might become more frequent after other + * operations. + * + * - Lazy-mode: + * + * Local updates are handled with a simple variable write, while + * non-local updates are handled through an atomic operation. Once + * non-local updates become more likely to happen in the future, the + * user can upgrade the counter, turning it into a normal + * per-cpu counter. + * + * Concurrency safety of 'local' accesses must be guaranteed by the + * caller API, either through task-local accesses or by external locks. + * + * In the initial lazy-mode, read is guaranteed to be exact only when + * reading from the local context with lazy_percpu_counter_sum_local. + * + * - Non-lazy-mode: + * Behaves as a per-cpu counter. + */ + +struct lazy_percpu_counter { + struct percpu_counter c; +}; + +#define LAZY_INIT_BIAS (1<<0) + +static inline s64 add_bias(long val) +{ + return (val << 1) | LAZY_INIT_BIAS; +} +static inline s64 remove_bias(long val) +{ + return val >> 1; +} + +static inline bool lazy_percpu_counter_initialized(struct lazy_percpu_counter *lpc) +{ + return !(atomic_long_read(&lpc->c.remote) & LAZY_INIT_BIAS); +} + +static inline void lazy_percpu_counter_init_many(struct lazy_percpu_counter *lpc, int amount, + int nr_counters) +{ + for (int i = 0; i < nr_counters; i++) { + lpc[i].c.count = amount; + atomic_long_set(&lpc[i].c.remote, LAZY_INIT_BIAS); + raw_spin_lock_init(&lpc[i].c.lock); + } +} + +static inline void lazy_percpu_counter_add_atomic(struct lazy_percpu_counter *lpc, s64 amount) +{ + long x = amount << 1; + long counter; + + do { + counter = atomic_long_read(&lpc->c.remote); + if (!(counter & LAZY_INIT_BIAS)) { + percpu_counter_add(&lpc->c, amount); + return; + } + } while (atomic_long_cmpxchg_relaxed(&lpc->c.remote, counter, (counter+x)) != counter); +} + +static inline void lazy_percpu_counter_add_fast(struct lazy_percpu_counter *lpc, s64 amount) +{ + if (lazy_percpu_counter_initialized(lpc)) + percpu_counter_add(&lpc->c, amount); + else + lpc->c.count += amount; +} + +/* + * lazy_percpu_counter_sync needs to be protected against concurrent + * local updates. + */ +static inline s64 lazy_percpu_counter_sum_local(struct lazy_percpu_counter *lpc) +{ + if (lazy_percpu_counter_initialized(lpc)) + return percpu_counter_sum(&lpc->c); + + lazy_percpu_counter_add_atomic(lpc, lpc->c.count); + lpc->c.count = 0; + return remove_bias(atomic_long_read(&lpc->c.remote)); +} + +static inline s64 lazy_percpu_counter_sum(struct lazy_percpu_counter *lpc) +{ + if (lazy_percpu_counter_initialized(lpc)) + return percpu_counter_sum(&lpc->c); + return remove_bias(atomic_long_read(&lpc->c.remote)) + lpc->c.count; +} + +static inline s64 lazy_percpu_counter_sum_positive(struct lazy_percpu_counter *lpc) +{ + s64 val = lazy_percpu_counter_sum(lpc); + + return (val > 0) ? val : 0; +} + +static inline s64 lazy_percpu_counter_read(struct lazy_percpu_counter *lpc) +{ + if (lazy_percpu_counter_initialized(lpc)) + return percpu_counter_read(&lpc->c); + return remove_bias(atomic_long_read(&lpc->c.remote)) + lpc->c.count; +} + +static inline s64 lazy_percpu_counter_read_positive(struct lazy_percpu_counter *lpc) +{ + s64 val = lazy_percpu_counter_read(lpc); + + return (val > 0) ? val : 0; +} + +int __lazy_percpu_counter_upgrade_many(struct lazy_percpu_counter *c, + int nr_counters, gfp_t gfp); +static inline int lazy_percpu_counter_upgrade_many(struct lazy_percpu_counter *c, + int nr_counters, gfp_t gfp) +{ + /* Only check the first element, as batches are expected to be + * upgraded together. + */ + if (!lazy_percpu_counter_initialized(c)) + return __lazy_percpu_counter_upgrade_many(c, nr_counters, gfp); + return 0; +} + +static inline void lazy_percpu_counter_destroy_many(struct lazy_percpu_counter *lpc, + u32 nr_counters) +{ + /* Only check the first element, as they must have been initialized together. */ + if (lazy_percpu_counter_initialized(lpc)) + percpu_counter_destroy_many((struct percpu_counter *)lpc, nr_counters); +} +#endif diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h index 3a44dd1e33d2..e6fada9cba44 100644 --- a/include/linux/percpu_counter.h +++ b/include/linux/percpu_counter.h @@ -25,7 +25,10 @@ struct percpu_counter { #ifdef CONFIG_HOTPLUG_CPU struct list_head list; /* All percpu_counters are on a list */ #endif - s32 __percpu *counters; + union { + s32 __percpu *counters; + atomic_long_t remote; + }; }; extern int percpu_counter_batch; diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index c2322d53f3b1..0a210496f219 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -4,6 +4,7 @@ */ #include <linux/percpu_counter.h> +#include <linux/lazy_percpu_counter.h> #include <linux/mutex.h> #include <linux/init.h> #include <linux/cpu.h> @@ -397,6 +398,45 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc, return good; } +int __lazy_percpu_counter_upgrade_many(struct lazy_percpu_counter *counters, + int nr_counters, gfp_t gfp) +{ + s32 __percpu *pcpu_mem; + size_t counter_size; + + counter_size = ALIGN(sizeof(*pcpu_mem), __alignof__(*pcpu_mem)); + pcpu_mem = __alloc_percpu_gfp(nr_counters * counter_size, + __alignof__(*pcpu_mem), gfp); + if (!pcpu_mem) + return -ENOMEM; + + for (int i = 0; i < nr_counters; i++) { + struct lazy_percpu_counter *lpc = &(counters[i]); + s32 __percpu *n_counter; + s64 remote = 0; + + WARN_ON(lazy_percpu_counter_initialized(lpc)); + + /* + * After the xchg, lazy_percpu_counter behaves as a + * regular percpu counter. + */ + n_counter = (void __percpu *)pcpu_mem + i * counter_size; + remote = (s64) atomic_long_xchg(&lpc->c.remote, (s64)(uintptr_t) n_counter); + + BUG_ON(!(remote & LAZY_INIT_BIAS)); + + percpu_counter_add_local(&lpc->c, remove_bias(remote)); + } + + for (int i = 0; i < nr_counters; i++) + debug_percpu_counter_activate(&counters[i].c); + + cpu_hotplug_add_watchlist((struct percpu_counter *) counters, nr_counters); + + return 0; +} + static int __init percpu_counter_startup(void) { int ret; -- 2.51.0 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks 2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters Gabriel Krisman Bertazi @ 2025-11-27 23:36 ` Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi 2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers 4 siblings, 0 replies; 19+ messages in thread From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw) To: linux-mm Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik, Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan The cost of the pcpu memory allocation when forking a new task is non-negligible, as reported in a few occasions, such as [1]. But it can also be fully avoided for single-threaded applications, where we know the vast majority of updates happen from the local task context. For the trivial benchmark, bound to cpu 0 to reduce cost of migrations), like below: for (( i = 0; i < 20000; i++ )); do /bin/true; done on an 80c machine, this patchset yielded a 6% improvement in system time. On a 256c machine, the system time reduced by 11%. Profiling shows mm_init went from 13.5% of samples to less than 3.33% in the same 256c machine: Before: - 13.50% 3.93% benchmark.sh [kernel.kallsyms] [k] mm_init - 9.57% mm_init + 4.80% pcpu_alloc_noprof + 3.87% __percpu_counter_init_many After: - 3.33% 0.80% benchmark.sh [kernel.kallsyms] [k] mm_init - 2.53% mm_init + 2.05% pcpu_alloc_noprof For kernbench in 256c, the patchset yields a 1.4% improvement on system time. For gitsource, the improvement in system time I'm measuring is around 3.12%. The upgrade adds some overhead to the second fork, in particular an atomic operation, besides the expensive allocation that was moved from the first fork to the second. So a fair question is the impact of this patchset on multi-threaded applications. I wrote a microbenchmark similar to the /bin/true above, but that just spawns a second pthread and waits for it to finish. The second thread just returns immediately. This is executed in a loop, bound to a single NUMA node, with: for (( i = 0; i < 20000; i++ )); do /bin/parallel-true; done Profiling shows the lazy upgrade impact is minimal to the performance: - 0.68% 0.04% parallel-true [kernel.kallsyms] [k] __lazy_percpu_counter_upgrade_many - 0.64% __lazy_percpu_counter_upgrade_many 0.62% pcpu_alloc_noprof Which is confirmed by the measured system time. With 20k runs, i'm still getting a slight improvement from baseline for the 2t case (2-4%). [1] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3 Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> --- include/linux/mm.h | 24 ++++++++---------------- include/linux/mm_types.h | 4 ++-- include/trace/events/kmem.h | 4 ++-- kernel/fork.c | 14 ++++++-------- 4 files changed, 18 insertions(+), 28 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index d16b33bacc32..29de4c60ac6c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2679,36 +2679,28 @@ static inline bool get_user_page_fast_only(unsigned long addr, */ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - return percpu_counter_read_positive(&mm->rss_stat[member]); + return lazy_percpu_counter_read_positive(&mm->rss_stat[member]); } static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member) { - return percpu_counter_sum_positive(&mm->rss_stat[member]); + return lazy_percpu_counter_sum_positive(&mm->rss_stat[member]); } void mm_trace_rss_stat(struct mm_struct *mm, int member); static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { - percpu_counter_add(&mm->rss_stat[member], value); - - mm_trace_rss_stat(mm, member); -} - -static inline void inc_mm_counter(struct mm_struct *mm, int member) -{ - percpu_counter_inc(&mm->rss_stat[member]); + if (READ_ONCE(current->mm) == mm) + lazy_percpu_counter_add_fast(&mm->rss_stat[member], value); + else + lazy_percpu_counter_add_atomic(&mm->rss_stat[member], value); mm_trace_rss_stat(mm, member); } -static inline void dec_mm_counter(struct mm_struct *mm, int member) -{ - percpu_counter_dec(&mm->rss_stat[member]); - - mm_trace_rss_stat(mm, member); -} +#define inc_mm_counter(mm, member) add_mm_counter(mm, member, 1) +#define dec_mm_counter(mm, member) add_mm_counter(mm, member, -1) /* Optimized variant when folio is already known not to be anon */ static inline int mm_counter_file(struct folio *folio) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 90e5790c318f..5a8d677efa85 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -18,7 +18,7 @@ #include <linux/page-flags-layout.h> #include <linux/workqueue.h> #include <linux/seqlock.h> -#include <linux/percpu_counter.h> +#include <linux/lazy_percpu_counter.h> #include <linux/types.h> #include <linux/bitmap.h> @@ -1119,7 +1119,7 @@ struct mm_struct { unsigned long saved_e_flags; #endif - struct percpu_counter rss_stat[NR_MM_COUNTERS]; + struct lazy_percpu_counter rss_stat[NR_MM_COUNTERS]; struct linux_binfmt *binfmt; diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 7f93e754da5c..e21572f4d8a6 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -442,8 +442,8 @@ TRACE_EVENT(rss_stat, __entry->mm_id = mm_ptr_to_hash(mm); __entry->curr = !!(current->mm == mm); __entry->member = member; - __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) - << PAGE_SHIFT); + __entry->size = (lazy_percpu_counter_sum_positive(&mm->rss_stat[member]) + << PAGE_SHIFT); ), TP_printk("mm_id=%u curr=%d type=%s size=%ldB", diff --git a/kernel/fork.c b/kernel/fork.c index 3da0f08615a9..92698c60922e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -583,7 +583,7 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); + long x = lazy_percpu_counter_sum_local(&mm->rss_stat[i]); if (unlikely(x)) { pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld Comm:%s Pid:%d\n", @@ -688,7 +688,7 @@ void __mmdrop(struct mm_struct *mm) put_user_ns(mm->user_ns); mm_pasid_drop(mm); mm_destroy_cid(mm); - percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); + lazy_percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); free_mm(mm); } @@ -1083,16 +1083,11 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, - NR_MM_COUNTERS)) - goto fail_pcpu; - + lazy_percpu_counter_init_many(mm->rss_stat, 0, NR_MM_COUNTERS); mm->user_ns = get_user_ns(user_ns); lru_gen_init_mm(mm); return mm; -fail_pcpu: - mm_destroy_cid(mm); fail_cid: destroy_context(mm); fail_nocontext: @@ -1535,6 +1530,9 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk) return 0; if (clone_flags & CLONE_VM) { + if (lazy_percpu_counter_upgrade_many(oldmm->rss_stat, + NR_MM_COUNTERS, GFP_KERNEL_ACCOUNT)) + return -ENOMEM; mmget(oldmm); mm = oldmm; } else { -- 2.51.0 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 4/4] mm: Split a slow path for updating mm counters 2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi ` (2 preceding siblings ...) 2025-11-27 23:36 ` [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks Gabriel Krisman Bertazi @ 2025-11-27 23:36 ` Gabriel Krisman Bertazi 2025-12-01 10:19 ` David Hildenbrand (Red Hat) 2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers 4 siblings, 1 reply; 19+ messages in thread From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw) To: linux-mm Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik, Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan For cases where we know we are not coming from local context, there is no point in touching current when incrementing/decrementing the counters. Split this path into another helper to avoid this cost. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> --- arch/s390/mm/gmap_helpers.c | 4 ++-- arch/s390/mm/pgtable.c | 4 ++-- fs/exec.c | 2 +- include/linux/mm.h | 14 +++++++++++--- kernel/events/uprobes.c | 2 +- mm/filemap.c | 2 +- mm/huge_memory.c | 22 +++++++++++----------- mm/khugepaged.c | 6 +++--- mm/ksm.c | 2 +- mm/madvise.c | 2 +- mm/memory.c | 20 ++++++++++---------- mm/migrate.c | 2 +- mm/migrate_device.c | 2 +- mm/rmap.c | 16 ++++++++-------- mm/swapfile.c | 6 +++--- mm/userfaultfd.c | 2 +- 16 files changed, 58 insertions(+), 50 deletions(-) diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c index d4c3c36855e2..6d8498c56d08 100644 --- a/arch/s390/mm/gmap_helpers.c +++ b/arch/s390/mm/gmap_helpers.c @@ -29,9 +29,9 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry) { if (!non_swap_entry(entry)) - dec_mm_counter(mm, MM_SWAPENTS); + dec_mm_counter_other(mm, MM_SWAPENTS); else if (is_migration_entry(entry)) - dec_mm_counter(mm, mm_counter(pfn_swap_entry_folio(entry))); + dec_mm_counter_other(mm, mm_counter(pfn_swap_entry_folio(entry))); free_swap_and_cache(entry); } diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 0fde20bbc50b..021a04f958e5 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -686,11 +686,11 @@ void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep) static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry) { if (!non_swap_entry(entry)) - dec_mm_counter(mm, MM_SWAPENTS); + dec_mm_counter_other(mm, MM_SWAPENTS); else if (is_migration_entry(entry)) { struct folio *folio = pfn_swap_entry_folio(entry); - dec_mm_counter(mm, mm_counter(folio)); + dec_mm_counter_other(mm, mm_counter(folio)); } free_swap_and_cache(entry); } diff --git a/fs/exec.c b/fs/exec.c index 4298e7e08d5d..33d0eb00d315 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -137,7 +137,7 @@ static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages) return; bprm->vma_pages = pages; - add_mm_counter(mm, MM_ANONPAGES, diff); + add_mm_counter_local(mm, MM_ANONPAGES, diff); } static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, diff --git a/include/linux/mm.h b/include/linux/mm.h index 29de4c60ac6c..2db12280e938 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2689,7 +2689,7 @@ static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member) void mm_trace_rss_stat(struct mm_struct *mm, int member); -static inline void add_mm_counter(struct mm_struct *mm, int member, long value) +static inline void add_mm_counter_local(struct mm_struct *mm, int member, long value) { if (READ_ONCE(current->mm) == mm) lazy_percpu_counter_add_fast(&mm->rss_stat[member], value); @@ -2698,9 +2698,17 @@ static inline void add_mm_counter(struct mm_struct *mm, int member, long value) mm_trace_rss_stat(mm, member); } +static inline void add_mm_counter_other(struct mm_struct *mm, int member, long value) +{ + lazy_percpu_counter_add_atomic(&mm->rss_stat[member], value); + + mm_trace_rss_stat(mm, member); +} -#define inc_mm_counter(mm, member) add_mm_counter(mm, member, 1) -#define dec_mm_counter(mm, member) add_mm_counter(mm, member, -1) +#define inc_mm_counter_local(mm, member) add_mm_counter_local(mm, member, 1) +#define dec_mm_counter_local(mm, member) add_mm_counter_local(mm, member, -1) +#define inc_mm_counter_other(mm, member) add_mm_counter_other(mm, member, 1) +#define dec_mm_counter_other(mm, member) add_mm_counter_other(mm, member, -1) /* Optimized variant when folio is already known not to be anon */ static inline int mm_counter_file(struct folio *folio) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8709c69118b5..9c0e73dd2948 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -447,7 +447,7 @@ static int __uprobe_write(struct vm_area_struct *vma, if (!orig_page_is_identical(vma, vaddr, fw->page, &pmd_mappable)) goto remap; - dec_mm_counter(vma->vm_mm, MM_ANONPAGES); + dec_mm_counter_other(vma->vm_mm, MM_ANONPAGES); folio_remove_rmap_pte(folio, fw->page, vma); if (!folio_mapped(folio) && folio_test_swapcache(folio) && folio_trylock(folio)) { diff --git a/mm/filemap.c b/mm/filemap.c index 13f0259d993c..5d1656e63602 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3854,7 +3854,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, folio_unlock(folio); } while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL); - add_mm_counter(vma->vm_mm, folio_type, rss); + add_mm_counter_other(vma->vm_mm, folio_type, rss); pte_unmap_unlock(vmf->pte, vmf->ptl); trace_mm_filemap_map_pages(mapping, start_pgoff, end_pgoff); out: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..614b0a8e168b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1228,7 +1228,7 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, folio_add_lru_vma(folio, vma); set_pmd_at(vma->vm_mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, haddr, pmd); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter_local(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); @@ -1444,7 +1444,7 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr, } else { folio_get(fop.folio); folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma); - add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR); + add_mm_counter_local(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR); } } else { entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot)); @@ -1563,7 +1563,7 @@ static vm_fault_t insert_pud(struct vm_area_struct *vma, unsigned long addr, folio_get(fop.folio); folio_add_file_rmap_pud(fop.folio, &fop.folio->page, vma); - add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR); + add_mm_counter_local(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR); } else { entry = pud_mkhuge(pfn_pud(fop.pfn, prot)); entry = pud_mkspecial(entry); @@ -1714,7 +1714,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd = pmd_swp_mkuffd_wp(pmd); set_pmd_at(src_mm, addr, src_pmd, pmd); } - add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter_local(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); if (!userfaultfd_wp(dst_vma)) @@ -1758,7 +1758,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, __split_huge_pmd(src_vma, src_pmd, addr, false); return -EAGAIN; } - add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter_local(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); out_zero_page: mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); @@ -2223,11 +2223,11 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, if (folio_test_anon(folio)) { zap_deposited_table(tlb->mm, pmd); - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); + add_mm_counter_other(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); } else { if (arch_needs_pgtable_deposit()) zap_deposited_table(tlb->mm, pmd); - add_mm_counter(tlb->mm, mm_counter_file(folio), + add_mm_counter_other(tlb->mm, mm_counter_file(folio), -HPAGE_PMD_NR); /* @@ -2719,7 +2719,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, page = pud_page(orig_pud); folio = page_folio(page); folio_remove_rmap_pud(folio, page, vma); - add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR); + add_mm_counter_other(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR); spin_unlock(ptl); tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE); @@ -2755,7 +2755,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud, folio_set_referenced(folio); folio_remove_rmap_pud(folio, page, vma); folio_put(folio); - add_mm_counter(vma->vm_mm, mm_counter_file(folio), + add_mm_counter_local(vma->vm_mm, mm_counter_file(folio), -HPAGE_PUD_NR); } @@ -2874,7 +2874,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, folio_remove_rmap_pmd(folio, page, vma); folio_put(folio); } - add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR); + add_mm_counter_local(mm, mm_counter_file(folio), -HPAGE_PMD_NR); return; } @@ -3188,7 +3188,7 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma, folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma); zap_deposited_table(mm, pmdp); - add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR); + add_mm_counter_local(mm, MM_ANONPAGES, -HPAGE_PMD_NR); if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); folio_put(folio); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index abe54f0043c7..a6634ca0667d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -691,7 +691,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, nr_ptes = 1; pteval = ptep_get(_pte); if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { - add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); + add_mm_counter_other(vma->vm_mm, MM_ANONPAGES, 1); if (is_zero_pfn(pte_pfn(pteval))) { /* * ptl mostly unnecessary. @@ -1664,7 +1664,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, /* step 3: set proper refcount and mm_counters. */ if (nr_mapped_ptes) { folio_ref_sub(folio, nr_mapped_ptes); - add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes); + add_mm_counter_other(mm, mm_counter_file(folio), -nr_mapped_ptes); } /* step 4: remove empty page table */ @@ -1700,7 +1700,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, if (nr_mapped_ptes) { flush_tlb_mm(mm); folio_ref_sub(folio, nr_mapped_ptes); - add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes); + add_mm_counter_other(mm, mm_counter_file(folio), -nr_mapped_ptes); } unlock: if (start_pte) diff --git a/mm/ksm.c b/mm/ksm.c index 7bc726b50b2f..7434cf1f4925 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1410,7 +1410,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, * will get wrong values in /proc, and a BUG message in dmesg * when tearing down the mm. */ - dec_mm_counter(mm, MM_ANONPAGES); + dec_mm_counter_other(mm, MM_ANONPAGES); } flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep))); diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..ba7ea134f5ad 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -776,7 +776,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, } if (nr_swap) - add_mm_counter(mm, MM_SWAPENTS, nr_swap); + add_mm_counter_local(mm, MM_SWAPENTS, nr_swap); if (start_pte) { arch_leave_lazy_mmu_mode(); pte_unmap_unlock(start_pte, ptl); diff --git a/mm/memory.c b/mm/memory.c index 74b45e258323..9a18ac25955c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -488,7 +488,7 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) for (i = 0; i < NR_MM_COUNTERS; i++) if (rss[i]) - add_mm_counter(mm, i, rss[i]); + add_mm_counter_other(mm, i, rss[i]); } static bool is_bad_page_map_ratelimited(void) @@ -2306,7 +2306,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, pteval = pte_mkyoung(pteval); pteval = maybe_mkwrite(pte_mkdirty(pteval), vma); } - inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); + inc_mm_counter_local(vma->vm_mm, mm_counter_file(folio)); folio_add_file_rmap_pte(folio, page, vma); } set_pte_at(vma->vm_mm, addr, pte, pteval); @@ -3716,12 +3716,12 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { if (old_folio) { if (!folio_test_anon(old_folio)) { - dec_mm_counter(mm, mm_counter_file(old_folio)); - inc_mm_counter(mm, MM_ANONPAGES); + dec_mm_counter_other(mm, mm_counter_file(old_folio)); + inc_mm_counter_other(mm, MM_ANONPAGES); } } else { ksm_might_unmap_zero_page(mm, vmf->orig_pte); - inc_mm_counter(mm, MM_ANONPAGES); + inc_mm_counter_other(mm, MM_ANONPAGES); } flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); entry = folio_mk_pte(new_folio, vma->vm_page_prot); @@ -4916,8 +4916,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (should_try_to_free_swap(folio, vma, vmf->flags)) folio_free_swap(folio); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); - add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); + add_mm_counter_other(vma->vm_mm, MM_ANONPAGES, nr_pages); + add_mm_counter_other(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte = mk_pte(page, vma->vm_page_prot); if (pte_swp_soft_dirty(vmf->orig_pte)) pte = pte_mksoft_dirty(pte); @@ -5223,7 +5223,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) } folio_ref_add(folio, nr_pages - 1); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); + add_mm_counter_other(vma->vm_mm, MM_ANONPAGES, nr_pages); count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC); folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); @@ -5375,7 +5375,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa if (write) entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); - add_mm_counter(vma->vm_mm, mm_counter_file(folio), HPAGE_PMD_NR); + add_mm_counter_other(vma->vm_mm, mm_counter_file(folio), HPAGE_PMD_NR); folio_add_file_rmap_pmd(folio, page, vma); /* @@ -5561,7 +5561,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) folio_ref_add(folio, nr_pages - 1); set_pte_range(vmf, folio, page, nr_pages, addr); type = is_cow ? MM_ANONPAGES : mm_counter_file(folio); - add_mm_counter(vma->vm_mm, type, nr_pages); + add_mm_counter_other(vma->vm_mm, type, nr_pages); ret = 0; unlock: diff --git a/mm/migrate.c b/mm/migrate.c index e3065c9edb55..dd8c6e6224f9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -329,7 +329,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); - dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio)); + dec_mm_counter_other(pvmw->vma->vm_mm, mm_counter(folio)); return true; } diff --git a/mm/migrate_device.c b/mm/migrate_device.c index abd9f6850db6..7f3e5d7b3109 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -676,7 +676,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, if (userfaultfd_missing(vma)) goto unlock_abort; - inc_mm_counter(mm, MM_ANONPAGES); + inc_mm_counter_other(mm, MM_ANONPAGES); folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); if (!folio_is_zone_device(folio)) folio_add_lru_vma(folio, vma); diff --git a/mm/rmap.c b/mm/rmap.c index ac4f783d6ec2..0f6023ffb65d 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2085,7 +2085,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, set_huge_pte_at(mm, address, pvmw.pte, pteval, hsz); } else { - dec_mm_counter(mm, mm_counter(folio)); + dec_mm_counter_other(mm, mm_counter(folio)); set_pte_at(mm, address, pvmw.pte, pteval); } } else if (likely(pte_present(pteval)) && pte_unused(pteval) && @@ -2100,7 +2100,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * migration) will not expect userfaults on already * copied pages. */ - dec_mm_counter(mm, mm_counter(folio)); + dec_mm_counter_other(mm, mm_counter(folio)); } else if (folio_test_anon(folio)) { swp_entry_t entry = page_swap_entry(subpage); pte_t swp_pte; @@ -2155,7 +2155,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, set_ptes(mm, address, pvmw.pte, pteval, nr_pages); goto walk_abort; } - add_mm_counter(mm, MM_ANONPAGES, -nr_pages); + add_mm_counter_other(mm, MM_ANONPAGES, -nr_pages); goto discard; } @@ -2188,8 +2188,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, list_add(&mm->mmlist, &init_mm.mmlist); spin_unlock(&mmlist_lock); } - dec_mm_counter(mm, MM_ANONPAGES); - inc_mm_counter(mm, MM_SWAPENTS); + dec_mm_counter_other(mm, MM_ANONPAGES); + inc_mm_counter_other(mm, MM_SWAPENTS); swp_pte = swp_entry_to_pte(entry); if (anon_exclusive) swp_pte = pte_swp_mkexclusive(swp_pte); @@ -2217,7 +2217,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * * See Documentation/mm/mmu_notifier.rst */ - dec_mm_counter(mm, mm_counter_file(folio)); + dec_mm_counter_other(mm, mm_counter_file(folio)); } discard: if (unlikely(folio_test_hugetlb(folio))) { @@ -2476,7 +2476,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, set_huge_pte_at(mm, address, pvmw.pte, pteval, hsz); } else { - dec_mm_counter(mm, mm_counter(folio)); + dec_mm_counter_other(mm, mm_counter(folio)); set_pte_at(mm, address, pvmw.pte, pteval); } } else if (likely(pte_present(pteval)) && pte_unused(pteval) && @@ -2491,7 +2491,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, * migration) will not expect userfaults on already * copied pages. */ - dec_mm_counter(mm, mm_counter(folio)); + dec_mm_counter_other(mm, mm_counter(folio)); } else { swp_entry_t entry; pte_t swp_pte; diff --git a/mm/swapfile.c b/mm/swapfile.c index 10760240a3a2..70f7d31c0854 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2163,7 +2163,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, if (unlikely(hwpoisoned || !folio_test_uptodate(folio))) { swp_entry_t swp_entry; - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); + dec_mm_counter_other(vma->vm_mm, MM_SWAPENTS); if (hwpoisoned) { swp_entry = make_hwpoison_entry(page); } else { @@ -2181,8 +2181,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, */ arch_swap_restore(folio_swap(entry, folio), folio); - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); + dec_mm_counter_other(vma->vm_mm, MM_SWAPENTS); + inc_mm_counter_other(vma->vm_mm, MM_ANONPAGES); folio_get(folio); if (folio == swapcache) { rmap_t rmap_flags = RMAP_NONE; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index af61b95c89e4..34e760c37b7b 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -221,7 +221,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, * Must happen after rmap, as mm_counter() checks mapping (via * PageAnon()), which is set by __page_set_anon_rmap(). */ - inc_mm_counter(dst_mm, mm_counter(folio)); + inc_mm_counter_other(dst_mm, mm_counter(folio)); set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); -- 2.51.0 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 4/4] mm: Split a slow path for updating mm counters 2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi @ 2025-12-01 10:19 ` David Hildenbrand (Red Hat) 0 siblings, 0 replies; 19+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-12-01 10:19 UTC (permalink / raw) To: Gabriel Krisman Bertazi, linux-mm Cc: linux-kernel, jack, Mateusz Guzik, Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan On 11/28/25 00:36, Gabriel Krisman Bertazi wrote: > For cases where we know we are not coming from local context, there is > no point in touching current when incrementing/decrementing the > counters. Split this path into another helper to avoid this cost. > > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> > --- > arch/s390/mm/gmap_helpers.c | 4 ++-- > arch/s390/mm/pgtable.c | 4 ++-- > fs/exec.c | 2 +- > include/linux/mm.h | 14 +++++++++++--- > kernel/events/uprobes.c | 2 +- > mm/filemap.c | 2 +- > mm/huge_memory.c | 22 +++++++++++----------- > mm/khugepaged.c | 6 +++--- > mm/ksm.c | 2 +- > mm/madvise.c | 2 +- > mm/memory.c | 20 ++++++++++---------- > mm/migrate.c | 2 +- > mm/migrate_device.c | 2 +- > mm/rmap.c | 16 ++++++++-------- > mm/swapfile.c | 6 +++--- > mm/userfaultfd.c | 2 +- > 16 files changed, 58 insertions(+), 50 deletions(-) > > diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c > index d4c3c36855e2..6d8498c56d08 100644 > --- a/arch/s390/mm/gmap_helpers.c > +++ b/arch/s390/mm/gmap_helpers.c > @@ -29,9 +29,9 @@ > static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry) > { > if (!non_swap_entry(entry)) > - dec_mm_counter(mm, MM_SWAPENTS); > + dec_mm_counter_other(mm, MM_SWAPENTS); > else if (is_migration_entry(entry)) > - dec_mm_counter(mm, mm_counter(pfn_swap_entry_folio(entry))); > + dec_mm_counter_other(mm, mm_counter(pfn_swap_entry_folio(entry))); > free_swap_and_cache(entry); > } > > diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c > index 0fde20bbc50b..021a04f958e5 100644 > --- a/arch/s390/mm/pgtable.c > +++ b/arch/s390/mm/pgtable.c > @@ -686,11 +686,11 @@ void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep) > static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry) > { > if (!non_swap_entry(entry)) > - dec_mm_counter(mm, MM_SWAPENTS); > + dec_mm_counter_other(mm, MM_SWAPENTS); > else if (is_migration_entry(entry)) { > struct folio *folio = pfn_swap_entry_folio(entry); > > - dec_mm_counter(mm, mm_counter(folio)); > + dec_mm_counter_other(mm, mm_counter(folio)); > } > free_swap_and_cache(entry); > } > diff --git a/fs/exec.c b/fs/exec.c > index 4298e7e08d5d..33d0eb00d315 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -137,7 +137,7 @@ static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages) > return; > > bprm->vma_pages = pages; > - add_mm_counter(mm, MM_ANONPAGES, diff); > + add_mm_counter_local(mm, MM_ANONPAGES, diff); > } > > static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 29de4c60ac6c..2db12280e938 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2689,7 +2689,7 @@ static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member) > > void mm_trace_rss_stat(struct mm_struct *mm, int member); > > -static inline void add_mm_counter(struct mm_struct *mm, int member, long value) > +static inline void add_mm_counter_local(struct mm_struct *mm, int member, long value) > { > if (READ_ONCE(current->mm) == mm) > lazy_percpu_counter_add_fast(&mm->rss_stat[member], value); > @@ -2698,9 +2698,17 @@ static inline void add_mm_counter(struct mm_struct *mm, int member, long value) > > mm_trace_rss_stat(mm, member); > } > +static inline void add_mm_counter_other(struct mm_struct *mm, int member, long value) > +{ > + lazy_percpu_counter_add_atomic(&mm->rss_stat[member], value); > + > + mm_trace_rss_stat(mm, member); > +} > > -#define inc_mm_counter(mm, member) add_mm_counter(mm, member, 1) > -#define dec_mm_counter(mm, member) add_mm_counter(mm, member, -1) > +#define inc_mm_counter_local(mm, member) add_mm_counter_local(mm, member, 1) > +#define dec_mm_counter_local(mm, member) add_mm_counter_local(mm, member, -1) > +#define inc_mm_counter_other(mm, member) add_mm_counter_other(mm, member, 1) > +#define dec_mm_counter_other(mm, member) add_mm_counter_other(mm, member, -1) I'd have thought that there is a local and !local version, whereby the latter one would simply maintain the old name. The "_other()" sticks out a bit. E.g., cmpxch() vs. cmpxchg_local(). Or would "_remote()" better describe what "_other()" intends to do? -- Cheers David ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi ` (3 preceding siblings ...) 2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi @ 2025-11-28 13:30 ` Mathieu Desnoyers 2025-11-28 20:10 ` Jan Kara 4 siblings, 1 reply; 19+ messages in thread From: Mathieu Desnoyers @ 2025-11-28 13:30 UTC (permalink / raw) To: Gabriel Krisman Bertazi, linux-mm Cc: linux-kernel, jack, Mateusz Guzik, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote: > The cost of the pcpu memory allocation is non-negligible for systems > with many cpus, and it is quite visible when forking a new task, as > reported in a few occasions. I've come to the same conclusion within the development of the hierarchical per-cpu counters. But while the mm_struct has a SLAB cache (initialized in kernel/fork.c:mm_cache_init()), there is no such thing for the per-mm per-cpu data. In the mm_struct, we have the following per-cpu data (please let me know if I missed any in the maze): - struct mm_cid __percpu *pcpu_cid (or equivalent through struct mm_mm_cid after Thomas Gleixner gets his rewrite upstream), - unsigned int __percpu *futex_ref, - NR_MM_COUNTERS rss_stats per-cpu counters. What would really reduce memory allocation overhead on fork is to move all those fields into a top level "struct mm_percpu_struct" as a first step. This would merge 3 per-cpu allocations into one when forking a new task. Then the second step is to create a mm_percpu_struct cache to bypass the per-cpu allocator. I suspect that by doing just that we'd get most of the performance benefits provided by the single-threaded special-case proposed here. I'm not against special casing single-threaded if it's still worth it after doing the underlying data structure layout/caching changes I'm proposing here, but I think we need to fix the memory allocation overhead issue first before working around it with special cases and added complexity. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers @ 2025-11-28 20:10 ` Jan Kara 2025-11-28 20:12 ` Mathieu Desnoyers 2025-11-29 5:57 ` Mateusz Guzik 0 siblings, 2 replies; 19+ messages in thread From: Jan Kara @ 2025-11-28 20:10 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Gabriel Krisman Bertazi, linux-mm, linux-kernel, jack, Mateusz Guzik, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: > On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote: > > The cost of the pcpu memory allocation is non-negligible for systems > > with many cpus, and it is quite visible when forking a new task, as > > reported in a few occasions. > I've come to the same conclusion within the development of > the hierarchical per-cpu counters. > > But while the mm_struct has a SLAB cache (initialized in > kernel/fork.c:mm_cache_init()), there is no such thing > for the per-mm per-cpu data. > > In the mm_struct, we have the following per-cpu data (please > let me know if I missed any in the maze): > > - struct mm_cid __percpu *pcpu_cid (or equivalent through > struct mm_mm_cid after Thomas Gleixner gets his rewrite > upstream), > > - unsigned int __percpu *futex_ref, > > - NR_MM_COUNTERS rss_stats per-cpu counters. > > What would really reduce memory allocation overhead on fork > is to move all those fields into a top level > "struct mm_percpu_struct" as a first step. This would > merge 3 per-cpu allocations into one when forking a new > task. > > Then the second step is to create a mm_percpu_struct > cache to bypass the per-cpu allocator. > > I suspect that by doing just that we'd get most of the > performance benefits provided by the single-threaded special-case > proposed here. I don't think so. Because in the profiles I have been doing for these loads the biggest cost wasn't actually the per-cpu allocation itself but the cost of zeroing the allocated counter for many CPUs (and then the counter summarization on exit) and you're not going to get rid of that with just reshuffling per-cpu fields and adding slab allocator in front. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-11-28 20:10 ` Jan Kara @ 2025-11-28 20:12 ` Mathieu Desnoyers 2025-11-29 5:57 ` Mateusz Guzik 1 sibling, 0 replies; 19+ messages in thread From: Mathieu Desnoyers @ 2025-11-28 20:12 UTC (permalink / raw) To: Jan Kara Cc: Gabriel Krisman Bertazi, linux-mm, linux-kernel, Mateusz Guzik, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On 2025-11-28 15:10, Jan Kara wrote: > On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: [...] >> I suspect that by doing just that we'd get most of the >> performance benefits provided by the single-threaded special-case >> proposed here. > > I don't think so. Because in the profiles I have been doing for these > loads the biggest cost wasn't actually the per-cpu allocation itself but > the cost of zeroing the allocated counter for many CPUs (and then the > counter summarization on exit) and you're not going to get rid of that with > just reshuffling per-cpu fields and adding slab allocator in front. That's a good point ! So skipping the zeroing of per-cpu fields would indeed justify special-casing the single-threaded case. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-11-28 20:10 ` Jan Kara 2025-11-28 20:12 ` Mathieu Desnoyers @ 2025-11-29 5:57 ` Mateusz Guzik 2025-11-29 7:50 ` Mateusz Guzik ` (2 more replies) 1 sibling, 3 replies; 19+ messages in thread From: Mateusz Guzik @ 2025-11-29 5:57 UTC (permalink / raw) To: Jan Kara Cc: Mathieu Desnoyers, Gabriel Krisman Bertazi, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote: > On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: > > What would really reduce memory allocation overhead on fork > > is to move all those fields into a top level > > "struct mm_percpu_struct" as a first step. This would > > merge 3 per-cpu allocations into one when forking a new > > task. > > > > Then the second step is to create a mm_percpu_struct > > cache to bypass the per-cpu allocator. > > > > I suspect that by doing just that we'd get most of the > > performance benefits provided by the single-threaded special-case > > proposed here. > > I don't think so. Because in the profiles I have been doing for these > loads the biggest cost wasn't actually the per-cpu allocation itself but > the cost of zeroing the allocated counter for many CPUs (and then the > counter summarization on exit) and you're not going to get rid of that with > just reshuffling per-cpu fields and adding slab allocator in front. > The entire ordeal has been discussed several times already. I'm rather disappointed there is a new patchset posted which does not address any of it and goes straight to special-casing single-threaded operation. The major claims (by me anyway) are: 1. single-threaded operation for fork + exec suffers avoidable overhead even without the rss counter problem, which are tractable with the same kind of thing which would sort out the multi-threaded problem 2. unfortunately there is an increasing number of multi-threaded (and often short lived) processes (example: lld, the linker form the llvm project; more broadly plenty of things Rust where people think threading == performance) Bottom line is, solutions like the one proposed in the patchset are at best a stopgap and even they leave performance on the table for the case they are optimizing for. The pragmatic way forward (as I see it anyway) is to fix up the multi-threaded thing and see if trying to special case for single-threaded case is justifiable afterwards. Given that the current patchset has to resort to atomics in certain cases, there is some error-pronnes and runtime overhead associated with it going beyond merely checking if the process is single-threaded, which puts an additional question mark on it. Now to business: You mentioned the rss loops are a problem. I agree, but they can be largely damage-controlled. More importantly there are 2 loops of the sort already happening even with the patchset at hand. mm_alloc_cid() results in one loop in the percpu allocator to zero out the area, then mm_init_cid() performs the following: for_each_possible_cpu(i) { struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i); pcpu_cid->cid = MM_CID_UNSET; pcpu_cid->recent_cid = MM_CID_UNSET; pcpu_cid->time = 0; } There is no way this is not visible already on 256 threads. Preferably some magic would be done to init this on first use on given CPU.There is some bitmap tracking CPU presence, maybe this can be tackled on top of it. But for the sake of argument let's say that's too expensive or perhaps not feasible. Even then, the walk can be done *once* by telling the percpu allocator to refrain from zeroing memory. Which brings me to rss counters. In the current kernel that's *another* loop over everything to zero it out. But it does not have to be that way. Suppose bitmap shenanigans mentioned above are no-go for these as well. So instead the code could reach out to the percpu allocator to allocate memory for both cid and rss (as mentined by Mathieu), but have it returned uninitialized and loop over it once sorting out both cid and rss in the same body. This should be drastically faster than the current code. But one may observe it is an invariant the values sum up to 0 on process exit. So if one was to make sure the first time this is handed out by the percpu allocator the values are all 0s and then cache the area somewhere for future allocs/frees of mm, there would be no need to do the zeroing on alloc. On the free side summing up rss counters in check_mm() is only there for debugging purposes. Suppose it is useful enough that it needs to stay. Even then, as implemented right now, this is just slow for no reason: for (i = 0; i < NR_MM_COUNTERS; i++) { long x = percpu_counter_sum(&mm->rss_stat[i]); [snip] } That's *four* loops with extra overhead of irq-trips for every single one. This can be patched up to only do one loop, possibly even with irqs enabled the entire time. Doing the loop is still slower than not doing it, but his may be just fast enough to obsolete the ideas like in the proposed patchset. While per-cpu level caching for all possible allocations seems like the easiest way out, it in fact does *NOT* fully solve problem -- you are still going to globally serialize in lru_gen_add_mm() (and the del part), pgd_alloc() and other places. Or to put it differently, per-cpu caching of mm_struct itself makes no sense in the current kernel (with the patchset or not) because on the way to finish the alloc or free you are going to globally serialize several times and *that* is the issue to fix in the long run. You can make the problematic locks fine-grained (and consequently alleviate the scalability aspect), but you are still going to suffer the overhead of taking them. As far as I'm concerned the real long term solution(tm) would make the cached mm's retain the expensive to sort out state -- list presence, percpu memory and whatever else. To that end I see 2 feasible approaches: 1. a dedicated allocator with coarse granularity Instead of per-cpu, you could have an instance for every n threads (let's say 8 or whatever). this would pose a tradeoff between total memory usage and scalability outside of a microbenchmark setting. you are still going to serialize in some cases, but only once on alloc and once on free, not several times and you are still cheaper single-threaded. This is faster all around. 2. dtor support in the slub allocator ctor does the hard work and dtor undoes it. There is an unfinished patchset by Harry which implements the idea[1]. There is a serious concern about deadlock potential stemming from running arbitrary dtor code during memory reclaim. I already described elsewhere how with a little bit of discipline supported by lockdep this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't take any locks if you hold them and you have to disable interrupts) + mark dtors as only allowed to hold a leaf spinlock et voila, code guaranteed to not deadlock). But then all code trying to cache its state in to be undone with dtor has to be patched to facilitate it. Again bugs in the area sorted out by lockdep. The good news is that folks were apparently open to punting reclaim of such memory into a workqueue, which completely alleviates that concern anyway. So happens if fork + exit is involved there are numerous other bottlenecks which overshadow the above, but that's a rant for another day. Here we can pretend for a minute they are solved. [1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-11-29 5:57 ` Mateusz Guzik @ 2025-11-29 7:50 ` Mateusz Guzik 2025-12-01 10:38 ` Harry Yoo 2025-12-01 15:23 ` Gabriel Krisman Bertazi 2 siblings, 0 replies; 19+ messages in thread From: Mateusz Guzik @ 2025-11-29 7:50 UTC (permalink / raw) To: Jan Kara Cc: Mathieu Desnoyers, Gabriel Krisman Bertazi, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote: > Now to business: > You mentioned the rss loops are a problem. I agree, but they can be > largely damage-controlled. More importantly there are 2 loops of the > sort already happening even with the patchset at hand. > > mm_alloc_cid() results in one loop in the percpu allocator to zero out > the area, then mm_init_cid() performs the following: > for_each_possible_cpu(i) { > struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i); > > pcpu_cid->cid = MM_CID_UNSET; > pcpu_cid->recent_cid = MM_CID_UNSET; > pcpu_cid->time = 0; > } > > There is no way this is not visible already on 256 threads. > > Preferably some magic would be done to init this on first use on given > CPU.There is some bitmap tracking CPU presence, maybe this can be > tackled on top of it. But for the sake of argument let's say that's > too expensive or perhaps not feasible. Even then, the walk can be done > *once* by telling the percpu allocator to refrain from zeroing memory. > > Which brings me to rss counters. In the current kernel that's > *another* loop over everything to zero it out. But it does not have to > be that way. Suppose bitmap shenanigans mentioned above are no-go for > these as well. > So I had another look and I think bitmapping it is perfectly feasible, albeit requiring a little bit of refactoring to avoid adding overhead in the common case. There is a bitmap for tlb tracking, updated like so on context switch in switch_mm_irqs_off(): if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next))) cpumask_set_cpu(cpu, mm_cpumask(next)); ... and of course cleared at times. Easiest way out would add an additional bitmap with bits which are *never* cleared. But that's another cache miss, preferably avoided. Instead the entire thing could be reimplemented to have 2 bits per CPU in the bitmap -- one for tlb and another for ever running on it. Having spotted you are running on the given cpu for the first time, the rss area gets zeroed out and *both* bits get set et voila. The common case gets away with the same load as always. The less common case gets more work of having to zero the counters initialize cid. In return both cid and rss handling can avoid mandatory linear walks by cpu count, instead merely having to visit the cpus known to have used a given mm. I don't think this is particularly ugly or complicated, just needs some care & time to sit through and refactor away all the direct access into helpers. So if I was tasked with working on the overall problem, I would definitely try to get this done. Fortunately for me this is not the case. :-) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-11-29 5:57 ` Mateusz Guzik 2025-11-29 7:50 ` Mateusz Guzik @ 2025-12-01 10:38 ` Harry Yoo 2025-12-01 11:31 ` Mateusz Guzik 2025-12-01 15:23 ` Gabriel Krisman Bertazi 2 siblings, 1 reply; 19+ messages in thread From: Harry Yoo @ 2025-12-01 10:38 UTC (permalink / raw) To: Mateusz Guzik Cc: Jan Kara, Mathieu Desnoyers, Gabriel Krisman Bertazi, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote: > On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote: > > On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: > > > What would really reduce memory allocation overhead on fork > > > is to move all those fields into a top level > > > "struct mm_percpu_struct" as a first step. This would > > > merge 3 per-cpu allocations into one when forking a new > > > task. > > > > > > Then the second step is to create a mm_percpu_struct > > > cache to bypass the per-cpu allocator. > > > > > > I suspect that by doing just that we'd get most of the > > > performance benefits provided by the single-threaded special-case > > > proposed here. > > > > I don't think so. Because in the profiles I have been doing for these > > loads the biggest cost wasn't actually the per-cpu allocation itself but > > the cost of zeroing the allocated counter for many CPUs (and then the > > counter summarization on exit) and you're not going to get rid of that with > > just reshuffling per-cpu fields and adding slab allocator in front. > > > > The entire ordeal has been discussed several times already. I'm rather > disappointed there is a new patchset posted which does not address any > of it and goes straight to special-casing single-threaded operation. > > The major claims (by me anyway) are: > 1. single-threaded operation for fork + exec suffers avoidable > overhead even without the rss counter problem, which are tractable > with the same kind of thing which would sort out the multi-threaded > problem > 2. unfortunately there is an increasing number of multi-threaded (and > often short lived) processes (example: lld, the linker form the llvm > project; more broadly plenty of things Rust where people think > threading == performance) > > Bottom line is, solutions like the one proposed in the patchset are at > best a stopgap and even they leave performance on the table for the > case they are optimizing for. > > The pragmatic way forward (as I see it anyway) is to fix up the > multi-threaded thing and see if trying to special case for > single-threaded case is justifiable afterwards. > > Given that the current patchset has to resort to atomics in certain > cases, there is some error-pronnes and runtime overhead associated > with it going beyond merely checking if the process is > single-threaded, which puts an additional question mark on it. > > Now to business: > You mentioned the rss loops are a problem. I agree, but they can be > largely damage-controlled. More importantly there are 2 loops of the > sort already happening even with the patchset at hand. > > mm_alloc_cid() results in one loop in the percpu allocator to zero out > the area, then mm_init_cid() performs the following: > for_each_possible_cpu(i) { > struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i); > > pcpu_cid->cid = MM_CID_UNSET; > pcpu_cid->recent_cid = MM_CID_UNSET; > pcpu_cid->time = 0; > } > > There is no way this is not visible already on 256 threads. > > Preferably some magic would be done to init this on first use on given > CPU.There is some bitmap tracking CPU presence, maybe this can be > tackled on top of it. But for the sake of argument let's say that's > too expensive or perhaps not feasible. Even then, the walk can be done > *once* by telling the percpu allocator to refrain from zeroing memory. > > Which brings me to rss counters. In the current kernel that's > *another* loop over everything to zero it out. But it does not have to > be that way. Suppose bitmap shenanigans mentioned above are no-go for > these as well. > > So instead the code could reach out to the percpu allocator to > allocate memory for both cid and rss (as mentined by Mathieu), but > have it returned uninitialized and loop over it once sorting out both > cid and rss in the same body. This should be drastically faster than > the current code. > > But one may observe it is an invariant the values sum up to 0 on process exit. > > So if one was to make sure the first time this is handed out by the > percpu allocator the values are all 0s and then cache the area > somewhere for future allocs/frees of mm, there would be no need to do > the zeroing on alloc. That's what slab constructor is for! > On the free side summing up rss counters in check_mm() is only there > for debugging purposes. Suppose it is useful enough that it needs to > stay. Even then, as implemented right now, this is just slow for no > reason: > > for (i = 0; i < NR_MM_COUNTERS; i++) { > long x = percpu_counter_sum(&mm->rss_stat[i]); > [snip] > } > > That's *four* loops with extra overhead of irq-trips for every single > one. This can be patched up to only do one loop, possibly even with > irqs enabled the entire time. > > Doing the loop is still slower than not doing it, but his may be just > fast enough to obsolete the ideas like in the proposed patchset. > > While per-cpu level caching for all possible allocations seems like > the easiest way out, it in fact does *NOT* fully solve problem -- you > are still going to globally serialize in lru_gen_add_mm() (and the del > part), pgd_alloc() and other places. > > Or to put it differently, per-cpu caching of mm_struct itself makes no > sense in the current kernel (with the patchset or not) because on the > way to finish the alloc or free you are going to globally serialize > several times and *that* is the issue to fix in the long run. You can > make the problematic locks fine-grained (and consequently alleviate > the scalability aspect), but you are still going to suffer the > overhead of taking them. > > As far as I'm concerned the real long term solution(tm) would make the > cached mm's retain the expensive to sort out state -- list presence, > percpu memory and whatever else. > > To that end I see 2 feasible approaches: > 1. a dedicated allocator with coarse granularity > > Instead of per-cpu, you could have an instance for every n threads > (let's say 8 or whatever). this would pose a tradeoff between total > memory usage and scalability outside of a microbenchmark setting. you > are still going to serialize in some cases, but only once on alloc and > once on free, not several times and you are still cheaper > single-threaded. This is faster all around. > > 2. dtor support in the slub allocator > > ctor does the hard work and dtor undoes it. There is an unfinished > patchset by Harry which implements the idea[1]. Apologies for not reposting it for a while. I have limited capacity to push this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip branch after rebasing it onto the latest slab/for-next. https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads My review on the version is limited, but did a little bit of testing. > There is a serious concern about deadlock potential stemming from > running arbitrary dtor code during memory reclaim. I already described > elsewhere how with a little bit of discipline supported by lockdep > this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't > take any locks if you hold them and you have to disable interrupts) + > mark dtors as only allowed to hold a leaf spinlock et voila, code > guaranteed to not deadlock). But then all code trying to cache its > state in to be undone with dtor has to be patched to facilitate it. > Again bugs in the area sorted out by lockdep. > > The good news is that folks were apparently open to punting reclaim of > such memory into a workqueue, which completely alleviates that concern > anyway. I took the good news and switched to using workqueue to reclaim slabs (for caches with dtor) in v2. > So happens if fork + exit is involved there are numerous other > bottlenecks which overshadow the above, but that's a rant for another > day. Here we can pretend for a minute they are solved. > > [1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-12-01 10:38 ` Harry Yoo @ 2025-12-01 11:31 ` Mateusz Guzik 2025-12-01 14:47 ` Mathieu Desnoyers 0 siblings, 1 reply; 19+ messages in thread From: Mateusz Guzik @ 2025-12-01 11:31 UTC (permalink / raw) To: Harry Yoo Cc: Jan Kara, Mathieu Desnoyers, Gabriel Krisman Bertazi, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On Mon, Dec 1, 2025 at 11:39 AM Harry Yoo <harry.yoo@oracle.com> wrote: > Apologies for not reposting it for a while. I have limited capacity to push > this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip > branch after rebasing it onto the latest slab/for-next. > > https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads > nice, thanks. This takes care of majority of the needful(tm). To reiterate, should something like this land, it is going to address the multicore scalability concern for single-threaded processes better than the patchset by Gabriel thanks to also taking care of cid. Bonus points for handling creation and teardown of multi-threaded processes. However, this is still going to suffer from doing a full cpu walk on process exit. As I described earlier the current handling can be massively depessimized by reimplementing this to take care of all 4 counters in each iteration, instead of walking everything 4 times. This is still going to be slower than not doing the walk at all, but it may be fast enough that Gabriel's patchset is no longer justifiable. But then the test box is "only" 256 hw threads, what about bigger boxes? Given my previous note about increased use of multithreading in userspace, the more concerned you happen to be about such a walk, the more you want an actual solution which takes care of multithreaded processes. Additionally one has to assume per-cpu memory will be useful for other facilities down the line, making such a walk into an even bigger problem. Thus ultimately *some* tracking of whether given mm was ever active on a given cpu is needed, preferably cheaply implemented at least for the context switch code. Per what I described in another e-mail, one way to do it would be to coalesce it with tlb handling by changing how the bitmap tracking is handled -- having 2 adjacent bits denote cpu usage + tlb separately. For the common case this should be almost the code to set the two. Iteration for tlb shootdowns would be less efficient but that's probably tolerable. Maybe there is a better way, I did not put much thought into it. I just claim sooner or later this will need to get solved. At the same time would be a bummer to add stopgaps without even trying. With the cpu tracking problem solved, check_mm would visit few cpus in the benchmark (probably just 1) and it would be faster single-threaded than the proposed patch *and* would retain that for processes which went multithreaded. I'm not signing up to handle this though and someone else would have to sign off on the cpu tracking thing anyway. That is to say, I laid out the lay of the land as I see it but I'm not doing any work. :) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-12-01 11:31 ` Mateusz Guzik @ 2025-12-01 14:47 ` Mathieu Desnoyers 0 siblings, 0 replies; 19+ messages in thread From: Mathieu Desnoyers @ 2025-12-01 14:47 UTC (permalink / raw) To: Mateusz Guzik, Harry Yoo Cc: Jan Kara, Gabriel Krisman Bertazi, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On 2025-12-01 06:31, Mateusz Guzik wrote: > On Mon, Dec 1, 2025 at 11:39 AM Harry Yoo <harry.yoo@oracle.com> wrote: >> Apologies for not reposting it for a while. I have limited capacity to push >> this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip >> branch after rebasing it onto the latest slab/for-next. >> >> https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads >> > > nice, thanks. This takes care of majority of the needful(tm). > > To reiterate, should something like this land, it is going to address > the multicore scalability concern for single-threaded processes better > than the patchset by Gabriel thanks to also taking care of cid. Bonus > points for handling creation and teardown of multi-threaded processes. > > However, this is still going to suffer from doing a full cpu walk on > process exit. As I described earlier the current handling can be > massively depessimized by reimplementing this to take care of all 4 > counters in each iteration, instead of walking everything 4 times. > This is still going to be slower than not doing the walk at all, but > it may be fast enough that Gabriel's patchset is no longer > justifiable. > > But then the test box is "only" 256 hw threads, what about bigger boxes? > > Given my previous note about increased use of multithreading in > userspace, the more concerned you happen to be about such a walk, the > more you want an actual solution which takes care of multithreaded > processes. > > Additionally one has to assume per-cpu memory will be useful for other > facilities down the line, making such a walk into an even bigger > problem. > > Thus ultimately *some* tracking of whether given mm was ever active on > a given cpu is needed, preferably cheaply implemented at least for the > context switch code. Per what I described in another e-mail, one way > to do it would be to coalesce it with tlb handling by changing how the > bitmap tracking is handled -- having 2 adjacent bits denote cpu usage > + tlb separately. For the common case this should be almost the code > to set the two. Iteration for tlb shootdowns would be less efficient > but that's probably tolerable. Maybe there is a better way, I did not > put much thought into it. I just claim sooner or later this will need > to get solved. At the same time would be a bummer to add stopgaps > without even trying. > > With the cpu tracking problem solved, check_mm would visit few cpus in > the benchmark (probably just 1) and it would be faster single-threaded > than the proposed patch *and* would retain that for processes which > went multithreaded. Looking at this problem, it appears to be a good fit for rseq mm_cid (per-mm concurrency ids). Let me explain. I originally implemented the rseq mm_cid for userspace. It keeps track of max_mm_cid = min(nr_threads, nr_allowed_cpus) for each mm, and lets the scheduler select a current mm_cid value within the range [0 .. max_mm_cid - 1]. With Thomas Gleixner's rewrite (currently in tip), we even have hooks in thread clone/exit where we know when max_mm_cid is increased/decreased for a mm. So we could keep track of the maximum value of max_mm_cid over the lifetime of a mm. So using mm_cid for per-mm rss counter would involve: - Still allocating memory per-cpu on mm allocation (nr_cpu_ids), but without zeroing all that memory (we eliminate a possible cpus walk on allocation). - Initialize CPU counters on thread clone when max_mm_cid is increased. Keep track of the max value of max_mm_cid over mm lifetime. - Rather than using the per-cpu accessors to access the counters, we would have to load the per-task mm_cid field to get the counter index. This would have a slight added overhead on the fast path, because we would change a segment-selector prefix operation for an access that depends on a load of the task struct current mm_cid index. - Iteration on all possible cpus at process exit is replaced by an iteration on mm maximum max_mm_cid, which will be bound by the maximum value of min(nr_threads, nr_allowed_cpus) over the mm lifetime. This iteration should be done with the new mm_cid mutex held across thread clone/exit. One more downside to consider is loss of NUMA locality, because the index used to access the per-cpu memory would not take into account the hardware topology. The index to topology should stay stable for a given mm, but if we mix the memory allocation of per-cpu data across different mm, then the NUMA locality would be degraded. Ideally we'd have a per-cpu allocator with per-mm arenas for mm_cid indexing if we care about NUMA locality. So let's say you have a 256-core machine, where cpu numbers can go from 0 to 255, with a 4-thread process, mm_cid will be limited to the range [0..3]. Likewise if there are tons of threads in a process limited to a few cores (e.g. pinned on cores from 10 to 19), which will limit the range to [0..9]. This approach solves the runtime overhead issue of zeroing per-cpu memory for all scenarios: * single-threaded: index = 0 * nr_threads < nr_cpu_ids * nr_threads < nr_allowed_cpus: index = [0 .. nr_threads - 1] * nr_threads >= nr_allowed_cpus: index = [0 .. nr_allowed_cpus - 1] * nr_threads >= nr_cpus_ids: index = [0 .. nr_cpu_ids - 1] Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-11-29 5:57 ` Mateusz Guzik 2025-11-29 7:50 ` Mateusz Guzik 2025-12-01 10:38 ` Harry Yoo @ 2025-12-01 15:23 ` Gabriel Krisman Bertazi 2025-12-01 19:16 ` Harry Yoo 2025-12-03 11:02 ` Mateusz Guzik 2 siblings, 2 replies; 19+ messages in thread From: Gabriel Krisman Bertazi @ 2025-12-01 15:23 UTC (permalink / raw) To: Mateusz Guzik Cc: Jan Kara, Mathieu Desnoyers, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner Mateusz Guzik <mjguzik@gmail.com> writes: > On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote: >> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: >> > What would really reduce memory allocation overhead on fork >> > is to move all those fields into a top level >> > "struct mm_percpu_struct" as a first step. This would >> > merge 3 per-cpu allocations into one when forking a new >> > task. >> > >> > Then the second step is to create a mm_percpu_struct >> > cache to bypass the per-cpu allocator. >> > >> > I suspect that by doing just that we'd get most of the >> > performance benefits provided by the single-threaded special-case >> > proposed here. >> >> I don't think so. Because in the profiles I have been doing for these >> loads the biggest cost wasn't actually the per-cpu allocation itself but >> the cost of zeroing the allocated counter for many CPUs (and then the >> counter summarization on exit) and you're not going to get rid of that with >> just reshuffling per-cpu fields and adding slab allocator in front. >> Hi Mateusz, > The major claims (by me anyway) are: > 1. single-threaded operation for fork + exec suffers avoidable > overhead even without the rss counter problem, which are tractable > with the same kind of thing which would sort out the multi-threaded > problem Agreed, there are more issues in the fork/exec path than just the rss_stat. The rss_stat performance is particularly relevant to us, though, because it is a clear regression for single-threaded introduced in 6.2. I took the time to test the slab constructor approach with the /sbin/true microbenchmark. I've seen only 2% gain on that tight loop in the 80c machine, which, granted, is an artificial benchmark, but still a good stressor of the single-threaded case. With this patchset, I reported 6% improvement, getting it close to the performance before the pcpu rss_stats introduction. This is expected, as avoiding the pcpu allocation and initialization all together for the single-threaded case, where it is not necessary, will always be better than speeding up the allocation (even though that a worthwhile effort itself, as Mathieu pointed out). > 2. unfortunately there is an increasing number of multi-threaded (and > often short lived) processes (example: lld, the linker form the llvm > project; more broadly plenty of things Rust where people think > threading == performance) I don't agree with this argument, though. Sure, there is an increasing amount of multi-threaded applications, but this is not relevant. The relevant argument is the amount of single-threaded workloads. One example are coreutils, which are spawned to death by scripts. I did take the care of testing the patchset with a full distro on my day-to-day laptop and I wasn't surprised to see the vast majority of forked tasks never fork a second thread. The ones that do are most often long-lived applications, where the cost of mm initialization is way less relevant to the overall system performance. Another example is the fact real-world benchmarks, like kernbench, can be improved with special-casing single-threads. > The pragmatic way forward (as I see it anyway) is to fix up the > multi-threaded thing and see if trying to special case for > single-threaded case is justifiable afterwards. > > Given that the current patchset has to resort to atomics in certain > cases, there is some error-pronnes and runtime overhead associated > with it going beyond merely checking if the process is > single-threaded, which puts an additional question mark on it. I don't get why atomics would make it error-prone. But, regarding the runtime overhead, please note the main point of this approach is that the hot path can be handled with a simple non-atomic variable write in the task context, and not the atomic operation. The later is only used for infrequent case where the counter is touched by an external task such as OOM, khugepaged, etc. > > Now to business: -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-12-01 15:23 ` Gabriel Krisman Bertazi @ 2025-12-01 19:16 ` Harry Yoo 2025-12-03 11:02 ` Mateusz Guzik 1 sibling, 0 replies; 19+ messages in thread From: Harry Yoo @ 2025-12-01 19:16 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: Mateusz Guzik, Jan Kara, Mathieu Desnoyers, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On Mon, Dec 01, 2025 at 10:23:43AM -0500, Gabriel Krisman Bertazi wrote: > Mateusz Guzik <mjguzik@gmail.com> writes: > > > On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote: > >> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: > >> > What would really reduce memory allocation overhead on fork > >> > is to move all those fields into a top level > >> > "struct mm_percpu_struct" as a first step. This would > >> > merge 3 per-cpu allocations into one when forking a new > >> > task. > >> > > >> > Then the second step is to create a mm_percpu_struct > >> > cache to bypass the per-cpu allocator. > >> > > >> > I suspect that by doing just that we'd get most of the > >> > performance benefits provided by the single-threaded special-case > >> > proposed here. > >> > >> I don't think so. Because in the profiles I have been doing for these > >> loads the biggest cost wasn't actually the per-cpu allocation itself but > >> the cost of zeroing the allocated counter for many CPUs (and then the > >> counter summarization on exit) and you're not going to get rid of that with > >> just reshuffling per-cpu fields and adding slab allocator in front. > >> > > Hi Mateusz, > > > The major claims (by me anyway) are: > > 1. single-threaded operation for fork + exec suffers avoidable > > overhead even without the rss counter problem, which are tractable > > with the same kind of thing which would sort out the multi-threaded > > problem > > Agreed, there are more issues in the fork/exec path than just the > rss_stat. The rss_stat performance is particularly relevant to us, > though, because it is a clear regression for single-threaded introduced > in 6.2. > > I took the time to test the slab constructor approach with the > /sbin/true microbenchmark. I've seen only 2% gain on that tight loop in > the 80c machine, which, granted, is an artificial benchmark, but still a > good stressor of the single-threaded case. With this patchset, I > reported 6% improvement, getting it close to the performance before the > pcpu rss_stats introduction. Hi Gabriel, I don't want to argue which approach is better, but just wanted to mention that maybe this is not a fair comparison because we can (almost) eliminate initialization cost with slab ctor & dtor pair. As Mateusz pointed out, under normal conditions, we know that the sum of each rss_stat counter is zero when it's freed. That is what slab constructor is for; if we know that certain fields of a type are freed in a particular state, then we only need to initialize them once in the constructor when the object is first created, and no initialization is needed for subsequent allocations. We couldn't use slab constructor to do this because percpu memory is not allocated when it's called, but with ctor/dtor pair we can do this. > This is expected, as avoiding the pcpu > allocation and initialization all together for the single-threaded case, > where it is not necessary, will always be better than speeding up the > allocation (even though that a worthwhile effort itself, as Mathieu > pointed out). -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-12-01 15:23 ` Gabriel Krisman Bertazi 2025-12-01 19:16 ` Harry Yoo @ 2025-12-03 11:02 ` Mateusz Guzik 2025-12-03 11:54 ` Mateusz Guzik 1 sibling, 1 reply; 19+ messages in thread From: Mateusz Guzik @ 2025-12-03 11:02 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: Jan Kara, Mathieu Desnoyers, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On Mon, Dec 1, 2025 at 4:23 PM Gabriel Krisman Bertazi <krisman@suse.de> wrote: > > Mateusz Guzik <mjguzik@gmail.com> writes: > > The major claims (by me anyway) are: > > 1. single-threaded operation for fork + exec suffers avoidable > > overhead even without the rss counter problem, which are tractable > > with the same kind of thing which would sort out the multi-threaded > > problem > > Agreed, there are more issues in the fork/exec path than just the > rss_stat. The rss_stat performance is particularly relevant to us, > though, because it is a clear regression for single-threaded introduced > in 6.2. > > I took the time to test the slab constructor approach with the > /sbin/true microbenchmark. I've seen only 2% gain on that tight loop in > the 80c machine, which, granted, is an artificial benchmark, but still a > good stressor of the single-threaded case. With this patchset, I > reported 6% improvement, getting it close to the performance before the > pcpu rss_stats introduction. This is expected, as avoiding the pcpu > allocation and initialization all together for the single-threaded case, > where it is not necessary, will always be better than speeding up the > allocation (even though that a worthwhile effort itself, as Mathieu > pointed out) I'm fine with the benchmark method, but it was used on a kernel which remains gimped by the avoidably slow walk in check_mm which I already talked about. Per my prior commentary and can be patched up to only do the walk once instead of 4 times, and without taking locks. But that's still more work than nothing and let's say that's still too slow. 2 ideas were proposed how to avoid the walk altogether: I proposed expanding the tlb bitmap and Mathieu went with the cid machinery. Either way the walk over all CPUs is not there. With the walk issue fixed and all allocations cached thanks ctor/dtor, even the single-threaded fork/exec will be faster than it is with your patch thanks to *never* reaching to the per-cpu allocator (with your patch it is still going to happen for the cid stuff). Additionally there are other locks which can be elided later with the ctor/dtor pair, further improving perf. > > > 2. unfortunately there is an increasing number of multi-threaded (and > > often short lived) processes (example: lld, the linker form the llvm > > project; more broadly plenty of things Rust where people think > > threading == performance) > > I don't agree with this argument, though. Sure, there is an increasing > amount of multi-threaded applications, but this is not relevant. The > relevant argument is the amount of single-threaded workloads. One > example are coreutils, which are spawned to death by scripts. I did > take the care of testing the patchset with a full distro on my > day-to-day laptop and I wasn't surprised to see the vast majority of > forked tasks never fork a second thread. The ones that do are most > often long-lived applications, where the cost of mm initialization is > way less relevant to the overall system performance. Another example is > the fact real-world benchmarks, like kernbench, can be improved with > special-casing single-threads. > I stress one more time that a full fixup for the situation as I described above not only gets rid of the problem for *both* single- and multi- threaded operation, but ends up with code which is faster than your patchset even for the case you are patching for. The multi-threaded stuff *is* very much relevant because it is increasingly more common (see below). I did not claim that single-threaded workloads don't matter. I would not be arguing here if there was no feasible way to handle both or if handling the multi-threaded case still resulted in measurable overhead for single-threaded workloads. Since you mention configure scripts, I'm intimately familiar with large-scale building as a workload. While it is true that there is rampant usage of shell, sed and whatnot (all of which are single-threaded), things turn multi-threaded (and short-lived) very quickly once you go past the gnu toolchain and/or c/c++ codebases. For example the llvm linker is multi-threaded and short-lived. Since most real programs are small, during a large scale build of different programs you end up with tons of lld spawning and quitting all the time. Beyond that java, erlang, zig and others like to multi-thread as well. Rust is an emerging ecosystem where people think adding threading equals automatically better performance and where crate authors think it's fine to sneak in threads (my favourite offender is the ctrlc crate). And since Rust is growing in popularity you can expect the kind of single-threaded tooling you see right now will turn multi-threaded from under you over time. > > The pragmatic way forward (as I see it anyway) is to fix up the > > multi-threaded thing and see if trying to special case for > > single-threaded case is justifiable afterwards. > > > > Given that the current patchset has to resort to atomics in certain > > cases, there is some error-pronnes and runtime overhead associated > > with it going beyond merely checking if the process is > > single-threaded, which puts an additional question mark on it. > > I don't get why atomics would make it error-prone. But, regarding the > runtime overhead, please note the main point of this approach is that > the hot path can be handled with a simple non-atomic variable write in > the task context, and not the atomic operation. The later is only used > for infrequent case where the counter is touched by an external task > such as OOM, khugepaged, etc. > The claim is there may be a bug where something should be using the atomic codepath but is not. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-12-03 11:02 ` Mateusz Guzik @ 2025-12-03 11:54 ` Mateusz Guzik 2025-12-03 14:36 ` Mateusz Guzik 0 siblings, 1 reply; 19+ messages in thread From: Mateusz Guzik @ 2025-12-03 11:54 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: Jan Kara, Mathieu Desnoyers, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On Wed, Dec 3, 2025 at 12:02 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Mon, Dec 1, 2025 at 4:23 PM Gabriel Krisman Bertazi <krisman@suse.de> wrote: > > > > Mateusz Guzik <mjguzik@gmail.com> writes: > > > The major claims (by me anyway) are: > > > 1. single-threaded operation for fork + exec suffers avoidable > > > overhead even without the rss counter problem, which are tractable > > > with the same kind of thing which would sort out the multi-threaded > > > problem > > > > Agreed, there are more issues in the fork/exec path than just the > > rss_stat. The rss_stat performance is particularly relevant to us, > > though, because it is a clear regression for single-threaded introduced > > in 6.2. > > > > I took the time to test the slab constructor approach with the > > /sbin/true microbenchmark. I've seen only 2% gain on that tight loop in > > the 80c machine, which, granted, is an artificial benchmark, but still a > > good stressor of the single-threaded case. With this patchset, I > > reported 6% improvement, getting it close to the performance before the > > pcpu rss_stats introduction. This is expected, as avoiding the pcpu > > allocation and initialization all together for the single-threaded case, > > where it is not necessary, will always be better than speeding up the > > allocation (even though that a worthwhile effort itself, as Mathieu > > pointed out) > > I'm fine with the benchmark method, but it was used on a kernel which > remains gimped by the avoidably slow walk in check_mm which I already > talked about. > > Per my prior commentary and can be patched up to only do the walk once > instead of 4 times, and without taking locks. > > But that's still more work than nothing and let's say that's still too > slow. 2 ideas were proposed how to avoid the walk altogether: I > proposed expanding the tlb bitmap and Mathieu went with the cid > machinery. Either way the walk over all CPUs is not there. > So I got another idea and it boils down to coalescing cid init with rss checks on exit. I repeat that with your patchset the single-threaded case is left with one walk on alloc (for cid stuff) and that's where issues arise for machines with tons of cpus. If the walk gets fixed, the same method can be used to avoid the walk for rss, obsoleting the patchset. So let's say it is unfixable for the time being. mm_init_cid stores a bunch of -1 per-cpu. I'm assuming this can't be changed. One can still handle allocation in ctor/dtor and make it an invariant that the state present is ready to use, so in particular mm_init_cid was already issued on it. Then it is on the exit side to clean it up and this is where the walk checks rss state *and* reinits cid in one loop. Excluding the repeat lock and irq trips which don't need to be there, I take it almost all of the overhead is cache misses. WIth one loop that's sorted out. Maybe I'm going to hack it up, but perhaps Mathieu or Harry would be happy to do it? (or have a better idea?) > With the walk issue fixed and all allocations cached thanks ctor/dtor, > even the single-threaded fork/exec will be faster than it is with your > patch thanks to *never* reaching to the per-cpu allocator (with your > patch it is still going to happen for the cid stuff). > > Additionally there are other locks which can be elided later with the > ctor/dtor pair, further improving perf. > > > > > > 2. unfortunately there is an increasing number of multi-threaded (and > > > often short lived) processes (example: lld, the linker form the llvm > > > project; more broadly plenty of things Rust where people think > > > threading == performance) > > > > I don't agree with this argument, though. Sure, there is an increasing > > amount of multi-threaded applications, but this is not relevant. The > > relevant argument is the amount of single-threaded workloads. One > > example are coreutils, which are spawned to death by scripts. I did > > take the care of testing the patchset with a full distro on my > > day-to-day laptop and I wasn't surprised to see the vast majority of > > forked tasks never fork a second thread. The ones that do are most > > often long-lived applications, where the cost of mm initialization is > > way less relevant to the overall system performance. Another example is > > the fact real-world benchmarks, like kernbench, can be improved with > > special-casing single-threads. > > > > I stress one more time that a full fixup for the situation as I > described above not only gets rid of the problem for *both* single- > and multi- threaded operation, but ends up with code which is faster > than your patchset even for the case you are patching for. > > The multi-threaded stuff *is* very much relevant because it is > increasingly more common (see below). I did not claim that > single-threaded workloads don't matter. > > I would not be arguing here if there was no feasible way to handle > both or if handling the multi-threaded case still resulted in > measurable overhead for single-threaded workloads. > > Since you mention configure scripts, I'm intimately familiar with > large-scale building as a workload. While it is true that there is > rampant usage of shell, sed and whatnot (all of which are > single-threaded), things turn multi-threaded (and short-lived) very > quickly once you go past the gnu toolchain and/or c/c++ codebases. > > For example the llvm linker is multi-threaded and short-lived. Since > most real programs are small, during a large scale build of different > programs you end up with tons of lld spawning and quitting all the > time. > > Beyond that java, erlang, zig and others like to multi-thread as well. > > Rust is an emerging ecosystem where people think adding threading > equals automatically better performance and where crate authors think > it's fine to sneak in threads (my favourite offender is the ctrlc > crate). And since Rust is growing in popularity you can expect the > kind of single-threaded tooling you see right now will turn > multi-threaded from under you over time. > > > > The pragmatic way forward (as I see it anyway) is to fix up the > > > multi-threaded thing and see if trying to special case for > > > single-threaded case is justifiable afterwards. > > > > > > Given that the current patchset has to resort to atomics in certain > > > cases, there is some error-pronnes and runtime overhead associated > > > with it going beyond merely checking if the process is > > > single-threaded, which puts an additional question mark on it. > > > > I don't get why atomics would make it error-prone. But, regarding the > > runtime overhead, please note the main point of this approach is that > > the hot path can be handled with a simple non-atomic variable write in > > the task context, and not the atomic operation. The later is only used > > for infrequent case where the counter is touched by an external task > > such as OOM, khugepaged, etc. > > > > The claim is there may be a bug where something should be using the > atomic codepath but is not. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks 2025-12-03 11:54 ` Mateusz Guzik @ 2025-12-03 14:36 ` Mateusz Guzik 0 siblings, 0 replies; 19+ messages in thread From: Mateusz Guzik @ 2025-12-03 14:36 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: Jan Kara, Mathieu Desnoyers, linux-mm, linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Thomas Gleixner On Wed, Dec 03, 2025 at 12:54:34PM +0100, Mateusz Guzik wrote: > So I got another idea and it boils down to coalescing cid init with > rss checks on exit. > So short version is I implemented a POC and I have the same performance for single-threaded processes as your patchset when testing on Sapphire Rapids in an 80-way vm. Caveats: - there is a performance bug on the cpu vs rep movsb (see https://lore.kernel.org/all/mwwusvl7jllmck64xczeka42lglmsh7mlthuvmmqlmi5stp3na@raiwozh466wz/), I worked around it like so: diff --git a/arch/x86/Makefile b/arch/x86/Makefile index e20e25b8b16c..1b538f7bbd89 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -189,6 +189,29 @@ ifeq ($(CONFIG_STACKPROTECTOR),y) endif endif +ifdef CONFIG_CC_IS_GCC +# +# Inline memcpy and memset handling policy for gcc. +# +# For ops of sizes known at compilation time it quickly resorts to issuing rep +# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup +# latency and it is faster to issue regular stores (even if in loops) to handle +# small buffers. +# +# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter +# reported 0.23% increase for enabling these. +# +# We inline up to 256 bytes, which in the best case issues few movs, in the +# worst case creates a 4 * 8 store loop. +# +# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a +# threshold past which a rep-prefixed op becomes faster, 256 being the lowest +# common denominator. Someone(tm) should revisit this from time to time. +# +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign +endif + # # If the function graph tracer is used with mcount instead of fentry, # '-maccumulate-outgoing-args' is needed to prevent a GCC bug - qemu version i'm saddled with does not pass FSRS to the guest, thus: diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S index fb5a03cf5ab7..a692bb4cece4 100644 --- a/arch/x86/lib/memset_64.S +++ b/arch/x86/lib/memset_64.S @@ -30,7 +30,7 @@ * which the compiler could/should do much better anyway. */ SYM_TYPED_FUNC_START(__memset) - ALTERNATIVE "jmp memset_orig", "", X86_FEATURE_FSRS +// ALTERNATIVE "jmp memset_orig", "", X86_FEATURE_FSRS movq %rdi,%r9 movb %sil,%al Baseline commit (+ the 2 above hacks) is the following: commit a8ec08bf32595ea4b109e3c7f679d4457d1c58c0 Merge: ed80cc758b78 48233291461b Author: Vlastimil Babka <vbabka@suse.cz> Date: Tue Nov 25 14:38:41 2025 +0100 Merge branch 'slab/for-6.19/mempool_alloc_bulk' into slab/for-next This is what the ctor/dtor branch is rebased on. It is missing some of the further changes to cid machinery in upstream, but they don't fundamentally mess with the core idea of the patch (pcpu memory is still allocated on mm creation and it is being zeroed) so I did not bother rebasing -- end perf will be the same. Benchmark is a static binary executing itself in a loop: http://apollo.backplane.com/DFlyMisc/doexec.c $ cc -O2 -o static-doexec doexec.c $ taskset --cpu-list 1 ./static-doexec 1 With ctor+dtor+unified walk I'm seeing 2% improvement over the baseline and the same performance as lazy counter. If nobody is willing to productize this I'm going to do it. non-production hack below for reference: diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index cb9c6b16c311..f952ec1f59d1 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1439,7 +1439,7 @@ static inline cpumask_t *mm_cidmask(struct mm_struct *mm) return (struct cpumask *)cid_bitmap; } -static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) +static inline void mm_init_cid_percpu(struct mm_struct *mm, struct task_struct *p) { int i; @@ -1457,6 +1457,15 @@ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) cpumask_clear(mm_cidmask(mm)); } +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) +{ + mm->nr_cpus_allowed = p->nr_cpus_allowed; + atomic_set(&mm->max_nr_cid, 0); + raw_spin_lock_init(&mm->cpus_allowed_lock); + cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask); + cpumask_clear(mm_cidmask(mm)); +} + static inline int mm_alloc_cid_noprof(struct mm_struct *mm) { mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid); diff --git a/kernel/fork.c b/kernel/fork.c index a26319cddc3c..1575db9f0198 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -575,21 +575,46 @@ static inline int mm_alloc_id(struct mm_struct *mm) { return 0; } static inline void mm_free_id(struct mm_struct *mm) {} #endif /* CONFIG_MM_ID */ +/* + * pretend this is fully integrated into hotplug support + */ +__cacheline_aligned_in_smp DEFINE_SEQLOCK(cpu_hotplug_lock); + static void check_mm(struct mm_struct *mm) { - int i; + long rss_stat[NR_MM_COUNTERS]; + unsigned cpu_seq; + int i, cpu; BUILD_BUG_ON_MSG(ARRAY_SIZE(resident_page_types) != NR_MM_COUNTERS, "Please make sure 'struct resident_page_types[]' is updated as well"); - for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); + cpu_seq = read_seqbegin(&cpu_hotplug_lock); + local_irq_disable(); + for (i = 0; i < NR_MM_COUNTERS; i++) + rss_stat[i] = mm->rss_stat[i].count; + + for_each_possible_cpu(cpu) { + struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu); + + pcpu_cid->cid = MM_CID_UNSET; + pcpu_cid->recent_cid = MM_CID_UNSET; + pcpu_cid->time = 0; - if (unlikely(x)) { + for (i = 0; i < NR_MM_COUNTERS; i++) + rss_stat[i] += *per_cpu_ptr(mm->rss_stat[i].counters, cpu); + } + local_irq_enable(); + if (read_seqretry(&cpu_hotplug_lock, cpu_seq)) + BUG(); + + for (i = 0; i < NR_MM_COUNTERS; i++) { + if (unlikely(rss_stat[i])) { pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld Comm:%s Pid:%d\n", - mm, resident_page_types[i], x, + mm, resident_page_types[i], rss_stat[i], current->comm, task_pid_nr(current)); + /* XXXBUG: ZERO IT OUT */ } } @@ -2953,10 +2978,19 @@ static int sighand_ctor(void *data) static int mm_struct_ctor(void *object) { struct mm_struct *mm = object; + int cpu; if (mm_alloc_cid(mm)) return -ENOMEM; + for_each_possible_cpu(cpu) { + struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu); + + pcpu_cid->cid = MM_CID_UNSET; + pcpu_cid->recent_cid = MM_CID_UNSET; + pcpu_cid->time = 0; + } + if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL, NR_MM_COUNTERS)) { mm_destroy_cid(mm); diff --git a/mm/percpu.c b/mm/percpu.c index 7d036f42b5af..47e23ea90d7b 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1693,7 +1693,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) obj_cgroup_put(objcg); } -bool pcpu_charge(void *ptr, size_t size, gfp_t gfp) +bool pcpu_charge(void __percpu *ptr, size_t size, gfp_t gfp) { struct obj_cgroup *objcg = NULL; void *addr; @@ -1710,7 +1710,7 @@ bool pcpu_charge(void *ptr, size_t size, gfp_t gfp) return true; } -void pcpu_uncharge(void *ptr, size_t size) +void pcpu_uncharge(void __percpu *ptr, size_t size) { void *addr; struct pcpu_chunk *chunk; ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-12-03 14:36 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks Gabriel Krisman Bertazi 2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi 2025-12-01 10:19 ` David Hildenbrand (Red Hat) 2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers 2025-11-28 20:10 ` Jan Kara 2025-11-28 20:12 ` Mathieu Desnoyers 2025-11-29 5:57 ` Mateusz Guzik 2025-11-29 7:50 ` Mateusz Guzik 2025-12-01 10:38 ` Harry Yoo 2025-12-01 11:31 ` Mateusz Guzik 2025-12-01 14:47 ` Mathieu Desnoyers 2025-12-01 15:23 ` Gabriel Krisman Bertazi 2025-12-01 19:16 ` Harry Yoo 2025-12-03 11:02 ` Mateusz Guzik 2025-12-03 11:54 ` Mateusz Guzik 2025-12-03 14:36 ` Mateusz Guzik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox