linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
@ 2025-11-27 23:36 Gabriel Krisman Bertazi
  2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik,
	Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan

The cost of the pcpu memory allocation is non-negligible for systems
with many cpus, and it is quite visible when forking a new task, as
reported in a few occasions.  In particular, Jan Kara reported the
commit introducing per-cpu counters for rss_stat caused a 10% regression
of system time for gitsource in his system [1].  In that same occasion,
Jan suggested we special-cased the single-threaded case: since we know
there won't be frequent remote updates of rss_stats for single-threaded
applications, we could special case it with a local counter for most
updates, and an atomic counter for the infrequent remote updates.  This
patchset implements this idea.

It exposes a dual-mode counter that starts as a simple counter, cheap to
initialize on single-threaded tasks, that can be upgraded inflight to a
fully-fledged per cpu counter later.  Patch 3 then modifies the rss_stat
counters to use that structure, forcing the upgrade as soon as a second
task sharing the mm_struct is spawned.  By delaying the initialization
cost until the MM is shared, we cover single-threaded applications
fairly cheaply, while not penalizing applications that spawn multiple
threads.  On a 256c system, where the pcpu allocation of the rss_stats
is quite noticeable, this has reduced the wall-clock time between 6%
15% (depending on the number of cores) of an artificial fork-intensive
microbenchmark (calling /bin/true in a loop).  In a more realistic
benchmark, it showed an improvement of 1.5% on kernbench elapsed time.

More performance data, including profilings is available in the patch
modifying the rss_stat counters.

While this patch exposes a single users of this API, this should be
useful in more cases.  This is why I made it into a proper API.  In
addition, considering the recent efforts in this area, such as
hierarchical per-cpu counters which are orthogonal to this work because
they improve multi-threaded workloads, abstracting this with a new API
could help the merging of both works.

Finally, this is a RFC because it is an early work. in particular, I'd
be interested in more benchmarks suggestions, and I'd like feedback
whether this new interface should be implemented inside percpu_counters
as lazy counters or as a completely separated interface.

Thanks,

[1] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3

---

Cc: linux-kernel@vger.kernel.org
Cc: jack@suse.cz
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>

Gabriel Krisman Bertazi (4):
  lib/percpu_counter: Split out a helper to insert into hotplug list
  lib: Support lazy initialization of per-cpu counters
  mm: Avoid percpu MM counters on single-threaded tasks
  mm: Split a slow path for updating mm counters

 arch/s390/mm/gmap_helpers.c         |   4 +-
 arch/s390/mm/pgtable.c              |   4 +-
 fs/exec.c                           |   2 +-
 include/linux/lazy_percpu_counter.h | 145 ++++++++++++++++++++++++++++
 include/linux/mm.h                  |  26 ++---
 include/linux/mm_types.h            |   4 +-
 include/linux/percpu_counter.h      |   5 +-
 include/trace/events/kmem.h         |   4 +-
 kernel/events/uprobes.c             |   2 +-
 kernel/fork.c                       |  14 ++-
 lib/percpu_counter.c                |  68 ++++++++++---
 mm/filemap.c                        |   2 +-
 mm/huge_memory.c                    |  22 ++---
 mm/khugepaged.c                     |   6 +-
 mm/ksm.c                            |   2 +-
 mm/madvise.c                        |   2 +-
 mm/memory.c                         |  20 ++--
 mm/migrate.c                        |   2 +-
 mm/migrate_device.c                 |   2 +-
 mm/rmap.c                           |  16 +--
 mm/swapfile.c                       |   6 +-
 mm/userfaultfd.c                    |   2 +-
 22 files changed, 276 insertions(+), 84 deletions(-)
 create mode 100644 include/linux/lazy_percpu_counter.h

-- 
2.51.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list
  2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi
@ 2025-11-27 23:36 ` Gabriel Krisman Bertazi
  2025-11-27 23:36 ` [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters Gabriel Krisman Bertazi
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik,
	Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan

In preparation to using it with the lazy pcpu counter.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 lib/percpu_counter.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 2891f94a11c6..c2322d53f3b1 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -185,11 +185,26 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
+static int cpu_hotplug_add_watchlist(struct percpu_counter *fbc, int nr_counters)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+	unsigned long flags;
+	int i;
+
+	spin_lock_irqsave(&percpu_counters_lock, flags);
+	for (i = 0; i < nr_counters; i++) {
+		INIT_LIST_HEAD(&fbc[i].list);
+		list_add(&fbc[i].list, &percpu_counters);
+	}
+	spin_unlock_irqrestore(&percpu_counters_lock, flags);
+#endif
+	return 0;
+}
+
 int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
 			       gfp_t gfp, u32 nr_counters,
 			       struct lock_class_key *key)
 {
-	unsigned long flags __maybe_unused;
 	size_t counter_size;
 	s32 __percpu *counters;
 	u32 i;
@@ -205,21 +220,12 @@ int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
 	for (i = 0; i < nr_counters; i++) {
 		raw_spin_lock_init(&fbc[i].lock);
 		lockdep_set_class(&fbc[i].lock, key);
-#ifdef CONFIG_HOTPLUG_CPU
-		INIT_LIST_HEAD(&fbc[i].list);
-#endif
 		fbc[i].count = amount;
 		fbc[i].counters = (void __percpu *)counters + i * counter_size;
 
 		debug_percpu_counter_activate(&fbc[i]);
 	}
-
-#ifdef CONFIG_HOTPLUG_CPU
-	spin_lock_irqsave(&percpu_counters_lock, flags);
-	for (i = 0; i < nr_counters; i++)
-		list_add(&fbc[i].list, &percpu_counters);
-	spin_unlock_irqrestore(&percpu_counters_lock, flags);
-#endif
+	cpu_hotplug_add_watchlist(fbc, nr_counters);
 	return 0;
 }
 EXPORT_SYMBOL(__percpu_counter_init_many);
-- 
2.51.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters
  2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi
  2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi
@ 2025-11-27 23:36 ` Gabriel Krisman Bertazi
  2025-11-27 23:36 ` [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks Gabriel Krisman Bertazi
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik,
	Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan

While per-cpu counters are efficient when there is a need for frequent
updates from different cpus, they have a non-trivial upfront
initialization cost, mainly due to the percpu variable allocation.  This
cost becomes relevant both for short-lived counters and for cases
where we don't know beforehand if there will be frequent updates from
remote cpus. On both cases, it could have been better to just use a
simple counter.

The prime example is rss_stats of single-threaded tasks, where the vast
majority of counter updates happen from a single-cpu context at a time,
except for slowpath cases, such as OOM, khugepage.  For those workloads,
a simple counter would have sufficed and likely yielded better overall
performance if the tasks were sufficiently short.  There is no end of
examples of short-lived single-thread workloads, in particular coreutils
tools.

This patch introduces a new counter flavor that delays the percpu
initialization until needed.  It is a dual-mode counter.  It starts with
a two-part counter that can be updated either from a local context
through simple arithmetic or from a remote context through an atomic
operation.  Once remote accesses become more frequent, and the user
considers the overhead of atomic updates surpasses the cost of
initializing a fully-fledged per-cpu counter, the user can seamlessly
upgrade the counter to the per-cpu counter.

The first user of this are the rss_stat counters.  Benchmarks results
are provided on that patch.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 include/linux/lazy_percpu_counter.h | 145 ++++++++++++++++++++++++++++
 include/linux/percpu_counter.h      |   5 +-
 lib/percpu_counter.c                |  40 ++++++++
 3 files changed, 189 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/lazy_percpu_counter.h

diff --git a/include/linux/lazy_percpu_counter.h b/include/linux/lazy_percpu_counter.h
new file mode 100644
index 000000000000..7300b8c33507
--- /dev/null
+++ b/include/linux/lazy_percpu_counter.h
@@ -0,0 +1,145 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/percpu_counter.h>
+#ifndef _LAZY_PERCPU_COUNTER
+#define _LAZY_PERCPU_COUNTER
+
+/* Lazy percpu counter is a bi-modal distributed counter structure that
+ * starts off as a simple counter and can be upgraded to a full per-cpu
+ * counter when the user considers more non-local updates are likely to
+ * happen more frequently in the future.  It is useful when non-local
+ * updates are rare, but might become more frequent after other
+ * operations.
+ *
+ * - Lazy-mode:
+ *
+ * Local updates are handled with a simple variable write, while
+ * non-local updates are handled through an atomic operation.  Once
+ * non-local updates become more likely to happen in the future, the
+ * user can upgrade the counter, turning it into a normal
+ * per-cpu counter.
+ *
+ * Concurrency safety of 'local' accesses must be guaranteed by the
+ * caller API, either through task-local accesses or by external locks.
+ *
+ * In the initial lazy-mode, read is guaranteed to be exact only when
+ * reading from the local context with lazy_percpu_counter_sum_local.
+ *
+ * - Non-lazy-mode:
+ *   Behaves as a per-cpu counter.
+ */
+
+struct lazy_percpu_counter {
+	struct percpu_counter c;
+};
+
+#define LAZY_INIT_BIAS (1<<0)
+
+static inline s64 add_bias(long val)
+{
+	return (val << 1) | LAZY_INIT_BIAS;
+}
+static inline s64 remove_bias(long val)
+{
+	return val >> 1;
+}
+
+static inline bool lazy_percpu_counter_initialized(struct lazy_percpu_counter *lpc)
+{
+	return !(atomic_long_read(&lpc->c.remote) & LAZY_INIT_BIAS);
+}
+
+static inline void lazy_percpu_counter_init_many(struct lazy_percpu_counter *lpc, int amount,
+					       int nr_counters)
+{
+	for (int i = 0; i < nr_counters; i++) {
+		lpc[i].c.count = amount;
+		atomic_long_set(&lpc[i].c.remote, LAZY_INIT_BIAS);
+		raw_spin_lock_init(&lpc[i].c.lock);
+	}
+}
+
+static inline void lazy_percpu_counter_add_atomic(struct lazy_percpu_counter *lpc, s64 amount)
+{
+	long x = amount << 1;
+	long counter;
+
+	do {
+		counter = atomic_long_read(&lpc->c.remote);
+		if (!(counter & LAZY_INIT_BIAS)) {
+			percpu_counter_add(&lpc->c, amount);
+			return;
+		}
+	} while (atomic_long_cmpxchg_relaxed(&lpc->c.remote, counter, (counter+x)) != counter);
+}
+
+static inline void lazy_percpu_counter_add_fast(struct lazy_percpu_counter *lpc, s64 amount)
+{
+	if (lazy_percpu_counter_initialized(lpc))
+		percpu_counter_add(&lpc->c, amount);
+	else
+		lpc->c.count += amount;
+}
+
+/*
+ * lazy_percpu_counter_sync needs to be protected against concurrent
+ * local updates.
+ */
+static inline s64 lazy_percpu_counter_sum_local(struct lazy_percpu_counter *lpc)
+{
+	if (lazy_percpu_counter_initialized(lpc))
+		return percpu_counter_sum(&lpc->c);
+
+	lazy_percpu_counter_add_atomic(lpc, lpc->c.count);
+	lpc->c.count = 0;
+	return remove_bias(atomic_long_read(&lpc->c.remote));
+}
+
+static inline s64 lazy_percpu_counter_sum(struct lazy_percpu_counter *lpc)
+{
+	if (lazy_percpu_counter_initialized(lpc))
+		return percpu_counter_sum(&lpc->c);
+	return remove_bias(atomic_long_read(&lpc->c.remote)) + lpc->c.count;
+}
+
+static inline s64 lazy_percpu_counter_sum_positive(struct lazy_percpu_counter *lpc)
+{
+	s64 val = lazy_percpu_counter_sum(lpc);
+
+	return (val > 0) ? val : 0;
+}
+
+static inline s64 lazy_percpu_counter_read(struct lazy_percpu_counter *lpc)
+{
+	if (lazy_percpu_counter_initialized(lpc))
+		return percpu_counter_read(&lpc->c);
+	return remove_bias(atomic_long_read(&lpc->c.remote)) + lpc->c.count;
+}
+
+static inline s64 lazy_percpu_counter_read_positive(struct lazy_percpu_counter *lpc)
+{
+	s64 val = lazy_percpu_counter_read(lpc);
+
+	return (val > 0) ? val : 0;
+}
+
+int __lazy_percpu_counter_upgrade_many(struct lazy_percpu_counter *c,
+				       int nr_counters, gfp_t gfp);
+static inline int lazy_percpu_counter_upgrade_many(struct lazy_percpu_counter *c,
+						   int nr_counters, gfp_t gfp)
+{
+	/* Only check the first element, as batches are expected to be
+	 * upgraded together.
+	 */
+	if (!lazy_percpu_counter_initialized(c))
+		return __lazy_percpu_counter_upgrade_many(c, nr_counters, gfp);
+	return 0;
+}
+
+static inline void lazy_percpu_counter_destroy_many(struct lazy_percpu_counter *lpc,
+						    u32 nr_counters)
+{
+	/* Only check the first element, as they must have been initialized together. */
+	if (lazy_percpu_counter_initialized(lpc))
+		percpu_counter_destroy_many((struct percpu_counter *)lpc, nr_counters);
+}
+#endif
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 3a44dd1e33d2..e6fada9cba44 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -25,7 +25,10 @@ struct percpu_counter {
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
-	s32 __percpu *counters;
+	union {
+		s32 __percpu *counters;
+		atomic_long_t remote;
+	};
 };
 
 extern int percpu_counter_batch;
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index c2322d53f3b1..0a210496f219 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -4,6 +4,7 @@
  */
 
 #include <linux/percpu_counter.h>
+#include <linux/lazy_percpu_counter.h>
 #include <linux/mutex.h>
 #include <linux/init.h>
 #include <linux/cpu.h>
@@ -397,6 +398,45 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
 	return good;
 }
 
+int __lazy_percpu_counter_upgrade_many(struct lazy_percpu_counter *counters,
+				       int nr_counters, gfp_t gfp)
+{
+	s32 __percpu *pcpu_mem;
+	size_t counter_size;
+
+	counter_size = ALIGN(sizeof(*pcpu_mem), __alignof__(*pcpu_mem));
+	pcpu_mem = __alloc_percpu_gfp(nr_counters * counter_size,
+				      __alignof__(*pcpu_mem), gfp);
+	if (!pcpu_mem)
+		return -ENOMEM;
+
+	for (int i = 0; i < nr_counters; i++) {
+		struct lazy_percpu_counter *lpc = &(counters[i]);
+		s32 __percpu *n_counter;
+		s64 remote = 0;
+
+		WARN_ON(lazy_percpu_counter_initialized(lpc));
+
+		/*
+		 * After the xchg, lazy_percpu_counter behaves as a
+		 * regular percpu counter.
+		 */
+		n_counter = (void __percpu *)pcpu_mem + i * counter_size;
+		remote = (s64) atomic_long_xchg(&lpc->c.remote, (s64)(uintptr_t) n_counter);
+
+		BUG_ON(!(remote & LAZY_INIT_BIAS));
+
+		percpu_counter_add_local(&lpc->c, remove_bias(remote));
+	}
+
+	for (int i = 0; i < nr_counters; i++)
+		debug_percpu_counter_activate(&counters[i].c);
+
+	cpu_hotplug_add_watchlist((struct percpu_counter *) counters, nr_counters);
+
+	return 0;
+}
+
 static int __init percpu_counter_startup(void)
 {
 	int ret;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks
  2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi
  2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi
  2025-11-27 23:36 ` [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters Gabriel Krisman Bertazi
@ 2025-11-27 23:36 ` Gabriel Krisman Bertazi
  2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi
  2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers
  4 siblings, 0 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik,
	Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan

The cost of the pcpu memory allocation when forking a new task is
non-negligible, as reported in a few occasions, such as [1].

But it can also be fully avoided for single-threaded applications, where
we know the vast majority of updates happen from the local task context.

For the trivial benchmark, bound to cpu 0 to reduce cost of migrations),
like below:

     for (( i = 0; i < 20000; i++ )); do /bin/true; done

on an 80c machine, this patchset yielded a 6% improvement in system
time.  On a 256c machine, the system time reduced by 11%. Profiling
shows mm_init went from 13.5% of samples to less than 3.33% in the same
256c machine:

Before:
-   13.50%     3.93%  benchmark.sh     [kernel.kallsyms] [k] mm_init
   - 9.57% mm_init
      + 4.80% pcpu_alloc_noprof
      + 3.87% __percpu_counter_init_many

After:
-    3.33%     0.80%  benchmark.sh  [kernel.kallsyms]  [k] mm_init
   - 2.53% mm_init
      + 2.05% pcpu_alloc_noprof

For kernbench in 256c, the patchset yields a 1.4% improvement on system
time.  For gitsource, the improvement in system time I'm measuring is
around 3.12%.

The upgrade adds some overhead to the second fork, in particular an
atomic operation, besides the expensive allocation that was moved from
the first fork to the second.  So a fair question is the impact of this
patchset on multi-threaded applications.  I wrote a microbenchmark
similar to the /bin/true above, but that just spawns a second pthread
and waits for it to finish. The second thread just returns immediately.
This is executed in a loop, bound to a single NUMA node, with:

       for (( i = 0; i < 20000; i++ )); do /bin/parallel-true; done

Profiling shows the lazy upgrade impact is minimal to the
performance:

-    0.68%     0.04%  parallel-true  [kernel.kallsyms]  [k] __lazy_percpu_counter_upgrade_many
   - 0.64% __lazy_percpu_counter_upgrade_many
        0.62% pcpu_alloc_noprof

Which is confirmed by the measured system time. With 20k runs, i'm still
getting a slight improvement from baseline for the 2t case (2-4%).

[1] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 include/linux/mm.h          | 24 ++++++++----------------
 include/linux/mm_types.h    |  4 ++--
 include/trace/events/kmem.h |  4 ++--
 kernel/fork.c               | 14 ++++++--------
 4 files changed, 18 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33bacc32..29de4c60ac6c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2679,36 +2679,28 @@ static inline bool get_user_page_fast_only(unsigned long addr,
  */
 static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 {
-	return percpu_counter_read_positive(&mm->rss_stat[member]);
+	return lazy_percpu_counter_read_positive(&mm->rss_stat[member]);
 }
 
 static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
 {
-	return percpu_counter_sum_positive(&mm->rss_stat[member]);
+	return lazy_percpu_counter_sum_positive(&mm->rss_stat[member]);
 }
 
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
 {
-	percpu_counter_add(&mm->rss_stat[member], value);
-
-	mm_trace_rss_stat(mm, member);
-}
-
-static inline void inc_mm_counter(struct mm_struct *mm, int member)
-{
-	percpu_counter_inc(&mm->rss_stat[member]);
+	if (READ_ONCE(current->mm) == mm)
+		lazy_percpu_counter_add_fast(&mm->rss_stat[member], value);
+	else
+		lazy_percpu_counter_add_atomic(&mm->rss_stat[member], value);
 
 	mm_trace_rss_stat(mm, member);
 }
 
-static inline void dec_mm_counter(struct mm_struct *mm, int member)
-{
-	percpu_counter_dec(&mm->rss_stat[member]);
-
-	mm_trace_rss_stat(mm, member);
-}
+#define inc_mm_counter(mm, member) add_mm_counter(mm, member, 1)
+#define dec_mm_counter(mm, member) add_mm_counter(mm, member, -1)
 
 /* Optimized variant when folio is already known not to be anon */
 static inline int mm_counter_file(struct folio *folio)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..5a8d677efa85 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -18,7 +18,7 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
-#include <linux/percpu_counter.h>
+#include <linux/lazy_percpu_counter.h>
 #include <linux/types.h>
 #include <linux/bitmap.h>
 
@@ -1119,7 +1119,7 @@ struct mm_struct {
 		unsigned long saved_e_flags;
 #endif
 
-		struct percpu_counter rss_stat[NR_MM_COUNTERS];
+		struct lazy_percpu_counter rss_stat[NR_MM_COUNTERS];
 
 		struct linux_binfmt *binfmt;
 
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 7f93e754da5c..e21572f4d8a6 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -442,8 +442,8 @@ TRACE_EVENT(rss_stat,
 		__entry->mm_id = mm_ptr_to_hash(mm);
 		__entry->curr = !!(current->mm == mm);
 		__entry->member = member;
-		__entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member])
-							    << PAGE_SHIFT);
+		__entry->size = (lazy_percpu_counter_sum_positive(&mm->rss_stat[member])
+				 << PAGE_SHIFT);
 	),
 
 	TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..92698c60922e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -583,7 +583,7 @@ static void check_mm(struct mm_struct *mm)
 			 "Please make sure 'struct resident_page_types[]' is updated as well");
 
 	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		long x = percpu_counter_sum(&mm->rss_stat[i]);
+		long x = lazy_percpu_counter_sum_local(&mm->rss_stat[i]);
 
 		if (unlikely(x)) {
 			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld Comm:%s Pid:%d\n",
@@ -688,7 +688,7 @@ void __mmdrop(struct mm_struct *mm)
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
-	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	lazy_percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
 
 	free_mm(mm);
 }
@@ -1083,16 +1083,11 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
 
-	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
-				     NR_MM_COUNTERS))
-		goto fail_pcpu;
-
+	lazy_percpu_counter_init_many(mm->rss_stat, 0, NR_MM_COUNTERS);
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
-fail_pcpu:
-	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
 fail_nocontext:
@@ -1535,6 +1530,9 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk)
 		return 0;
 
 	if (clone_flags & CLONE_VM) {
+		if (lazy_percpu_counter_upgrade_many(oldmm->rss_stat,
+						     NR_MM_COUNTERS, GFP_KERNEL_ACCOUNT))
+			return -ENOMEM;
 		mmget(oldmm);
 		mm = oldmm;
 	} else {
-- 
2.51.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 4/4] mm: Split a slow path for updating mm counters
  2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi
                   ` (2 preceding siblings ...)
  2025-11-27 23:36 ` [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks Gabriel Krisman Bertazi
@ 2025-11-27 23:36 ` Gabriel Krisman Bertazi
  2025-12-01 10:19   ` David Hildenbrand (Red Hat)
  2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers
  4 siblings, 1 reply; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik,
	Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan

For cases where we know we are not coming from local context, there is
no point in touching current when incrementing/decrementing the
counters.  Split this path into another helper to avoid this cost.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 arch/s390/mm/gmap_helpers.c |  4 ++--
 arch/s390/mm/pgtable.c      |  4 ++--
 fs/exec.c                   |  2 +-
 include/linux/mm.h          | 14 +++++++++++---
 kernel/events/uprobes.c     |  2 +-
 mm/filemap.c                |  2 +-
 mm/huge_memory.c            | 22 +++++++++++-----------
 mm/khugepaged.c             |  6 +++---
 mm/ksm.c                    |  2 +-
 mm/madvise.c                |  2 +-
 mm/memory.c                 | 20 ++++++++++----------
 mm/migrate.c                |  2 +-
 mm/migrate_device.c         |  2 +-
 mm/rmap.c                   | 16 ++++++++--------
 mm/swapfile.c               |  6 +++---
 mm/userfaultfd.c            |  2 +-
 16 files changed, 58 insertions(+), 50 deletions(-)

diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c
index d4c3c36855e2..6d8498c56d08 100644
--- a/arch/s390/mm/gmap_helpers.c
+++ b/arch/s390/mm/gmap_helpers.c
@@ -29,9 +29,9 @@
 static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
 {
 	if (!non_swap_entry(entry))
-		dec_mm_counter(mm, MM_SWAPENTS);
+		dec_mm_counter_other(mm, MM_SWAPENTS);
 	else if (is_migration_entry(entry))
-		dec_mm_counter(mm, mm_counter(pfn_swap_entry_folio(entry)));
+		dec_mm_counter_other(mm, mm_counter(pfn_swap_entry_folio(entry)));
 	free_swap_and_cache(entry);
 }
 
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 0fde20bbc50b..021a04f958e5 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -686,11 +686,11 @@ void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep)
 static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
 {
 	if (!non_swap_entry(entry))
-		dec_mm_counter(mm, MM_SWAPENTS);
+		dec_mm_counter_other(mm, MM_SWAPENTS);
 	else if (is_migration_entry(entry)) {
 		struct folio *folio = pfn_swap_entry_folio(entry);
 
-		dec_mm_counter(mm, mm_counter(folio));
+		dec_mm_counter_other(mm, mm_counter(folio));
 	}
 	free_swap_and_cache(entry);
 }
diff --git a/fs/exec.c b/fs/exec.c
index 4298e7e08d5d..33d0eb00d315 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -137,7 +137,7 @@ static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
 		return;
 
 	bprm->vma_pages = pages;
-	add_mm_counter(mm, MM_ANONPAGES, diff);
+	add_mm_counter_local(mm, MM_ANONPAGES, diff);
 }
 
 static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 29de4c60ac6c..2db12280e938 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2689,7 +2689,7 @@ static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
 
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
-static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
+static inline void add_mm_counter_local(struct mm_struct *mm, int member, long value)
 {
 	if (READ_ONCE(current->mm) == mm)
 		lazy_percpu_counter_add_fast(&mm->rss_stat[member], value);
@@ -2698,9 +2698,17 @@ static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
 
 	mm_trace_rss_stat(mm, member);
 }
+static inline void add_mm_counter_other(struct mm_struct *mm, int member, long value)
+{
+	lazy_percpu_counter_add_atomic(&mm->rss_stat[member], value);
+
+	mm_trace_rss_stat(mm, member);
+}
 
-#define inc_mm_counter(mm, member) add_mm_counter(mm, member, 1)
-#define dec_mm_counter(mm, member) add_mm_counter(mm, member, -1)
+#define inc_mm_counter_local(mm, member) add_mm_counter_local(mm, member, 1)
+#define dec_mm_counter_local(mm, member) add_mm_counter_local(mm, member, -1)
+#define inc_mm_counter_other(mm, member) add_mm_counter_other(mm, member, 1)
+#define dec_mm_counter_other(mm, member) add_mm_counter_other(mm, member, -1)
 
 /* Optimized variant when folio is already known not to be anon */
 static inline int mm_counter_file(struct folio *folio)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8709c69118b5..9c0e73dd2948 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -447,7 +447,7 @@ static int __uprobe_write(struct vm_area_struct *vma,
 	if (!orig_page_is_identical(vma, vaddr, fw->page, &pmd_mappable))
 		goto remap;
 
-	dec_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	dec_mm_counter_other(vma->vm_mm, MM_ANONPAGES);
 	folio_remove_rmap_pte(folio, fw->page, vma);
 	if (!folio_mapped(folio) && folio_test_swapcache(folio) &&
 	     folio_trylock(folio)) {
diff --git a/mm/filemap.c b/mm/filemap.c
index 13f0259d993c..5d1656e63602 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3854,7 +3854,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 
 		folio_unlock(folio);
 	} while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
-	add_mm_counter(vma->vm_mm, folio_type, rss);
+	add_mm_counter_other(vma->vm_mm, folio_type, rss);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	trace_mm_filemap_map_pages(mapping, start_pgoff, end_pgoff);
 out:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225..614b0a8e168b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1228,7 +1228,7 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd,
 	folio_add_lru_vma(folio, vma);
 	set_pmd_at(vma->vm_mm, haddr, pmd, entry);
 	update_mmu_cache_pmd(vma, haddr, pmd);
-	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	add_mm_counter_local(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 	count_vm_event(THP_FAULT_ALLOC);
 	count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
 	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
@@ -1444,7 +1444,7 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 		} else {
 			folio_get(fop.folio);
 			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
-			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+			add_mm_counter_local(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
 		}
 	} else {
 		entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot));
@@ -1563,7 +1563,7 @@ static vm_fault_t insert_pud(struct vm_area_struct *vma, unsigned long addr,
 
 		folio_get(fop.folio);
 		folio_add_file_rmap_pud(fop.folio, &fop.folio->page, vma);
-		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR);
+		add_mm_counter_local(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR);
 	} else {
 		entry = pud_mkhuge(pfn_pud(fop.pfn, prot));
 		entry = pud_mkspecial(entry);
@@ -1714,7 +1714,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pmd = pmd_swp_mkuffd_wp(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
-		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		add_mm_counter_local(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 		if (!userfaultfd_wp(dst_vma))
@@ -1758,7 +1758,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		__split_huge_pmd(src_vma, src_pmd, addr, false);
 		return -EAGAIN;
 	}
-	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	add_mm_counter_local(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 out_zero_page:
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -2223,11 +2223,11 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (folio_test_anon(folio)) {
 			zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			add_mm_counter_other(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 		} else {
 			if (arch_needs_pgtable_deposit())
 				zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, mm_counter_file(folio),
+			add_mm_counter_other(tlb->mm, mm_counter_file(folio),
 				       -HPAGE_PMD_NR);
 
 			/*
@@ -2719,7 +2719,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		page = pud_page(orig_pud);
 		folio = page_folio(page);
 		folio_remove_rmap_pud(folio, page, vma);
-		add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
+		add_mm_counter_other(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
@@ -2755,7 +2755,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 		folio_set_referenced(folio);
 	folio_remove_rmap_pud(folio, page, vma);
 	folio_put(folio);
-	add_mm_counter(vma->vm_mm, mm_counter_file(folio),
+	add_mm_counter_local(vma->vm_mm, mm_counter_file(folio),
 		-HPAGE_PUD_NR);
 }
 
@@ -2874,7 +2874,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			folio_remove_rmap_pmd(folio, page, vma);
 			folio_put(folio);
 		}
-		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
+		add_mm_counter_local(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
 		return;
 	}
 
@@ -3188,7 +3188,7 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 
 	folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma);
 	zap_deposited_table(mm, pmdp);
-	add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+	add_mm_counter_local(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 	if (vma->vm_flags & VM_LOCKED)
 		mlock_drain_local();
 	folio_put(folio);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index abe54f0043c7..a6634ca0667d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -691,7 +691,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 		nr_ptes = 1;
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+			add_mm_counter_other(vma->vm_mm, MM_ANONPAGES, 1);
 			if (is_zero_pfn(pte_pfn(pteval))) {
 				/*
 				 * ptl mostly unnecessary.
@@ -1664,7 +1664,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	/* step 3: set proper refcount and mm_counters. */
 	if (nr_mapped_ptes) {
 		folio_ref_sub(folio, nr_mapped_ptes);
-		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
+		add_mm_counter_other(mm, mm_counter_file(folio), -nr_mapped_ptes);
 	}
 
 	/* step 4: remove empty page table */
@@ -1700,7 +1700,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (nr_mapped_ptes) {
 		flush_tlb_mm(mm);
 		folio_ref_sub(folio, nr_mapped_ptes);
-		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
+		add_mm_counter_other(mm, mm_counter_file(folio), -nr_mapped_ptes);
 	}
 unlock:
 	if (start_pte)
diff --git a/mm/ksm.c b/mm/ksm.c
index 7bc726b50b2f..7434cf1f4925 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1410,7 +1410,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 		 * will get wrong values in /proc, and a BUG message in dmesg
 		 * when tearing down the mm.
 		 */
-		dec_mm_counter(mm, MM_ANONPAGES);
+		dec_mm_counter_other(mm, MM_ANONPAGES);
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..ba7ea134f5ad 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -776,7 +776,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 
 	if (nr_swap)
-		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+		add_mm_counter_local(mm, MM_SWAPENTS, nr_swap);
 	if (start_pte) {
 		arch_leave_lazy_mmu_mode();
 		pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..9a18ac25955c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -488,7 +488,7 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		if (rss[i])
-			add_mm_counter(mm, i, rss[i]);
+			add_mm_counter_other(mm, i, rss[i]);
 }
 
 static bool is_bad_page_map_ratelimited(void)
@@ -2306,7 +2306,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
 			pteval = pte_mkyoung(pteval);
 			pteval = maybe_mkwrite(pte_mkdirty(pteval), vma);
 		}
-		inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
+		inc_mm_counter_local(vma->vm_mm, mm_counter_file(folio));
 		folio_add_file_rmap_pte(folio, page, vma);
 	}
 	set_pte_at(vma->vm_mm, addr, pte, pteval);
@@ -3716,12 +3716,12 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
 		if (old_folio) {
 			if (!folio_test_anon(old_folio)) {
-				dec_mm_counter(mm, mm_counter_file(old_folio));
-				inc_mm_counter(mm, MM_ANONPAGES);
+				dec_mm_counter_other(mm, mm_counter_file(old_folio));
+				inc_mm_counter_other(mm, MM_ANONPAGES);
 			}
 		} else {
 			ksm_might_unmap_zero_page(mm, vmf->orig_pte);
-			inc_mm_counter(mm, MM_ANONPAGES);
+			inc_mm_counter_other(mm, MM_ANONPAGES);
 		}
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = folio_mk_pte(new_folio, vma->vm_page_prot);
@@ -4916,8 +4916,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
-	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
-	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+	add_mm_counter_other(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	add_mm_counter_other(vma->vm_mm, MM_SWAPENTS, -nr_pages);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
@@ -5223,7 +5223,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	}
 
 	folio_ref_add(folio, nr_pages - 1);
-	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	add_mm_counter_other(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC);
 	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
 	folio_add_lru_vma(folio, vma);
@@ -5375,7 +5375,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
 	if (write)
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
-	add_mm_counter(vma->vm_mm, mm_counter_file(folio), HPAGE_PMD_NR);
+	add_mm_counter_other(vma->vm_mm, mm_counter_file(folio), HPAGE_PMD_NR);
 	folio_add_file_rmap_pmd(folio, page, vma);
 
 	/*
@@ -5561,7 +5561,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	folio_ref_add(folio, nr_pages - 1);
 	set_pte_range(vmf, folio, page, nr_pages, addr);
 	type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
-	add_mm_counter(vma->vm_mm, type, nr_pages);
+	add_mm_counter_other(vma->vm_mm, type, nr_pages);
 	ret = 0;
 
 unlock:
diff --git a/mm/migrate.c b/mm/migrate.c
index e3065c9edb55..dd8c6e6224f9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -329,7 +329,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
 
 	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
 
-	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
+	dec_mm_counter_other(pvmw->vma->vm_mm, mm_counter(folio));
 	return true;
 }
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index abd9f6850db6..7f3e5d7b3109 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -676,7 +676,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	if (userfaultfd_missing(vma))
 		goto unlock_abort;
 
-	inc_mm_counter(mm, MM_ANONPAGES);
+	inc_mm_counter_other(mm, MM_ANONPAGES);
 	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
 	if (!folio_is_zone_device(folio))
 		folio_add_lru_vma(folio, vma);
diff --git a/mm/rmap.c b/mm/rmap.c
index ac4f783d6ec2..0f6023ffb65d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2085,7 +2085,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				set_huge_pte_at(mm, address, pvmw.pte, pteval,
 						hsz);
 			} else {
-				dec_mm_counter(mm, mm_counter(folio));
+				dec_mm_counter_other(mm, mm_counter(folio));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
 		} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
@@ -2100,7 +2100,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * migration) will not expect userfaults on already
 			 * copied pages.
 			 */
-			dec_mm_counter(mm, mm_counter(folio));
+			dec_mm_counter_other(mm, mm_counter(folio));
 		} else if (folio_test_anon(folio)) {
 			swp_entry_t entry = page_swap_entry(subpage);
 			pte_t swp_pte;
@@ -2155,7 +2155,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
 					goto walk_abort;
 				}
-				add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
+				add_mm_counter_other(mm, MM_ANONPAGES, -nr_pages);
 				goto discard;
 			}
 
@@ -2188,8 +2188,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					list_add(&mm->mmlist, &init_mm.mmlist);
 				spin_unlock(&mmlist_lock);
 			}
-			dec_mm_counter(mm, MM_ANONPAGES);
-			inc_mm_counter(mm, MM_SWAPENTS);
+			dec_mm_counter_other(mm, MM_ANONPAGES);
+			inc_mm_counter_other(mm, MM_SWAPENTS);
 			swp_pte = swp_entry_to_pte(entry);
 			if (anon_exclusive)
 				swp_pte = pte_swp_mkexclusive(swp_pte);
@@ -2217,7 +2217,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 *
 			 * See Documentation/mm/mmu_notifier.rst
 			 */
-			dec_mm_counter(mm, mm_counter_file(folio));
+			dec_mm_counter_other(mm, mm_counter_file(folio));
 		}
 discard:
 		if (unlikely(folio_test_hugetlb(folio))) {
@@ -2476,7 +2476,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				set_huge_pte_at(mm, address, pvmw.pte, pteval,
 						hsz);
 			} else {
-				dec_mm_counter(mm, mm_counter(folio));
+				dec_mm_counter_other(mm, mm_counter(folio));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
 		} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
@@ -2491,7 +2491,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * migration) will not expect userfaults on already
 			 * copied pages.
 			 */
-			dec_mm_counter(mm, mm_counter(folio));
+			dec_mm_counter_other(mm, mm_counter(folio));
 		} else {
 			swp_entry_t entry;
 			pte_t swp_pte;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 10760240a3a2..70f7d31c0854 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2163,7 +2163,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(hwpoisoned || !folio_test_uptodate(folio))) {
 		swp_entry_t swp_entry;
 
-		dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+		dec_mm_counter_other(vma->vm_mm, MM_SWAPENTS);
 		if (hwpoisoned) {
 			swp_entry = make_hwpoison_entry(page);
 		} else {
@@ -2181,8 +2181,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
-	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	dec_mm_counter_other(vma->vm_mm, MM_SWAPENTS);
+	inc_mm_counter_other(vma->vm_mm, MM_ANONPAGES);
 	folio_get(folio);
 	if (folio == swapcache) {
 		rmap_t rmap_flags = RMAP_NONE;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af61b95c89e4..34e760c37b7b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -221,7 +221,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
 	 * Must happen after rmap, as mm_counter() checks mapping (via
 	 * PageAnon()), which is set by __page_set_anon_rmap().
 	 */
-	inc_mm_counter(dst_mm, mm_counter(folio));
+	inc_mm_counter_other(dst_mm, mm_counter(folio));
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi
                   ` (3 preceding siblings ...)
  2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi
@ 2025-11-28 13:30 ` Mathieu Desnoyers
  2025-11-28 20:10   ` Jan Kara
  4 siblings, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2025-11-28 13:30 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, linux-mm
  Cc: linux-kernel, jack, Mateusz Guzik, Shakeel Butt, Michal Hocko,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Thomas Gleixner

On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote:
> The cost of the pcpu memory allocation is non-negligible for systems
> with many cpus, and it is quite visible when forking a new task, as
> reported in a few occasions.
I've come to the same conclusion within the development of
the hierarchical per-cpu counters.

But while the mm_struct has a SLAB cache (initialized in
kernel/fork.c:mm_cache_init()), there is no such thing
for the per-mm per-cpu data.

In the mm_struct, we have the following per-cpu data (please
let me know if I missed any in the maze):

- struct mm_cid __percpu *pcpu_cid (or equivalent through
   struct mm_mm_cid after Thomas Gleixner gets his rewrite
   upstream),

- unsigned int __percpu *futex_ref,

- NR_MM_COUNTERS rss_stats per-cpu counters.

What would really reduce memory allocation overhead on fork
is to move all those fields into a top level
"struct mm_percpu_struct" as a first step. This would
merge 3 per-cpu allocations into one when forking a new
task.

Then the second step is to create a mm_percpu_struct
cache to bypass the per-cpu allocator.

I suspect that by doing just that we'd get most of the
performance benefits provided by the single-threaded special-case
proposed here.

I'm not against special casing single-threaded if it's still
worth it after doing the underlying data structure layout/caching
changes I'm proposing here, but I think we need to fix the
memory allocation overhead issue first before working around it
with special cases and added complexity.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers
@ 2025-11-28 20:10   ` Jan Kara
  2025-11-28 20:12     ` Mathieu Desnoyers
  2025-11-29  5:57     ` Mateusz Guzik
  0 siblings, 2 replies; 19+ messages in thread
From: Jan Kara @ 2025-11-28 20:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Gabriel Krisman Bertazi, linux-mm, linux-kernel, jack,
	Mateusz Guzik, Shakeel Butt, Michal Hocko, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote:
> > The cost of the pcpu memory allocation is non-negligible for systems
> > with many cpus, and it is quite visible when forking a new task, as
> > reported in a few occasions.
> I've come to the same conclusion within the development of
> the hierarchical per-cpu counters.
> 
> But while the mm_struct has a SLAB cache (initialized in
> kernel/fork.c:mm_cache_init()), there is no such thing
> for the per-mm per-cpu data.
> 
> In the mm_struct, we have the following per-cpu data (please
> let me know if I missed any in the maze):
> 
> - struct mm_cid __percpu *pcpu_cid (or equivalent through
>   struct mm_mm_cid after Thomas Gleixner gets his rewrite
>   upstream),
> 
> - unsigned int __percpu *futex_ref,
> 
> - NR_MM_COUNTERS rss_stats per-cpu counters.
> 
> What would really reduce memory allocation overhead on fork
> is to move all those fields into a top level
> "struct mm_percpu_struct" as a first step. This would
> merge 3 per-cpu allocations into one when forking a new
> task.
> 
> Then the second step is to create a mm_percpu_struct
> cache to bypass the per-cpu allocator.
> 
> I suspect that by doing just that we'd get most of the
> performance benefits provided by the single-threaded special-case
> proposed here.

I don't think so. Because in the profiles I have been doing for these
loads the biggest cost wasn't actually the per-cpu allocation itself but
the cost of zeroing the allocated counter for many CPUs (and then the
counter summarization on exit) and you're not going to get rid of that with
just reshuffling per-cpu fields and adding slab allocator in front.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-11-28 20:10   ` Jan Kara
@ 2025-11-28 20:12     ` Mathieu Desnoyers
  2025-11-29  5:57     ` Mateusz Guzik
  1 sibling, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2025-11-28 20:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Gabriel Krisman Bertazi, linux-mm, linux-kernel, Mateusz Guzik,
	Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On 2025-11-28 15:10, Jan Kara wrote:
> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
[...]
>> I suspect that by doing just that we'd get most of the
>> performance benefits provided by the single-threaded special-case
>> proposed here.
> 
> I don't think so. Because in the profiles I have been doing for these
> loads the biggest cost wasn't actually the per-cpu allocation itself but
> the cost of zeroing the allocated counter for many CPUs (and then the
> counter summarization on exit) and you're not going to get rid of that with
> just reshuffling per-cpu fields and adding slab allocator in front.

That's a good point ! So skipping the zeroing of per-cpu fields would
indeed justify special-casing the single-threaded case.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-11-28 20:10   ` Jan Kara
  2025-11-28 20:12     ` Mathieu Desnoyers
@ 2025-11-29  5:57     ` Mateusz Guzik
  2025-11-29  7:50       ` Mateusz Guzik
                         ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Mateusz Guzik @ 2025-11-29  5:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mathieu Desnoyers, Gabriel Krisman Bertazi, linux-mm,
	linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> > What would really reduce memory allocation overhead on fork
> > is to move all those fields into a top level
> > "struct mm_percpu_struct" as a first step. This would
> > merge 3 per-cpu allocations into one when forking a new
> > task.
> >
> > Then the second step is to create a mm_percpu_struct
> > cache to bypass the per-cpu allocator.
> >
> > I suspect that by doing just that we'd get most of the
> > performance benefits provided by the single-threaded special-case
> > proposed here.
>
> I don't think so. Because in the profiles I have been doing for these
> loads the biggest cost wasn't actually the per-cpu allocation itself but
> the cost of zeroing the allocated counter for many CPUs (and then the
> counter summarization on exit) and you're not going to get rid of that with
> just reshuffling per-cpu fields and adding slab allocator in front.
>

The entire ordeal has been discussed several times already. I'm rather
disappointed there is a new patchset posted which does not address any
of it and goes straight to special-casing single-threaded operation.

The major claims (by me anyway) are:
1. single-threaded operation for fork + exec suffers avoidable
overhead even without the rss counter problem, which are tractable
with the same kind of thing which would sort out the multi-threaded
problem
2. unfortunately there is an increasing number of multi-threaded (and
often short lived) processes (example: lld, the linker form the llvm
project; more broadly plenty of things Rust where people think
threading == performance)

Bottom line is, solutions like the one proposed in the patchset are at
best a stopgap and even they leave performance on the table for the
case they are optimizing for.

The pragmatic way forward (as I see it anyway) is to fix up the
multi-threaded thing and see if trying to special case for
single-threaded case is justifiable afterwards.

Given that the current patchset has to resort to atomics in certain
cases, there is some error-pronnes and runtime overhead associated
with it going beyond merely checking if the process is
single-threaded, which puts an additional question mark on it.

Now to business:
You mentioned the rss loops are a problem. I agree, but they can be
largely damage-controlled. More importantly there are 2 loops of the
sort already happening even with the patchset at hand.

mm_alloc_cid() results in one loop in the percpu allocator to zero out
the area, then mm_init_cid() performs the following:
        for_each_possible_cpu(i) {
                struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);

                pcpu_cid->cid = MM_CID_UNSET;
                pcpu_cid->recent_cid = MM_CID_UNSET;
                pcpu_cid->time = 0;
        }

There is no way this is not visible already on 256 threads.

Preferably some magic would be done to init this on first use on given
CPU.There is some bitmap tracking CPU presence, maybe this can be
tackled on top of it. But for the sake of argument let's say that's
too expensive or perhaps not feasible. Even then, the walk can be done
*once* by telling the percpu allocator to refrain from zeroing memory.

Which brings me to rss counters. In the current kernel that's
*another* loop over everything to zero it out. But it does not have to
be that way. Suppose bitmap shenanigans mentioned above are no-go for
these as well.

So instead the code could reach out to the percpu allocator to
allocate memory for both cid and rss (as mentined by Mathieu), but
have it returned uninitialized and loop over it once sorting out both
cid and rss in the same body. This should be drastically faster than
the current code.

But one may observe it is an invariant the values sum up to 0 on process exit.

So if one was to make sure the first time this is handed out by the
percpu allocator the values are all 0s and then cache the area
somewhere for future allocs/frees of mm, there would be no need to do
the zeroing on alloc.

On the free side summing up rss counters in check_mm() is only there
for debugging purposes. Suppose it is useful enough that it needs to
stay. Even then, as implemented right now, this is just slow for no
reason:

        for (i = 0; i < NR_MM_COUNTERS; i++) {
                long x = percpu_counter_sum(&mm->rss_stat[i]);
[snip]
        }

That's *four* loops with extra overhead of irq-trips for every single
one. This can be patched up to only do one loop, possibly even with
irqs enabled the entire time.

Doing the loop is still slower than not doing it, but his may be just
fast enough to obsolete the ideas like in the proposed patchset.

While per-cpu level caching for all possible allocations seems like
the easiest way out, it in fact does *NOT* fully solve problem -- you
are still going to globally serialize in lru_gen_add_mm() (and the del
part), pgd_alloc() and other places.

Or to put it differently, per-cpu caching of mm_struct itself makes no
sense in the current kernel (with the patchset or not) because on the
way to finish the alloc or free you are going to globally serialize
several times and *that* is the issue to fix in the long run. You can
make the problematic locks fine-grained (and consequently alleviate
the scalability aspect), but you are still going to suffer the
overhead of taking them.

As far as I'm concerned the real long term solution(tm) would make the
cached mm's retain the expensive to sort out state -- list presence,
percpu memory and whatever else.

To that end I see 2 feasible approaches:
1. a dedicated allocator with coarse granularity

Instead of per-cpu, you could have an instance for every n threads
(let's say 8 or whatever). this would pose a tradeoff between total
memory usage and scalability outside of a microbenchmark setting. you
are still going to serialize in some cases, but only once on alloc and
once on free, not several times and you are still cheaper
single-threaded. This is faster all around.

2. dtor support in the slub allocator

ctor does the hard work and dtor undoes it. There is an unfinished
patchset by Harry which implements the idea[1].

There is a serious concern about deadlock potential stemming from
running arbitrary dtor code during memory reclaim. I already described
elsewhere how with a little bit of discipline supported by lockdep
this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't
take any locks if you hold them and you have to disable interrupts) +
mark dtors as only allowed to hold a leaf spinlock et voila, code
guaranteed to not deadlock). But then all code trying to cache its
state in to be undone with dtor has to be patched to facilitate it.
Again bugs in the area sorted out by lockdep.

The good news is that folks were apparently open to punting reclaim of
such memory into a workqueue, which completely alleviates that concern
anyway.

So happens if fork + exit is involved there are numerous other
bottlenecks which overshadow the above, but that's a rant for another
day. Here we can pretend for a minute they are solved.

[1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-11-29  5:57     ` Mateusz Guzik
@ 2025-11-29  7:50       ` Mateusz Guzik
  2025-12-01 10:38       ` Harry Yoo
  2025-12-01 15:23       ` Gabriel Krisman Bertazi
  2 siblings, 0 replies; 19+ messages in thread
From: Mateusz Guzik @ 2025-11-29  7:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mathieu Desnoyers, Gabriel Krisman Bertazi, linux-mm,
	linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote:
> Now to business:
> You mentioned the rss loops are a problem. I agree, but they can be
> largely damage-controlled. More importantly there are 2 loops of the
> sort already happening even with the patchset at hand.
> 
> mm_alloc_cid() results in one loop in the percpu allocator to zero out
> the area, then mm_init_cid() performs the following:
>         for_each_possible_cpu(i) {
>                 struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
> 
>                 pcpu_cid->cid = MM_CID_UNSET;
>                 pcpu_cid->recent_cid = MM_CID_UNSET;
>                 pcpu_cid->time = 0;
>         }
> 
> There is no way this is not visible already on 256 threads.
> 
> Preferably some magic would be done to init this on first use on given
> CPU.There is some bitmap tracking CPU presence, maybe this can be
> tackled on top of it. But for the sake of argument let's say that's
> too expensive or perhaps not feasible. Even then, the walk can be done
> *once* by telling the percpu allocator to refrain from zeroing memory.
> 
> Which brings me to rss counters. In the current kernel that's
> *another* loop over everything to zero it out. But it does not have to
> be that way. Suppose bitmap shenanigans mentioned above are no-go for
> these as well.
> 

So I had another look and I think bitmapping it is perfectly feasible,
albeit requiring a little bit of refactoring to avoid adding overhead in
the common case.

There is a bitmap for tlb tracking, updated like so on context switch in
switch_mm_irqs_off():

	if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
		cpumask_set_cpu(cpu, mm_cpumask(next));

... and of course cleared at times.

Easiest way out would add an additional bitmap with bits which are
*never* cleared. But that's another cache miss, preferably avoided.

Instead the entire thing could be reimplemented to have 2 bits per CPU
in the bitmap -- one for tlb and another for ever running on it.

Having spotted you are running on the given cpu for the first time, the
rss area gets zeroed out and *both* bits get set et voila. The common
case gets away with the same load as always. The less common case gets
more work of having to zero the counters initialize cid.

In return both cid and rss handling can avoid mandatory linear walks by
cpu count, instead merely having to visit the cpus known to have used a
given mm.

I don't think this is particularly ugly or complicated, just needs some
care & time to sit through and refactor away all the direct access into
helpers.

So if I was tasked with working on the overall problem, I would
definitely try to get this done. Fortunately for me this is not the
case. :-)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] mm: Split a slow path for updating mm counters
  2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi
@ 2025-12-01 10:19   ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-01 10:19 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, linux-mm
  Cc: linux-kernel, jack, Mateusz Guzik, Shakeel Butt, Michal Hocko,
	Mathieu Desnoyers, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan

On 11/28/25 00:36, Gabriel Krisman Bertazi wrote:
> For cases where we know we are not coming from local context, there is
> no point in touching current when incrementing/decrementing the
> counters.  Split this path into another helper to avoid this cost.
> 
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> ---
>   arch/s390/mm/gmap_helpers.c |  4 ++--
>   arch/s390/mm/pgtable.c      |  4 ++--
>   fs/exec.c                   |  2 +-
>   include/linux/mm.h          | 14 +++++++++++---
>   kernel/events/uprobes.c     |  2 +-
>   mm/filemap.c                |  2 +-
>   mm/huge_memory.c            | 22 +++++++++++-----------
>   mm/khugepaged.c             |  6 +++---
>   mm/ksm.c                    |  2 +-
>   mm/madvise.c                |  2 +-
>   mm/memory.c                 | 20 ++++++++++----------
>   mm/migrate.c                |  2 +-
>   mm/migrate_device.c         |  2 +-
>   mm/rmap.c                   | 16 ++++++++--------
>   mm/swapfile.c               |  6 +++---
>   mm/userfaultfd.c            |  2 +-
>   16 files changed, 58 insertions(+), 50 deletions(-)
> 
> diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c
> index d4c3c36855e2..6d8498c56d08 100644
> --- a/arch/s390/mm/gmap_helpers.c
> +++ b/arch/s390/mm/gmap_helpers.c
> @@ -29,9 +29,9 @@
>   static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
>   {
>   	if (!non_swap_entry(entry))
> -		dec_mm_counter(mm, MM_SWAPENTS);
> +		dec_mm_counter_other(mm, MM_SWAPENTS);
>   	else if (is_migration_entry(entry))
> -		dec_mm_counter(mm, mm_counter(pfn_swap_entry_folio(entry)));
> +		dec_mm_counter_other(mm, mm_counter(pfn_swap_entry_folio(entry)));
>   	free_swap_and_cache(entry);
>   }
>   
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index 0fde20bbc50b..021a04f958e5 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -686,11 +686,11 @@ void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep)
>   static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
>   {
>   	if (!non_swap_entry(entry))
> -		dec_mm_counter(mm, MM_SWAPENTS);
> +		dec_mm_counter_other(mm, MM_SWAPENTS);
>   	else if (is_migration_entry(entry)) {
>   		struct folio *folio = pfn_swap_entry_folio(entry);
>   
> -		dec_mm_counter(mm, mm_counter(folio));
> +		dec_mm_counter_other(mm, mm_counter(folio));
>   	}
>   	free_swap_and_cache(entry);
>   }
> diff --git a/fs/exec.c b/fs/exec.c
> index 4298e7e08d5d..33d0eb00d315 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -137,7 +137,7 @@ static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
>   		return;
>   
>   	bprm->vma_pages = pages;
> -	add_mm_counter(mm, MM_ANONPAGES, diff);
> +	add_mm_counter_local(mm, MM_ANONPAGES, diff);
>   }
>   
>   static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 29de4c60ac6c..2db12280e938 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2689,7 +2689,7 @@ static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
>   
>   void mm_trace_rss_stat(struct mm_struct *mm, int member);
>   
> -static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
> +static inline void add_mm_counter_local(struct mm_struct *mm, int member, long value)
>   {
>   	if (READ_ONCE(current->mm) == mm)
>   		lazy_percpu_counter_add_fast(&mm->rss_stat[member], value);
> @@ -2698,9 +2698,17 @@ static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
>   
>   	mm_trace_rss_stat(mm, member);
>   }
> +static inline void add_mm_counter_other(struct mm_struct *mm, int member, long value)
> +{
> +	lazy_percpu_counter_add_atomic(&mm->rss_stat[member], value);
> +
> +	mm_trace_rss_stat(mm, member);
> +}
>   
> -#define inc_mm_counter(mm, member) add_mm_counter(mm, member, 1)
> -#define dec_mm_counter(mm, member) add_mm_counter(mm, member, -1)
> +#define inc_mm_counter_local(mm, member) add_mm_counter_local(mm, member, 1)
> +#define dec_mm_counter_local(mm, member) add_mm_counter_local(mm, member, -1)
> +#define inc_mm_counter_other(mm, member) add_mm_counter_other(mm, member, 1)
> +#define dec_mm_counter_other(mm, member) add_mm_counter_other(mm, member, -1)

I'd have thought that there is a local and !local version, whereby the 
latter one would simply maintain the old name. The "_other()" sticks out 
a bit.

E.g., cmpxch() vs. cmpxchg_local().

Or would "_remote()" better describe what "_other()" intends to do?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-11-29  5:57     ` Mateusz Guzik
  2025-11-29  7:50       ` Mateusz Guzik
@ 2025-12-01 10:38       ` Harry Yoo
  2025-12-01 11:31         ` Mateusz Guzik
  2025-12-01 15:23       ` Gabriel Krisman Bertazi
  2 siblings, 1 reply; 19+ messages in thread
From: Harry Yoo @ 2025-12-01 10:38 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jan Kara, Mathieu Desnoyers, Gabriel Krisman Bertazi, linux-mm,
	linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote:
> On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> > > What would really reduce memory allocation overhead on fork
> > > is to move all those fields into a top level
> > > "struct mm_percpu_struct" as a first step. This would
> > > merge 3 per-cpu allocations into one when forking a new
> > > task.
> > >
> > > Then the second step is to create a mm_percpu_struct
> > > cache to bypass the per-cpu allocator.
> > >
> > > I suspect that by doing just that we'd get most of the
> > > performance benefits provided by the single-threaded special-case
> > > proposed here.
> >
> > I don't think so. Because in the profiles I have been doing for these
> > loads the biggest cost wasn't actually the per-cpu allocation itself but
> > the cost of zeroing the allocated counter for many CPUs (and then the
> > counter summarization on exit) and you're not going to get rid of that with
> > just reshuffling per-cpu fields and adding slab allocator in front.
> >
> 
> The entire ordeal has been discussed several times already. I'm rather
> disappointed there is a new patchset posted which does not address any
> of it and goes straight to special-casing single-threaded operation.
> 
> The major claims (by me anyway) are:
> 1. single-threaded operation for fork + exec suffers avoidable
> overhead even without the rss counter problem, which are tractable
> with the same kind of thing which would sort out the multi-threaded
> problem
> 2. unfortunately there is an increasing number of multi-threaded (and
> often short lived) processes (example: lld, the linker form the llvm
> project; more broadly plenty of things Rust where people think
> threading == performance)
> 
> Bottom line is, solutions like the one proposed in the patchset are at
> best a stopgap and even they leave performance on the table for the
> case they are optimizing for.
> 
> The pragmatic way forward (as I see it anyway) is to fix up the
> multi-threaded thing and see if trying to special case for
> single-threaded case is justifiable afterwards.
> 
> Given that the current patchset has to resort to atomics in certain
> cases, there is some error-pronnes and runtime overhead associated
> with it going beyond merely checking if the process is
> single-threaded, which puts an additional question mark on it.
> 
> Now to business:
> You mentioned the rss loops are a problem. I agree, but they can be
> largely damage-controlled. More importantly there are 2 loops of the
> sort already happening even with the patchset at hand.
> 
> mm_alloc_cid() results in one loop in the percpu allocator to zero out
> the area, then mm_init_cid() performs the following:
>         for_each_possible_cpu(i) {
>                 struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
> 
>                 pcpu_cid->cid = MM_CID_UNSET;
>                 pcpu_cid->recent_cid = MM_CID_UNSET;
>                 pcpu_cid->time = 0;
>         }
> 
> There is no way this is not visible already on 256 threads.
> 
> Preferably some magic would be done to init this on first use on given
> CPU.There is some bitmap tracking CPU presence, maybe this can be
> tackled on top of it. But for the sake of argument let's say that's
> too expensive or perhaps not feasible. Even then, the walk can be done
> *once* by telling the percpu allocator to refrain from zeroing memory.
> 
> Which brings me to rss counters. In the current kernel that's
> *another* loop over everything to zero it out. But it does not have to
> be that way. Suppose bitmap shenanigans mentioned above are no-go for
> these as well.
> 
> So instead the code could reach out to the percpu allocator to
> allocate memory for both cid and rss (as mentined by Mathieu), but
> have it returned uninitialized and loop over it once sorting out both
> cid and rss in the same body. This should be drastically faster than
> the current code.
> 
> But one may observe it is an invariant the values sum up to 0 on process exit.
> 
> So if one was to make sure the first time this is handed out by the
> percpu allocator the values are all 0s and then cache the area
> somewhere for future allocs/frees of mm, there would be no need to do
> the zeroing on alloc.

That's what slab constructor is for!

> On the free side summing up rss counters in check_mm() is only there
> for debugging purposes. Suppose it is useful enough that it needs to
> stay. Even then, as implemented right now, this is just slow for no
> reason:
> 
>         for (i = 0; i < NR_MM_COUNTERS; i++) {
>                 long x = percpu_counter_sum(&mm->rss_stat[i]);
> [snip]
>         }
> 
> That's *four* loops with extra overhead of irq-trips for every single
> one. This can be patched up to only do one loop, possibly even with
> irqs enabled the entire time.
> 
> Doing the loop is still slower than not doing it, but his may be just
> fast enough to obsolete the ideas like in the proposed patchset.
> 
> While per-cpu level caching for all possible allocations seems like
> the easiest way out, it in fact does *NOT* fully solve problem -- you
> are still going to globally serialize in lru_gen_add_mm() (and the del
> part), pgd_alloc() and other places.
> 
> Or to put it differently, per-cpu caching of mm_struct itself makes no
> sense in the current kernel (with the patchset or not) because on the
> way to finish the alloc or free you are going to globally serialize
> several times and *that* is the issue to fix in the long run. You can
> make the problematic locks fine-grained (and consequently alleviate
> the scalability aspect), but you are still going to suffer the
> overhead of taking them.
> 
> As far as I'm concerned the real long term solution(tm) would make the
> cached mm's retain the expensive to sort out state -- list presence,
> percpu memory and whatever else.
> 
> To that end I see 2 feasible approaches:
> 1. a dedicated allocator with coarse granularity
> 
> Instead of per-cpu, you could have an instance for every n threads
> (let's say 8 or whatever). this would pose a tradeoff between total
> memory usage and scalability outside of a microbenchmark setting. you
> are still going to serialize in some cases, but only once on alloc and
> once on free, not several times and you are still cheaper
> single-threaded. This is faster all around.
> 
> 2. dtor support in the slub allocator
> 
> ctor does the hard work and dtor undoes it. There is an unfinished
> patchset by Harry which implements the idea[1].

Apologies for not reposting it for a while. I have limited capacity to push
this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip
branch after rebasing it onto the latest slab/for-next.

https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads

My review on the version is limited, but did a little bit of testing.

> There is a serious concern about deadlock potential stemming from
> running arbitrary dtor code during memory reclaim. I already described
> elsewhere how with a little bit of discipline supported by lockdep
> this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't
> take any locks if you hold them and you have to disable interrupts) +
> mark dtors as only allowed to hold a leaf spinlock et voila, code
> guaranteed to not deadlock). But then all code trying to cache its
> state in to be undone with dtor has to be patched to facilitate it.
> Again bugs in the area sorted out by lockdep.
> 
> The good news is that folks were apparently open to punting reclaim of
> such memory into a workqueue, which completely alleviates that concern
> anyway.

I took the good news and switched to using workqueue to reclaim slabs
(for caches with dtor) in v2.

> So happens if fork + exit is involved there are numerous other
> bottlenecks which overshadow the above, but that's a rant for another
> day. Here we can pretend for a minute they are solved.
> 
> [1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-12-01 10:38       ` Harry Yoo
@ 2025-12-01 11:31         ` Mateusz Guzik
  2025-12-01 14:47           ` Mathieu Desnoyers
  0 siblings, 1 reply; 19+ messages in thread
From: Mateusz Guzik @ 2025-12-01 11:31 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Jan Kara, Mathieu Desnoyers, Gabriel Krisman Bertazi, linux-mm,
	linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On Mon, Dec 1, 2025 at 11:39 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> Apologies for not reposting it for a while. I have limited capacity to push
> this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip
> branch after rebasing it onto the latest slab/for-next.
>
> https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads
>

nice, thanks. This takes care of majority of the needful(tm).

To reiterate, should something like this land, it is going to address
the multicore scalability concern for single-threaded processes better
than the patchset by Gabriel thanks to also taking care of cid. Bonus
points for handling creation and teardown of multi-threaded processes.

However, this is still going to suffer from doing a full cpu walk on
process exit. As I described earlier the current handling can be
massively depessimized by reimplementing this to take care of all 4
counters in each iteration, instead of walking everything 4 times.
This is still going to be slower than not doing the walk at all, but
it may be fast enough that Gabriel's patchset is no longer
justifiable.

But then the test box is "only" 256 hw threads, what about bigger boxes?

Given my previous note about increased use of multithreading in
userspace, the more concerned you happen to be about such a walk, the
more you want an actual solution which takes care of multithreaded
processes.

Additionally one has to assume per-cpu memory will be useful for other
facilities down the line, making such a walk into an even bigger
problem.

Thus ultimately *some* tracking of whether given mm was ever active on
a given cpu is needed, preferably cheaply implemented at least for the
context switch code. Per what I described in another e-mail, one way
to do it would be to coalesce it with tlb handling by changing how the
bitmap tracking is handled -- having 2 adjacent bits denote cpu usage
+ tlb separately. For the common case this should be almost the code
to set the two. Iteration for tlb shootdowns would be less efficient
but that's probably tolerable. Maybe there is a better way, I did not
put much thought into it. I just claim sooner or later this will need
to get solved. At the same time would be a bummer to add stopgaps
without even trying.

With the cpu tracking problem solved, check_mm would visit few cpus in
the benchmark (probably just 1) and it would be faster single-threaded
than the proposed patch *and* would retain that for processes which
went multithreaded.

I'm not signing up to handle this though and someone else would have
to sign off on the cpu tracking thing anyway.

That is to say, I laid out the lay of the land as I see it but I'm not
doing any work. :)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-12-01 11:31         ` Mateusz Guzik
@ 2025-12-01 14:47           ` Mathieu Desnoyers
  0 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2025-12-01 14:47 UTC (permalink / raw)
  To: Mateusz Guzik, Harry Yoo
  Cc: Jan Kara, Gabriel Krisman Bertazi, linux-mm, linux-kernel,
	Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On 2025-12-01 06:31, Mateusz Guzik wrote:
> On Mon, Dec 1, 2025 at 11:39 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>> Apologies for not reposting it for a while. I have limited capacity to push
>> this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip
>> branch after rebasing it onto the latest slab/for-next.
>>
>> https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads
>>
> 
> nice, thanks. This takes care of majority of the needful(tm).
> 
> To reiterate, should something like this land, it is going to address
> the multicore scalability concern for single-threaded processes better
> than the patchset by Gabriel thanks to also taking care of cid. Bonus
> points for handling creation and teardown of multi-threaded processes.
> 
> However, this is still going to suffer from doing a full cpu walk on
> process exit. As I described earlier the current handling can be
> massively depessimized by reimplementing this to take care of all 4
> counters in each iteration, instead of walking everything 4 times.
> This is still going to be slower than not doing the walk at all, but
> it may be fast enough that Gabriel's patchset is no longer
> justifiable.
> 
> But then the test box is "only" 256 hw threads, what about bigger boxes?
> 
> Given my previous note about increased use of multithreading in
> userspace, the more concerned you happen to be about such a walk, the
> more you want an actual solution which takes care of multithreaded
> processes.
> 
> Additionally one has to assume per-cpu memory will be useful for other
> facilities down the line, making such a walk into an even bigger
> problem.
> 
> Thus ultimately *some* tracking of whether given mm was ever active on
> a given cpu is needed, preferably cheaply implemented at least for the
> context switch code. Per what I described in another e-mail, one way
> to do it would be to coalesce it with tlb handling by changing how the
> bitmap tracking is handled -- having 2 adjacent bits denote cpu usage
> + tlb separately. For the common case this should be almost the code
> to set the two. Iteration for tlb shootdowns would be less efficient
> but that's probably tolerable. Maybe there is a better way, I did not
> put much thought into it. I just claim sooner or later this will need
> to get solved. At the same time would be a bummer to add stopgaps
> without even trying.
> 
> With the cpu tracking problem solved, check_mm would visit few cpus in
> the benchmark (probably just 1) and it would be faster single-threaded
> than the proposed patch *and* would retain that for processes which
> went multithreaded.
Looking at this problem, it appears to be a good fit for rseq mm_cid
(per-mm concurrency ids). Let me explain.

I originally implemented the rseq mm_cid for userspace. It keeps track
of max_mm_cid = min(nr_threads, nr_allowed_cpus) for each mm, and lets
the scheduler select a current mm_cid value within the range
[0 .. max_mm_cid - 1]. With Thomas Gleixner's rewrite (currently in
tip), we even have hooks in thread clone/exit where we know when
max_mm_cid is increased/decreased for a mm. So we could keep track of
the maximum value of max_mm_cid over the lifetime of a mm.

So using mm_cid for per-mm rss counter would involve:

- Still allocating memory per-cpu on mm allocation (nr_cpu_ids), but
   without zeroing all that memory (we eliminate a possible cpus walk on
   allocation).

- Initialize CPU counters on thread clone when max_mm_cid is increased.
   Keep track of the max value of max_mm_cid over mm lifetime.

- Rather than using the per-cpu accessors to access the counters, we
   would have to load the per-task mm_cid field to get the counter index.
   This would have a slight added overhead on the fast path, because we
   would change a segment-selector prefix operation for an access that
   depends on a load of the task struct current mm_cid index.

- Iteration on all possible cpus at process exit is replaced by an
   iteration on mm maximum max_mm_cid, which will be bound by
   the maximum value of min(nr_threads, nr_allowed_cpus) over the
   mm lifetime. This iteration should be done with the new mm_cid
   mutex held across thread clone/exit.

One more downside to consider is loss of NUMA locality, because the
index used to access the per-cpu memory would not take into account
the hardware topology. The index to topology should stay stable for
a given mm, but if we mix the memory allocation of per-cpu data
across different mm, then the NUMA locality would be degraded.
Ideally we'd have a per-cpu allocator with per-mm arenas for mm_cid
indexing if we care about NUMA locality.

So let's say you have a 256-core machine, where cpu numbers can go
from 0 to 255, with a 4-thread process, mm_cid will be limited to
the range [0..3]. Likewise if there are tons of threads in a process
limited to a few cores (e.g. pinned on cores from 10 to 19), which
will limit the range to [0..9].

This approach solves the runtime overhead issue of zeroing per-cpu
memory for all scenarios:

* single-threaded: index = 0

* nr_threads < nr_cpu_ids
   * nr_threads < nr_allowed_cpus: index = [0 .. nr_threads - 1]
   * nr_threads >= nr_allowed_cpus: index = [0 .. nr_allowed_cpus - 1]

* nr_threads >= nr_cpus_ids: index = [0 .. nr_cpu_ids - 1]

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-11-29  5:57     ` Mateusz Guzik
  2025-11-29  7:50       ` Mateusz Guzik
  2025-12-01 10:38       ` Harry Yoo
@ 2025-12-01 15:23       ` Gabriel Krisman Bertazi
  2025-12-01 19:16         ` Harry Yoo
  2025-12-03 11:02         ` Mateusz Guzik
  2 siblings, 2 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-12-01 15:23 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jan Kara, Mathieu Desnoyers, linux-mm, linux-kernel,
	Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

Mateusz Guzik <mjguzik@gmail.com> writes:

> On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
>> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
>> > What would really reduce memory allocation overhead on fork
>> > is to move all those fields into a top level
>> > "struct mm_percpu_struct" as a first step. This would
>> > merge 3 per-cpu allocations into one when forking a new
>> > task.
>> >
>> > Then the second step is to create a mm_percpu_struct
>> > cache to bypass the per-cpu allocator.
>> >
>> > I suspect that by doing just that we'd get most of the
>> > performance benefits provided by the single-threaded special-case
>> > proposed here.
>>
>> I don't think so. Because in the profiles I have been doing for these
>> loads the biggest cost wasn't actually the per-cpu allocation itself but
>> the cost of zeroing the allocated counter for many CPUs (and then the
>> counter summarization on exit) and you're not going to get rid of that with
>> just reshuffling per-cpu fields and adding slab allocator in front.
>>

Hi Mateusz,

> The major claims (by me anyway) are:
> 1. single-threaded operation for fork + exec suffers avoidable
> overhead even without the rss counter problem, which are tractable
> with the same kind of thing which would sort out the multi-threaded
> problem

Agreed, there are more issues in the fork/exec path than just the
rss_stat.  The rss_stat performance is particularly relevant to us,
though, because it is a clear regression for single-threaded introduced
in 6.2.

I took the time to test the slab constructor approach with the
/sbin/true microbenchmark.  I've seen only 2% gain on that tight loop in
the 80c machine, which, granted, is an artificial benchmark, but still a
good stressor of the single-threaded case.  With this patchset, I
reported 6% improvement, getting it close to the performance before the
pcpu rss_stats introduction. This is expected, as avoiding the pcpu
allocation and initialization all together for the single-threaded case,
where it is not necessary, will always be better than speeding up the
allocation (even though that a worthwhile effort itself, as Mathieu
pointed out).

> 2. unfortunately there is an increasing number of multi-threaded (and
> often short lived) processes (example: lld, the linker form the llvm
> project; more broadly plenty of things Rust where people think
> threading == performance)

I don't agree with this argument, though.  Sure, there is an increasing
amount of multi-threaded applications, but this is not relevant.  The
relevant argument is the amount of single-threaded workloads. One
example are coreutils, which are spawned to death by scripts.  I did
take the care of testing the patchset with a full distro on my
day-to-day laptop and I wasn't surprised to see the vast majority of
forked tasks never fork a second thread.  The ones that do are most
often long-lived applications, where the cost of mm initialization is
way less relevant to the overall system performance.  Another example is
the fact real-world benchmarks, like kernbench, can be improved with
special-casing single-threads.

> The pragmatic way forward (as I see it anyway) is to fix up the
> multi-threaded thing and see if trying to special case for
> single-threaded case is justifiable afterwards.
>
> Given that the current patchset has to resort to atomics in certain
> cases, there is some error-pronnes and runtime overhead associated
> with it going beyond merely checking if the process is
> single-threaded, which puts an additional question mark on it.

I don't get why atomics would make it error-prone.  But, regarding the
runtime overhead, please note the main point of this approach is that
the hot path can be handled with a simple non-atomic variable write in
the task context, and not the atomic operation. The later is only used
for infrequent case where the counter is touched by an external task
such as OOM, khugepaged, etc.

>
> Now to business:

-- 
Gabriel Krisman Bertazi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-12-01 15:23       ` Gabriel Krisman Bertazi
@ 2025-12-01 19:16         ` Harry Yoo
  2025-12-03 11:02         ` Mateusz Guzik
  1 sibling, 0 replies; 19+ messages in thread
From: Harry Yoo @ 2025-12-01 19:16 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Mateusz Guzik, Jan Kara, Mathieu Desnoyers, linux-mm,
	linux-kernel, Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On Mon, Dec 01, 2025 at 10:23:43AM -0500, Gabriel Krisman Bertazi wrote:
> Mateusz Guzik <mjguzik@gmail.com> writes:
> 
> > On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
> >> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> >> > What would really reduce memory allocation overhead on fork
> >> > is to move all those fields into a top level
> >> > "struct mm_percpu_struct" as a first step. This would
> >> > merge 3 per-cpu allocations into one when forking a new
> >> > task.
> >> >
> >> > Then the second step is to create a mm_percpu_struct
> >> > cache to bypass the per-cpu allocator.
> >> >
> >> > I suspect that by doing just that we'd get most of the
> >> > performance benefits provided by the single-threaded special-case
> >> > proposed here.
> >>
> >> I don't think so. Because in the profiles I have been doing for these
> >> loads the biggest cost wasn't actually the per-cpu allocation itself but
> >> the cost of zeroing the allocated counter for many CPUs (and then the
> >> counter summarization on exit) and you're not going to get rid of that with
> >> just reshuffling per-cpu fields and adding slab allocator in front.
> >>
> 
> Hi Mateusz,
> 
> > The major claims (by me anyway) are:
> > 1. single-threaded operation for fork + exec suffers avoidable
> > overhead even without the rss counter problem, which are tractable
> > with the same kind of thing which would sort out the multi-threaded
> > problem
> 
> Agreed, there are more issues in the fork/exec path than just the
> rss_stat.  The rss_stat performance is particularly relevant to us,
> though, because it is a clear regression for single-threaded introduced
> in 6.2.
> 
> I took the time to test the slab constructor approach with the
> /sbin/true microbenchmark.  I've seen only 2% gain on that tight loop in
> the 80c machine, which, granted, is an artificial benchmark, but still a
> good stressor of the single-threaded case.  With this patchset, I
> reported 6% improvement, getting it close to the performance before the
> pcpu rss_stats introduction.

Hi Gabriel,

I don't want to argue which approach is better, but just wanted to
mention that maybe this is not a fair comparison because we can (almost)
eliminate initialization cost with slab ctor & dtor pair. As Mateusz
pointed out, under normal conditions, we know that the sum of
each rss_stat counter is zero when it's freed.

That is what slab constructor is for; if we know that certain fields of
a type are freed in a particular state, then we only need to initialize
them once in the constructor when the object is first created, and no
initialization is needed for subsequent allocations.

We couldn't use slab constructor to do this because percpu memory is not
allocated when it's called, but with ctor/dtor pair we can do this.

> This is expected, as avoiding the pcpu
> allocation and initialization all together for the single-threaded case,
> where it is not necessary, will always be better than speeding up the
> allocation (even though that a worthwhile effort itself, as Mathieu
> pointed out).

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-12-01 15:23       ` Gabriel Krisman Bertazi
  2025-12-01 19:16         ` Harry Yoo
@ 2025-12-03 11:02         ` Mateusz Guzik
  2025-12-03 11:54           ` Mateusz Guzik
  1 sibling, 1 reply; 19+ messages in thread
From: Mateusz Guzik @ 2025-12-03 11:02 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Jan Kara, Mathieu Desnoyers, linux-mm, linux-kernel,
	Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On Mon, Dec 1, 2025 at 4:23 PM Gabriel Krisman Bertazi <krisman@suse.de> wrote:
>
> Mateusz Guzik <mjguzik@gmail.com> writes:
> > The major claims (by me anyway) are:
> > 1. single-threaded operation for fork + exec suffers avoidable
> > overhead even without the rss counter problem, which are tractable
> > with the same kind of thing which would sort out the multi-threaded
> > problem
>
> Agreed, there are more issues in the fork/exec path than just the
> rss_stat.  The rss_stat performance is particularly relevant to us,
> though, because it is a clear regression for single-threaded introduced
> in 6.2.
>
> I took the time to test the slab constructor approach with the
> /sbin/true microbenchmark.  I've seen only 2% gain on that tight loop in
> the 80c machine, which, granted, is an artificial benchmark, but still a
> good stressor of the single-threaded case.  With this patchset, I
> reported 6% improvement, getting it close to the performance before the
> pcpu rss_stats introduction. This is expected, as avoiding the pcpu
> allocation and initialization all together for the single-threaded case,
> where it is not necessary, will always be better than speeding up the
> allocation (even though that a worthwhile effort itself, as Mathieu
> pointed out)

I'm fine with the benchmark method, but it was used on a kernel which
remains gimped by the avoidably slow walk in check_mm which I already
talked about.

Per my prior commentary and can be patched up to only do the walk once
instead of 4 times, and without taking locks.

But that's still more work than nothing and let's say that's still too
slow. 2 ideas were proposed how to avoid the walk altogether: I
proposed expanding the tlb bitmap and Mathieu went with the cid
machinery. Either way the walk over all CPUs is not there.

With the walk issue fixed and all allocations cached thanks ctor/dtor,
even the single-threaded fork/exec will be faster than it is with your
patch thanks to *never* reaching to the per-cpu allocator (with your
patch it is still going to happen for the cid stuff).

Additionally there are other locks which can be elided later with the
ctor/dtor pair, further improving perf.

>
> > 2. unfortunately there is an increasing number of multi-threaded (and
> > often short lived) processes (example: lld, the linker form the llvm
> > project; more broadly plenty of things Rust where people think
> > threading == performance)
>
> I don't agree with this argument, though.  Sure, there is an increasing
> amount of multi-threaded applications, but this is not relevant.  The
> relevant argument is the amount of single-threaded workloads. One
> example are coreutils, which are spawned to death by scripts.  I did
> take the care of testing the patchset with a full distro on my
> day-to-day laptop and I wasn't surprised to see the vast majority of
> forked tasks never fork a second thread.  The ones that do are most
> often long-lived applications, where the cost of mm initialization is
> way less relevant to the overall system performance.  Another example is
> the fact real-world benchmarks, like kernbench, can be improved with
> special-casing single-threads.
>

I stress one more time that a full fixup for the situation as I
described above not only gets rid of the problem for *both* single-
and multi- threaded operation, but ends up with code which is faster
than your patchset even for the case you are patching for.

The multi-threaded stuff *is* very much relevant because it is
increasingly more common (see below). I did not claim that
single-threaded workloads don't matter.

I would not be arguing here if there was no feasible way to handle
both or if handling the multi-threaded case still resulted in
measurable overhead for single-threaded workloads.

Since you mention configure scripts, I'm intimately familiar with
large-scale building as a workload. While it is true that there is
rampant usage of shell, sed and whatnot (all of which are
single-threaded), things turn multi-threaded (and short-lived) very
quickly once you go past the gnu toolchain and/or c/c++ codebases.

For example the llvm linker is multi-threaded and short-lived. Since
most real programs are small, during a large scale build of different
programs you end up with tons of lld spawning and quitting all the
time.

Beyond that java, erlang, zig and others like to multi-thread as well.

Rust is an emerging ecosystem where people think adding threading
equals automatically better performance and where crate authors think
it's fine to sneak in threads (my favourite offender is the ctrlc
crate). And since Rust is growing in popularity you can expect the
kind of single-threaded tooling you see right now will turn
multi-threaded from under you over time.

> > The pragmatic way forward (as I see it anyway) is to fix up the
> > multi-threaded thing and see if trying to special case for
> > single-threaded case is justifiable afterwards.
> >
> > Given that the current patchset has to resort to atomics in certain
> > cases, there is some error-pronnes and runtime overhead associated
> > with it going beyond merely checking if the process is
> > single-threaded, which puts an additional question mark on it.
>
> I don't get why atomics would make it error-prone.  But, regarding the
> runtime overhead, please note the main point of this approach is that
> the hot path can be handled with a simple non-atomic variable write in
> the task context, and not the atomic operation. The later is only used
> for infrequent case where the counter is touched by an external task
> such as OOM, khugepaged, etc.
>

The claim is there may be a bug where something should be using the
atomic codepath but is not.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-12-03 11:02         ` Mateusz Guzik
@ 2025-12-03 11:54           ` Mateusz Guzik
  2025-12-03 14:36             ` Mateusz Guzik
  0 siblings, 1 reply; 19+ messages in thread
From: Mateusz Guzik @ 2025-12-03 11:54 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Jan Kara, Mathieu Desnoyers, linux-mm, linux-kernel,
	Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On Wed, Dec 3, 2025 at 12:02 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Mon, Dec 1, 2025 at 4:23 PM Gabriel Krisman Bertazi <krisman@suse.de> wrote:
> >
> > Mateusz Guzik <mjguzik@gmail.com> writes:
> > > The major claims (by me anyway) are:
> > > 1. single-threaded operation for fork + exec suffers avoidable
> > > overhead even without the rss counter problem, which are tractable
> > > with the same kind of thing which would sort out the multi-threaded
> > > problem
> >
> > Agreed, there are more issues in the fork/exec path than just the
> > rss_stat.  The rss_stat performance is particularly relevant to us,
> > though, because it is a clear regression for single-threaded introduced
> > in 6.2.
> >
> > I took the time to test the slab constructor approach with the
> > /sbin/true microbenchmark.  I've seen only 2% gain on that tight loop in
> > the 80c machine, which, granted, is an artificial benchmark, but still a
> > good stressor of the single-threaded case.  With this patchset, I
> > reported 6% improvement, getting it close to the performance before the
> > pcpu rss_stats introduction. This is expected, as avoiding the pcpu
> > allocation and initialization all together for the single-threaded case,
> > where it is not necessary, will always be better than speeding up the
> > allocation (even though that a worthwhile effort itself, as Mathieu
> > pointed out)
>
> I'm fine with the benchmark method, but it was used on a kernel which
> remains gimped by the avoidably slow walk in check_mm which I already
> talked about.
>
> Per my prior commentary and can be patched up to only do the walk once
> instead of 4 times, and without taking locks.
>
> But that's still more work than nothing and let's say that's still too
> slow. 2 ideas were proposed how to avoid the walk altogether: I
> proposed expanding the tlb bitmap and Mathieu went with the cid
> machinery. Either way the walk over all CPUs is not there.
>

So I got another idea and it boils down to coalescing cid init with
rss checks on exit.

I repeat that with your patchset the single-threaded case is left with
one walk on alloc (for cid stuff) and that's where issues arise for
machines with tons of cpus.

If the walk gets fixed, the same method can be used to avoid the walk
for rss, obsoleting the patchset.

So let's say it is unfixable for the time being.

mm_init_cid stores a bunch of -1 per-cpu. I'm assuming this can't be changed.

One can still handle allocation in ctor/dtor and make it an invariant
that the state present is ready to use, so in particular mm_init_cid
was already issued on it.

Then it is on the exit side to clean it up and this is where the walk
checks rss state *and* reinits cid in one loop.

Excluding the repeat lock and irq trips which don't need to be there,
I take it almost all of the overhead is cache misses. WIth one loop
that's sorted out.

Maybe I'm going to hack it up, but perhaps Mathieu or Harry would be
happy to do it? (or have a better idea?)

> With the walk issue fixed and all allocations cached thanks ctor/dtor,
> even the single-threaded fork/exec will be faster than it is with your
> patch thanks to *never* reaching to the per-cpu allocator (with your
> patch it is still going to happen for the cid stuff).
>
> Additionally there are other locks which can be elided later with the
> ctor/dtor pair, further improving perf.
>
> >
> > > 2. unfortunately there is an increasing number of multi-threaded (and
> > > often short lived) processes (example: lld, the linker form the llvm
> > > project; more broadly plenty of things Rust where people think
> > > threading == performance)
> >
> > I don't agree with this argument, though.  Sure, there is an increasing
> > amount of multi-threaded applications, but this is not relevant.  The
> > relevant argument is the amount of single-threaded workloads. One
> > example are coreutils, which are spawned to death by scripts.  I did
> > take the care of testing the patchset with a full distro on my
> > day-to-day laptop and I wasn't surprised to see the vast majority of
> > forked tasks never fork a second thread.  The ones that do are most
> > often long-lived applications, where the cost of mm initialization is
> > way less relevant to the overall system performance.  Another example is
> > the fact real-world benchmarks, like kernbench, can be improved with
> > special-casing single-threads.
> >
>
> I stress one more time that a full fixup for the situation as I
> described above not only gets rid of the problem for *both* single-
> and multi- threaded operation, but ends up with code which is faster
> than your patchset even for the case you are patching for.
>
> The multi-threaded stuff *is* very much relevant because it is
> increasingly more common (see below). I did not claim that
> single-threaded workloads don't matter.
>
> I would not be arguing here if there was no feasible way to handle
> both or if handling the multi-threaded case still resulted in
> measurable overhead for single-threaded workloads.
>
> Since you mention configure scripts, I'm intimately familiar with
> large-scale building as a workload. While it is true that there is
> rampant usage of shell, sed and whatnot (all of which are
> single-threaded), things turn multi-threaded (and short-lived) very
> quickly once you go past the gnu toolchain and/or c/c++ codebases.
>
> For example the llvm linker is multi-threaded and short-lived. Since
> most real programs are small, during a large scale build of different
> programs you end up with tons of lld spawning and quitting all the
> time.
>
> Beyond that java, erlang, zig and others like to multi-thread as well.
>
> Rust is an emerging ecosystem where people think adding threading
> equals automatically better performance and where crate authors think
> it's fine to sneak in threads (my favourite offender is the ctrlc
> crate). And since Rust is growing in popularity you can expect the
> kind of single-threaded tooling you see right now will turn
> multi-threaded from under you over time.
>
> > > The pragmatic way forward (as I see it anyway) is to fix up the
> > > multi-threaded thing and see if trying to special case for
> > > single-threaded case is justifiable afterwards.
> > >
> > > Given that the current patchset has to resort to atomics in certain
> > > cases, there is some error-pronnes and runtime overhead associated
> > > with it going beyond merely checking if the process is
> > > single-threaded, which puts an additional question mark on it.
> >
> > I don't get why atomics would make it error-prone.  But, regarding the
> > runtime overhead, please note the main point of this approach is that
> > the hot path can be handled with a simple non-atomic variable write in
> > the task context, and not the atomic operation. The later is only used
> > for infrequent case where the counter is touched by an external task
> > such as OOM, khugepaged, etc.
> >
>
> The claim is there may be a bug where something should be using the
> atomic codepath but is not.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
  2025-12-03 11:54           ` Mateusz Guzik
@ 2025-12-03 14:36             ` Mateusz Guzik
  0 siblings, 0 replies; 19+ messages in thread
From: Mateusz Guzik @ 2025-12-03 14:36 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Jan Kara, Mathieu Desnoyers, linux-mm, linux-kernel,
	Shakeel Butt, Michal Hocko, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Gleixner

On Wed, Dec 03, 2025 at 12:54:34PM +0100, Mateusz Guzik wrote:
> So I got another idea and it boils down to coalescing cid init with
> rss checks on exit.
> 
 
So short version is I implemented a POC and I have the same performance
for single-threaded processes as your patchset when testing on Sapphire
Rapids in an 80-way vm.

Caveats:
- there is a performance bug on the cpu vs rep movsb (see https://lore.kernel.org/all/mwwusvl7jllmck64xczeka42lglmsh7mlthuvmmqlmi5stp3na@raiwozh466wz/), I worked around it like so:
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index e20e25b8b16c..1b538f7bbd89 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -189,6 +189,29 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
     endif
 endif
 
+ifdef CONFIG_CC_IS_GCC
+#
+# Inline memcpy and memset handling policy for gcc.
+#
+# For ops of sizes known at compilation time it quickly resorts to issuing rep
+# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
+# latency and it is faster to issue regular stores (even if in loops) to handle
+# small buffers.
+#
+# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
+# reported 0.23% increase for enabling these.
+#
+# We inline up to 256 bytes, which in the best case issues few movs, in the
+# worst case creates a 4 * 8 store loop.
+#
+# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
+# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
+# common denominator. Someone(tm) should revisit this from time to time.
+#
+KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+endif
+
 #
 # If the function graph tracer is used with mcount instead of fentry,
 # '-maccumulate-outgoing-args' is needed to prevent a GCC bug

- qemu version i'm saddled with does not pass FSRS to the guest, thus:
diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index fb5a03cf5ab7..a692bb4cece4 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -30,7 +30,7 @@
  * which the compiler could/should do much better anyway.
  */
 SYM_TYPED_FUNC_START(__memset)
-       ALTERNATIVE "jmp memset_orig", "", X86_FEATURE_FSRS
+//     ALTERNATIVE "jmp memset_orig", "", X86_FEATURE_FSRS
 
        movq %rdi,%r9
        movb %sil,%al

Baseline commit (+ the 2 above hacks) is the following:
commit a8ec08bf32595ea4b109e3c7f679d4457d1c58c0
Merge: ed80cc758b78 48233291461b
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Tue Nov 25 14:38:41 2025 +0100

    Merge branch 'slab/for-6.19/mempool_alloc_bulk' into slab/for-next

This is what the ctor/dtor branch is rebased on. It is missing some of
the further changes to cid machinery in upstream, but they don't
fundamentally mess with the core idea of the patch (pcpu memory is still
allocated on mm creation and it is being zeroed) so I did not bother
rebasing -- end perf will be the same.

Benchmark is a static binary executing itself in a loop: http://apollo.backplane.com/DFlyMisc/doexec.c
    
$ cc -O2 -o static-doexec doexec.c
$ taskset --cpu-list 1 ./static-doexec 1

With ctor+dtor+unified walk I'm seeing 2% improvement over the baseline and the same performance as lazy counter.

If nobody is willing to productize this I'm going to do it.

non-production hack below for reference:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cb9c6b16c311..f952ec1f59d1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1439,7 +1439,7 @@ static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 	return (struct cpumask *)cid_bitmap;
 }
 
-static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
+static inline void mm_init_cid_percpu(struct mm_struct *mm, struct task_struct *p)
 {
 	int i;
 
@@ -1457,6 +1457,15 @@ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
 	cpumask_clear(mm_cidmask(mm));
 }
 
+static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
+{
+	mm->nr_cpus_allowed = p->nr_cpus_allowed;
+	atomic_set(&mm->max_nr_cid, 0);
+	raw_spin_lock_init(&mm->cpus_allowed_lock);
+	cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
+	cpumask_clear(mm_cidmask(mm));
+}
+
 static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
 {
 	mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
diff --git a/kernel/fork.c b/kernel/fork.c
index a26319cddc3c..1575db9f0198 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -575,21 +575,46 @@ static inline int mm_alloc_id(struct mm_struct *mm) { return 0; }
 static inline void mm_free_id(struct mm_struct *mm) {}
 #endif /* CONFIG_MM_ID */
 
+/*
+ * pretend this is fully integrated into hotplug support
+ */
+__cacheline_aligned_in_smp DEFINE_SEQLOCK(cpu_hotplug_lock);
+
 static void check_mm(struct mm_struct *mm)
 {
-	int i;
+	long rss_stat[NR_MM_COUNTERS];
+	unsigned cpu_seq;
+	int i, cpu;
 
 	BUILD_BUG_ON_MSG(ARRAY_SIZE(resident_page_types) != NR_MM_COUNTERS,
 			 "Please make sure 'struct resident_page_types[]' is updated as well");
 
-	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		long x = percpu_counter_sum(&mm->rss_stat[i]);
+	cpu_seq = read_seqbegin(&cpu_hotplug_lock);
+	local_irq_disable();
+	for (i = 0; i < NR_MM_COUNTERS; i++)
+		rss_stat[i] = mm->rss_stat[i].count;
+
+	for_each_possible_cpu(cpu) {
+		struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu);
+
+		pcpu_cid->cid = MM_CID_UNSET;
+		pcpu_cid->recent_cid = MM_CID_UNSET;
+		pcpu_cid->time = 0;
 
-		if (unlikely(x)) {
+		for (i = 0; i < NR_MM_COUNTERS; i++)
+			rss_stat[i] += *per_cpu_ptr(mm->rss_stat[i].counters, cpu);
+	}
+	local_irq_enable();
+	if (read_seqretry(&cpu_hotplug_lock, cpu_seq))
+		BUG();
+
+	for (i = 0; i < NR_MM_COUNTERS; i++) {
+		if (unlikely(rss_stat[i])) {
 			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld Comm:%s Pid:%d\n",
-				 mm, resident_page_types[i], x,
+				 mm, resident_page_types[i], rss_stat[i],
 				 current->comm,
 				 task_pid_nr(current));
+			/* XXXBUG: ZERO IT OUT */
 		}
 	}
 
@@ -2953,10 +2978,19 @@ static int sighand_ctor(void *data)
 static int mm_struct_ctor(void *object)
 {
 	struct mm_struct *mm = object;
+	int cpu;
 
 	if (mm_alloc_cid(mm))
 		return -ENOMEM;
 
+	for_each_possible_cpu(cpu) {
+		struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu);
+
+		pcpu_cid->cid = MM_CID_UNSET;
+		pcpu_cid->recent_cid = MM_CID_UNSET;
+		pcpu_cid->time = 0;
+	}
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL,
 				     NR_MM_COUNTERS)) {
 		mm_destroy_cid(mm);
diff --git a/mm/percpu.c b/mm/percpu.c
index 7d036f42b5af..47e23ea90d7b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1693,7 +1693,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 
 	obj_cgroup_put(objcg);
 }
-bool pcpu_charge(void *ptr, size_t size, gfp_t gfp)
+bool pcpu_charge(void __percpu *ptr, size_t size, gfp_t gfp)
 {
 	struct obj_cgroup *objcg = NULL;
 	void *addr;
@@ -1710,7 +1710,7 @@ bool pcpu_charge(void *ptr, size_t size, gfp_t gfp)
 	return true;
 }
 
-void pcpu_uncharge(void *ptr, size_t size)
+void pcpu_uncharge(void __percpu *ptr, size_t size)
 {
 	void *addr;
 	struct pcpu_chunk *chunk;


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-12-03 14:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi
2025-12-01 10:19   ` David Hildenbrand (Red Hat)
2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers
2025-11-28 20:10   ` Jan Kara
2025-11-28 20:12     ` Mathieu Desnoyers
2025-11-29  5:57     ` Mateusz Guzik
2025-11-29  7:50       ` Mateusz Guzik
2025-12-01 10:38       ` Harry Yoo
2025-12-01 11:31         ` Mateusz Guzik
2025-12-01 14:47           ` Mathieu Desnoyers
2025-12-01 15:23       ` Gabriel Krisman Bertazi
2025-12-01 19:16         ` Harry Yoo
2025-12-03 11:02         ` Mateusz Guzik
2025-12-03 11:54           ` Mateusz Guzik
2025-12-03 14:36             ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox