[RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu_counter
@ 2024-04-12  9:24 Peng Zhang
  2024-04-12  9:24 ` [RFC PATCH 1/3] Lazy percpu counters Peng Zhang
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Peng Zhang @ 2024-04-12  9:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong,
	zhangpeng362

From: ZhangPeng <zhangpeng362@huawei.com>

Since commit f1a7941243c1 ("mm: convert mm's rss stats into
percpu_counter"), the rss_stats have converted into percpu_counter,
which convert the error margin from (nr_threads * 64) to approximately
(nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
performance regression on fork/exec/shell. Even after commit
14ef95be6f55 ("kernel/fork: group allocation/free of per-cpu counters
for mm struct"), the performance of fork/exec/shell is still poor
compared to previous kernel versions.

To mitigate performance regression, we use lazy_percpu_counter[1] to
delay the allocation of percpu memory for rss_stats. After lmbench test,
we will get 3% ~ 6% performance improvement for lmbench
fork_proc/exec_proc/shell_proc after conversion.

The test results are as follows:

             base           base+revert        base+lazy_percpu_counter

fork_proc    427.4ms        394.1ms  (7.8%)    413.9ms  (3.2%)
exec_proc    2205.1ms       2042.2ms (7.4%)    2072.0ms (6.0%)
shell_proc   3180.9ms       2963.7ms (6.8%)    3010.7ms (5.4%)

This solution has not been fully evaluated and tested. The main idea of
this RFC patch series is to get the community's opinion on this approach.

[1] https://lore.kernel.org/linux-iommu/20230501165450.15352-8-surenb@google.com/

Kent Overstreet (1):
  Lazy percpu counters

ZhangPeng (2):
  lazy_percpu_counter: include struct percpu_counter in struct
    lazy_percpu_counter
  mm: convert mm's rss stats into lazy_percpu_counter

 include/linux/lazy-percpu-counter.h |  88 +++++++++++++++++++
 include/linux/mm.h                  |   8 +-
 include/linux/mm_types.h            |   4 +-
 include/trace/events/kmem.h         |   4 +-
 kernel/fork.c                       |  12 +--
 lib/Makefile                        |   2 +-
 lib/lazy-percpu-counter.c           | 131 ++++++++++++++++++++++++++++
 7 files changed, 232 insertions(+), 17 deletions(-)
 create mode 100644 include/linux/lazy-percpu-counter.h
 create mode 100644 lib/lazy-percpu-counter.c

-- 
2.25.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 1/3] Lazy percpu counters
  2024-04-12  9:24 [RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu_counter Peng Zhang
@ 2024-04-12  9:24 ` Peng Zhang
  2024-04-12  9:24 ` [RFC PATCH 2/3] lazy_percpu_counter: include struct percpu_counter in struct lazy_percpu_counter Peng Zhang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Peng Zhang @ 2024-04-12  9:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong,
	zhangpeng362

From: Kent Overstreet <kent.overstreet@linux.dev>

This patch adds lib/lazy-percpu-counter.c, which implements counters
that start out as atomics, but lazily switch to percpu mode if the
update rate crosses some threshold (arbitrarily set at 256 per second).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
---
 include/linux/lazy-percpu-counter.h | 82 +++++++++++++++++++++++++++++
 lib/Makefile                        |  2 +-
 lib/lazy-percpu-counter.c           | 82 +++++++++++++++++++++++++++++
 3 files changed, 165 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/lazy-percpu-counter.h
 create mode 100644 lib/lazy-percpu-counter.c

diff --git a/include/linux/lazy-percpu-counter.h b/include/linux/lazy-percpu-counter.h
new file mode 100644
index 000000000000..281b8dd88cb2
--- /dev/null
+++ b/include/linux/lazy-percpu-counter.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Lazy percpu counters:
+ * (C) 2022 Kent Overstreet
+ *
+ * Lazy percpu counters start out in atomic mode, then switch to percpu mode if
+ * the update rate crosses some threshold.
+ *
+ * This means we don't have to decide between low memory overhead atomic
+ * counters and higher performance percpu counters - we can have our cake and
+ * eat it, too!
+ *
+ * Internally we use an atomic64_t, where the low bit indicates whether we're in
+ * percpu mode, and the high 8 bits are a secondary counter that's incremented
+ * when the counter is modified - meaning 55 bits of precision are available for
+ * the counter itself.
+ */
+
+#ifndef _LINUX_LAZY_PERCPU_COUNTER_H
+#define _LINUX_LAZY_PERCPU_COUNTER_H
+
+#include <linux/atomic.h>
+#include <asm/percpu.h>
+
+struct lazy_percpu_counter {
+	atomic64_t			v;
+	unsigned long			last_wrap;
+};
+
+void lazy_percpu_counter_exit(struct lazy_percpu_counter *c);
+void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i);
+
+/*
+ * We use the high bits of the atomic counter for a secondary counter, which is
+ * incremented every time the counter is touched. When the secondary counter
+ * wraps, we check the time the counter last wrapped, and if it was recent
+ * enough that means the update frequency has crossed our threshold and we
+ * switch to percpu mode:
+ */
+#define COUNTER_MOD_BITS		8
+#define COUNTER_MOD_MASK		~(~0ULL >> COUNTER_MOD_BITS)
+#define COUNTER_MOD_BITS_START		(64 - COUNTER_MOD_BITS)
+
+/*
+ * We use the low bit of the counter to indicate whether we're in atomic mode
+ * (low bit clear), or percpu mode (low bit set, counter is a pointer to actual
+ * percpu counters:
+ */
+#define COUNTER_IS_PCPU_BIT		1
+
+static inline u64 __percpu *lazy_percpu_counter_is_pcpu(u64 v)
+{
+	if (!(v & COUNTER_IS_PCPU_BIT))
+		return NULL;
+
+	v ^= COUNTER_IS_PCPU_BIT;
+	return (u64 __percpu *)(unsigned long)v;
+}
+
+/**
+ * lazy_percpu_counter_add: Add a value to a lazy_percpu_counter
+ *
+ * @c: counter to modify
+ * @i: value to add
+ */
+static inline void lazy_percpu_counter_add(struct lazy_percpu_counter *c, s64 i)
+{
+	u64 v = atomic64_read(&c->v);
+	u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+	if (likely(pcpu_v))
+		this_cpu_add(*pcpu_v, i);
+	else
+		lazy_percpu_counter_add_slowpath(c, i);
+}
+
+static inline void lazy_percpu_counter_sub(struct lazy_percpu_counter *c, s64 i)
+{
+	lazy_percpu_counter_add(c, -i);
+}
+
+#endif /* _LINUX_LAZY_PERCPU_COUNTER_H */
diff --git a/lib/Makefile b/lib/Makefile
index 2f4e17bfb299..7afa0c3e7cc7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -46,7 +46,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
 	 bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \
 	 list_sort.o uuid.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
-	 percpu-refcount.o rhashtable.o base64.o \
+	 percpu-refcount.o lazy-percpu-counter.o rhashtable.o base64.o \
 	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
 	 generic-radix-tree.o bitmap-str.o
 obj-$(CONFIG_STRING_KUNIT_TEST) += string_kunit.o
diff --git a/lib/lazy-percpu-counter.c b/lib/lazy-percpu-counter.c
new file mode 100644
index 000000000000..e1914207214d
--- /dev/null
+++ b/lib/lazy-percpu-counter.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/atomic.h>
+#include <linux/gfp.h>
+#include <linux/jiffies.h>
+#include <linux/lazy-percpu-counter.h>
+#include <linux/percpu.h>
+
+static inline s64 lazy_percpu_counter_atomic_val(s64 v)
+{
+	/* Ensure output is sign extended properly: */
+	return (v << COUNTER_MOD_BITS) >>
+		(COUNTER_MOD_BITS + COUNTER_IS_PCPU_BIT);
+}
+
+static void lazy_percpu_counter_switch_to_pcpu(struct lazy_percpu_counter *c)
+{
+	u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
+	u64 old, new, v;
+
+	if (!pcpu_v)
+		return;
+
+	preempt_disable();
+	v = atomic64_read(&c->v);
+	do {
+		if (lazy_percpu_counter_is_pcpu(v)) {
+			free_percpu(pcpu_v);
+			return;
+		}
+
+		old = v;
+		new = (unsigned long)pcpu_v | 1;
+
+		*this_cpu_ptr(pcpu_v) = lazy_percpu_counter_atomic_val(v);
+	} while ((v = atomic64_cmpxchg(&c->v, old, new)) != old);
+	preempt_enable();
+}
+
+/**
+ * lazy_percpu_counter_exit: Free resources associated with a
+ * lazy_percpu_counter
+ *
+ * @c: counter to exit
+ */
+void lazy_percpu_counter_exit(struct lazy_percpu_counter *c)
+{
+	free_percpu(lazy_percpu_counter_is_pcpu(atomic64_read(&c->v)));
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_exit);
+
+void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i)
+{
+	u64 atomic_i;
+	u64 old, v = atomic64_read(&c->v);
+	u64 __percpu *pcpu_v;
+
+	atomic_i  = i << COUNTER_IS_PCPU_BIT;
+	atomic_i &= ~COUNTER_MOD_MASK;
+	atomic_i |= 1ULL << COUNTER_MOD_BITS_START;
+
+	do {
+		pcpu_v = lazy_percpu_counter_is_pcpu(v);
+		if (pcpu_v) {
+			this_cpu_add(*pcpu_v, i);
+			return;
+		}
+
+		old = v;
+	} while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
+
+	if (unlikely(!(v & COUNTER_MOD_MASK))) {
+		unsigned long now = jiffies;
+
+		if (c->last_wrap &&
+		    unlikely(time_after(c->last_wrap + HZ, now)))
+			lazy_percpu_counter_switch_to_pcpu(c);
+		else
+			c->last_wrap = now;
+	}
+}
+EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 2/3] lazy_percpu_counter: include struct percpu_counter in struct lazy_percpu_counter
  2024-04-12  9:24 [RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu_counter Peng Zhang
  2024-04-12  9:24 ` [RFC PATCH 1/3] Lazy percpu counters Peng Zhang
@ 2024-04-12  9:24 ` Peng Zhang
  2024-04-12  9:24 ` [RFC PATCH 3/3] mm: convert mm's rss stats into lazy_percpu_counter Peng Zhang
  2024-04-12 13:53 ` [RFC PATCH 0/3] " Jan Kara
  3 siblings, 0 replies; 6+ messages in thread
From: Peng Zhang @ 2024-04-12  9:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong,
	zhangpeng362

From: ZhangPeng <zhangpeng362@huawei.com>

Add the struct percpu_counter fbc to struct lazy_percpu_counter.
Convert the u64 __percpu parameter of the lazy percpu counter function
to the struct percpu_counter parameter to prepare for converting
mm's rss stats into lazy_percpu_counter.

Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 include/linux/lazy-percpu-counter.h | 16 ++++--
 lib/lazy-percpu-counter.c           | 83 +++++++++++++++++++++++------
 2 files changed, 77 insertions(+), 22 deletions(-)

diff --git a/include/linux/lazy-percpu-counter.h b/include/linux/lazy-percpu-counter.h
index 281b8dd88cb2..03ff24f0128d 100644
--- a/include/linux/lazy-percpu-counter.h
+++ b/include/linux/lazy-percpu-counter.h
@@ -20,15 +20,21 @@
 #define _LINUX_LAZY_PERCPU_COUNTER_H
 
 #include <linux/atomic.h>
+#include <linux/percpu_counter.h>
 #include <asm/percpu.h>
 
 struct lazy_percpu_counter {
 	atomic64_t			v;
 	unsigned long			last_wrap;
+	struct percpu_counter		fbc;
 };
 
-void lazy_percpu_counter_exit(struct lazy_percpu_counter *c);
+void lazy_percpu_counter_destroy_many(struct lazy_percpu_counter *c,
+				      u32 nr_counters);
 void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i);
+s64 lazy_percpu_counter_read_positive(struct lazy_percpu_counter *c);
+s64 lazy_percpu_counter_sum(struct lazy_percpu_counter *c);
+s64 lazy_percpu_counter_sum_positive(struct lazy_percpu_counter *c);
 
 /*
  * We use the high bits of the atomic counter for a secondary counter, which is
@@ -48,13 +54,13 @@ void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i);
  */
 #define COUNTER_IS_PCPU_BIT		1
 
-static inline u64 __percpu *lazy_percpu_counter_is_pcpu(u64 v)
+static inline struct percpu_counter *lazy_percpu_counter_is_pcpu(u64 v)
 {
 	if (!(v & COUNTER_IS_PCPU_BIT))
 		return NULL;
 
 	v ^= COUNTER_IS_PCPU_BIT;
-	return (u64 __percpu *)(unsigned long)v;
+	return (struct percpu_counter *)(unsigned long)v;
 }
 
 /**
@@ -66,10 +72,10 @@ static inline u64 __percpu *lazy_percpu_counter_is_pcpu(u64 v)
 static inline void lazy_percpu_counter_add(struct lazy_percpu_counter *c, s64 i)
 {
 	u64 v = atomic64_read(&c->v);
-	u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+	struct percpu_counter *pcpu_v = lazy_percpu_counter_is_pcpu(v);
 
 	if (likely(pcpu_v))
-		this_cpu_add(*pcpu_v, i);
+		percpu_counter_add(pcpu_v, i);
 	else
 		lazy_percpu_counter_add_slowpath(c, i);
 }
diff --git a/lib/lazy-percpu-counter.c b/lib/lazy-percpu-counter.c
index e1914207214d..c360903cc02a 100644
--- a/lib/lazy-percpu-counter.c
+++ b/lib/lazy-percpu-counter.c
@@ -15,45 +15,94 @@ static inline s64 lazy_percpu_counter_atomic_val(s64 v)
 
 static void lazy_percpu_counter_switch_to_pcpu(struct lazy_percpu_counter *c)
 {
-	u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
 	u64 old, new, v;
+	unsigned long flags;
+	bool allocated = false;
 
-	if (!pcpu_v)
-		return;
-
+	local_irq_save(flags);
 	preempt_disable();
 	v = atomic64_read(&c->v);
 	do {
-		if (lazy_percpu_counter_is_pcpu(v)) {
-			free_percpu(pcpu_v);
-			return;
+		if (lazy_percpu_counter_is_pcpu(v))
+			break;
+
+		if (!allocated) {
+			if (percpu_counter_init(&c->fbc, 0, GFP_ATOMIC|__GFP_NOWARN))
+				break;
+			allocated = true;
 		}
 
 		old = v;
-		new = (unsigned long)pcpu_v | 1;
+		new = (unsigned long)&c->fbc | 1;
 
-		*this_cpu_ptr(pcpu_v) = lazy_percpu_counter_atomic_val(v);
+		percpu_counter_set(&c->fbc, lazy_percpu_counter_atomic_val(v));
 	} while ((v = atomic64_cmpxchg(&c->v, old, new)) != old);
 	preempt_enable();
+	local_irq_restore(flags);
 }
 
 /**
- * lazy_percpu_counter_exit: Free resources associated with a
- * lazy_percpu_counter
+ * lazy_percpu_counter_destroy_many: Free resources associated with
+ * lazy_percpu_counters
  *
- * @c: counter to exit
+ * @c: counters to exit
+ * @nr_counters: number of counters
  */
-void lazy_percpu_counter_exit(struct lazy_percpu_counter *c)
+void lazy_percpu_counter_destroy_many(struct lazy_percpu_counter *c,
+				      u32 nr_counters)
+{
+	struct percpu_counter *pcpu_v;
+	u32 i;
+
+	for (i = 0; i < nr_counters; i++) {
+		pcpu_v = lazy_percpu_counter_is_pcpu(atomic64_read(&c[i].v));
+		if (pcpu_v)
+			percpu_counter_destroy(pcpu_v);
+	}
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_destroy_many);
+
+s64 lazy_percpu_counter_read_positive(struct lazy_percpu_counter *c)
+{
+	s64 v = atomic64_read(&c->v);
+	struct percpu_counter *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+	if (pcpu_v)
+		return percpu_counter_read_positive(pcpu_v);
+
+	return lazy_percpu_counter_atomic_val(v);
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_read_positive);
+
+s64 lazy_percpu_counter_sum(struct lazy_percpu_counter *c)
+{
+	s64 v = atomic64_read(&c->v);
+	struct percpu_counter *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+	if (pcpu_v)
+		return percpu_counter_sum(pcpu_v);
+
+	return lazy_percpu_counter_atomic_val(v);
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_sum);
+
+s64 lazy_percpu_counter_sum_positive(struct lazy_percpu_counter *c)
 {
-	free_percpu(lazy_percpu_counter_is_pcpu(atomic64_read(&c->v)));
+	s64 v = atomic64_read(&c->v);
+	struct percpu_counter *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+	if (pcpu_v)
+		return percpu_counter_sum_positive(pcpu_v);
+
+	return lazy_percpu_counter_atomic_val(v);
 }
-EXPORT_SYMBOL_GPL(lazy_percpu_counter_exit);
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_sum_positive);
 
 void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i)
 {
 	u64 atomic_i;
 	u64 old, v = atomic64_read(&c->v);
-	u64 __percpu *pcpu_v;
+	struct percpu_counter *pcpu_v;
 
 	atomic_i  = i << COUNTER_IS_PCPU_BIT;
 	atomic_i &= ~COUNTER_MOD_MASK;
@@ -62,7 +111,7 @@ void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i)
 	do {
 		pcpu_v = lazy_percpu_counter_is_pcpu(v);
 		if (pcpu_v) {
-			this_cpu_add(*pcpu_v, i);
+			percpu_counter_add(pcpu_v, i);
 			return;
 		}
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 3/3] mm: convert mm's rss stats into lazy_percpu_counter
  2024-04-12  9:24 [RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu_counter Peng Zhang
  2024-04-12  9:24 ` [RFC PATCH 1/3] Lazy percpu counters Peng Zhang
  2024-04-12  9:24 ` [RFC PATCH 2/3] lazy_percpu_counter: include struct percpu_counter in struct lazy_percpu_counter Peng Zhang
@ 2024-04-12  9:24 ` Peng Zhang
  2024-04-12 13:53 ` [RFC PATCH 0/3] " Jan Kara
  3 siblings, 0 replies; 6+ messages in thread
From: Peng Zhang @ 2024-04-12  9:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong,
	zhangpeng362

From: ZhangPeng <zhangpeng362@huawei.com>

Since commit f1a7941243c1 ("mm: convert mm's rss stats into
percpu_counter"), the rss_stats have converted into percpu_counter,
which convert the error margin from (nr_threads * 64) to approximately
(nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
performance regression on fork/exec/shell. Even after commit 14ef95be6f55
("kernel/fork: group allocation/free of per-cpu counters for mm struct"),
the performance of fork/exec/shell is still poor compared to
previous kernel versions.

To mitigate performance regression, we use lazy_percpu_counter to delay
the allocation of percpu memory for rss_stats. After lmbench test, we
will get 3% ~ 6% performance improvement for lmbench fork_proc/exec_proc/
shell_proc after conversion. The test results are as follows:

             base           base+revert        base+lazy_percpu_counter

fork_proc    427.4ms        394.1ms  (7.8%)    413.9ms  (3.2%)
exec_proc    2205.1ms       2042.2ms (7.4%)    2072.0ms (6.0%)
shell_proc   3180.9ms       2963.7ms (6.8%)    3010.7ms (5.4%)

Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 include/linux/mm.h          |  8 ++++----
 include/linux/mm_types.h    |  4 ++--
 include/trace/events/kmem.h |  4 ++--
 kernel/fork.c               | 12 ++++--------
 4 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 07c73451d42f..d1ea246b99c3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2631,28 +2631,28 @@ static inline bool get_user_page_fast_only(unsigned long addr,
  */
 static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 {
-	return percpu_counter_read_positive(&mm->rss_stat[member]);
+	return lazy_percpu_counter_read_positive(&mm->rss_stat[member]);
 }
 
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
 {
-	percpu_counter_add(&mm->rss_stat[member], value);
+	lazy_percpu_counter_add(&mm->rss_stat[member], value);
 
 	mm_trace_rss_stat(mm, member);
 }
 
 static inline void inc_mm_counter(struct mm_struct *mm, int member)
 {
-	percpu_counter_inc(&mm->rss_stat[member]);
+	lazy_percpu_counter_add(&mm->rss_stat[member], 1);
 
 	mm_trace_rss_stat(mm, member);
 }
 
 static inline void dec_mm_counter(struct mm_struct *mm, int member)
 {
-	percpu_counter_dec(&mm->rss_stat[member]);
+	lazy_percpu_counter_sub(&mm->rss_stat[member], 1);
 
 	mm_trace_rss_stat(mm, member);
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c432add95913..bf44c3a6fc99 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -18,7 +18,7 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
-#include <linux/percpu_counter.h>
+#include <linux/lazy-percpu-counter.h>
 
 #include <asm/mmu.h>
 
@@ -898,7 +898,7 @@ struct mm_struct {
 
 		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-		struct percpu_counter rss_stat[NR_MM_COUNTERS];
+		struct lazy_percpu_counter rss_stat[NR_MM_COUNTERS];
 
 		struct linux_binfmt *binfmt;
 
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 6e62cc64cd92..3a35d9a665b7 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat,
 		__entry->mm_id = mm_ptr_to_hash(mm);
 		__entry->curr = !!(current->mm == mm);
 		__entry->member = member;
-		__entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member])
-							    << PAGE_SHIFT);
+		__entry->size = (lazy_percpu_counter_sum_positive(&mm->rss_stat[member])
+							  << PAGE_SHIFT);
 	),
 
 	TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..0a4efb436030 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm)
 			 "Please make sure 'struct resident_page_types[]' is updated as well");
 
 	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		long x = percpu_counter_sum(&mm->rss_stat[i]);
+		long x = lazy_percpu_counter_sum(&mm->rss_stat[i]);
 
 		if (unlikely(x))
 			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n",
@@ -910,6 +910,8 @@ static void cleanup_lazy_tlbs(struct mm_struct *mm)
  */
 void __mmdrop(struct mm_struct *mm)
 {
+	int i;
+
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
 
@@ -924,7 +926,7 @@ void __mmdrop(struct mm_struct *mm)
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
-	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	lazy_percpu_counter_destroy_many(&mm->rss_stat[i], NR_MM_COUNTERS);
 
 	free_mm(mm);
 }
@@ -1301,16 +1303,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm))
 		goto fail_cid;
 
-	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
-				     NR_MM_COUNTERS))
-		goto fail_pcpu;
-
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
-fail_pcpu:
-	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
 fail_nocontext:
-- 
2.25.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu_counter
  2024-04-12  9:24 [RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu_counter Peng Zhang
                   ` (2 preceding siblings ...)
  2024-04-12  9:24 ` [RFC PATCH 3/3] mm: convert mm's rss stats into lazy_percpu_counter Peng Zhang
@ 2024-04-12 13:53 ` Jan Kara
  2024-04-15 12:33   ` zhangpeng (AS)
  3 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2024-04-12 13:53 UTC (permalink / raw)
  To: Peng Zhang
  Cc: linux-mm, linux-kernel, akpm, dennisszhou, shakeelb, jack,
	surenb, kent.overstreet, mhocko, vbabka, yuzhao, yu.ma,
	wangkefeng.wang, sunnanyong

On Fri 12-04-24 17:24:38, Peng Zhang wrote:
> From: ZhangPeng <zhangpeng362@huawei.com>
> 
> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter"), the rss_stats have converted into percpu_counter,
> which convert the error margin from (nr_threads * 64) to approximately
> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
> performance regression on fork/exec/shell. Even after commit
> 14ef95be6f55 ("kernel/fork: group allocation/free of per-cpu counters
> for mm struct"), the performance of fork/exec/shell is still poor
> compared to previous kernel versions.
> 
> To mitigate performance regression, we use lazy_percpu_counter[1] to
> delay the allocation of percpu memory for rss_stats. After lmbench test,
> we will get 3% ~ 6% performance improvement for lmbench
> fork_proc/exec_proc/shell_proc after conversion.
> 
> The test results are as follows:
> 
>              base           base+revert        base+lazy_percpu_counter
> 
> fork_proc    427.4ms        394.1ms  (7.8%)    413.9ms  (3.2%)
> exec_proc    2205.1ms       2042.2ms (7.4%)    2072.0ms (6.0%)
> shell_proc   3180.9ms       2963.7ms (6.8%)    3010.7ms (5.4%)
> 
> This solution has not been fully evaluated and tested. The main idea of
> this RFC patch series is to get the community's opinion on this approach.

Thanks! I like the idea and in fact I wanted to do something similar (just
never got to it). Thread [2] has couple of good observations regarding this
problem. Couple of thoughts regarding your approach:

1) I think switching to pcpu counter when update rate exceeds 256 updates/s
is not a great fit for RSS because the updates are going to be frequent in
some cases but usually they will all happen from one thread. So I think it
would make more sense to move the decision of switching to pcpu mode from
the counter itself into the callers and just switch on clone() when the
second thread gets created.

2) I thought that for RSS lazy percpu counters, we could directly use
struct percpu_counter and just make it that if 'counters' is NULL, the
counter is in atomic mode (count is used as atomic_long_t), if counters !=
NULL, we are in pcpu mode.

3) In [2] Mateusz had a good observation that the old RSS counters actually
used atomic operations only in rare cases so even lazy pcpu counters are
going to have worse performance for singlethreaded processes than the old
code. We could *almost* get away with non-atomic updates to counter->count
if it was not for occasional RSS updates from unrelated tasks. So it might
be worth it to further optimize the counters as:

struct rss_counter_single {
	void *state;			/* To detect switching to pcpu mode */
	atomic_long_t counter_atomic;	/* Used for foreign updates */
	long counter;			/* Used by local updates */
}

struct rss_counter {
	union {
		struct rss_counter_single single;
		/* struct percpu_counter needs to be modified to have
		 * 'counters' first to avoid issues for different
		 * architectures or with CONFIG_HOTPLUG_CPU enabled */
		struct percpu_counter pcpu;
	}
}

But I'm not sure this complexity is worth it so I'd do it as a separate
patch with separate benchmarking if at all.

								Honza

[2] https://lore.kernel.org/all/ZOPSEJTzrow8YFix@snowbird/

> 
> [1] https://lore.kernel.org/linux-iommu/20230501165450.15352-8-surenb@google.com/
> 
> Kent Overstreet (1):
>   Lazy percpu counters
> 
> ZhangPeng (2):
>   lazy_percpu_counter: include struct percpu_counter in struct
>     lazy_percpu_counter
>   mm: convert mm's rss stats into lazy_percpu_counter
> 
>  include/linux/lazy-percpu-counter.h |  88 +++++++++++++++++++
>  include/linux/mm.h                  |   8 +-
>  include/linux/mm_types.h            |   4 +-
>  include/trace/events/kmem.h         |   4 +-
>  kernel/fork.c                       |  12 +--
>  lib/Makefile                        |   2 +-
>  lib/lazy-percpu-counter.c           | 131 ++++++++++++++++++++++++++++
>  7 files changed, 232 insertions(+), 17 deletions(-)
>  create mode 100644 include/linux/lazy-percpu-counter.h
>  create mode 100644 lib/lazy-percpu-counter.c
> 
> -- 
> 2.25.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu_counter
  2024-04-12 13:53 ` [RFC PATCH 0/3] " Jan Kara
@ 2024-04-15 12:33   ` zhangpeng (AS)
  0 siblings, 0 replies; 6+ messages in thread
From: zhangpeng (AS) @ 2024-04-15 12:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-mm, linux-kernel, akpm, dennisszhou, shakeelb, surenb,
	kent.overstreet, mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang,
	sunnanyong

On 2024/4/12 21:53, Jan Kara wrote:

> On Fri 12-04-24 17:24:38, Peng Zhang wrote:
>> From: ZhangPeng <zhangpeng362@huawei.com>
>>
>> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
>> percpu_counter"), the rss_stats have converted into percpu_counter,
>> which convert the error margin from (nr_threads * 64) to approximately
>> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
>> performance regression on fork/exec/shell. Even after commit
>> 14ef95be6f55 ("kernel/fork: group allocation/free of per-cpu counters
>> for mm struct"), the performance of fork/exec/shell is still poor
>> compared to previous kernel versions.
>>
>> To mitigate performance regression, we use lazy_percpu_counter[1] to
>> delay the allocation of percpu memory for rss_stats. After lmbench test,
>> we will get 3% ~ 6% performance improvement for lmbench
>> fork_proc/exec_proc/shell_proc after conversion.
>>
>> The test results are as follows:
>>
>>               base           base+revert        base+lazy_percpu_counter
>>
>> fork_proc    427.4ms        394.1ms  (7.8%)    413.9ms  (3.2%)
>> exec_proc    2205.1ms       2042.2ms (7.4%)    2072.0ms (6.0%)
>> shell_proc   3180.9ms       2963.7ms (6.8%)    3010.7ms (5.4%)
>>
>> This solution has not been fully evaluated and tested. The main idea of
>> this RFC patch series is to get the community's opinion on this approach.
> Thanks! I like the idea and in fact I wanted to do something similar (just
> never got to it). Thread [2] has couple of good observations regarding this
> problem. Couple of thoughts regarding your approach:
>
> 1) I think switching to pcpu counter when update rate exceeds 256 updates/s
> is not a great fit for RSS because the updates are going to be frequent in
> some cases but usually they will all happen from one thread. So I think it
> would make more sense to move the decision of switching to pcpu mode from
> the counter itself into the callers and just switch on clone() when the
> second thread gets created.
>
> 2) I thought that for RSS lazy percpu counters, we could directly use
> struct percpu_counter and just make it that if 'counters' is NULL, the
> counter is in atomic mode (count is used as atomic_long_t), if counters !=
> NULL, we are in pcpu mode.

Thanks for your reply!
Agree with your thoughts, I'll implement it in the next version.

> 3) In [2] Mateusz had a good observation that the old RSS counters actually
> used atomic operations only in rare cases so even lazy pcpu counters are
> going to have worse performance for singlethreaded processes than the old
> code. We could *almost* get away with non-atomic updates to counter->count
> if it was not for occasional RSS updates from unrelated tasks. So it might
> be worth it to further optimize the counters as:
>
> struct rss_counter_single {
> 	void *state;			/* To detect switching to pcpu mode */
> 	atomic_long_t counter_atomic;	/* Used for foreign updates */
> 	long counter;			/* Used by local updates */
> }
>
> struct rss_counter {
> 	union {
> 		struct rss_counter_single single;
> 		/* struct percpu_counter needs to be modified to have
> 		 * 'counters' first to avoid issues for different
> 		 * architectures or with CONFIG_HOTPLUG_CPU enabled */
> 		struct percpu_counter pcpu;
> 	}
> }
>
> But I'm not sure this complexity is worth it so I'd do it as a separate
> patch with separate benchmarking if at all.
>
> 								Honza

Agreed. Single-threaded processes don't need atomic operations, and
this scenario needs to be thoroughly tested.
I'll try to implement it in another patch series after I finish the
basic approach.

> [2] https://lore.kernel.org/all/ZOPSEJTzrow8YFix@snowbird/
>
>> [1] https://lore.kernel.org/linux-iommu/20230501165450.15352-8-surenb@google.com/
>>
>> Kent Overstreet (1):
>>    Lazy percpu counters
>>
>> ZhangPeng (2):
>>    lazy_percpu_counter: include struct percpu_counter in struct
>>      lazy_percpu_counter
>>    mm: convert mm's rss stats into lazy_percpu_counter
>>
>>   include/linux/lazy-percpu-counter.h |  88 +++++++++++++++++++
>>   include/linux/mm.h                  |   8 +-
>>   include/linux/mm_types.h            |   4 +-
>>   include/trace/events/kmem.h         |   4 +-
>>   kernel/fork.c                       |  12 +--
>>   lib/Makefile                        |   2 +-
>>   lib/lazy-percpu-counter.c           | 131 ++++++++++++++++++++++++++++
>>   7 files changed, 232 insertions(+), 17 deletions(-)
>>   create mode 100644 include/linux/lazy-percpu-counter.h
>>   create mode 100644 lib/lazy-percpu-counter.c
>>
>> -- 
>> 2.25.1
>>
-- 
Best Regards,
Peng



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-04-15 12:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-12  9:24 [RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu_counter Peng Zhang
2024-04-12  9:24 ` [RFC PATCH 1/3] Lazy percpu counters Peng Zhang
2024-04-12  9:24 ` [RFC PATCH 2/3] lazy_percpu_counter: include struct percpu_counter in struct lazy_percpu_counter Peng Zhang
2024-04-12  9:24 ` [RFC PATCH 3/3] mm: convert mm's rss stats into lazy_percpu_counter Peng Zhang
2024-04-12 13:53 ` [RFC PATCH 0/3] " Jan Kara
2024-04-15 12:33   ` zhangpeng (AS)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox