linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v7 0/2] mm: Fix OOM killer inaccuracy on large many-core systems
@ 2025-10-31 14:42 Mathieu Desnoyers
  2025-10-31 14:42 ` [RFC PATCH v7 1/2] lib: Introduce hierarchical per-cpu counters Mathieu Desnoyers
  2025-10-31 14:42 ` [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
  0 siblings, 2 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 14:42 UTC (permalink / raw)
  To: Mateusz Guzik, Vlastimil Babka, Sweet Tea Dorminy
  Cc: linux-kernel, Mathieu Desnoyers, Andrew Morton, Paul E. McKenney,
	Steven Rostedt, Masami Hiramatsu, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Martin Liu, David Rientjes, christian.koenig,
	Shakeel Butt, SeongJae Park, Michal Hocko, Johannes Weiner,
	Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Christian Brauner, Wei Yang,
	David Hildenbrand, Miaohe Lin, Al Viro, linux-mm,
	linux-trace-kernel, Yu Zhao, Roman Gushchin, Matthew Wilcox,
	Baolin Wang, Aboorva Devarajan

Introduce hierarchical per-cpu counters and use them for rss tracking to
fix the per-mm RSS tracking which has become too inaccurate for OOM
killer purposes on large many-core systems.

The approach proposed here is to replace this by the hierarchical
per-cpu counters, which bounds the inaccuracy based on the system
topology with O(N*logN).

Relevant delta since v6: Rebase to v6.18-rc3 and implement
get_mm_counter_sum() as percpu_counter_tree_precise_sum().

Testing and feedback are welcome!

Thanks,

Mathieu

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>

Mathieu Desnoyers (2):
  lib: Introduce hierarchical per-cpu counters
  mm: Fix OOM killer inaccuracy on large many-core systems

 include/linux/mm.h                  |  10 +-
 include/linux/mm_types.h            |   4 +-
 include/linux/percpu_counter_tree.h | 203 ++++++++++++++
 include/trace/events/kmem.h         |   2 +-
 kernel/fork.c                       |  32 ++-
 lib/Makefile                        |   1 +
 lib/percpu_counter_tree.c           | 394 ++++++++++++++++++++++++++++
 7 files changed, 627 insertions(+), 19 deletions(-)
 create mode 100644 include/linux/percpu_counter_tree.h
 create mode 100644 lib/percpu_counter_tree.c

-- 
2.39.5


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH v7 1/2] lib: Introduce hierarchical per-cpu counters
  2025-10-31 14:42 [RFC PATCH v7 0/2] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
@ 2025-10-31 14:42 ` Mathieu Desnoyers
  2025-10-31 14:42 ` [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
  1 sibling, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 14:42 UTC (permalink / raw)
  To: Mateusz Guzik, Vlastimil Babka, Sweet Tea Dorminy
  Cc: linux-kernel, Mathieu Desnoyers, Andrew Morton, Paul E. McKenney,
	Steven Rostedt, Masami Hiramatsu, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Martin Liu, David Rientjes, christian.koenig,
	Shakeel Butt, SeongJae Park, Michal Hocko, Johannes Weiner,
	Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Christian Brauner, Wei Yang,
	David Hildenbrand, Miaohe Lin, Al Viro, linux-mm,
	linux-trace-kernel, Yu Zhao, Roman Gushchin, Matthew Wilcox,
	Baolin Wang, Aboorva Devarajan

* Motivation

The purpose of this hierarchical split-counter scheme is to:

- Minimize contention when incrementing and decrementing counters,
- Provide fast access to a sum approximation,
- Provide a sum approximation with an acceptable accuracy level when
  scaling to many-core systems.
- Provide approximate and precise comparison of two counters, and
  between a counter and a value.

It aims at fixing the per-mm RSS tracking which has become too
inaccurate for OOM killer purposes on large many-core systems [1].

* Design

The hierarchical per-CPU counters propagate a sum approximation through
a N-way tree. When reaching the batch size, the carry is propagated
through a binary tree which consists of logN(nr_cpu_ids) levels. The
batch size for each level is twice the batch size of the prior level.

Example propagation diagram with 8 cpus through a binary tree:

Level 0:  0    1    2    3    4    5    6    7
          |   /     |   /     |   /     |   /
          |  /      |  /      |  /      |  /
          | /       | /       | /       | /
Level 1:  0         1         2         3
          |       /           |       /
          |    /              |    /
          | /                 | /
Level 2:  0                   1
          |               /
          |         /
          |   /
Level 3:  0

For a binary tree, the maximum inaccuracy is bound by:
   batch_size * log2(nr_cpus) * nr_cpus
which evolves with O(n*log(n)) as the number of CPUs increases.

For a N-way tree, the maximum inaccuracy can be pre-calculated
based on the the N-arity of each level and the batch size.

Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
Changes since v5:
- Introduce percpu_counter_tree_approximate_sum_positive.
- Introduce !CONFIG_SMP static inlines for UP build.
- Remove percpu_counter_tree_set_bias from the public API and make it
  static.

Changes since v3:
- Add gfp flags to init function.

Changes since v2:
- Introduce N-way tree to reduce tree depth on larger systems.

Changes since v1:
- Remove percpu_counter_tree_precise_sum_unbiased from public header,
  make this function static,
- Introduce precise and approximate comparisons between two counters,
- Reorder the struct percpu_counter_tree fields,
- Introduce approx_sum field, which points to the approximate sum
  for the percpu_counter_tree_approximate_sum() fast path.
---
 include/linux/percpu_counter_tree.h | 203 ++++++++++++++
 lib/Makefile                        |   1 +
 lib/percpu_counter_tree.c           | 394 ++++++++++++++++++++++++++++
 3 files changed, 598 insertions(+)
 create mode 100644 include/linux/percpu_counter_tree.h
 create mode 100644 lib/percpu_counter_tree.c

diff --git a/include/linux/percpu_counter_tree.h b/include/linux/percpu_counter_tree.h
new file mode 100644
index 000000000000..8795e782680a
--- /dev/null
+++ b/include/linux/percpu_counter_tree.h
@@ -0,0 +1,203 @@
+/* SPDX-License-Identifier: GPL-2.0+ OR MIT */
+/* SPDX-FileCopyrightText: 2025 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> */
+
+#ifndef _PERCPU_COUNTER_TREE_H
+#define _PERCPU_COUNTER_TREE_H
+
+#include <linux/cleanup.h>
+#include <linux/preempt.h>
+#include <linux/atomic.h>
+#include <linux/percpu.h>
+
+#ifdef CONFIG_SMP
+
+struct percpu_counter_tree_level_item {
+	atomic_t count;
+} ____cacheline_aligned_in_smp;
+
+struct percpu_counter_tree {
+	/* Fast-path fields. */
+	unsigned int __percpu *level0;
+	unsigned int level0_bit_mask;
+	union {
+		unsigned int *i;
+		atomic_t *a;
+	} approx_sum;
+	int bias;			/* bias for counter_set */
+
+	/* Slow-path fields. */
+	struct percpu_counter_tree_level_item *items;
+	unsigned int batch_size;
+	unsigned int inaccuracy;	/* approximation imprecise within ± inaccuracy */
+};
+
+int percpu_counter_tree_init(struct percpu_counter_tree *counter, unsigned int batch_size, gfp_t gfp_flags);
+void percpu_counter_tree_destroy(struct percpu_counter_tree *counter);
+void percpu_counter_tree_add_slowpath(struct percpu_counter_tree *counter, int inc);
+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter);
+int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b);
+int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tree *counter, int v);
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b);
+int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v);
+void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v);
+unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter);
+
+/* Fast paths */
+
+static inline
+int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int bit_mask)
+{
+	if (inc < 0) {
+		inc = -(-inc & ~(bit_mask - 1));
+		/*
+		 * xor bit_mask: underflow.
+		 *
+		 * If inc has bit set, decrement an additional bit if
+		 * there is _no_ bit transition between orig and res.
+		 * Else, inc has bit cleared, decrement an additional
+		 * bit if there is a bit transition between orig and
+		 * res.
+		 */
+		if ((inc ^ orig ^ res) & bit_mask)
+			inc -= bit_mask;
+	} else {
+		inc &= ~(bit_mask - 1);
+		/*
+		 * xor bit_mask: overflow.
+		 *
+		 * If inc has bit set, increment an additional bit if
+		 * there is _no_ bit transition between orig and res.
+		 * Else, inc has bit cleared, increment an additional
+		 * bit if there is a bit transition between orig and
+		 * res.
+		 */
+		if ((inc ^ orig ^ res) & bit_mask)
+			inc += bit_mask;
+	}
+	return inc;
+}
+
+static inline
+void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
+{
+	unsigned int bit_mask = counter->level0_bit_mask, orig, res;
+
+	if (!inc)
+		return;
+	/* Make sure the fast and slow paths use the same cpu number. */
+	guard(migrate)();
+	res = this_cpu_add_return(*counter->level0, inc);
+	orig = res - inc;
+	inc = percpu_counter_tree_carry(orig, res, inc, bit_mask);
+	if (!inc)
+		return;
+	percpu_counter_tree_add_slowpath(counter, inc);
+}
+
+static inline
+int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter)
+{
+	unsigned int v;
+
+	if (!counter->level0_bit_mask)
+		v = READ_ONCE(*counter->approx_sum.i);
+	else
+		v = atomic_read(counter->approx_sum.a);
+	return (int) (v + (unsigned int)READ_ONCE(counter->bias));
+}
+
+#else	/* !CONFIG_SMP */
+
+struct percpu_counter_tree {
+	atomic_t count;
+};
+
+static inline
+int percpu_counter_tree_init(struct percpu_counter_tree *counter, unsigned int batch_size, gfp_t gfp_flags)
+{
+	atomic_set(&counter->count, 0);
+	return 0;
+}
+
+static inline
+void percpu_counter_tree_destroy(struct percpu_counter_tree *counter)
+{
+}
+
+static inline
+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter)
+{
+	return atomic_read(&counter->count);
+}
+
+static inline
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b)
+{
+	int count_a = percpu_counter_tree_precise_sum(a),
+	    count_b = percpu_counter_tree_precise_sum(b);
+
+	if (count_a == count_b)
+		return 0;
+	if (count_a < count_b)
+		return -1;
+	return 1;
+}
+
+static inline
+int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v)
+{
+	int count = percpu_counter_tree_precise_sum(counter);
+
+	if (count == v)
+		return 0;
+	if (count < v)
+		return -1;
+	return 1;
+}
+
+static inline
+int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b)
+{
+	return percpu_counter_tree_precise_compare(a, b);
+}
+
+static inline
+int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tree *counter, int v)
+{
+	return percpu_counter_tree_precise_compare_value(counter, v);
+}
+
+static inline
+void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v)
+{
+	atomic_set(&counter->count, v);
+}
+
+static inline
+unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter)
+{
+	return 0;
+}
+
+static inline
+void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
+{
+	atomic_add(inc, &counter->count);
+}
+
+static inline
+int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter)
+{
+	return percpu_counter_tree_precise_sum(counter);
+}
+
+#endif	/* CONFIG_SMP */
+
+static inline
+int percpu_counter_tree_approximate_sum_positive(struct percpu_counter_tree *counter)
+{
+	int v = percpu_counter_tree_approximate_sum(counter);
+	return v > 0 ? v : 0;
+}
+
+#endif  /* _PERCPU_COUNTER_TREE_H */
diff --git a/lib/Makefile b/lib/Makefile
index 1ab2c4be3b66..767dc178a55c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -179,6 +179,7 @@ obj-$(CONFIG_TEXTSEARCH_KMP) += ts_kmp.o
 obj-$(CONFIG_TEXTSEARCH_BM) += ts_bm.o
 obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
+obj-$(CONFIG_SMP) += percpu_counter_tree.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
diff --git a/lib/percpu_counter_tree.c b/lib/percpu_counter_tree.c
new file mode 100644
index 000000000000..9577d94251d1
--- /dev/null
+++ b/lib/percpu_counter_tree.c
@@ -0,0 +1,394 @@
+// SPDX-License-Identifier: GPL-2.0+ OR MIT
+// SPDX-FileCopyrightText: 2025 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+
+/*
+ * Split Counters With Tree Approximation Propagation
+ *
+ * * Propagation diagram when reaching batch size thresholds (± batch size):
+ *
+ * Example diagram for 8 CPUs:
+ *
+ * log2(8) = 3 levels
+ *
+ * At each level, each pair propagates its values to the next level when
+ * reaching the batch size thresholds.
+ *
+ * Counters at levels 0, 1, 2 can be kept on a single byte (±128 range),
+ * although it may be relevant to keep them on 32-bit counters for
+ * simplicity. (complexity vs memory footprint tradeoff)
+ *
+ * Counter at level 3 can be kept on a 32-bit counter.
+ *
+ * Level 0:  0    1    2    3    4    5    6    7
+ *           |   /     |   /     |   /     |   /
+ *           |  /      |  /      |  /      |  /
+ *           | /       | /       | /       | /
+ * Level 1:  0         1         2         3
+ *           |       /           |       /
+ *           |    /              |    /
+ *           | /                 | /
+ * Level 2:  0                   1
+ *           |               /
+ *           |         /
+ *           |   /
+ * Level 3:  0
+ *
+ * * Approximation inaccuracy:
+ *
+ * BATCH(level N): Level N batch size.
+ *
+ * Example for BATCH(level 0) = 32.
+ *
+ * BATCH(level 0) =  32
+ * BATCH(level 1) =  64
+ * BATCH(level 2) = 128
+ * BATCH(level N) = BATCH(level 0) * 2^N
+ *
+ *            per-counter     global
+ *            inaccuracy      inaccuracy
+ * Level 0:   [ -32 ..  +31]  ±256  (8 * 32)
+ * Level 1:   [ -64 ..  +63]  ±256  (4 * 64)
+ * Level 2:   [-128 .. +127]  ±256  (2 * 128)
+ * Total:      ------         ±768  (log2(nr_cpu_ids) * BATCH(level 0) * nr_cpu_ids)
+ *
+ * -----
+ *
+ * Approximate Sum Carry Propagation
+ *
+ * Let's define a number of counter bits for each level, e.g.:
+ *
+ * log2(BATCH(level 0)) = log2(32) = 5
+ *
+ *               nr_bit        value_mask                      range
+ * Level 0:      5 bits        v                             0 ..  +31
+ * Level 1:      1 bit        (v & ~((1UL << 5) - 1))        0 ..  +63
+ * Level 2:      1 bit        (v & ~((1UL << 6) - 1))        0 .. +127
+ * Level 3:     25 bits       (v & ~((1UL << 7) - 1))        0 .. 2^32-1
+ *
+ * Note: Use a full 32-bit per-cpu counter at level 0 to allow precise sum.
+ *
+ * Note: Use cacheline aligned counters at levels above 0 to prevent false sharing.
+ *       If memory footprint is an issue, a specialized allocator could be used
+ *       to eliminate padding.
+ *
+ * Example with expanded values:
+ *
+ * counter_add(counter, inc):
+ *
+ *         if (!inc)
+ *                 return;
+ *
+ *         res = percpu_add_return(counter @ Level 0, inc);
+ *         orig = res - inc;
+ *         if (inc < 0) {
+ *                 inc = -(-inc & ~0b00011111);  // Clear used bits
+ *                 // xor bit 5: underflow
+ *                 if ((inc ^ orig ^ res) & 0b00100000)
+ *                         inc -= 0b00100000;
+ *         } else {
+ *                 inc &= ~0b00011111;           // Clear used bits
+ *                 // xor bit 5: overflow
+ *                 if ((inc ^ orig ^ res) & 0b00100000)
+ *                         inc += 0b00100000;
+ *         }
+ *         if (!inc)
+ *                 return;
+ *
+ *         res = atomic_add_return(counter @ Level 1, inc);
+ *         orig = res - inc;
+ *         if (inc < 0) {
+ *                 inc = -(-inc & ~0b00111111);  // Clear used bits
+ *                 // xor bit 6: underflow
+ *                 if ((inc ^ orig ^ res) & 0b01000000)
+ *                         inc -= 0b01000000;
+ *         } else {
+ *                 inc &= ~0b00111111;           // Clear used bits
+ *                 // xor bit 6: overflow
+ *                 if ((inc ^ orig ^ res) & 0b01000000)
+ *                         inc += 0b01000000;
+ *         }
+ *         if (!inc)
+ *                 return;
+ *
+ *         res = atomic_add_return(counter @ Level 2, inc);
+ *         orig = res - inc;
+ *         if (inc < 0) {
+ *                 inc = -(-inc & ~0b01111111);  // Clear used bits
+ *                 // xor bit 7: underflow
+ *                 if ((inc ^ orig ^ res) & 0b10000000)
+ *                         inc -= 0b10000000;
+ *         } else {
+ *                 inc &= ~0b01111111;           // Clear used bits
+ *                 // xor bit 7: overflow
+ *                 if ((inc ^ orig ^ res) & 0b10000000)
+ *                         inc += 0b10000000;
+ *         }
+ *         if (!inc)
+ *                 return;
+ *
+ *         atomic_add(counter @ Level 3, inc);
+ */
+
+#include <linux/percpu_counter_tree.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/math.h>
+
+#define MAX_NR_LEVELS 5
+
+struct counter_config {
+	unsigned int nr_items;
+	unsigned char nr_levels;
+	unsigned char n_arity_order[MAX_NR_LEVELS];
+};
+
+/*
+ * nr_items is the number of items in the tree for levels 1 to and
+ * including the final level (approximate sum). It excludes the level 0
+ * per-cpu counters.
+ */
+static const struct counter_config per_nr_cpu_order_config[] = {
+	[0] =	{ .nr_items = 1,	.nr_levels = 0,		.n_arity_order = { 0 } },
+	[1] =	{ .nr_items = 3,	.nr_levels = 1,		.n_arity_order = { 1 } },
+	[2] =	{ .nr_items = 3,	.nr_levels = 2,		.n_arity_order = { 1, 1 } },
+	[3] =	{ .nr_items = 7,	.nr_levels = 3,		.n_arity_order = { 1, 1, 1 } },
+	[4] =	{ .nr_items = 7,	.nr_levels = 3,		.n_arity_order = { 2, 1, 1 } },
+	[5] =	{ .nr_items = 11,	.nr_levels = 3,		.n_arity_order = { 2, 2, 1 } },
+	[6] =	{ .nr_items = 21,	.nr_levels = 3,		.n_arity_order = { 2, 2, 2 } },
+	[7] =	{ .nr_items = 21,	.nr_levels = 3,		.n_arity_order = { 3, 2, 2 } },
+	[8] =	{ .nr_items = 37,	.nr_levels = 3,		.n_arity_order = { 3, 3, 2 } },
+	[9] =	{ .nr_items = 73,	.nr_levels = 3,		.n_arity_order = { 3, 3, 3 } },
+	[10] =	{ .nr_items = 149,	.nr_levels = 4,		.n_arity_order = { 3, 3, 2, 2 } },
+	[11] =	{ .nr_items = 293,	.nr_levels = 4,		.n_arity_order = { 3, 3, 3, 2 } },
+	[12] =	{ .nr_items = 585,	.nr_levels = 4,		.n_arity_order = { 3, 3, 3, 3 } },
+	[13] =	{ .nr_items = 1173,	.nr_levels = 5,		.n_arity_order = { 3, 3, 3, 2, 2 } },
+	[14] =	{ .nr_items = 2341,	.nr_levels = 5,		.n_arity_order = { 3, 3, 3, 3, 2 } },
+	[15] =	{ .nr_items = 4681,	.nr_levels = 5,		.n_arity_order = { 3, 3, 3, 3, 3 } },
+	[16] =	{ .nr_items = 4681,	.nr_levels = 5,		.n_arity_order = { 4, 3, 3, 3, 3 } },
+	[17] =	{ .nr_items = 8777,	.nr_levels = 5,		.n_arity_order = { 4, 4, 3, 3, 3 } },
+	[18] =	{ .nr_items = 17481,	.nr_levels = 5,		.n_arity_order = { 4, 4, 4, 3, 3 } },
+	[19] =	{ .nr_items = 34953,	.nr_levels = 5,		.n_arity_order = { 4, 4, 4, 4, 3 } },
+	[20] =	{ .nr_items = 69905,	.nr_levels = 5,		.n_arity_order = { 4, 4, 4, 4, 4 } },
+};
+
+static const struct counter_config *counter_config;
+static unsigned int nr_cpus_order, inaccuracy_multiplier;
+
+int percpu_counter_tree_init(struct percpu_counter_tree *counter, unsigned int batch_size, gfp_t gfp_flags)
+{
+	/* Batch size must be power of 2 */
+	if (!batch_size || (batch_size & (batch_size - 1)))
+		return -EINVAL;
+	counter->batch_size = batch_size;
+	counter->bias = 0;
+	counter->level0 = alloc_percpu_gfp(unsigned int, gfp_flags);
+	if (!counter->level0)
+		return -ENOMEM;
+	if (!nr_cpus_order) {
+		counter->items = NULL;
+		counter->approx_sum.i = per_cpu_ptr(counter->level0, 0);
+		counter->level0_bit_mask = 0;
+	} else {
+		counter->items = kzalloc(counter_config->nr_items *
+					 sizeof(struct percpu_counter_tree_level_item),
+					 gfp_flags);
+		if (!counter->items) {
+			free_percpu(counter->level0);
+			return -ENOMEM;
+		}
+		counter->approx_sum.a = &counter->items[counter_config->nr_items - 1].count;
+		counter->level0_bit_mask = 1UL << get_count_order(batch_size);
+	}
+	counter->inaccuracy = batch_size * inaccuracy_multiplier;
+	return 0;
+}
+
+void percpu_counter_tree_destroy(struct percpu_counter_tree *counter)
+{
+	free_percpu(counter->level0);
+	kfree(counter->items);
+}
+
+/* Called with migration disabled. */
+void percpu_counter_tree_add_slowpath(struct percpu_counter_tree *counter, int inc)
+{
+	unsigned int level_items, nr_levels = counter_config->nr_levels,
+		     level, n_arity_order, bit_mask;
+	struct percpu_counter_tree_level_item *item = counter->items;
+	unsigned int cpu = smp_processor_id();
+
+	WARN_ON_ONCE(!nr_cpus_order);	/* Should never be called for 1 cpu. */
+
+	n_arity_order = counter_config->n_arity_order[0];
+	bit_mask = counter->level0_bit_mask << n_arity_order;
+	level_items = 1U << (nr_cpus_order - n_arity_order);
+
+	for (level = 1; level < nr_levels; level++) {
+		atomic_t *count = &item[cpu & (level_items - 1)].count;
+		unsigned int orig, res;
+
+		res = atomic_add_return_relaxed(inc, count);
+		orig = res - inc;
+		inc = percpu_counter_tree_carry(orig, res, inc, bit_mask);
+		if (!inc)
+			return;
+		item += level_items;
+		n_arity_order = counter_config->n_arity_order[level];
+		level_items >>= n_arity_order;
+		bit_mask <<= n_arity_order;
+	}
+	atomic_add(inc, counter->approx_sum.a);
+}
+
+/*
+ * Precise sum. Perform the sum of all per-cpu counters.
+ */
+static int percpu_counter_tree_precise_sum_unbiased(struct percpu_counter_tree *counter)
+{
+	unsigned int sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += *per_cpu_ptr(counter->level0, cpu);
+	return (int) sum;
+}
+
+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter)
+{
+	return percpu_counter_tree_precise_sum_unbiased(counter) + READ_ONCE(counter->bias);
+}
+
+/*
+ * Do an approximate comparison of two counters.
+ * Return 0 if counters do not differ by more than the sum of their
+ * respective inaccuracy ranges,
+ * Return -1 if counter @a less than counter @b,
+ * Return 1 if counter @a is greater than counter @b.
+ */
+int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b)
+{
+	int count_a = percpu_counter_tree_approximate_sum(a),
+	    count_b = percpu_counter_tree_approximate_sum(b);
+
+	if (abs(count_a - count_b) <= (a->inaccuracy + b->inaccuracy))
+		return 0;
+	if (count_a < count_b)
+		return -1;
+	return 1;
+}
+
+/*
+ * Do an approximate comparison of a counter against a given value.
+ * Return 0 if the value is within the inaccuracy range of the counter,
+ * Return -1 if the value less than counter,
+ * Return 1 if the value is greater than counter.
+ */
+int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tree *counter, int v)
+{
+	int count = percpu_counter_tree_approximate_sum(counter);
+
+	if (abs(v - count) <= counter->inaccuracy)
+		return 0;
+	if (count < v)
+		return -1;
+	return 1;
+}
+
+/*
+ * Do a precise comparison of two counters.
+ * Return 0 if the counters are equal,
+ * Return -1 if counter @a less than counter @b,
+ * Return 1 if counter @a is greater than counter @b.
+ */
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b)
+{
+	int count_a = percpu_counter_tree_approximate_sum(a),
+	    count_b = percpu_counter_tree_approximate_sum(b);
+
+	if (abs(count_a - count_b) <= (a->inaccuracy + b->inaccuracy)) {
+		if (b->inaccuracy < a->inaccuracy) {
+			count_a = percpu_counter_tree_precise_sum(a);
+			if (abs(count_a - count_b) <= b->inaccuracy)
+				count_b = percpu_counter_tree_precise_sum(b);
+		} else {
+			count_b = percpu_counter_tree_precise_sum(b);
+			if (abs(count_a - count_b) <= a->inaccuracy)
+				count_a = percpu_counter_tree_precise_sum(a);
+		}
+	}
+	if (count_a > count_b)
+		return -1;
+	if (count_a > count_b)
+		return 1;
+	return 0;
+}
+
+/*
+ * Do a precise comparision of a counter against a given value.
+ * Return 0 if the value is equal to the counter,
+ * Return -1 if the value less than counter,
+ * Return 1 if the value is greater than counter.
+ */
+int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v)
+{
+	int count = percpu_counter_tree_approximate_sum(counter);
+
+	if (abs(v - count) <= counter->inaccuracy)
+		count = percpu_counter_tree_precise_sum(counter);
+	if (count < v)
+		return -1;
+	if (count > v)
+		return 1;
+	return 0;
+}
+
+static
+void percpu_counter_tree_set_bias(struct percpu_counter_tree *counter, int bias)
+{
+	WRITE_ONCE(counter->bias, bias);
+}
+
+void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v)
+{
+	percpu_counter_tree_set_bias(counter,
+				     v - percpu_counter_tree_precise_sum_unbiased(counter));
+}
+
+unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter)
+{
+	return counter->inaccuracy;
+}
+
+static unsigned int __init calculate_inaccuracy_multiplier(void)
+{
+	unsigned int nr_levels = counter_config->nr_levels, level;
+	unsigned int level_items = 1U << nr_cpus_order;
+	unsigned int inaccuracy = 0, batch_size = 1;
+
+	for (level = 0; level < nr_levels; level++) {
+		unsigned int n_arity_order = counter_config->n_arity_order[level];
+
+		inaccuracy += batch_size * level_items;
+		batch_size <<= n_arity_order;
+		level_items >>= n_arity_order;
+	}
+	return inaccuracy;
+}
+
+static int __init percpu_counter_startup(void)
+{
+
+	nr_cpus_order = get_count_order(nr_cpu_ids);
+	if (WARN_ON_ONCE(nr_cpus_order >= ARRAY_SIZE(per_nr_cpu_order_config))) {
+		printk(KERN_ERR "Unsupported number of CPUs (%u)\n", nr_cpu_ids);
+		return -1;
+	}
+	counter_config = &per_nr_cpu_order_config[nr_cpus_order];
+	inaccuracy_multiplier = calculate_inaccuracy_multiplier();
+	return 0;
+}
+module_init(percpu_counter_startup);
-- 
2.39.5



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
  2025-10-31 14:42 [RFC PATCH v7 0/2] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
  2025-10-31 14:42 ` [RFC PATCH v7 1/2] lib: Introduce hierarchical per-cpu counters Mathieu Desnoyers
@ 2025-10-31 14:42 ` Mathieu Desnoyers
  2025-11-06  6:53   ` kernel test robot
  1 sibling, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 14:42 UTC (permalink / raw)
  To: Mateusz Guzik, Vlastimil Babka, Sweet Tea Dorminy
  Cc: linux-kernel, Mathieu Desnoyers, Andrew Morton, Paul E. McKenney,
	Steven Rostedt, Masami Hiramatsu, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Martin Liu, David Rientjes, christian.koenig,
	Shakeel Butt, SeongJae Park, Michal Hocko, Johannes Weiner,
	Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Christian Brauner, Wei Yang,
	David Hildenbrand, Miaohe Lin, Al Viro, linux-mm,
	linux-trace-kernel, Yu Zhao, Roman Gushchin, Matthew Wilcox,
	Baolin Wang, Aboorva Devarajan

Use hierarchical per-cpu counters for rss tracking to fix the per-mm RSS
tracking which has become too inaccurate for OOM killer purposes on
large many-core systems.

The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:

  Recently, several internal services had an RSS usage regression as part of a
  kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
  read RSS statistics in a backup watchdog process to monitor and decide if
  they'd overrun their memory budget. Now, however, a representative service
  with five threads, expected to use about a hundred MB of memory, on a 250-cpu
  machine had memory usage tens of megabytes different from the expected amount
  -- this constituted a significant percentage of inaccuracy, causing the
  watchdog to act.

  This was a result of f1a7941243c1 ("mm: convert mm's rss stats into
  percpu_counter") [1].  Previously, the memory error was bounded by
  64*nr_threads pages, a very livable megabyte. Now, however, as a result of
  scheduler decisions moving the threads around the CPUs, the memory error could
  be as large as a gigabyte.

  This is a really tremendous inaccuracy for any few-threaded program on a
  large machine and impedes monitoring significantly. These stat counters are
  also used to make OOM killing decisions, so this additional inaccuracy could
  make a big difference in OOM situations -- either resulting in the wrong
  process being killed, or in less memory being returned from an OOM-kill than
  expected.

Here is a (possibly incomplete) list of the prior approaches that were
used or proposed, along with their downside:

1) Per-thread rss tracking: large error on many-thread processes.

2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
   increased system time in make test workloads [1]. Moreover, the
   inaccuracy increases with O(n^2) with the number of CPUs.

3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
   error is high with systems that have lots of NUMA nodes (32 times
   the number of NUMA nodes).

The approach proposed here is to replace this by the hierarchical
per-cpu counters, which bounds the inaccuracy based on the system
topology with O(N*logN).

commit 82241a83cd15 ("Baolin Wang <baolin.wang@linux.alibaba.com>")
introduced get_mm_counter_sum() for precise /proc memory status queries.
Implement it with percpu_counter_tree_precise_sum() since it is not a
fast path and precision is preferred over speed.

Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
Link: https://lore.kernel.org/lkml/20250704150226.47980-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
Changes since v6:
- Rebased on v6.18-rc3.
- Implement get_mm_counter_sum as percpu_counter_tree_precise_sum for
  /proc virtual files memory state queries.

Changes since v5:
- Use percpu_counter_tree_approximate_sum_positive.

Change since v4:
- get_mm_counter needs to return 0 or a positive value.
---
 include/linux/mm.h          | 10 +++++-----
 include/linux/mm_types.h    |  4 ++--
 include/trace/events/kmem.h |  2 +-
 kernel/fork.c               | 32 +++++++++++++++++++++-----------
 4 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33bacc32..4f8f3118cfd3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2679,33 +2679,33 @@ static inline bool get_user_page_fast_only(unsigned long addr,
  */
 static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 {
-	return percpu_counter_read_positive(&mm->rss_stat[member]);
+	return percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member]);
 }
 
 static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
 {
-	return percpu_counter_sum_positive(&mm->rss_stat[member]);
+	return percpu_counter_tree_precise_sum(&mm->rss_stat[member]);
 }
 
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
 {
-	percpu_counter_add(&mm->rss_stat[member], value);
+	percpu_counter_tree_add(&mm->rss_stat[member], value);
 
 	mm_trace_rss_stat(mm, member);
 }
 
 static inline void inc_mm_counter(struct mm_struct *mm, int member)
 {
-	percpu_counter_inc(&mm->rss_stat[member]);
+	percpu_counter_tree_add(&mm->rss_stat[member], 1);
 
 	mm_trace_rss_stat(mm, member);
 }
 
 static inline void dec_mm_counter(struct mm_struct *mm, int member)
 {
-	percpu_counter_dec(&mm->rss_stat[member]);
+	percpu_counter_tree_add(&mm->rss_stat[member], -1);
 
 	mm_trace_rss_stat(mm, member);
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..adb2f227bac7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -18,7 +18,7 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
-#include <linux/percpu_counter.h>
+#include <linux/percpu_counter_tree.h>
 #include <linux/types.h>
 #include <linux/bitmap.h>
 
@@ -1119,7 +1119,7 @@ struct mm_struct {
 		unsigned long saved_e_flags;
 #endif
 
-		struct percpu_counter rss_stat[NR_MM_COUNTERS];
+		struct percpu_counter_tree rss_stat[NR_MM_COUNTERS];
 
 		struct linux_binfmt *binfmt;
 
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 7f93e754da5c..91c81c44f884 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -442,7 +442,7 @@ TRACE_EVENT(rss_stat,
 		__entry->mm_id = mm_ptr_to_hash(mm);
 		__entry->curr = !!(current->mm == mm);
 		__entry->member = member;
-		__entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member])
+		__entry->size = (percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member])
 							    << PAGE_SHIFT);
 	),
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..e3dd00809cf3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -133,6 +133,11 @@
  */
 #define MAX_THREADS FUTEX_TID_MASK
 
+/*
+ * Batch size of rss stat approximation
+ */
+#define RSS_STAT_BATCH_SIZE	32
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -583,14 +588,12 @@ static void check_mm(struct mm_struct *mm)
 			 "Please make sure 'struct resident_page_types[]' is updated as well");
 
 	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		long x = percpu_counter_sum(&mm->rss_stat[i]);
-
-		if (unlikely(x)) {
-			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld Comm:%s Pid:%d\n",
-				 mm, resident_page_types[i], x,
+		if (unlikely(percpu_counter_tree_precise_compare_value(&mm->rss_stat[i], 0) != 0))
+			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%d Comm:%s Pid:%d\n",
+				 mm, resident_page_types[i],
+				 percpu_counter_tree_precise_sum(&mm->rss_stat[i]),
 				 current->comm,
 				 task_pid_nr(current));
-		}
 	}
 
 	if (mm_pgtables_bytes(mm))
@@ -673,6 +676,8 @@ static void cleanup_lazy_tlbs(struct mm_struct *mm)
  */
 void __mmdrop(struct mm_struct *mm)
 {
+	int i;
+
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
 
@@ -688,8 +693,8 @@ void __mmdrop(struct mm_struct *mm)
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
-	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
-
+	for (i = 0; i < NR_MM_COUNTERS; i++)
+		percpu_counter_tree_destroy(&mm->rss_stat[i]);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1030,6 +1035,8 @@ static void mmap_init_lock(struct mm_struct *mm)
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
+	int i;
+
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
 	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
 	atomic_set(&mm->mm_users, 1);
@@ -1083,15 +1090,18 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
 
-	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
-				     NR_MM_COUNTERS))
-		goto fail_pcpu;
+	for (i = 0; i < NR_MM_COUNTERS; i++) {
+		if (percpu_counter_tree_init(&mm->rss_stat[i], RSS_STAT_BATCH_SIZE, GFP_KERNEL_ACCOUNT))
+			goto fail_pcpu;
+	}
 
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
 fail_pcpu:
+	for (i--; i >= 0; i--)
+		percpu_counter_tree_destroy(&mm->rss_stat[i]);
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
-- 
2.39.5



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
  2025-10-31 14:42 ` [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
@ 2025-11-06  6:53   ` kernel test robot
  2025-11-07  0:32     ` Shakeel Butt
  0 siblings, 1 reply; 9+ messages in thread
From: kernel test robot @ 2025-11-06  6:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: oe-lkp, lkp, Andrew Morton, Paul E. McKenney, Steven Rostedt,
	Masami Hiramatsu, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Martin Liu, David Rientjes, Shakeel Butt, SeongJae Park,
	Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
	Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
	David Hildenbrand, Miaohe Lin, Al Viro, Yu Zhao, Roman Gushchin,
	Mateusz Guzik, Matthew Wilcox, Baolin Wang, Aboorva Devarajan,
	linux-mm, linux-kernel, linux-trace-kernel, Mathieu Desnoyers,
	christian.koenig, oliver.sang



Hello,

kernel test robot noticed "BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:kworker##Pid" on:

commit: 25ae03e80acad812e536694c1a07a3f57784ae23 ("[RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems")
url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/lib-Introduce-hierarchical-per-cpu-counters/20251031-224455
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20251031144232.15284-3-mathieu.desnoyers@efficios.com/
patch subject: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems

in testcase: boot

config: x86_64-randconfig-002-20251103
compiler: clang-20
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


in fact, we observed various BUG:Bad_rss-counter_state_mm issues for this commit
but clean on parent, as below

+------------------------------------------------------------------------+------------+------------+
|                                                                        | 05880dc4af | 25ae03e80a |
+------------------------------------------------------------------------+------------+------------+
| BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val:#Comm:kworker##Pid | 0          | 10         |
| BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:kworker##Pid | 0          | 17         |
| BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:swapper_Pid  | 0          | 2          |
| BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:modprobe_Pid | 0          | 3          |
| BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val:#Comm:modprobe_Pid | 0          | 1          |
+------------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202511061432.4e534796-lkp@intel.com


[   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
[   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
[   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
[   14.918858][   T71] module: module-autoload: duplicate request for module crypto-aes
[   14.919479][   T71] module: module-autoload: duplicate request for module crypto-aes-all
[   14.920801][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain<block
[   14.921844][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain==block
[   14.922852][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain>block
[   14.923843][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc no plain
[   14.939591][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain<block
[   14.940614][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain==block
[   14.941586][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain>block
[   14.942547][    T1] krb5: Running camellia128-cts-cmac enc no plain
[   15.018568][   T85] BUG: Bad rss-counter state mm:ffff888160f81340 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:85
[   15.054490][   T89] module: module-autoload: duplicate request for module crypto-camellia
[   15.055466][   T89] module: module-autoload: duplicate request for module crypto-camellia-all
[   15.056999][    T1] krb5: Running camellia128-cts-cmac enc 1 plain
[   15.057912][    T1] krb5: Running camellia128-cts-cmac enc 9 plain
[   15.058781][    T1] krb5: Running camellia128-cts-cmac enc 13 plain
[   15.059603][    T1] krb5: Running camellia128-cts-cmac enc 30 plain
[   15.061279][    T1] krb5: Running camellia256-cts-cmac enc no plain
[   15.062207][    T1] krb5: Running camellia256-cts-cmac enc 1 plain
[   15.063150][    T1] krb5: Running camellia256-cts-cmac enc 9 plain
[   15.072917][    T1] krb5: Running camellia256-cts-cmac enc 13 plain
[   15.073896][    T1] krb5: Running camellia256-cts-cmac enc 30 plain
[   15.074834][    T1] krb5: Running aes128-cts-hmac-sha256-128 mic
[   15.075625][    T1] krb5: Running aes256-cts-hmac-sha384-192 mic
[   15.076396][    T1] krb5: Running camellia128-cts-cmac mic abc
[   15.077225][    T1] krb5: Running camellia128-cts-cmac mic ABC
[   15.078052][    T1] krb5: Running camellia256-cts-cmac mic 123
[   15.078853][    T1] krb5: Running camellia256-cts-cmac mic !@#
[   15.079621][    T1] krb5: Selftests succeeded
[   15.080683][    T1] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
[   15.081610][    T1] io scheduler kyber registered
[   15.082527][    T1] test_mul_u64_u64_div_u64: Starting mul_u64_u64_div_u64() test
[   15.083365][    T1] test_mul_u64_u64_div_u64: ERROR: 0x000000000000000b * 0x0000000000000007 +/ 0x0000000000000003
[   15.086382][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 000000000000001a
[   15.087178][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 0000000000000019
[   15.088064][    T1] test_mul_u64_u64_div_u64: ERROR: 0x00000000ffffffff * 0x00000000ffffffff +/ 0x0000000000000002
[   15.089105][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 7fffffff00000001
[   15.089924][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 7fffffff00000000
[   15.090696][    T1] test_mul_u64_u64_div_u64: ERROR: 0x00000001ffffffff * 0x00000000ffffffff +/ 0x0000000000000002
[   15.091734][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: fffffffe80000001
[   15.092502][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: fffffffe80000000
[   15.093281][    T1] test_mul_u64_u64_div_u64: ERROR: 0x00000001ffffffff * 0x00000001ffffffff +/ 0x0000000000000004
[   15.094337][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: ffffffff00000001
[   15.095172][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: ffffffff00000000
[   15.095953][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffff000000000000 * 0xffff000000000000 +/ 0xffff000000000001
[   15.097175][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: ffff000000000000
[   15.098020][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: fffeffffffffffff
[   15.098837][    T1] test_mul_u64_u64_div_u64: ERROR: 0x3333333333333333 * 0x3333333333333333 +/ 0x5555555555555555
[   15.099924][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 1eb851eb851eb852
[   15.100721][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 1eb851eb851eb851
[   15.101542][    T1] test_mul_u64_u64_div_u64: ERROR: 0x7fffffffffffffff * 0x0000000000000002 +/ 0x0000000000000003
[   15.102565][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 5555555555555555
[   15.103368][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 5555555555555554
[   15.107134][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x0000000000000002 +/ 0x8000000000000000
[   15.108196][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 0000000000000004
[   15.109049][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 0000000000000003
[   15.109887][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x0000000000000002 +/ 0xc000000000000000
[   15.111017][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 0000000000000003
[   15.111907][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 0000000000000002
[   15.112666][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x4000000000000004 +/ 0x8000000000000000
[   15.113703][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000008
[   15.114527][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000007
[   15.115424][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x4000000000000001 +/ 0x8000000000000000
[   15.116279][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000002
[   15.116882][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000001
[   15.117490][    T1] test_mul_u64_u64_div_u64: ERROR: 0xfffffffffffffffe * 0x8000000000000001 +/ 0xffffffffffffffff
[   15.118363][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000001
[   15.119240][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000000
[   15.119914][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x8000000000000001 +/ 0xfffffffffffffffe
[   15.120785][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000002
[   15.121627][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000001
[   15.122503][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x8000000000000001 +/ 0xfffffffffffffffd
[   15.123624][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 8000000000000003
[   15.124521][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 8000000000000002
[   15.125399][    T1] test_mul_u64_u64_div_u64: ERROR: 0x7fffffffffffffff * 0xffffffffffffffff +/ 0xc000000000000000
[   15.126592][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: aaaaaaaaaaaaaaa9
[   15.127438][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: aaaaaaaaaaaaaaa8
[   15.128411][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x7fffffffffffffff +/ 0xa000000000000000
[   15.129565][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: cccccccccccccccb
[   15.130454][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: ccccccccccccccca
[   15.131239][    T1] test_mul_u64_u64_div_u64: ERROR: 0xffffffffffffffff * 0x7fffffffffffffff +/ 0x9000000000000000
[   15.132213][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: e38e38e38e38e38c
[   15.132793][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: e38e38e38e38e38b
[   15.133374][    T1] test_mul_u64_u64_div_u64: ERROR: 0x7fffffffffffffff * 0x7fffffffffffffff +/ 0x5000000000000000
[   15.134101][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: ccccccccccccccca
[   15.134674][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: ccccccccccccccc9
[   15.135235][    T1] test_mul_u64_u64_div_u64: ERROR: 0xe6102d256d7ea3ae * 0x70a77d0be4c31201 +/ 0xd63ec35ab3220357
[   15.135984][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 78f8bf8cc86c6e19
[   15.136587][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 78f8bf8cc86c6e18
[   15.137140][    T1] test_mul_u64_u64_div_u64: ERROR: 0xf53bae05cb86c6e1 * 0x3847b32d2f8d32e0 +/ 0xcfd4f55a647f403c
[   15.137964][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 42687f79d8998d36
[   15.138541][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 42687f79d8998d35
[   15.139135][    T1] test_mul_u64_u64_div_u64: ERROR: 0x9951c5498f941092 * 0x1f8c8bfdf287a251 +/ 0xa3c8dc5f81ea3fe2
[   15.139884][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 1d887cb259000920
[   15.140444][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 1d887cb25900091f
[   15.141025][    T1] test_mul_u64_u64_div_u64: ERROR: 0x374fee9daa1bb2bb * 0x0d0bfbff7b8ae3ef +/ 0xc169337bd42d5179
[   15.141759][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: 03bb2dbaffcbb962
[   15.142324][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: 03bb2dbaffcbb961
[   15.142890][    T1] test_mul_u64_u64_div_u64: ERROR: 0xeac0d03ac10eeaf0 * 0x89be05dfa162ed9b +/ 0x92bb1679a41f0e4b
[   15.143618][    T1] test_mul_u64_u64_div_u64: ERROR: expected result: dc5f5cc9e270d217
[   15.144200][    T1] test_mul_u64_u64_div_u64: ERROR: obtained result: dc5f5cc9e270d216
[   15.144767][    T1] test_mul_u64_u64_div_u64: Completed mul_u64_u64_div_u64() test, 56 tests, 23 errors, 61402015 ns
[   15.147067][    T1] gpio_virtuser: Failed to create the debugfs tree: -2
[   15.148313][    T1] gpio_winbond: chip ID at 2e is ffff
[   15.148884][    T1] gpio_winbond: not an our chip
[   15.149345][    T1] gpio_winbond: chip ID at 4e is ffff
[   15.149721][    T1] gpio_winbond: not an our chip
[   15.151343][    T1] IPMI message handler: version 39.2
[   15.151885][    T1] ipmi device interface
[   15.152644][    T1] ipmi_si: IPMI System Interface driver
[   15.153494][    T1] ipmi_si: Unable to find any System Interface(s)


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251106/202511061432.4e534796-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
  2025-11-06  6:53   ` kernel test robot
@ 2025-11-07  0:32     ` Shakeel Butt
  2025-11-07 14:43       ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Shakeel Butt @ 2025-11-07  0:32 UTC (permalink / raw)
  To: kernel test robot
  Cc: Mathieu Desnoyers, oe-lkp, lkp, Andrew Morton, Paul E. McKenney,
	Steven Rostedt, Masami Hiramatsu, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Martin Liu, David Rientjes, SeongJae Park,
	Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
	Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
	David Hildenbrand, Miaohe Lin, Al Viro, Yu Zhao, Roman Gushchin,
	Mateusz Guzik, Matthew Wilcox, Baolin Wang, Aboorva Devarajan,
	linux-mm, linux-kernel, linux-trace-kernel, christian.koenig

On Thu, Nov 06, 2025 at 02:53:09PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:kworker##Pid" on:
> 
> commit: 25ae03e80acad812e536694c1a07a3f57784ae23 ("[RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems")
> url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/lib-Introduce-hierarchical-per-cpu-counters/20251031-224455
> base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/all/20251031144232.15284-3-mathieu.desnoyers@efficios.com/
> patch subject: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
> 
> in testcase: boot
> 
> config: x86_64-randconfig-002-20251103
> compiler: clang-20
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> in fact, we observed various BUG:Bad_rss-counter_state_mm issues for this commit
> but clean on parent, as below
> 
> +------------------------------------------------------------------------+------------+------------+
> |                                                                        | 05880dc4af | 25ae03e80a |
> +------------------------------------------------------------------------+------------+------------+
> | BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val:#Comm:kworker##Pid | 0          | 10         |
> | BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:kworker##Pid | 0          | 17         |
> | BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:swapper_Pid  | 0          | 2          |
> | BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val:#Comm:modprobe_Pid | 0          | 3          |
> | BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val:#Comm:modprobe_Pid | 0          | 1          |
> +------------------------------------------------------------------------+------------+------------+
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202511061432.4e534796-lkp@intel.com
> 
> 
> [   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
> [   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
> [   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69

Hmm this shows that percpu_counter_tree_precise_sum() is returning 0 but
percpu_counter_tree_approximate_sum() is off more than
counter->inaccuracy. I have not dig deeper to find why but this needs to
be resolved before considering this series for upstream.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
  2025-11-07  0:32     ` Shakeel Butt
@ 2025-11-07 14:43       ` Mathieu Desnoyers
  2025-11-07 15:53         ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2025-11-07 14:43 UTC (permalink / raw)
  To: Shakeel Butt, kernel test robot
  Cc: oe-lkp, lkp, Andrew Morton, Paul E. McKenney, Steven Rostedt,
	Masami Hiramatsu, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Martin Liu, David Rientjes, SeongJae Park, Michal Hocko,
	Johannes Weiner, Sweet Tea Dorminy, Lorenzo Stoakes,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Christian Brauner, Wei Yang, David Hildenbrand,
	Miaohe Lin, Al Viro, Yu Zhao, Roman Gushchin, Mateusz Guzik,
	Matthew Wilcox, Baolin Wang, Aboorva Devarajan, linux-mm,
	linux-kernel, linux-trace-kernel, christian.koenig

On 2025-11-06 19:32, Shakeel Butt wrote:

[...]

>> [   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
>> [   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
>> [   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
> 
> Hmm this shows that percpu_counter_tree_precise_sum() is returning 0 but
> percpu_counter_tree_approximate_sum() is off more than
> counter->inaccuracy. I have not dig deeper to find why but this needs to
> be resolved before considering this series for upstream.

I notice that those BUG show up while loading modules at boot in kworker context, e.g.:

[   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
[   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
[   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
[   14.918858][   T71] module: module-autoload: duplicate request for module crypto-aes
[   14.919479][   T71] module: module-autoload: duplicate request for module crypto-aes-all
[   14.920801][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain<block
[   14.921844][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain==block
[   14.922852][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc plain>block
[   14.923843][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc no plain
[   14.939591][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain<block
[   14.940614][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain==block
[   14.941586][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc plain>block
[   14.942547][    T1] krb5: Running camellia128-cts-cmac enc no plain
[   15.018568][   T85] BUG: Bad rss-counter state mm:ffff888160f81340 type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:85b

I used "module_init" similarly to lib/percpu_counter.c, but I think it
happens too late in the boot sequence:

   module_init(percpu_counter_startup);

module_init maps to __initcall within a built-in compile unit, which
maps to device_initcall(), which happens quite late within the sequence
called from do_initcalls(), called from do_basic_setup().

And even do_basic_setup is documented as:

  * Ok, the machine is now initialized. None of the devices
  * have been touched yet, but the CPU subsystem is up and
  * running, and memory and process management works.

which clearly requires that the mm subsystem is expected to
be ready at that point.

It probably was not an issue for the non-hierarchical percpu
counters because all it was initializing is handling of CPU hotplug,
but the new hierarchical counters initialize the pre-calculated
inaccuracy value which is used to figure out whether the approximate
sum is sufficient to compare values or if the precise sum is needed.

I think this is why we are hitting this BUG.

Now I wonder where I should move this initialization. It requires
"nr_cpu_ids" to be initialized, and pretty much need to be done
before mms are created. I'm starting to suspect that the module init
code can spawn kworkers that have a mm before the init process runs.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
  2025-11-07 14:43       ` Mathieu Desnoyers
@ 2025-11-07 15:53         ` Mathieu Desnoyers
  2025-11-07 16:04           ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2025-11-07 15:53 UTC (permalink / raw)
  To: Shakeel Butt, kernel test robot
  Cc: oe-lkp, lkp, Andrew Morton, Paul E. McKenney, Steven Rostedt,
	Masami Hiramatsu, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Martin Liu, David Rientjes, SeongJae Park, Michal Hocko,
	Johannes Weiner, Sweet Tea Dorminy, Lorenzo Stoakes,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Christian Brauner, Wei Yang, David Hildenbrand,
	Miaohe Lin, Al Viro, Yu Zhao, Roman Gushchin, Mateusz Guzik,
	Matthew Wilcox, Baolin Wang, Aboorva Devarajan, linux-mm,
	linux-kernel, linux-trace-kernel, christian.koenig

On 2025-11-07 09:43, Mathieu Desnoyers wrote:
> On 2025-11-06 19:32, Shakeel Butt wrote:
> 
> [...]
> 
>>> [   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 
>>> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
>>> [   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 
>>> type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
>>> [   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 
>>> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
>>
>> Hmm this shows that percpu_counter_tree_precise_sum() is returning 0 but
>> percpu_counter_tree_approximate_sum() is off more than
>> counter->inaccuracy. I have not dig deeper to find why but this needs to
>> be resolved before considering this series for upstream.
> 
> I notice that those BUG show up while loading modules at boot in kworker 
> context, e.g.:
> 
> [   14.858862][   T67] BUG: Bad rss-counter state mm:ffff8881000655c0 
> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:67
> [   14.894890][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 
> type:MM_FILEPAGES val:0 Comm:kworker/u9:0 Pid:69
> [   14.896108][   T69] BUG: Bad rss-counter state mm:ffff888100061cc0 
> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:69
> [   14.918858][   T71] module: module-autoload: duplicate request for 
> module crypto-aes
> [   14.919479][   T71] module: module-autoload: duplicate request for 
> module crypto-aes-all
> [   14.920801][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc 
> plain<block
> [   14.921844][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc 
> plain==block
> [   14.922852][    T1] krb5: Running aes128-cts-hmac-sha256-128 enc 
> plain>block
> [   14.923843][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc no 
> plain
> [   14.939591][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc 
> plain<block
> [   14.940614][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc 
> plain==block
> [   14.941586][    T1] krb5: Running aes256-cts-hmac-sha384-192 enc 
> plain>block
> [   14.942547][    T1] krb5: Running camellia128-cts-cmac enc no plain
> [   15.018568][   T85] BUG: Bad rss-counter state mm:ffff888160f81340 
> type:MM_ANONPAGES val:0 Comm:kworker/u9:0 Pid:85b
> 
> I used "module_init" similarly to lib/percpu_counter.c, but I think it
> happens too late in the boot sequence:
> 
>    module_init(percpu_counter_startup);
> 
> module_init maps to __initcall within a built-in compile unit, which
> maps to device_initcall(), which happens quite late within the sequence
> called from do_initcalls(), called from do_basic_setup().
> 
> And even do_basic_setup is documented as:
> 
>   * Ok, the machine is now initialized. None of the devices
>   * have been touched yet, but the CPU subsystem is up and
>   * running, and memory and process management works.
> 
> which clearly requires that the mm subsystem is expected to
> be ready at that point.
> 
> It probably was not an issue for the non-hierarchical percpu
> counters because all it was initializing is handling of CPU hotplug,
> but the new hierarchical counters initialize the pre-calculated
> inaccuracy value which is used to figure out whether the approximate
> sum is sufficient to compare values or if the precise sum is needed.
> 
> I think this is why we are hitting this BUG.
> 
> Now I wonder where I should move this initialization. It requires
> "nr_cpu_ids" to be initialized, and pretty much need to be done
> before mms are created. I'm starting to suspect that the module init
> code can spawn kworkers that have a mm before the init process runs.

At least on x86, nr_cpu_ids appears to be set by set_nr_cpu_ids()
through early_param("possible_cpus", setup_possible_cpus), which is
AFAIU called from parse_early_param(), which happens very early in the
boot sequence.

It would make sense to call an explicit percpu counter tree init
function from start_kernel() between the call to mm_core_init() and the
call to maple_tree_init(). This way it would be initialized right after
mm, but given that the hierarchical counter tree is a lib that can be
used for other purposes than mm accounting, I think it makes sense
to call its init explicitly from start_kernel() rather than bury
it within mm_core_init().

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
  2025-11-07 15:53         ` Mathieu Desnoyers
@ 2025-11-07 16:04           ` Mathieu Desnoyers
  2025-11-08  0:01             ` Shakeel Butt
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2025-11-07 16:04 UTC (permalink / raw)
  To: Shakeel Butt, kernel test robot
  Cc: oe-lkp, lkp, Andrew Morton, Paul E. McKenney, Steven Rostedt,
	Masami Hiramatsu, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Martin Liu, David Rientjes, SeongJae Park, Michal Hocko,
	Johannes Weiner, Sweet Tea Dorminy, Lorenzo Stoakes,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Christian Brauner, Wei Yang, David Hildenbrand,
	Miaohe Lin, Al Viro, Yu Zhao, Roman Gushchin, Mateusz Guzik,
	Matthew Wilcox, Baolin Wang, Aboorva Devarajan, linux-mm,
	linux-kernel, linux-trace-kernel, christian.koenig

On 2025-11-07 10:53, Mathieu Desnoyers wrote:
[...]
> 
> It would make sense to call an explicit percpu counter tree init
> function from start_kernel() between the call to mm_core_init() and the
> call to maple_tree_init(). This way it would be initialized right after
> mm, but given that the hierarchical counter tree is a lib that can be
> used for other purposes than mm accounting, I think it makes sense
> to call its init explicitly from start_kernel() rather than bury
> it within mm_core_init().
See the following diff. If nobody object, I'll prepare a v8 which
includes it.

diff --git a/include/linux/percpu_counter_tree.h 
b/include/linux/percpu_counter_tree.h
index 8795e782680a..40fcdd6456b6 100644
--- a/include/linux/percpu_counter_tree.h
+++ b/include/linux/percpu_counter_tree.h
@@ -41,6 +41,7 @@ int percpu_counter_tree_precise_compare(struct 
percpu_counter_tree *a, struct pe
  int percpu_counter_tree_precise_compare_value(struct 
percpu_counter_tree *counter, int v);
  void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v);
  unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree 
*counter);
+int percpu_counter_tree_subsystem_init(void);

  /* Fast paths */

@@ -191,6 +192,12 @@ int percpu_counter_tree_approximate_sum(struct 
percpu_counter_tree *counter)
  	return percpu_counter_tree_precise_sum(counter);
  }

+static inline
+int percpu_counter_tree_subsystem_init(void)
+{
+	return 0;
+}
+
  #endif	/* CONFIG_SMP */

  static inline
diff --git a/init/main.c b/init/main.c
index 07a3116811c5..204d9f913130 100644
--- a/init/main.c
+++ b/init/main.c
@@ -104,6 +104,7 @@
  #include <linux/pidfs.h>
  #include <linux/ptdump.h>
  #include <linux/time_namespace.h>
+#include <linux/percpu_counter_tree.h>
  #include <net/net_namespace.h>

  #include <asm/io.h>
@@ -969,6 +970,7 @@ void start_kernel(void)
  	sort_main_extable();
  	trap_init();
  	mm_core_init();
+	percpu_counter_tree_subsystem_init();
  	maple_tree_init();
  	poking_init();
  	ftrace_init();
diff --git a/lib/percpu_counter_tree.c b/lib/percpu_counter_tree.c
index 9577d94251d1..05c3db0ce5b1 100644
--- a/lib/percpu_counter_tree.c
+++ b/lib/percpu_counter_tree.c
@@ -379,7 +379,7 @@ static unsigned int __init 
calculate_inaccuracy_multiplier(void)
  	return inaccuracy;
  }

-static int __init percpu_counter_startup(void)
+int __init percpu_counter_tree_subsystem_init(void)
  {

  	nr_cpus_order = get_count_order(nr_cpu_ids);
@@ -391,4 +391,3 @@ static int __init percpu_counter_startup(void)
  	inaccuracy_multiplier = calculate_inaccuracy_multiplier();
  	return 0;
  }
-module_init(percpu_counter_startup);


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems
  2025-11-07 16:04           ` Mathieu Desnoyers
@ 2025-11-08  0:01             ` Shakeel Butt
  0 siblings, 0 replies; 9+ messages in thread
From: Shakeel Butt @ 2025-11-08  0:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: kernel test robot, oe-lkp, lkp, Andrew Morton, Paul E. McKenney,
	Steven Rostedt, Masami Hiramatsu, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Martin Liu, David Rientjes, SeongJae Park,
	Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
	Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
	David Hildenbrand, Miaohe Lin, Al Viro, Yu Zhao, Roman Gushchin,
	Mateusz Guzik, Matthew Wilcox, Baolin Wang, Aboorva Devarajan,
	linux-mm, linux-kernel, linux-trace-kernel, christian.koenig

On Fri, Nov 07, 2025 at 11:04:01AM -0500, Mathieu Desnoyers wrote:
> On 2025-11-07 10:53, Mathieu Desnoyers wrote:
> [...]
> > 
> > It would make sense to call an explicit percpu counter tree init
> > function from start_kernel() between the call to mm_core_init() and the
> > call to maple_tree_init(). This way it would be initialized right after
> > mm, but given that the hierarchical counter tree is a lib that can be
> > used for other purposes than mm accounting, I think it makes sense
> > to call its init explicitly from start_kernel() rather than bury
> > it within mm_core_init().
> See the following diff. If nobody object, I'll prepare a v8 which
> includes it.

This seems reasonable to me, I see v8 is already posted. I will take a
deeper look.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-11-08  0:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-31 14:42 [RFC PATCH v7 0/2] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
2025-10-31 14:42 ` [RFC PATCH v7 1/2] lib: Introduce hierarchical per-cpu counters Mathieu Desnoyers
2025-10-31 14:42 ` [RFC PATCH v7 2/2] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
2025-11-06  6:53   ` kernel test robot
2025-11-07  0:32     ` Shakeel Butt
2025-11-07 14:43       ` Mathieu Desnoyers
2025-11-07 15:53         ` Mathieu Desnoyers
2025-11-07 16:04           ` Mathieu Desnoyers
2025-11-08  0:01             ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox