linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems
@ 2026-01-11 15:02 Mathieu Desnoyers
  2026-01-11 15:02 ` [PATCH v12 1/3] lib: Introduce hierarchical per-cpu counters Mathieu Desnoyers
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Mathieu Desnoyers @ 2026-01-11 15:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Mathieu Desnoyers, Paul E. McKenney,
	Steven Rostedt, Masami Hiramatsu, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Martin Liu, David Rientjes, christian.koenig,
	Shakeel Butt, SeongJae Park, Michal Hocko, Johannes Weiner,
	Sweet Tea Dorminy, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
	Christian Brauner, Wei Yang, David Hildenbrand, Miaohe Lin,
	Al Viro, linux-mm, linux-trace-kernel, Yu Zhao, Roman Gushchin,
	Mateusz Guzik, Matthew Wilcox, Baolin Wang, Aboorva Devarajan

Introduce hierarchical per-cpu counters and use them for RSS tracking to
fix the per-mm RSS tracking which has become too inaccurate for OOM
killer purposes on large many-core systems.

The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:

  Recently, several internal services had an RSS usage regression as part of a
  kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
  read RSS statistics in a backup watchdog process to monitor and decide if
  they'd overrun their memory budget. Now, however, a representative service
  with five threads, expected to use about a hundred MB of memory, on a 250-cpu
  machine had memory usage tens of megabytes different from the expected amount
  -- this constituted a significant percentage of inaccuracy, causing the
  watchdog to act.

  This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
  into percpu_counter") [1].  Previously, the memory error was bounded by
  64*nr_threads pages, a very livable megabyte. Now, however, as a result of
  scheduler decisions moving the threads around the CPUs, the memory error could
  be as large as a gigabyte.

  This is a really tremendous inaccuracy for any few-threaded program on a
  large machine and impedes monitoring significantly. These stat counters are
  also used to make OOM killing decisions, so this additional inaccuracy could
  make a big difference in OOM situations -- either resulting in the wrong
  process being killed, or in less memory being returned from an OOM-kill than
  expected.

The approach proposed here is to replace this by the hierarchical
per-cpu counters, which bounds the inaccuracy based on the system
topology with O(N*logN).

Notable changes for v12:

- Reduce per-CPU counters memory allocation size to sizeof long
  (fixing mixup with sizeof intermediate cache line aligned items).
- Use "long" counters types rather than "int".
- get_mm_counter_sum() returns a precise sum.
- Introduce and use functions to calculate the min/max possible precise
  sum values associated with an approximate sum.

I've done moderate testing of this series on a 256-core VM with 128GB
RAM. Figuring out whether this indeed helps solve issues with real-life
workloads will require broader feedback from the community.

This series is based on v6.19-rc4, on top of the following two
preparation series:

https://lore.kernel.org/linux-mm/20251224173358.647691-1-mathieu.desnoyers@efficios.com/T/#t
https://lore.kernel.org/linux-mm/20251224173810.648699-1-mathieu.desnoyers@efficios.com/T/#t

Andrew, this series replaces v11, for testing in mm-new.

Thanks!

Mathieu

Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>

Mathieu Desnoyers (3):
  lib: Introduce hierarchical per-cpu counters
  mm: Fix OOM killer inaccuracy on large many-core systems
  mm: Implement precise OOM killer task selection

 fs/proc/base.c                      |   2 +-
 include/linux/mm.h                  |  49 +-
 include/linux/mm_types.h            |  54 ++-
 include/linux/oom.h                 |  11 +-
 include/linux/percpu_counter_tree.h | 344 ++++++++++++++
 include/trace/events/kmem.h         |   2 +-
 init/main.c                         |   2 +
 kernel/fork.c                       |  22 +-
 lib/Makefile                        |   1 +
 lib/percpu_counter_tree.c           | 702 ++++++++++++++++++++++++++++
 mm/oom_kill.c                       |  82 +++-
 11 files changed, 1222 insertions(+), 49 deletions(-)
 create mode 100644 include/linux/percpu_counter_tree.h
 create mode 100644 lib/percpu_counter_tree.c

-- 
2.39.5


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-01-11 19:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-11 15:02 [PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
2026-01-11 15:02 ` [PATCH v12 1/3] lib: Introduce hierarchical per-cpu counters Mathieu Desnoyers
2026-01-11 18:36   ` kernel test robot
2026-01-11 19:25     ` Mathieu Desnoyers
2026-01-11 15:02 ` [PATCH v12 2/3] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
2026-01-11 15:02 ` [PATCH v12 3/3] mm: Implement precise OOM killer task selection Mathieu Desnoyers
2026-01-11 17:50   ` kernel test robot
2026-01-11 19:30     ` Mathieu Desnoyers
2026-01-11 18:03   ` kernel test robot
2026-01-11 19:35     ` Mathieu Desnoyers
2026-01-11 17:48 ` [PATCH v12 0/3] mm: Fix OOM killer inaccuracy on large many-core systems Andrew Morton
2026-01-11 18:04   ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox