* [PATCH v1 0/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
@ 2026-01-13 19:47 Mathieu Desnoyers
2026-01-13 19:47 ` [PATCH v1 1/1] " Mathieu Desnoyers
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2026-01-13 19:47 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Mathieu Desnoyers, Paul E. McKenney,
Steven Rostedt, Masami Hiramatsu, Dennis Zhou, Tejun Heo,
Christoph Lameter, Martin Liu, David Rientjes, christian.koenig,
Shakeel Butt, SeongJae Park, Michal Hocko, Johannes Weiner,
Sweet Tea Dorminy, Lorenzo Stoakes, Liam R . Howlett,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Christian Brauner, Wei Yang, David Hildenbrand, Miaohe Lin,
Al Viro, linux-mm, stable, linux-trace-kernel, Yu Zhao,
Roman Gushchin, Mateusz Guzik, Matthew Wilcox, Baolin Wang,
Aboorva Devarajan
Hi Andrew,
This patch modifies the OOM killer and all proc RSS stats to use the
precise for-each-possible-cpu sum to fix the inaccuracy issues. This
approach was suggested by Michal Hocko as a straightforward fix for the
inaccuracy issue by using more precise (but slower) RSS stats sum.
With this, the hierarchical per-cpu counters become a simple
optimization rather than a bug fix. I will post a new version
of the HPCC soon which will be based on this patch.
Feedback is welcome!
Thanks,
Mathieu
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
Mathieu Desnoyers (1):
mm: Fix OOM killer and proc stats inaccuracy on large many-core
systems
fs/proc/task_mmu.c | 14 +++++++-------
include/linux/mm.h | 5 -----
2 files changed, 7 insertions(+), 12 deletions(-)
--
2.39.5
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
2026-01-13 19:47 [PATCH v1 0/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems Mathieu Desnoyers
@ 2026-01-13 19:47 ` Mathieu Desnoyers
2026-01-13 21:46 ` Andrew Morton
2026-01-14 3:18 ` Baolin Wang
0 siblings, 2 replies; 10+ messages in thread
From: Mathieu Desnoyers @ 2026-01-13 19:47 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Mathieu Desnoyers, Paul E. McKenney,
Steven Rostedt, Masami Hiramatsu, Dennis Zhou, Tejun Heo,
Christoph Lameter, Martin Liu, David Rientjes, christian.koenig,
Shakeel Butt, SeongJae Park, Michal Hocko, Johannes Weiner,
Sweet Tea Dorminy, Lorenzo Stoakes, Liam R . Howlett,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Christian Brauner, Wei Yang, David Hildenbrand, Miaohe Lin,
Al Viro, linux-mm, stable, linux-trace-kernel, Yu Zhao,
Roman Gushchin, Mateusz Guzik, Matthew Wilcox, Baolin Wang,
Aboorva Devarajan
Use the precise, albeit slower, precise RSS counter sums for the OOM
killer task selection and proc statistics. The approximated value is
too imprecise on large many-core systems.
The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:
Recently, several internal services had an RSS usage regression as part of a
kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
read RSS statistics in a backup watchdog process to monitor and decide if
they'd overrun their memory budget. Now, however, a representative service
with five threads, expected to use about a hundred MB of memory, on a 250-cpu
machine had memory usage tens of megabytes different from the expected amount
-- this constituted a significant percentage of inaccuracy, causing the
watchdog to act.
This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
into percpu_counter") [1]. Previously, the memory error was bounded by
64*nr_threads pages, a very livable megabyte. Now, however, as a result of
scheduler decisions moving the threads around the CPUs, the memory error could
be as large as a gigabyte.
This is a really tremendous inaccuracy for any few-threaded program on a
large machine and impedes monitoring significantly. These stat counters are
also used to make OOM killing decisions, so this additional inaccuracy could
make a big difference in OOM situations -- either resulting in the wrong
process being killed, or in less memory being returned from an OOM-kill than
expected.
Here is a (possibly incomplete) list of the prior approaches that were
used or proposed, along with their downside:
1) Per-thread rss tracking: large error on many-thread processes.
2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
increased system time in make test workloads [1]. Moreover, the
inaccuracy increases with O(n^2) with the number of CPUs.
3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
error is high with systems that have lots of NUMA nodes (32 times
the number of NUMA nodes).
The simple fix proposed here is to do the precise per-cpu counters sum
every time a counter value needs to be read. This applies to the OOM
killer task selection, to the /proc statistics, and to the oom mark_victim
trace event.
Note that commit 82241a83cd15 ("mm: fix the inaccurate memory statistics
issue for users") introduced get_mm_counter_sum() for precise proc
memory status queries for _some_ proc files. This change renames
get_mm_counter_sum() to get_mm_counter(), thus moving the rest of the
proc files to the precise sum.
This change effectively increases the latency introduced when the OOM
killer executes in favor of doing a more precise OOM target task
selection. Effectively, the OOM killer iterates on all tasks, for all
relevant page types, for which the precise sum iterates on all possible
CPUs.
As a reference, here is the execution time of the OOM killer
before/after the change:
AMD EPYC 9654 96-Core (2 sockets)
Within a KVM, configured with 256 logical cpus.
| before | after |
----------------------------------|----------|----------|
nr_processes=40 | 0.3 ms | 0.5 ms |
nr_processes=10000 | 3.0 ms | 80.0 ms |
Suggested-by: Michal Hocko <mhocko@suse.com>
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
fs/proc/task_mmu.c | 14 +++++++-------
include/linux/mm.h | 5 -----
2 files changed, 7 insertions(+), 12 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 81dfc26bfae8..8ca4fbf53fc5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -39,9 +39,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
unsigned long text, lib, swap, anon, file, shmem;
unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
- anon = get_mm_counter_sum(mm, MM_ANONPAGES);
- file = get_mm_counter_sum(mm, MM_FILEPAGES);
- shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
+ anon = get_mm_counter(mm, MM_ANONPAGES);
+ file = get_mm_counter(mm, MM_FILEPAGES);
+ shmem = get_mm_counter(mm, MM_SHMEMPAGES);
/*
* Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -62,7 +62,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
text = min(text, mm->exec_vm << PAGE_SHIFT);
lib = (mm->exec_vm << PAGE_SHIFT) - text;
- swap = get_mm_counter_sum(mm, MM_SWAPENTS);
+ swap = get_mm_counter(mm, MM_SWAPENTS);
SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
@@ -95,12 +95,12 @@ unsigned long task_statm(struct mm_struct *mm,
unsigned long *shared, unsigned long *text,
unsigned long *data, unsigned long *resident)
{
- *shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
- get_mm_counter_sum(mm, MM_SHMEMPAGES);
+ *shared = get_mm_counter(mm, MM_FILEPAGES) +
+ get_mm_counter(mm, MM_SHMEMPAGES);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->data_vm + mm->stack_vm;
- *resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
+ *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
return mm->total_vm;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f959d8ca4b4..d096bb3593ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2847,11 +2847,6 @@ static inline bool get_user_page_fast_only(unsigned long addr,
* per-process(per-mm_struct) statistics.
*/
static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
-{
- return percpu_counter_read_positive(&mm->rss_stat[member]);
-}
-
-static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
{
return percpu_counter_sum_positive(&mm->rss_stat[member]);
}
--
2.39.5
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
2026-01-13 19:47 ` [PATCH v1 1/1] " Mathieu Desnoyers
@ 2026-01-13 21:46 ` Andrew Morton
2026-01-13 22:16 ` Mathieu Desnoyers
2026-01-14 3:18 ` Baolin Wang
1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2026-01-13 21:46 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, Paul E. McKenney, Steven Rostedt, Masami Hiramatsu,
Dennis Zhou, Tejun Heo, Christoph Lameter, Martin Liu,
David Rientjes, christian.koenig, Shakeel Butt, SeongJae Park,
Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
David Hildenbrand, Miaohe Lin, Al Viro, linux-mm, stable,
linux-trace-kernel, Yu Zhao, Roman Gushchin, Mateusz Guzik,
Matthew Wilcox, Baolin Wang, Aboorva Devarajan
On Tue, 13 Jan 2026 14:47:34 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Use the precise, albeit slower, precise RSS counter sums for the OOM
> killer task selection and proc statistics. The approximated value is
> too imprecise on large many-core systems.
Thanks.
Problem: if I also queue your "mm: Reduce latency of OOM killer task
selection" series then this single patch won't get tested, because the
larger series erases this patch, yes?
Obvious solution: aim this patch at next-merge-window and let's look at
the larger series for the next -rc cycle. Thoughts?
> The following rss tracking issues were noted by Sweet Tea Dorminy [1],
> which lead to picking wrong tasks as OOM kill target:
>
> Recently, several internal services had an RSS usage regression as part of a
> kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
> read RSS statistics in a backup watchdog process to monitor and decide if
> they'd overrun their memory budget. Now, however, a representative service
> with five threads, expected to use about a hundred MB of memory, on a 250-cpu
> machine had memory usage tens of megabytes different from the expected amount
> -- this constituted a significant percentage of inaccuracy, causing the
> watchdog to act.
>
> This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
> into percpu_counter") [1]. Previously, the memory error was bounded by
> 64*nr_threads pages, a very livable megabyte. Now, however, as a result of
> scheduler decisions moving the threads around the CPUs, the memory error could
> be as large as a gigabyte.
>
> This is a really tremendous inaccuracy for any few-threaded program on a
> large machine and impedes monitoring significantly. These stat counters are
> also used to make OOM killing decisions, so this additional inaccuracy could
> make a big difference in OOM situations -- either resulting in the wrong
> process being killed, or in less memory being returned from an OOM-kill than
> expected.
>
> Here is a (possibly incomplete) list of the prior approaches that were
> used or proposed, along with their downside:
>
> 1) Per-thread rss tracking: large error on many-thread processes.
>
> 2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
> increased system time in make test workloads [1]. Moreover, the
> inaccuracy increases with O(n^2) with the number of CPUs.
>
> 3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
> error is high with systems that have lots of NUMA nodes (32 times
> the number of NUMA nodes).
>
> The simple fix proposed here is to do the precise per-cpu counters sum
> every time a counter value needs to be read. This applies to the OOM
> killer task selection, to the /proc statistics, and to the oom mark_victim
> trace event.
>
> Note that commit 82241a83cd15 ("mm: fix the inaccurate memory statistics
> issue for users") introduced get_mm_counter_sum() for precise proc
> memory status queries for _some_ proc files. This change renames
> get_mm_counter_sum() to get_mm_counter(), thus moving the rest of the
> proc files to the precise sum.
Please confirm - switching /proc functions from get_mm_counter_sum() to
get_mm_counter_sum() doesn't actually change anything, right? It would
be concerning to add possible overhead to things like task_statm().
> This change effectively increases the latency introduced when the OOM
> killer executes in favor of doing a more precise OOM target task
> selection. Effectively, the OOM killer iterates on all tasks, for all
> relevant page types, for which the precise sum iterates on all possible
> CPUs.
>
> As a reference, here is the execution time of the OOM killer
> before/after the change:
>
> AMD EPYC 9654 96-Core (2 sockets)
> Within a KVM, configured with 256 logical cpus.
>
> | before | after |
> ----------------------------------|----------|----------|
> nr_processes=40 | 0.3 ms | 0.5 ms |
> nr_processes=10000 | 3.0 ms | 80.0 ms |
That seems acceptable.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
2026-01-13 21:46 ` Andrew Morton
@ 2026-01-13 22:16 ` Mathieu Desnoyers
2026-01-13 23:55 ` Andrew Morton
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2026-01-13 22:16 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Paul E. McKenney, Steven Rostedt, Masami Hiramatsu,
Dennis Zhou, Tejun Heo, Christoph Lameter, Martin Liu,
David Rientjes, christian.koenig, Shakeel Butt, SeongJae Park,
Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
David Hildenbrand, Miaohe Lin, Al Viro, linux-mm, stable,
linux-trace-kernel, Yu Zhao, Roman Gushchin, Mateusz Guzik,
Matthew Wilcox, Baolin Wang, Aboorva Devarajan
On 2026-01-13 16:46, Andrew Morton wrote:
> On Tue, 13 Jan 2026 14:47:34 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>> Use the precise, albeit slower, precise RSS counter sums for the OOM
>> killer task selection and proc statistics. The approximated value is
>> too imprecise on large many-core systems.
>
> Thanks.
>
> Problem: if I also queue your "mm: Reduce latency of OOM killer task
> selection" series then this single patch won't get tested, because the
> larger series erases this patch, yes?
That's a good point.
>
> Obvious solution: aim this patch at next-merge-window and let's look at
> the larger series for the next -rc cycle. Thoughts?
Yes, that works for me. Does it mean I should re-submit the hpcc
series after the next merge window closes, or do you keep a queue of
stuff waiting for the next -rc cycle somewhere ?
>
>> The following rss tracking issues were noted by Sweet Tea Dorminy [1],
>> which lead to picking wrong tasks as OOM kill target:
>>
>> Recently, several internal services had an RSS usage regression as part of a
>> kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
>> read RSS statistics in a backup watchdog process to monitor and decide if
>> they'd overrun their memory budget. Now, however, a representative service
>> with five threads, expected to use about a hundred MB of memory, on a 250-cpu
>> machine had memory usage tens of megabytes different from the expected amount
>> -- this constituted a significant percentage of inaccuracy, causing the
>> watchdog to act.
>>
>> This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
>> into percpu_counter") [1]. Previously, the memory error was bounded by
>> 64*nr_threads pages, a very livable megabyte. Now, however, as a result of
>> scheduler decisions moving the threads around the CPUs, the memory error could
>> be as large as a gigabyte.
>>
>> This is a really tremendous inaccuracy for any few-threaded program on a
>> large machine and impedes monitoring significantly. These stat counters are
>> also used to make OOM killing decisions, so this additional inaccuracy could
>> make a big difference in OOM situations -- either resulting in the wrong
>> process being killed, or in less memory being returned from an OOM-kill than
>> expected.
>>
>> Here is a (possibly incomplete) list of the prior approaches that were
>> used or proposed, along with their downside:
>>
>> 1) Per-thread rss tracking: large error on many-thread processes.
>>
>> 2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
>> increased system time in make test workloads [1]. Moreover, the
>> inaccuracy increases with O(n^2) with the number of CPUs.
>>
>> 3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
>> error is high with systems that have lots of NUMA nodes (32 times
>> the number of NUMA nodes).
>>
>> The simple fix proposed here is to do the precise per-cpu counters sum
>> every time a counter value needs to be read. This applies to the OOM
>> killer task selection, to the /proc statistics, and to the oom mark_victim
>> trace event.
>>
>> Note that commit 82241a83cd15 ("mm: fix the inaccurate memory statistics
>> issue for users") introduced get_mm_counter_sum() for precise proc
>> memory status queries for _some_ proc files. This change renames
>> get_mm_counter_sum() to get_mm_counter(), thus moving the rest of the
>> proc files to the precise sum.
>
> Please confirm - switching /proc functions from get_mm_counter_sum() to
> get_mm_counter_sum() doesn't actually change anything, right? It would
> be concerning to add possible overhead to things like task_statm().
The approach proposed by this patch is to switch all proc ABIs which
query RSS to the precise sum to eliminate any discrepancy caused by too
imprecise approximate sums. It's a big hammer, and it can slow down
those proc interfaces, including task_statm(). Is it an issue ?
The hpcc series introduces an approximation which provides accuracy
limits on the approximation that make the result is still somewhat
meaninful on large many core systems.
The overall approach here would be to move back those proc interfaces
which care about low overhead to the hpcc approximate sum when it lands
upstream. But in order to learn that, we need to know which proc
interface files are performance-sensitive. How can we get that data ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
2026-01-13 22:16 ` Mathieu Desnoyers
@ 2026-01-13 23:55 ` Andrew Morton
2026-01-14 1:22 ` Mathieu Desnoyers
0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2026-01-13 23:55 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, Paul E. McKenney, Steven Rostedt, Masami Hiramatsu,
Dennis Zhou, Tejun Heo, Christoph Lameter, Martin Liu,
David Rientjes, christian.koenig, Shakeel Butt, SeongJae Park,
Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
David Hildenbrand, Miaohe Lin, Al Viro, linux-mm, stable,
linux-trace-kernel, Yu Zhao, Roman Gushchin, Mateusz Guzik,
Matthew Wilcox, Baolin Wang, Aboorva Devarajan
On Tue, 13 Jan 2026 17:16:16 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> On 2026-01-13 16:46, Andrew Morton wrote:
> > On Tue, 13 Jan 2026 14:47:34 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> >> Use the precise, albeit slower, precise RSS counter sums for the OOM
> >> killer task selection and proc statistics. The approximated value is
> >> too imprecise on large many-core systems.
> >
> > Thanks.
> >
> > Problem: if I also queue your "mm: Reduce latency of OOM killer task
> > selection" series then this single patch won't get tested, because the
> > larger series erases this patch, yes?
>
> That's a good point.
>
> >
> > Obvious solution: aim this patch at next-merge-window and let's look at
> > the larger series for the next -rc cycle. Thoughts?
>
> Yes, that works for me. Does it mean I should re-submit the hpcc
> series after the next merge window closes, or do you keep a queue of
> stuff waiting for the next -rc cycle somewhere ?
I do keep such a queue, but I rarely use it - things go stale quickly.
So a fresh version would be best please.
> >> Note that commit 82241a83cd15 ("mm: fix the inaccurate memory statistics
> >> issue for users") introduced get_mm_counter_sum() for precise proc
> >> memory status queries for _some_ proc files. This change renames
> >> get_mm_counter_sum() to get_mm_counter(), thus moving the rest of the
> >> proc files to the precise sum.
> >
> > Please confirm - switching /proc functions from get_mm_counter_sum() to
> > get_mm_counter_sum() doesn't actually change anything, right? It would
> > be concerning to add possible overhead to things like task_statm().
>
> The approach proposed by this patch is to switch all proc ABIs which
> query RSS to the precise sum to eliminate any discrepancy caused by too
> imprecise approximate sums. It's a big hammer, and it can slow down
> those proc interfaces, including task_statm().
Oh, so I misunderstood.
> Is it an issue ?
Well it might be - there are a lot of users out there and they do the
weirdest stuff.
> The hpcc series introduces an approximation which provides accuracy
> limits on the approximation that make the result is still somewhat
> meaninful on large many core systems.
Can we leave the non-oom related parts of procfs as-is for now, then
migrate them over to hpcc when that is available? Safer that way.
> The overall approach here would be to move back those proc interfaces
> which care about low overhead to the hpcc approximate sum when it lands
> upstream. But in order to learn that, we need to know which proc
> interface files are performance-sensitive. How can we get that data ?
Gee. Wait for the unhappy emails :(
People do sometimes search all-of-open-source for API changes, but that
doesn't cover in-house things, and tools which whack away at /proc
files are often in-house-only.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
2026-01-13 23:55 ` Andrew Morton
@ 2026-01-14 1:22 ` Mathieu Desnoyers
2026-01-14 1:35 ` Andrew Morton
2026-01-14 8:18 ` Michal Hocko
0 siblings, 2 replies; 10+ messages in thread
From: Mathieu Desnoyers @ 2026-01-14 1:22 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Paul E. McKenney, Steven Rostedt, Masami Hiramatsu,
Dennis Zhou, Tejun Heo, Christoph Lameter, Martin Liu,
David Rientjes, christian.koenig, Shakeel Butt, SeongJae Park,
Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
David Hildenbrand, Miaohe Lin, Al Viro, linux-mm, stable,
linux-trace-kernel, Yu Zhao, Roman Gushchin, Mateusz Guzik,
Matthew Wilcox, Baolin Wang, Aboorva Devarajan
On 2026-01-13 18:55, Andrew Morton wrote:
> On Tue, 13 Jan 2026 17:16:16 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>> The hpcc series introduces an approximation which provides accuracy
>> limits on the approximation that make the result is still somewhat
>> meaninful on large many core systems.
>
> Can we leave the non-oom related parts of procfs as-is for now, then
> migrate them over to hpcc when that is available? Safer that way.
Of course.
So AFAIU the plan is:
1) update the oom accuracy fix to only use the precise sum for
the oom killer, no changes to procfs ABIs. This targets mm-new.
2) update the hpcc series to base them on top of the new fix from (1).
Update their commit messages to indicate that they bring accuracy
improvements to the procfs ABI on large many-core systems, as well as
latency improvements to the oom killer. This will target upstreaming
after the next merge window, but I will still post it soon to gather
feedback.
Does that plan look OK ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
2026-01-14 1:22 ` Mathieu Desnoyers
@ 2026-01-14 1:35 ` Andrew Morton
2026-01-14 8:18 ` Michal Hocko
1 sibling, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2026-01-14 1:35 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, Paul E. McKenney, Steven Rostedt, Masami Hiramatsu,
Dennis Zhou, Tejun Heo, Christoph Lameter, Martin Liu,
David Rientjes, christian.koenig, Shakeel Butt, SeongJae Park,
Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
David Hildenbrand, Miaohe Lin, Al Viro, linux-mm, stable,
linux-trace-kernel, Yu Zhao, Roman Gushchin, Mateusz Guzik,
Matthew Wilcox, Baolin Wang, Aboorva Devarajan
On Tue, 13 Jan 2026 20:22:16 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> On 2026-01-13 18:55, Andrew Morton wrote:
> > On Tue, 13 Jan 2026 17:16:16 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> >> The hpcc series introduces an approximation which provides accuracy
> >> limits on the approximation that make the result is still somewhat
> >> meaninful on large many core systems.
> >
> > Can we leave the non-oom related parts of procfs as-is for now, then
> > migrate them over to hpcc when that is available? Safer that way.
>
> Of course.
>
> So AFAIU the plan is:
>
> 1) update the oom accuracy fix to only use the precise sum for
> the oom killer, no changes to procfs ABIs. This targets mm-new.
>
> 2) update the hpcc series to base them on top of the new fix from (1).
> Update their commit messages to indicate that they bring accuracy
> improvements to the procfs ABI on large many-core systems, as well as
> latency improvements to the oom killer. This will target upstreaming
> after the next merge window, but I will still post it soon to gather
> feedback.
>
> Does that plan look OK ?
Perfect, thanks. Except there is no "(1)". We shall survive ;)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
2026-01-13 19:47 ` [PATCH v1 1/1] " Mathieu Desnoyers
2026-01-13 21:46 ` Andrew Morton
@ 2026-01-14 3:18 ` Baolin Wang
2026-01-14 14:57 ` Mathieu Desnoyers
1 sibling, 1 reply; 10+ messages in thread
From: Baolin Wang @ 2026-01-14 3:18 UTC (permalink / raw)
To: Mathieu Desnoyers, Andrew Morton
Cc: linux-kernel, Paul E. McKenney, Steven Rostedt, Masami Hiramatsu,
Dennis Zhou, Tejun Heo, Christoph Lameter, Martin Liu,
David Rientjes, christian.koenig, Shakeel Butt, SeongJae Park,
Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
David Hildenbrand, Miaohe Lin, Al Viro, linux-mm, stable,
linux-trace-kernel, Yu Zhao, Roman Gushchin, Mateusz Guzik,
Matthew Wilcox, Aboorva Devarajan
Hi,
On 1/14/26 3:47 AM, Mathieu Desnoyers wrote:
> Use the precise, albeit slower, precise RSS counter sums for the OOM
> killer task selection and proc statistics. The approximated value is
> too imprecise on large many-core systems.
>
> The following rss tracking issues were noted by Sweet Tea Dorminy [1],
> which lead to picking wrong tasks as OOM kill target:
>
> Recently, several internal services had an RSS usage regression as part of a
> kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
> read RSS statistics in a backup watchdog process to monitor and decide if
> they'd overrun their memory budget. Now, however, a representative service
> with five threads, expected to use about a hundred MB of memory, on a 250-cpu
> machine had memory usage tens of megabytes different from the expected amount
> -- this constituted a significant percentage of inaccuracy, causing the
> watchdog to act.
>
> This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
> into percpu_counter") [1]. Previously, the memory error was bounded by
> 64*nr_threads pages, a very livable megabyte. Now, however, as a result of
> scheduler decisions moving the threads around the CPUs, the memory error could
> be as large as a gigabyte.
>
> This is a really tremendous inaccuracy for any few-threaded program on a
> large machine and impedes monitoring significantly. These stat counters are
> also used to make OOM killing decisions, so this additional inaccuracy could
> make a big difference in OOM situations -- either resulting in the wrong
> process being killed, or in less memory being returned from an OOM-kill than
> expected.
>
> Here is a (possibly incomplete) list of the prior approaches that were
> used or proposed, along with their downside:
>
> 1) Per-thread rss tracking: large error on many-thread processes.
>
> 2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
> increased system time in make test workloads [1]. Moreover, the
> inaccuracy increases with O(n^2) with the number of CPUs.
>
> 3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
> error is high with systems that have lots of NUMA nodes (32 times
> the number of NUMA nodes).
>
> The simple fix proposed here is to do the precise per-cpu counters sum
> every time a counter value needs to be read. This applies to the OOM
> killer task selection, to the /proc statistics, and to the oom mark_victim
> trace event.
>
> Note that commit 82241a83cd15 ("mm: fix the inaccurate memory statistics
> issue for users") introduced get_mm_counter_sum() for precise proc
> memory status queries for _some_ proc files. This change renames
> get_mm_counter_sum() to get_mm_counter(), thus moving the rest of the
> proc files to the precise sum.
I'm not against this patch. However, I’m concerned that it may affect
not only the rest of the proc files, but also fork(), which calls
get_mm_rss(). At least we should evaluate its impact on fork()?
> This change effectively increases the latency introduced when the OOM
> killer executes in favor of doing a more precise OOM target task
> selection. Effectively, the OOM killer iterates on all tasks, for all
> relevant page types, for which the precise sum iterates on all possible
> CPUs.
>
> As a reference, here is the execution time of the OOM killer
> before/after the change:
>
> AMD EPYC 9654 96-Core (2 sockets)
> Within a KVM, configured with 256 logical cpus.
>
> | before | after |
> ----------------------------------|----------|----------|
> nr_processes=40 | 0.3 ms | 0.5 ms |
> nr_processes=10000 | 3.0 ms | 80.0 ms |
>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
> Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Dennis Zhou <dennis@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Martin Liu <liumartin@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: christian.koenig@amd.com
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Wei Yang <richard.weiyang@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: linux-mm@kvack.org
> Cc: stable@vger.kernel.org
> Cc: linux-trace-kernel@vger.kernel.org
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Mateusz Guzik <mjguzik@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
> ---
> fs/proc/task_mmu.c | 14 +++++++-------
> include/linux/mm.h | 5 -----
> 2 files changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 81dfc26bfae8..8ca4fbf53fc5 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -39,9 +39,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> unsigned long text, lib, swap, anon, file, shmem;
> unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>
> - anon = get_mm_counter_sum(mm, MM_ANONPAGES);
> - file = get_mm_counter_sum(mm, MM_FILEPAGES);
> - shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
> + anon = get_mm_counter(mm, MM_ANONPAGES);
> + file = get_mm_counter(mm, MM_FILEPAGES);
> + shmem = get_mm_counter(mm, MM_SHMEMPAGES);
>
> /*
> * Note: to minimize their overhead, mm maintains hiwater_vm and
> @@ -62,7 +62,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> text = min(text, mm->exec_vm << PAGE_SHIFT);
> lib = (mm->exec_vm << PAGE_SHIFT) - text;
>
> - swap = get_mm_counter_sum(mm, MM_SWAPENTS);
> + swap = get_mm_counter(mm, MM_SWAPENTS);
> SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
> SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
> SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
> @@ -95,12 +95,12 @@ unsigned long task_statm(struct mm_struct *mm,
> unsigned long *shared, unsigned long *text,
> unsigned long *data, unsigned long *resident)
> {
> - *shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
> - get_mm_counter_sum(mm, MM_SHMEMPAGES);
> + *shared = get_mm_counter(mm, MM_FILEPAGES) +
> + get_mm_counter(mm, MM_SHMEMPAGES);
> *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
> >> PAGE_SHIFT;
> *data = mm->data_vm + mm->stack_vm;
> - *resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
> + *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> return mm->total_vm;
> }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6f959d8ca4b4..d096bb3593ba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2847,11 +2847,6 @@ static inline bool get_user_page_fast_only(unsigned long addr,
> * per-process(per-mm_struct) statistics.
> */
> static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
> -{
> - return percpu_counter_read_positive(&mm->rss_stat[member]);
> -}
> -
> -static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
> {
> return percpu_counter_sum_positive(&mm->rss_stat[member]);
> }
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
2026-01-14 1:22 ` Mathieu Desnoyers
2026-01-14 1:35 ` Andrew Morton
@ 2026-01-14 8:18 ` Michal Hocko
1 sibling, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2026-01-14 8:18 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Andrew Morton, linux-kernel, Paul E. McKenney, Steven Rostedt,
Masami Hiramatsu, Dennis Zhou, Tejun Heo, Christoph Lameter,
Martin Liu, David Rientjes, christian.koenig, Shakeel Butt,
SeongJae Park, Johannes Weiner, Sweet Tea Dorminy,
Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
David Hildenbrand, Miaohe Lin, Al Viro, linux-mm, stable,
linux-trace-kernel, Yu Zhao, Roman Gushchin, Mateusz Guzik,
Matthew Wilcox, Baolin Wang, Aboorva Devarajan
On Tue 13-01-26 20:22:16, Mathieu Desnoyers wrote:
> On 2026-01-13 18:55, Andrew Morton wrote:
> > On Tue, 13 Jan 2026 17:16:16 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> > > The hpcc series introduces an approximation which provides accuracy
> > > limits on the approximation that make the result is still somewhat
> > > meaninful on large many core systems.
> >
> > Can we leave the non-oom related parts of procfs as-is for now, then
> > migrate them over to hpcc when that is available? Safer that way.
>
> Of course.
>
> So AFAIU the plan is:
>
> 1) update the oom accuracy fix to only use the precise sum for
> the oom killer, no changes to procfs ABIs. This targets mm-new.
>
> 2) update the hpcc series to base them on top of the new fix from (1).
> Update their commit messages to indicate that they bring accuracy
> improvements to the procfs ABI on large many-core systems, as well as
> latency improvements to the oom killer. This will target upstreaming
> after the next merge window, but I will still post it soon to gather
> feedback.
>
> Does that plan look OK ?
I was about to propose the same. 1) is a regression fix and should be
merged first and go to stable trees (it is a low priority fix but still
worth having addressed).
Also a minor nit. You do not have to send cover letter for a single
patch series.
Thanks for working on this Mathieu!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems
2026-01-14 3:18 ` Baolin Wang
@ 2026-01-14 14:57 ` Mathieu Desnoyers
0 siblings, 0 replies; 10+ messages in thread
From: Mathieu Desnoyers @ 2026-01-14 14:57 UTC (permalink / raw)
To: Baolin Wang, Andrew Morton
Cc: linux-kernel, Paul E. McKenney, Steven Rostedt, Masami Hiramatsu,
Dennis Zhou, Tejun Heo, Christoph Lameter, Martin Liu,
David Rientjes, christian.koenig, Shakeel Butt, SeongJae Park,
Michal Hocko, Johannes Weiner, Sweet Tea Dorminy,
Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Christian Brauner, Wei Yang,
David Hildenbrand, Miaohe Lin, Al Viro, linux-mm, stable,
linux-trace-kernel, Yu Zhao, Roman Gushchin, Mateusz Guzik,
Matthew Wilcox, Aboorva Devarajan
On 2026-01-13 22:18, Baolin Wang wrote:
>
> I'm not against this patch. However, I’m concerned that it may affect
> not only the rest of the proc files, but also fork(), which calls
> get_mm_rss(). At least we should evaluate its impact on fork()?
>
It's fixed in v2. I'm introducing get_mm_rss_sum() specifically for
the oom killer.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-01-14 14:57 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-13 19:47 [PATCH v1 0/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems Mathieu Desnoyers
2026-01-13 19:47 ` [PATCH v1 1/1] " Mathieu Desnoyers
2026-01-13 21:46 ` Andrew Morton
2026-01-13 22:16 ` Mathieu Desnoyers
2026-01-13 23:55 ` Andrew Morton
2026-01-14 1:22 ` Mathieu Desnoyers
2026-01-14 1:35 ` Andrew Morton
2026-01-14 8:18 ` Michal Hocko
2026-01-14 3:18 ` Baolin Wang
2026-01-14 14:57 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox