From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org,
"Paul E. McKenney" <paulmck@kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
Christoph Lameter <cl@linux.com>,
Martin Liu <liumartin@google.com>,
David Rientjes <rientjes@google.com>,
christian.koenig@amd.com, Shakeel Butt <shakeel.butt@linux.dev>,
SeongJae Park <sj@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Sweet Tea Dorminy <sweettea-kernel@dorminy.me>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
"Liam R . Howlett" <liam.howlett@oracle.com>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Vlastimil Babka <vbabka@suse.cz>,
Christian Brauner <brauner@kernel.org>,
Wei Yang <richard.weiyang@gmail.com>,
David Hildenbrand <david@redhat.com>,
Miaohe Lin <linmiaohe@huawei.com>,
Al Viro <viro@zeniv.linux.org.uk>,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
Yu Zhao <yuzhao@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Mateusz Guzik <mjguzik@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Aboorva Devarajan <aboorvad@linux.ibm.com>
Subject: Re: [PATCH v13 2/3] mm: Fix OOM killer inaccuracy on large many-core systems
Date: Tue, 13 Jan 2026 08:51:45 -0500 [thread overview]
Message-ID: <c99778c3-6ef0-48de-98ac-10913419ec90@efficios.com> (raw)
In-Reply-To: <aWYPWNIv4lR2FpUZ@tiehlicka>
On 2026-01-13 04:24, Michal Hocko wrote:
[...]
>> Would you be OK with introducing changes in the following order ?
>>
>> 1) Fix the OOM killer inaccuracy by using counter sum (iteration on all
>> cpu counters) in task selection. This may slow down the oom killer,
>> but would at least fix its current inaccuracy issues. This could be
>> backported to stable kernels.
>>
>> 2) Introduce the hierarchical percpu counters on top, as a oom killer
>> task selection performance optimization (reduce latency of oom kill).
>>
>> This way, (2) becomes purely a performance optimization, so it's easy
>> to bissect and revert if it causes issues.
>
> Yes, this makes more sense.
>
>> I agree that bringing a fix along with a performance optimization within
>> a single commit makes it hard to backport to stable, and tricky to
>> revert if it causes problems.
>>
>> As for finding other users of the hpcc, I have ideas, but not so much
>> time available to try them out, as I'm pretty much doing this in my
>> spare time.
>
> I do understand this constrain and motivation to have OOM situation
> addressed with a priority. I am pretty sure that if you see issues in
> OOM path then other consumers of get_mm_counter would be affected as
> well. Namely /proc/<pid>/stat.
Indeed /proc/<pid>/stat (implemented in fs/proc/array.c:do_task_stat())
uses get_mm_rss() which currently exports the approximated value to
userspace.
> There might be others but I can imagine
> that some of them are more performance than precision sensitive.
Agreed.
> All that being said it seems that we need slow-and-precise and
> fast-approximate interfaces to have incremental path for other users as
> well. Looking at patch 1 it seems there are interfaces available for
> that. I think it would be great to call those out explicitly in the
> highlevel doc to give some guidance what to use when with what kind of
> expectations.
I figured I'd first focus on the oom killers internals before tackling
the userspace ABI aspect of the problem, but since you're bringing it
up, here is what I have in mind, more or less:
- Introduce new proc files, e.g.
/proc/<pid>/rss/approximate
/proc/<pid>/rss/precise
Where the "approximate" file would export the following lines for each
page type (MM_FILEPAGES, MM_ANONPAGES, MM_SWAPENTS, MM_SHMPAGES,
allowing future additions):
<page type> <approximate> <precise_sum_min> <precise_sum_max>
And "precise" would export lines for each page type:
<page type> <precise_sum>
The key thing here is to have different files to query approximated
vs precise values, so we don't have the overhead of the precise sum
when all we need is an approximation.
This would expose all the bits and pieces needed to allow userspace to
implement something similar to the 2-pass algorithm I'm proposing for
the OOM killer, but tweaked for other use-cases.
This proposed ABI is purely hypothetical at this stage. Please let me
know if you have something different in mind.
When you mention "highlevel doc", which document do you have in mind ?
Something related to lib/percpu_counter_tree.c or to the /proc ABI ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
next prev parent reply other threads:[~2026-01-13 13:51 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-11 19:49 [PATCH v13 0/3] " Mathieu Desnoyers
2026-01-11 19:49 ` [PATCH v13 1/3] lib: Introduce hierarchical per-cpu counters Mathieu Desnoyers
2026-01-11 19:49 ` [PATCH v13 2/3] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
2026-01-12 8:42 ` Michal Hocko
2026-01-12 19:37 ` Mathieu Desnoyers
2026-01-12 19:48 ` Michal Hocko
2026-01-13 0:47 ` Mathieu Desnoyers
2026-01-13 9:24 ` Michal Hocko
2026-01-13 13:51 ` Mathieu Desnoyers [this message]
2026-01-13 14:11 ` Michal Hocko
2026-01-12 17:29 ` Shakeel Butt
2026-01-12 18:46 ` Mathieu Desnoyers
2026-01-11 19:49 ` [PATCH v13 3/3] mm: Implement precise OOM killer task selection Mathieu Desnoyers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c99778c3-6ef0-48de-98ac-10913419ec90@efficios.com \
--to=mathieu.desnoyers@efficios.com \
--cc=aboorvad@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=brauner@kernel.org \
--cc=christian.koenig@amd.com \
--cc=cl@linux.com \
--cc=david@redhat.com \
--cc=dennis@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=liam.howlett@oracle.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=liumartin@google.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=mjguzik@gmail.com \
--cc=paulmck@kernel.org \
--cc=richard.weiyang@gmail.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=sweettea-kernel@dorminy.me \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox