linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	JP Kobryn <inwardvessel@gmail.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org, bpf@vger.kernel.org,
	Martin KaFai Lau <martin.lau@kernel.org>,
	Song Liu <song@kernel.org>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
Date: Tue, 4 Nov 2025 20:22:06 +0100	[thread overview]
Message-ID: <aQpSXgUL1fUzRgyL@tiehlicka> (raw)
In-Reply-To: <87h5v93bte.fsf@linux.dev>

On Tue 04-11-25 10:14:05, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 03-11-25 17:45:09, Roman Gushchin wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Sun 02-11-25 13:36:25, Roman Gushchin wrote:
> >> >> Michal Hocko <mhocko@suse.com> writes:
> > [...]
> >> > No, I do not feel strongly one way or the other but I would like to
> >> > understand thinking behind that. My slight preference would be to have a
> >> > single return status that clearly describe the intention. If you want to
> >> > have more flexible chaining semantic then an enum { IGNORED, HANDLED,
> >> > PASS_TO_PARENT, ...} would be both more flexible, extensible and easier
> >> > to understand.
> >> 
> >> The thinking is simple:
> >> 1) Most users will have a single global bpf oom policy, which basically
> >> replaces the in-kernel oom killer.
> >> 2) If there are standalone containers, they might want to do the same on
> >> their level. And the "host" system doesn't directly control it.
> >> 3) If for some reason the inner oom handler fails to free up some
> >> memory, there are two potential fallback options: call the in-kernel oom
> >> killer for that memory cgroup or call an upper level bpf oom killer, if
> >> there is one.
> >> 
> >> I think the latter is more logical and less surprising. Imagine you're
> >> running multiple containers and some of them implement their own bpf oom
> >> logic and some don't. Why would we treat them differently if their bpf
> >> logic fails?
> >
> > I think both approaches are valid and it should be the actual handler to
> > tell what to do next. If the handler would prefer the in-kernel fallback
> > it should be able to enforce that rather than a potentially unknown bpf
> > handler up the chain.
> 
> The counter-argument is that cgroups are hierarchical and higher level
> cgroups should be able to enforce the desired behavior for their
> sub-trees. I'm not sure what's more important here and have to think
> more about it.

Right and they can enforce that through their limits - hence oom.

> Do you have an example when it might be important for container to not
> pass to a higher level bpf handler?

Nothing really specific. I still trying to wrap my head around what
level of flexibility is necessary here. My initial thoughts would be
just deal with it in the scope of the bpf handler and fallback to the
kernel implementation if it cannot deal with the situation. Since you
brought that up you made me think.

I know that we do not provide userspace like no-regression policy to BPF
programs but it would be still good to have a way to add new potential
fallback policies without breaking existing handlers.

> >> Re a single return value: I can absolutely specify return values as an
> >> enum, my point is that unlike the kernel code we can't fully trust the
> >> value returned from a bpf program, this is why the second check is in
> >> place.
> >
> > I do not understand this. Could you elaborate? Why we cannot trust the
> > return value but we can trust a combination of the return value and a
> > state stored in a helper structure?
> 
> Imagine bpf program which does nothing and simple returns 1. Imagine
> it's loaded as a system-wide oom handler. This will effectively disable
> the oom killer and lead to a potential deadlock on memory.
> But it's a perfectly valid bpf program.
> This is something I want to avoid (and it's a common practice with other
> bpf programs).
> 
> What I do I also rely on the value of the oom control's field, which is
> not accessible to the bpf program for write directly, but can be changed
> by calling certain helper functions, e.g. bpf_oom_kill_process.

OK, now I can see your point. You want to have a line of defense from
trusted BPF facing interface. This makes sense to me. Maybe it would be
good to call that out more explicitly. Something like 
The BPF OOM infrastructure only trusts BPF handlers which are using pre
selected functions to free up memory e.g. bpf_oom_kill_process. Those
will set an internal state not available to those handlers directly.
BPF handler return value is ignored if that state is not set.

I would rather call this differently to freed_memory as the actual
memory might be freed asynchronously (e.g. oom_reaper) and this is more
about conformity/trust than actual physical memory being freed. I do not
care much about naming as long as this is clearly document though.
Including set of functions that are forming that prescribed API.

[...]
> > OK. All I am trying to say is that only safe sleepable locks are
> > trylocks and that should be documented because I do not think it can be
> > enforced
> 
> It can! Not directly, but by controlling which kfuncs/helpers are
> available to bpf programs.

OK, I see. This is better than relying only on having this documented.

-- 
Michal Hocko
SUSE Labs


  reply	other threads:[~2025-11-04 19:22 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 01/23] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 15:57     ` Roman Gushchin
2025-10-29 18:01   ` Song Liu
2025-10-29 20:26     ` Roman Gushchin
2025-10-30 17:22     ` Roman Gushchin
2025-10-30 18:03       ` Song Liu
2025-10-30 18:19         ` Amery Hung
2025-10-30 19:06           ` Roman Gushchin
2025-10-30 21:34             ` Song Liu
2025-10-30 22:42               ` Martin KaFai Lau
2025-10-30 23:14                 ` Roman Gushchin
2025-10-31  0:05                 ` Song Liu
2025-10-30 22:19             ` bpf_st_ops and cgroups. Was: " Alexei Starovoitov
2025-10-30 23:24               ` Roman Gushchin
2025-10-31  3:03                 ` Yafang Shao
2025-10-31  6:14                 ` Song Liu
2025-10-31 11:35                   ` Yafang Shao
2025-10-31 17:37                 ` Alexei Starovoitov
2025-10-29 18:14   ` Tejun Heo
2025-10-29 20:25     ` Roman Gushchin
2025-10-29 20:36       ` Tejun Heo
2025-10-29 21:18         ` Song Liu
2025-10-29 21:27           ` Tejun Heo
2025-10-29 21:37             ` Song Liu
2025-10-29 21:45               ` Tejun Heo
2025-10-30  4:32                 ` Song Liu
2025-10-30 16:13                   ` Tejun Heo
2025-10-30 17:56                     ` Song Liu
2025-10-29 21:53           ` Roman Gushchin
2025-10-29 22:43             ` Alexei Starovoitov
2025-10-29 22:53               ` Tejun Heo
2025-10-29 23:53                 ` Alexei Starovoitov
2025-10-30  0:03                   ` Tejun Heo
2025-10-30  0:16                     ` Alexei Starovoitov
2025-10-30  6:33                       ` Yafang Shao
2025-10-29 21:04   ` Song Liu
2025-10-30  0:43   ` Martin KaFai Lau
2025-10-27 23:17 ` [PATCH v2 03/23] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 04/23] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
2025-10-31  8:32   ` Michal Hocko
2025-10-27 23:17 ` [PATCH v2 05/23] mm: declare memcg_page_state_output() in memcontrol.h Roman Gushchin
2025-10-31  8:34   ` Michal Hocko
2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
2025-10-27 23:57   ` bot+bpf-ci
2025-10-28 17:45   ` Alexei Starovoitov
2025-10-28 18:42     ` Roman Gushchin
2025-10-28 22:07       ` Alexei Starovoitov
2025-10-28 22:56         ` Roman Gushchin
2025-10-28 21:33   ` Song Liu
2025-10-28 23:24     ` Roman Gushchin
2025-10-30  0:20   ` Martin KaFai Lau
2025-10-30  5:57   ` Yafang Shao
2025-10-30 14:26     ` Roman Gushchin
2025-10-31  9:02   ` Michal Hocko
2025-11-02 21:36     ` Roman Gushchin
2025-11-03 19:00       ` Michal Hocko
2025-11-04  1:45         ` Roman Gushchin
2025-11-04  8:18           ` Michal Hocko
2025-11-04 18:14             ` Roman Gushchin
2025-11-04 19:22               ` Michal Hocko [this message]
2026-01-12 14:54   ` Matt Bobrowski
2026-01-12 17:20     ` Roman Gushchin
2026-01-12 18:20       ` Matt Bobrowski
2025-10-27 23:17 ` [PATCH v2 07/23] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
2025-10-31  9:05   ` Michal Hocko
2025-11-02 21:09     ` Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 16:10     ` Roman Gushchin
2025-10-28 17:12       ` Alexei Starovoitov
2025-10-28 18:03         ` Chris Mason
2025-10-28 18:32           ` Roman Gushchin
2025-10-28 17:42   ` Tejun Heo
2025-10-28 18:12     ` Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 09/23] mm: introduce bpf_get_root_mem_cgroup() BPF kfunc Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 16:16     ` Roman Gushchin
2025-10-31  9:08   ` Michal Hocko
2025-10-31  9:31 ` [PATCH v2 00/23] mm: BPF OOM Michal Hocko
2025-10-31 16:48   ` Lance Yang
2025-11-02 20:53   ` Roman Gushchin
2025-11-03 18:18     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQpSXgUL1fUzRgyL@tiehlicka \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=inwardvessel@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=song@kernel.org \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox