Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@suse.com>
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Matt Bobrowski <mattbobrowski@google.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	JP Kobryn <inwardvessel@gmail.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Suren Baghdasaryan <surenb@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Date: Wed, 28 Jan 2026 09:00:45 +0100	[thread overview]
Message-ID: <aXnCLWYbQ8xZ2IyO@tiehlicka> (raw)
In-Reply-To: <7ia4tsw6hi93.fsf@castle.c.googlers.com>

On Tue 27-01-26 21:12:56, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
> >> Introduce a bpf struct ops for implementing custom OOM handling
> >> policies.
> >> 
> >> It's possible to load one bpf_oom_ops for the system and one
> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
> >> corresponding BPF OOM handlers are executed until some memory is
> >> freed. If no memory is freed, the kernel OOM killer is invoked.
> >> 
> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> which expected to return 1 if it was able to free some memory and 0
> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> >> field of the oom_control structure, which is expected to be set by
> >> kfuncs suitable for releasing memory (which will be introduced later
> >> in the patch series). If both are set, OOM is considered handled,
> >> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
> >> attached to the parent cgroup or the kernel OOM killer.
> >
> > I still find this dual reporting a bit confusing. I can see your
> > intention in having a pre-defined "releasers" of the memory to trust BPF
> > handlers more but they do have access to oc->bpf_memory_freed so they
> > can manipulate it. Therefore an additional level of protection is rather
> > weak.
> 
> No, they can't. They have only a read-only access.

Could you explain this a bit more. This must be some BPF magic because
they are getting a standard pointer to oom_control.
 
> > It is also not really clear to me how this works while there is OOM
> > victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
> > will result in no killing therefore no bpf_memory_freed, right? Handler
> > itself should consider its work done. How exactly is this handled.
> 
> It's a good question, I see your point...
> Basically we want to give a handler an option to exit with "I promise,
> some memory will be freed soon" without doing anything destructive.
> But keeping it save at the same time.

Yes, something like OOM_BACKOFF, OOM_PROCESSED, OOM_FAILED.

> I don't have a perfect answer out of my head, maybe some sort of a
> rate-limiter/counter might work? E.g. a handler can promise this N times
> before the kernel kicks in? Any ideas?

Counters usually do not work very well for async operations. In this
case there is oom_repaer and/or task exit to finish the oom operation.
The former is bound and guaranteed to make a forward progress but there
is no time frame to assume when that happens as it depends on how many
tasks might be queued (usually a single one but this is not something to
rely on because of concurrent ooms in memcgs and also multiple tasks
could be killed at the same time).

Another complication is that there are multiple levels of OOM to track
(global, NUMA, memcg) so any watchdog would have to be aware of that as
well. I am really wondering whether we really need to be so careful with
handlers. It is not like you would allow any random oom handler to be
loaded, right? Would it make sense to start without this protection and
converge to something as we see how this evolves? Maybe this will raise
the bar for oom handlers as the price for bugs is going to be really
high.

> > Also is there any way to handle the oom by increasing the memcg limit?
> > I do not see a callback for that.
> 
> There is no kfunc yet, but it's a good idea (which we accidentally
> discussed few days ago). I'll implement it.

Cool!
-- 
Michal Hocko
SUSE Labs

next prev parent reply	other threads:[~2026-01-28  8:00 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
2026-01-27  5:50   ` Yafang Shao
2026-01-28 11:28   ` Matt Bobrowski
2026-01-27  2:44 ` [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups Roman Gushchin
2026-01-27  3:08   ` bot+bpf-ci
2026-01-27  5:49   ` Yafang Shao
2026-01-28  3:10   ` Josh Don
2026-01-28 18:52     ` Roman Gushchin
2026-01-28 11:25   ` Matt Bobrowski
2026-01-28 19:18     ` Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 03/17] libbpf: fix return value on memory allocation failure Roman Gushchin
2026-01-27  5:52   ` Yafang Shao
2026-01-27  2:44 ` [PATCH bpf-next v3 04/17] libbpf: introduce bpf_map__attach_struct_ops_opts() Roman Gushchin
2026-01-27  3:08   ` bot+bpf-ci
2026-01-27  2:44 ` [PATCH bpf-next v3 05/17] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
2026-01-27  6:06   ` Yafang Shao
2026-02-02  4:56   ` Matt Bobrowski
2026-01-27  2:44 ` [PATCH bpf-next v3 06/17] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
2026-01-27  6:12   ` Yafang Shao
2026-02-02  3:50   ` Shakeel Butt
2026-03-26  8:17   ` teawater
2026-01-27  2:44 ` [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops Roman Gushchin
2026-01-27  9:38   ` Michal Hocko
2026-01-27 21:12     ` Roman Gushchin
2026-01-28  8:00       ` Michal Hocko [this message]
2026-01-28 18:44         ` Roman Gushchin
2026-02-02  4:06       ` Matt Bobrowski
2026-01-28  3:26   ` Josh Don
2026-01-28 19:03     ` Roman Gushchin
2026-01-28 11:19   ` Michal Hocko
2026-01-28 18:53     ` Roman Gushchin
2026-01-29 21:00   ` Martin KaFai Lau
2026-01-30 23:29     ` Roman Gushchin
2026-02-02 20:27       ` Martin KaFai Lau
2026-01-27  2:44 ` [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
2026-01-27 20:21   ` Martin KaFai Lau
2026-01-27 20:47     ` Roman Gushchin
2026-02-02  4:49   ` Matt Bobrowski
2026-01-27  2:44 ` [PATCH bpf-next v3 09/17] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
2026-01-28 20:21   ` Matt Bobrowski
2026-01-27  2:44 ` [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
2026-02-02  5:39   ` Matt Bobrowski
2026-02-02 17:30     ` Alexei Starovoitov
2026-02-03  0:14       ` Roman Gushchin
2026-02-03 13:23         ` Michal Hocko
2026-02-03 16:31           ` Alexei Starovoitov
2026-02-04  9:02             ` Michal Hocko
2026-02-05  0:12               ` Alexei Starovoitov
2026-01-27  2:44 ` [PATCH bpf-next v3 11/17] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
2026-01-27  3:08   ` bot+bpf-ci
2026-01-27  2:44 ` [PATCH bpf-next v3 12/17] bpf: selftests: BPF OOM struct ops test Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 13/17] sched: psi: add a trace point to psi_avgs_work() Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 14/17] sched: psi: add cgroup_id field to psi_group structure Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 15/17] bpf: allow calling bpf_out_of_memory() from a PSI tracepoint Roman Gushchin
2026-01-27  9:02 ` [PATCH bpf-next v3 00/17] mm: BPF OOM Michal Hocko
2026-01-27 21:01   ` Roman Gushchin
2026-01-28  8:06     ` Michal Hocko
2026-01-28 16:59       ` Alexei Starovoitov
2026-01-28 18:23         ` Roman Gushchin
2026-01-28 18:53           ` Alexei Starovoitov
2026-02-02  3:26         ` Matt Bobrowski
2026-02-02 17:50           ` Alexei Starovoitov
2026-02-04 23:52             ` Matt Bobrowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aXnCLWYbQ8xZ2IyO@tiehlicka \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=inwardvessel@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mattbobrowski@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox