[PATCH bpf-next v3 00/17] mm: BPF OOM

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next v3 00/17] mm: BPF OOM
@ 2026-01-27  2:44 Roman Gushchin
  2026-01-27  2:44 ` [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
                   ` (15 more replies)
  0 siblings, 16 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

This patchset adds an ability to customize the out of memory
handling using bpf.

It focuses on two parts:
1) OOM handling policy,
2) PSI-based OOM invocation.

The idea to use bpf for customizing the OOM handling is not new, but
unlike the previous proposal [1], which augmented the existing task
ranking policy, this one tries to be as generic as possible and
leverage the full power of the modern bpf.

It provides a generic interface which is called before the existing OOM
killer code and allows implementing any policy, e.g. picking a victim
task or memory cgroup or potentially even releasing memory in other
ways, e.g. deleting tmpfs files (the last one might require some
additional but relatively simple changes).

The past attempt to implement memory-cgroup aware policy [2] showed
that there are multiple opinions on what the best policy is.  As it's
highly workload-dependent and specific to a concrete way of organizing
workloads, the structure of the cgroup tree etc, a customizable
bpf-based implementation is preferable over an in-kernel implementation
with a dozen of sysctls.

The second part is related to the fundamental question on when to
declare the OOM event. It's a trade-off between the risk of
unnecessary OOM kills and associated work losses and the risk of
infinite trashing and effective soft lockups.  In the last few years
several PSI-based userspace solutions were developed (e.g. OOMd [3] or
systemd-OOMd [4]). The common idea was to use userspace daemons to
implement custom OOM logic as well as rely on PSI monitoring to avoid
stalls. In this scenario the userspace daemon was supposed to handle
the majority of OOMs, while the in-kernel OOM killer worked as the
last resort measure to guarantee that the system would never deadlock
on the memory. But this approach creates additional infrastructure
churn: userspace OOM daemon is a separate entity which needs to be
deployed, updated, monitored. A completely different pipeline needs to
be built to monitor both types of OOM events and collect associated
logs. A userspace daemon is more restricted in terms on what data is
available to it. Implementing a daemon which can work reliably under a
heavy memory pressure in the system is also tricky.

This patchset includes the code, tests and many ideas from the patchset
of JP Kobryn, which implemented bpf kfuncs to provide a faster method
to access memcg data [5].

[1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
[2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
[3]: https://github.com/facebookincubator/oomd
[4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[5]: https://lkml.org/lkml/2025/10/15/1554

---

v3:
  1) Replaced bpf_psi struct ops with a tracepoint in psi_avgs_work() (Tejun H.)
  2) Updated bpf_oom struct ops:
     - removed bpf_oom_ctx, passing bpf_struct_ops_link instead (by Alexei S.)
     - removed handle_cgroup_offline callback.
  3) Updated kfuncs:
     - bpf_out_of_memory() dropped constraint_text argument (by Michal H.)
     - bpf_oom_kill_process() added check for OOM_SCORE_ADJ_MIN.
  4) Libbpf: updated bpf_map__attach_struct_ops_opts to use target_fd. (by Alexei S.)

v2:
  1) A single bpf_oom can be attached system-wide and a single bpf_oom per memcg.
     (by Alexei Starovoitov)
  2) Initial support for attaching struct ops to cgroups (Martin KaFai Lau,
     Andrii Nakryiko and others)
  3) bpf memcontrol kfuncs enhancements and tests (co-developed by JP Kobryn)
  4) Many mall-ish fixes and cleanups (suggested by Andrew Morton, Suren Baghdasaryan,
     Andrii Nakryiko and Kumar Kartikeya Dwivedi)
  5) bpf_out_of_memory() is taking u64 flags instead of bool wait_on_oom_lock
     (suggested by Kumar Kartikeya Dwivedi)
  6) bpf_get_mem_cgroup() got KF_RCU flag (suggested by Kumar Kartikeya Dwivedi)
  7) cgroup online and offline callbacks for bpf_psi, cgroup offline for bpf_oom

v1:
  1) Both OOM and PSI parts are now implemented using bpf struct ops,
     providing a path the future extensions (suggested by Kumar Kartikeya Dwivedi,
     Song Liu and Matt Bobrowski)
  2) It's possible to create PSI triggers from BPF, no need for an additional
     userspace agent. (suggested by Suren Baghdasaryan)
     Also there is now a callback for the cgroup release event.
  3) Added an ability to block on oom_lock instead of bailing out (suggested by Michal Hocko)
  4) Added bpf_task_is_oom_victim (suggested by Michal Hocko)
  5) PSI callbacks are scheduled using a separate workqueue (suggested by Suren Baghdasaryan)

RFC:
  https://lwn.net/ml/all/20250428033617.3797686-1-roman.gushchin@linux.dev/


JP Kobryn (1):
  bpf: selftests: add config for psi

Roman Gushchin (16):
  bpf: move bpf_struct_ops_link into bpf.h
  bpf: allow attaching struct_ops to cgroups
  libbpf: fix return value on memory allocation failure
  libbpf: introduce bpf_map__attach_struct_ops_opts()
  bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
  mm: introduce BPF OOM struct ops
  mm: introduce bpf_oom_kill_process() bpf kfunc
  mm: introduce bpf_out_of_memory() BPF kfunc
  mm: introduce bpf_task_is_oom_victim() kfunc
  bpf: selftests: introduce read_cgroup_file() helper
  bpf: selftests: BPF OOM struct ops test
  sched: psi: add a trace point to psi_avgs_work()
  sched: psi: add cgroup_id field to psi_group structure
  bpf: allow calling bpf_out_of_memory() from a PSI tracepoint
  bpf: selftests: PSI struct ops test

 MAINTAINERS                                   |   2 +
 include/linux/bpf-cgroup-defs.h               |   6 +
 include/linux/bpf-cgroup.h                    |  16 ++
 include/linux/bpf.h                           |  10 +
 include/linux/bpf_oom.h                       |  46 ++++
 include/linux/memcontrol.h                    |   4 +-
 include/linux/oom.h                           |  13 +
 include/linux/psi_types.h                     |   4 +
 include/trace/events/psi.h                    |  27 ++
 include/uapi/linux/bpf.h                      |   3 +
 kernel/bpf/bpf_struct_ops.c                   |  77 +++++-
 kernel/bpf/cgroup.c                           |  46 ++++
 kernel/bpf/verifier.c                         |   5 +
 kernel/sched/psi.c                            |   7 +
 mm/Makefile                                   |   2 +-
 mm/bpf_oom.c                                  | 192 +++++++++++++
 mm/memcontrol.c                               |   2 -
 mm/oom_kill.c                                 | 202 ++++++++++++++
 tools/include/uapi/linux/bpf.h                |   1 +
 tools/lib/bpf/libbpf.c                        |  22 +-
 tools/lib/bpf/libbpf.h                        |  14 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/testing/selftests/bpf/cgroup_helpers.c  |  45 +++
 tools/testing/selftests/bpf/cgroup_helpers.h  |   3 +
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/test_oom.c       | 256 ++++++++++++++++++
 .../selftests/bpf/prog_tests/test_psi.c       | 225 +++++++++++++++
 tools/testing/selftests/bpf/progs/test_oom.c  | 111 ++++++++
 tools/testing/selftests/bpf/progs/test_psi.c  |  90 ++++++
 29 files changed, 1412 insertions(+), 21 deletions(-)
 create mode 100644 include/linux/bpf_oom.h
 create mode 100644 include/trace/events/psi.h
 create mode 100644 mm/bpf_oom.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c

-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  5:50   ` Yafang Shao
  2026-01-28 11:28   ` Matt Bobrowski
  2026-01-27  2:44 ` [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups Roman Gushchin
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Move struct bpf_struct_ops_link's definition into bpf.h,
where other custom bpf links definitions are.

It's necessary to access its members from outside of generic
bpf_struct_ops implementation, which will be done by following
patches in the series.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/bpf.h         | 6 ++++++
 kernel/bpf/bpf_struct_ops.c | 6 ------
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4427c6e98331..899dd911dc82 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1891,6 +1891,12 @@ struct bpf_raw_tp_link {
 	u64 cookie;
 };
 
+struct bpf_struct_ops_link {
+	struct bpf_link link;
+	struct bpf_map __rcu *map;
+	wait_queue_head_t wait_hup;
+};
+
 struct bpf_link_primer {
 	struct bpf_link *link;
 	struct file *file;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index c43346cb3d76..de01cf3025b3 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -55,12 +55,6 @@ struct bpf_struct_ops_map {
 	struct bpf_struct_ops_value kvalue;
 };
 
-struct bpf_struct_ops_link {
-	struct bpf_link link;
-	struct bpf_map __rcu *map;
-	wait_queue_head_t wait_hup;
-};
-
 static DEFINE_MUTEX(update_mutex);
 
 #define VALUE_PREFIX "bpf_struct_ops_"
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
  2026-01-27  2:44 ` [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  3:08   ` bot+bpf-ci
                     ` (3 more replies)
  2026-01-27  2:44 ` [PATCH bpf-next v3 03/17] libbpf: fix return value on memory allocation failure Roman Gushchin
                   ` (13 subsequent siblings)
  15 siblings, 4 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Introduce an ability to attach bpf struct_ops'es to cgroups.

From user's standpoint it works in the following way:
a user passes a BPF_F_CGROUP_FD flag and specifies the target cgroup
fd while creating a struct_ops link. As the result, the bpf struct_ops
link will be created and attached to a cgroup.

The cgroup.bpf structure maintains a list of attached struct ops links.
If the cgroup is getting deleted, attached struct ops'es are getting
auto-detached and the userspace program gets a notification.

This change doesn't answer the question how bpf programs belonging
to these struct ops'es will be executed. It will be done individually
for every bpf struct ops which supports this.

Please, note that unlike "normal" bpf programs, struct ops'es
are not propagated to cgroup sub-trees.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/bpf-cgroup-defs.h |  3 ++
 include/linux/bpf-cgroup.h      | 16 +++++++++
 include/linux/bpf.h             |  3 ++
 include/uapi/linux/bpf.h        |  3 ++
 kernel/bpf/bpf_struct_ops.c     | 59 ++++++++++++++++++++++++++++++---
 kernel/bpf/cgroup.c             | 46 +++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h  |  1 +
 7 files changed, 127 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
index c9e6b26abab6..6c5e37190dad 100644
--- a/include/linux/bpf-cgroup-defs.h
+++ b/include/linux/bpf-cgroup-defs.h
@@ -71,6 +71,9 @@ struct cgroup_bpf {
 	/* temp storage for effective prog array used by prog_attach/detach */
 	struct bpf_prog_array *inactive;
 
+	/* list of bpf struct ops links */
+	struct list_head struct_ops_links;
+
 	/* reference counter used to detach bpf programs after cgroup removal */
 	struct percpu_ref refcnt;
 
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 2f535331f926..a6c327257006 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -423,6 +423,11 @@ int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 int cgroup_bpf_prog_query(const union bpf_attr *attr,
 			  union bpf_attr __user *uattr);
 
+int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
+				 struct bpf_struct_ops_link *link);
+void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
+				  struct bpf_struct_ops_link *link);
+
 const struct bpf_func_proto *
 cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
 #else
@@ -451,6 +456,17 @@ static inline int cgroup_bpf_link_attach(const union bpf_attr *attr,
 	return -EINVAL;
 }
 
+static inline int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
+					       struct bpf_struct_ops_link *link)
+{
+	return -EINVAL;
+}
+
+static inline void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
+						struct bpf_struct_ops_link *link)
+{
+}
+
 static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
 					union bpf_attr __user *uattr)
 {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 899dd911dc82..391888eb257c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1894,6 +1894,9 @@ struct bpf_raw_tp_link {
 struct bpf_struct_ops_link {
 	struct bpf_link link;
 	struct bpf_map __rcu *map;
+	struct cgroup *cgroup;
+	bool cgroup_removed;
+	struct list_head list;
 	wait_queue_head_t wait_hup;
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 44e7dbc278e3..28544e8af1cd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1237,6 +1237,7 @@ enum bpf_perf_event_type {
 #define BPF_F_AFTER		(1U << 4)
 #define BPF_F_ID		(1U << 5)
 #define BPF_F_PREORDER		(1U << 6)
+#define BPF_F_CGROUP_FD		(1U << 7)
 #define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
 
 /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
@@ -6775,6 +6776,8 @@ struct bpf_link_info {
 		} xdp;
 		struct {
 			__u32 map_id;
+			__u32 :32;
+			__u64 cgroup_id;
 		} struct_ops;
 		struct {
 			__u32 pf;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index de01cf3025b3..2e361e22cfa0 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -13,6 +13,8 @@
 #include <linux/btf_ids.h>
 #include <linux/rcupdate_wait.h>
 #include <linux/poll.h>
+#include <linux/bpf-cgroup.h>
+#include <linux/cgroup.h>
 
 struct bpf_struct_ops_value {
 	struct bpf_struct_ops_common_value common;
@@ -1220,6 +1222,10 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
 		st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data, link);
 		bpf_map_put(&st_map->map);
 	}
+
+	if (st_link->cgroup)
+		cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
+
 	kfree(st_link);
 }
 
@@ -1228,6 +1234,7 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
 {
 	struct bpf_struct_ops_link *st_link;
 	struct bpf_map *map;
+	u64 cgrp_id = 0;
 
 	st_link = container_of(link, struct bpf_struct_ops_link, link);
 	rcu_read_lock();
@@ -1235,6 +1242,14 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
 	if (map)
 		seq_printf(seq, "map_id:\t%d\n", map->id);
 	rcu_read_unlock();
+
+	cgroup_lock();
+	if (st_link->cgroup)
+		cgrp_id = cgroup_id(st_link->cgroup);
+	cgroup_unlock();
+
+	if (cgrp_id)
+		seq_printf(seq, "cgroup_id:\t%llu\n", cgrp_id);
 }
 
 static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
@@ -1242,6 +1257,7 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
 {
 	struct bpf_struct_ops_link *st_link;
 	struct bpf_map *map;
+	u64 cgrp_id = 0;
 
 	st_link = container_of(link, struct bpf_struct_ops_link, link);
 	rcu_read_lock();
@@ -1249,6 +1265,13 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
 	if (map)
 		info->struct_ops.map_id = map->id;
 	rcu_read_unlock();
+
+	cgroup_lock();
+	if (st_link->cgroup)
+		cgrp_id = cgroup_id(st_link->cgroup);
+	cgroup_unlock();
+
+	info->struct_ops.cgroup_id = cgrp_id;
 	return 0;
 }
 
@@ -1327,6 +1350,9 @@ static int bpf_struct_ops_map_link_detach(struct bpf_link *link)
 
 	mutex_unlock(&update_mutex);
 
+	if (st_link->cgroup)
+		cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
+
 	wake_up_interruptible_poll(&st_link->wait_hup, EPOLLHUP);
 
 	return 0;
@@ -1339,6 +1365,9 @@ static __poll_t bpf_struct_ops_map_link_poll(struct file *file,
 
 	poll_wait(file, &st_link->wait_hup, pts);
 
+	if (st_link->cgroup_removed)
+		return EPOLLHUP;
+
 	return rcu_access_pointer(st_link->map) ? 0 : EPOLLHUP;
 }
 
@@ -1357,8 +1386,12 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
 	struct bpf_link_primer link_primer;
 	struct bpf_struct_ops_map *st_map;
 	struct bpf_map *map;
+	struct cgroup *cgrp;
 	int err;
 
+	if (attr->link_create.flags & ~BPF_F_CGROUP_FD)
+		return -EINVAL;
+
 	map = bpf_map_get(attr->link_create.map_fd);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
@@ -1378,11 +1411,26 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
 	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
 		      attr->link_create.attach_type);
 
+	init_waitqueue_head(&link->wait_hup);
+
+	if (attr->link_create.flags & BPF_F_CGROUP_FD) {
+		cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
+		if (IS_ERR(cgrp)) {
+			err = PTR_ERR(cgrp);
+			goto err_out;
+		}
+		link->cgroup = cgrp;
+		err = cgroup_bpf_attach_struct_ops(cgrp, link);
+		if (err) {
+			cgroup_put(cgrp);
+			link->cgroup = NULL;
+			goto err_out;
+		}
+	}
+
 	err = bpf_link_prime(&link->link, &link_primer);
 	if (err)
-		goto err_out;
-
-	init_waitqueue_head(&link->wait_hup);
+		goto err_put_cgroup;
 
 	/* Hold the update_mutex such that the subsystem cannot
 	 * do link->ops->detach() before the link is fully initialized.
@@ -1393,13 +1441,16 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
 		mutex_unlock(&update_mutex);
 		bpf_link_cleanup(&link_primer);
 		link = NULL;
-		goto err_out;
+		goto err_put_cgroup;
 	}
 	RCU_INIT_POINTER(link->map, map);
 	mutex_unlock(&update_mutex);
 
 	return bpf_link_settle(&link_primer);
 
+err_put_cgroup:
+	if (link && link->cgroup)
+		cgroup_bpf_detach_struct_ops(link->cgroup, link);
 err_out:
 	bpf_map_put(map);
 	kfree(link);
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 69988af44b37..7b1903be6f69 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -16,6 +16,7 @@
 #include <linux/bpf-cgroup.h>
 #include <linux/bpf_lsm.h>
 #include <linux/bpf_verifier.h>
+#include <linux/poll.h>
 #include <net/sock.h>
 #include <net/bpf_sk_storage.h>
 
@@ -307,12 +308,23 @@ static void cgroup_bpf_release(struct work_struct *work)
 					       bpf.release_work);
 	struct bpf_prog_array *old_array;
 	struct list_head *storages = &cgrp->bpf.storages;
+	struct bpf_struct_ops_link *st_link, *st_tmp;
 	struct bpf_cgroup_storage *storage, *stmp;
+	LIST_HEAD(st_links);
 
 	unsigned int atype;
 
 	cgroup_lock();
 
+	list_splice_init(&cgrp->bpf.struct_ops_links, &st_links);
+	list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
+		st_link->cgroup = NULL;
+		st_link->cgroup_removed = true;
+		cgroup_put(cgrp);
+		if (IS_ERR(bpf_link_inc_not_zero(&st_link->link)))
+			list_del(&st_link->list);
+	}
+
 	for (atype = 0; atype < ARRAY_SIZE(cgrp->bpf.progs); atype++) {
 		struct hlist_head *progs = &cgrp->bpf.progs[atype];
 		struct bpf_prog_list *pl;
@@ -346,6 +358,11 @@ static void cgroup_bpf_release(struct work_struct *work)
 
 	cgroup_unlock();
 
+	list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
+		st_link->link.ops->detach(&st_link->link);
+		bpf_link_put(&st_link->link);
+	}
+
 	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
 		cgroup_bpf_put(p);
 
@@ -525,6 +542,7 @@ static int cgroup_bpf_inherit(struct cgroup *cgrp)
 		INIT_HLIST_HEAD(&cgrp->bpf.progs[i]);
 
 	INIT_LIST_HEAD(&cgrp->bpf.storages);
+	INIT_LIST_HEAD(&cgrp->bpf.struct_ops_links);
 
 	for (i = 0; i < NR; i++)
 		if (compute_effective_progs(cgrp, i, &arrays[i]))
@@ -2759,3 +2777,31 @@ cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return NULL;
 	}
 }
+
+int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
+				 struct bpf_struct_ops_link *link)
+{
+	int ret = 0;
+
+	cgroup_lock();
+	if (percpu_ref_is_zero(&cgrp->bpf.refcnt)) {
+		ret = -EBUSY;
+		goto out;
+	}
+	list_add_tail(&link->list, &cgrp->bpf.struct_ops_links);
+out:
+	cgroup_unlock();
+	return ret;
+}
+
+void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
+				  struct bpf_struct_ops_link *link)
+{
+	cgroup_lock();
+	if (link->cgroup == cgrp) {
+		list_del(&link->list);
+		link->cgroup = NULL;
+		cgroup_put(cgrp);
+	}
+	cgroup_unlock();
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3ca7d76e05f0..d5492e60744a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1237,6 +1237,7 @@ enum bpf_perf_event_type {
 #define BPF_F_AFTER		(1U << 4)
 #define BPF_F_ID		(1U << 5)
 #define BPF_F_PREORDER		(1U << 6)
+#define BPF_F_CGROUP_FD		(1U << 7)
 #define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
 
 /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 03/17] libbpf: fix return value on memory allocation failure
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
  2026-01-27  2:44 ` [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
  2026-01-27  2:44 ` [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  5:52   ` Yafang Shao
  2026-01-27  2:44 ` [PATCH bpf-next v3 04/17] libbpf: introduce bpf_map__attach_struct_ops_opts() Roman Gushchin
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

bpf_map__attach_struct_ops() returns -EINVAL instead of -ENOMEM
on the memory allocation failure. Fix it.

Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/lib/bpf/libbpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 0c8bf0b5cce4..46d2762f5993 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -13480,7 +13480,7 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 
 	link = calloc(1, sizeof(*link));
 	if (!link)
-		return libbpf_err_ptr(-EINVAL);
+		return libbpf_err_ptr(-ENOMEM);
 
 	/* kern_vdata should be prepared during the loading phase. */
 	err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 04/17] libbpf: introduce bpf_map__attach_struct_ops_opts()
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (2 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 03/17] libbpf: fix return value on memory allocation failure Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  3:08   ` bot+bpf-ci
  2026-01-27  2:44 ` [PATCH bpf-next v3 05/17] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Introduce bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops(), which takes additional struct
bpf_struct_ops_opts argument.

This allows to pass a target_fd argument and the BPF_F_CGROUP_FD flag
and attach the struct ops to a cgroup as a result.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/lib/bpf/libbpf.c   | 20 +++++++++++++++++---
 tools/lib/bpf/libbpf.h   | 14 ++++++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 46d2762f5993..9ba67089bf9d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -13462,11 +13462,18 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
 	return close(link->fd);
 }
 
-struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
+						 const struct bpf_struct_ops_opts *opts)
 {
+	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
 	struct bpf_link_struct_ops *link;
+	int err, fd, target_fd;
 	__u32 zero = 0;
-	int err, fd;
+
+	if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
+		pr_warn("map '%s': invalid opts\n", map->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
 
 	if (!bpf_map__is_struct_ops(map)) {
 		pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
@@ -13503,7 +13510,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 		return &link->link;
 	}
 
-	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
+	link_opts.flags = OPTS_GET(opts, flags, 0);
+	target_fd = OPTS_GET(opts, target_fd, 0);
+	fd = bpf_link_create(map->fd, target_fd, BPF_STRUCT_OPS, &link_opts);
 	if (fd < 0) {
 		free(link);
 		return libbpf_err_ptr(fd);
@@ -13515,6 +13524,11 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 	return &link->link;
 }
 
+struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+{
+	return bpf_map__attach_struct_ops_opts(map, NULL);
+}
+
 /*
  * Swap the back struct_ops of a link with a new struct_ops map.
  */
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index dfc37a615578..2c28cf80e7fe 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -920,6 +920,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
 struct bpf_map;
 
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
+
+struct bpf_struct_ops_opts {
+	/* size of this struct, for forward/backward compatibility */
+	size_t sz;
+	__u32 flags;
+	__u32 target_fd;
+	__u64 expected_revision;
+	size_t :0;
+};
+#define bpf_struct_ops_opts__last_field expected_revision
+
+LIBBPF_API struct bpf_link *
+bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
+				const struct bpf_struct_ops_opts *opts);
 LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
 
 struct bpf_iter_attach_opts {
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index d18fbcea7578..4779190c97b6 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -454,4 +454,5 @@ LIBBPF_1.7.0 {
 		bpf_prog_assoc_struct_ops;
 		bpf_program__assoc_struct_ops;
 		btf__permute;
+		bpf_map__attach_struct_ops_opts;
 } LIBBPF_1.6.0;
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 05/17] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (3 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 04/17] libbpf: introduce bpf_map__attach_struct_ops_opts() Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  6:06   ` Yafang Shao
  2026-02-02  4:56   ` Matt Bobrowski
  2026-01-27  2:44 ` [PATCH bpf-next v3 06/17] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin,
	Kumar Kartikeya Dwivedi

Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c2f2650db9fd..cca36edb460d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7242,6 +7242,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
 	struct file *vm_file;
 };
 
+BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control) {
+	struct mem_cgroup *memcg;
+};
+
 static bool type_is_rcu(struct bpf_verifier_env *env,
 			struct bpf_reg_state *reg,
 			const char *field_name, u32 btf_id)
@@ -7284,6 +7288,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
+	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control));
 
 	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
 					  "__safe_trusted_or_null");
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 06/17] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (4 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 05/17] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  6:12   ` Yafang Shao
  2026-02-02  3:50   ` Shakeel Butt
  2026-01-27  2:44 ` [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops Roman Gushchin
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h | 4 ++--
 mm/memcontrol.c            | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 229ac9835adb..f3b8c71870d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -833,9 +833,9 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
 {
 	return memcg ? cgroup_ino(memcg->css.cgroup) : 0;
 }
+#endif
 
 struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino);
-#endif
 
 static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m)
 {
@@ -1298,12 +1298,12 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
 {
 	return 0;
 }
+#endif
 
 static inline struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
 {
 	return NULL;
 }
-#endif
 
 static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3808845bc8cc..1f74fce27677 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3658,7 +3658,6 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return xa_load(&mem_cgroup_ids, id);
 }
 
-#ifdef CONFIG_SHRINKER_DEBUG
 struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
 {
 	struct cgroup *cgrp;
@@ -3679,7 +3678,6 @@ struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
 
 	return memcg;
 }
-#endif
 
 static void free_mem_cgroup_per_node_info(struct mem_cgroup_per_node *pn)
 {
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (5 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 06/17] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  9:38   ` Michal Hocko
                     ` (3 more replies)
  2026-01-27  2:44 ` [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
                   ` (8 subsequent siblings)
  15 siblings, 4 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Introduce a bpf struct ops for implementing custom OOM handling
policies.

It's possible to load one bpf_oom_ops for the system and one
bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
cgroup tree is traversed from the OOM'ing memcg up to the root and
corresponding BPF OOM handlers are executed until some memory is
freed. If no memory is freed, the kernel OOM killer is invoked.

The struct ops provides the bpf_handle_out_of_memory() callback,
which expected to return 1 if it was able to free some memory and 0
otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
field of the oom_control structure, which is expected to be set by
kfuncs suitable for releasing memory (which will be introduced later
in the patch series). If both are set, OOM is considered handled,
otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
attached to the parent cgroup or the kernel OOM killer.

The bpf_handle_out_of_memory() callback program is sleepable to allow
using iterators, e.g. cgroup iterators. The callback receives struct
oom_control as an argument, so it can determine the scope of the OOM
event: if this is a memcg-wide or system-wide OOM. It also receives
bpf_struct_ops_link as the second argument, so it can detect the
cgroup level at which this specific instance is attached.

The bpf_handle_out_of_memory() callback is executed just before the
kernel victim task selection algorithm, so all heuristics and sysctls
like panic on oom, sysctl_oom_kill_allocating_task and
sysctl_oom_kill_allocating_task are respected.

The struct ops has the name field, which allows to define a custom
name for the implemented policy. It's printed in the OOM report
in the oom_handler=<name> format only if a bpf handler is invoked.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 MAINTAINERS                     |   2 +
 include/linux/bpf-cgroup-defs.h |   3 +
 include/linux/bpf.h             |   1 +
 include/linux/bpf_oom.h         |  46 ++++++++
 include/linux/oom.h             |   8 ++
 kernel/bpf/bpf_struct_ops.c     |  12 +-
 mm/Makefile                     |   2 +-
 mm/bpf_oom.c                    | 192 ++++++++++++++++++++++++++++++++
 mm/oom_kill.c                   |  19 ++++
 9 files changed, 282 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/bpf_oom.h
 create mode 100644 mm/bpf_oom.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 491d567f7dc8..53465570c1e5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4807,7 +4807,9 @@ M:	Shakeel Butt <shakeel.butt@linux.dev>
 L:	bpf@vger.kernel.org
 L:	linux-mm@kvack.org
 S:	Maintained
+F:	include/linux/bpf_oom.h
 F:	mm/bpf_memcontrol.c
+F:	mm/bpf_oom.c
 
 BPF [MISC]
 L:	bpf@vger.kernel.org
diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
index 6c5e37190dad..52395834ce13 100644
--- a/include/linux/bpf-cgroup-defs.h
+++ b/include/linux/bpf-cgroup-defs.h
@@ -74,6 +74,9 @@ struct cgroup_bpf {
 	/* list of bpf struct ops links */
 	struct list_head struct_ops_links;
 
+	/* BPF OOM struct ops link */
+	struct bpf_struct_ops_link __rcu *bpf_oom_link;
+
 	/* reference counter used to detach bpf programs after cgroup removal */
 	struct percpu_ref refcnt;
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 391888eb257c..a5cee5a657b0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3944,6 +3944,7 @@ static inline bool bpf_is_subprog(const struct bpf_prog *prog)
 int bpf_prog_get_file_line(struct bpf_prog *prog, unsigned long ip, const char **filep,
 			   const char **linep, int *nump);
 struct bpf_prog *bpf_prog_find_from_stack(void);
+void *bpf_struct_ops_data(struct bpf_map *map);
 
 int bpf_insn_array_init(struct bpf_map *map, const struct bpf_prog *prog);
 int bpf_insn_array_ready(struct bpf_map *map);
diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
new file mode 100644
index 000000000000..c81133145c50
--- /dev/null
+++ b/include/linux/bpf_oom.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __BPF_OOM_H
+#define __BPF_OOM_H
+
+struct oom_control;
+
+#define BPF_OOM_NAME_MAX_LEN 64
+
+struct bpf_oom_ops {
+	/**
+	 * @handle_out_of_memory: Out of memory bpf handler, called before
+	 * the in-kernel OOM killer.
+	 * @oc: OOM control structure
+	 * @st_link: struct ops link
+	 *
+	 * Should return 1 if some memory was freed up, otherwise
+	 * the in-kernel OOM killer is invoked.
+	 */
+	int (*handle_out_of_memory)(struct oom_control *oc,
+				    struct bpf_struct_ops_link *st_link);
+
+	/**
+	 * @name: BPF OOM policy name
+	 */
+	char name[BPF_OOM_NAME_MAX_LEN];
+};
+
+#ifdef CONFIG_BPF_SYSCALL
+/**
+ * @bpf_handle_oom: handle out of memory condition using bpf
+ * @oc: OOM control structure
+ *
+ * Returns true if some memory was freed.
+ */
+bool bpf_handle_oom(struct oom_control *oc);
+
+#else /* CONFIG_BPF_SYSCALL */
+static inline bool bpf_handle_oom(struct oom_control *oc)
+{
+	return false;
+}
+
+#endif /* CONFIG_BPF_SYSCALL */
+
+#endif /* __BPF_OOM_H */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7b02bc1d0a7e..c2dce336bcb4 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -51,6 +51,14 @@ struct oom_control {
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* Used by the bpf oom implementation to mark the forward progress */
+	bool bpf_memory_freed;
+
+	/* Handler name */
+	const char *bpf_handler_name;
+#endif
 };
 
 extern struct mutex oom_lock;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 2e361e22cfa0..6285a6d56b98 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -1009,7 +1009,7 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 	 * in the tramopline image to finish before releasing
 	 * the trampoline image.
 	 */
-	synchronize_rcu_mult(call_rcu, call_rcu_tasks);
+	synchronize_rcu_mult(call_rcu, call_rcu_tasks, call_rcu_tasks_trace);
 
 	__bpf_struct_ops_map_free(map);
 }
@@ -1226,7 +1226,8 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
 	if (st_link->cgroup)
 		cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
 
-	kfree(st_link);
+	synchronize_rcu_tasks_trace();
+	kfree_rcu(st_link, link.rcu);
 }
 
 static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
@@ -1535,3 +1536,10 @@ void bpf_map_struct_ops_info_fill(struct bpf_map_info *info, struct bpf_map *map
 
 	info->btf_vmlinux_id = btf_obj_id(st_map->btf);
 }
+
+void *bpf_struct_ops_data(struct bpf_map *map)
+{
+	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+
+	return &st_map->kvalue.data;
+}
diff --git a/mm/Makefile b/mm/Makefile
index bf46fe31dc14..e939525ba01b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,7 +107,7 @@ ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
 ifdef CONFIG_BPF_SYSCALL
-obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
+obj-$(CONFIG_MEMCG) += bpf_memcontrol.o bpf_oom.o
 endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
new file mode 100644
index 000000000000..ea70be6e2c26
--- /dev/null
+++ b/mm/bpf_oom.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BPF-driven OOM killer customization
+ *
+ * Author: Roman Gushchin <roman.gushchin@linux.dev>
+ */
+
+#include <linux/bpf.h>
+#include <linux/oom.h>
+#include <linux/bpf_oom.h>
+#include <linux/bpf-cgroup.h>
+#include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/uaccess.h>
+
+static int bpf_ops_handle_oom(struct bpf_oom_ops *bpf_oom_ops,
+			      struct bpf_struct_ops_link *st_link,
+			      struct oom_control *oc)
+{
+	int ret;
+
+	oc->bpf_handler_name = &bpf_oom_ops->name[0];
+	oc->bpf_memory_freed = false;
+	pagefault_disable();
+	ret = bpf_oom_ops->handle_out_of_memory(oc, st_link);
+	pagefault_enable();
+	oc->bpf_handler_name = NULL;
+
+	return ret;
+}
+
+bool bpf_handle_oom(struct oom_control *oc)
+{
+	struct bpf_struct_ops_link *st_link;
+	struct bpf_oom_ops *bpf_oom_ops;
+	struct mem_cgroup *memcg;
+	struct bpf_map *map;
+	int ret = 0;
+
+	/*
+	 * System-wide OOMs are handled by the struct ops attached
+	 * to the root memory cgroup
+	 */
+	memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
+
+	rcu_read_lock_trace();
+
+	/* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
+						rcu_read_lock_trace_held());
+		if (!st_link)
+			continue;
+
+		map = rcu_dereference_check((st_link->map),
+					    rcu_read_lock_trace_held());
+		if (!map)
+			continue;
+
+		/* Call BPF OOM handler */
+		bpf_oom_ops = bpf_struct_ops_data(map);
+		ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
+		if (ret && oc->bpf_memory_freed)
+			break;
+		ret = 0;
+	}
+
+	rcu_read_unlock_trace();
+
+	return ret && oc->bpf_memory_freed;
+}
+
+static int __handle_out_of_memory(struct oom_control *oc,
+				  struct bpf_struct_ops_link *st_link)
+{
+	return 0;
+}
+
+static struct bpf_oom_ops __bpf_oom_ops = {
+	.handle_out_of_memory = __handle_out_of_memory,
+};
+
+static const struct bpf_func_proto *
+bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return tracing_prog_func_proto(func_id, prog);
+}
+
+static bool bpf_oom_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_verifier_ops bpf_oom_verifier_ops = {
+	.get_func_proto = bpf_oom_func_proto,
+	.is_valid_access = bpf_oom_ops_is_valid_access,
+};
+
+static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct cgroup *cgrp;
+
+	/* The link is not yet fully initialized, but cgroup should be set */
+	if (!link)
+		return -EOPNOTSUPP;
+
+	cgrp = st_link->cgroup;
+	if (!cgrp)
+		return -EINVAL;
+
+	if (cmpxchg(&cgrp->bpf.bpf_oom_link, NULL, st_link))
+		return -EEXIST;
+
+	return 0;
+}
+
+static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct cgroup *cgrp;
+
+	if (!link)
+		return;
+
+	cgrp = st_link->cgroup;
+	if (!cgrp)
+		return;
+
+	WARN_ON(cmpxchg(&cgrp->bpf.bpf_oom_link, st_link, NULL) != st_link);
+}
+
+static int bpf_oom_ops_check_member(const struct btf_type *t,
+				    const struct btf_member *member,
+				    const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_oom_ops, handle_out_of_memory):
+		if (!prog)
+			return -EINVAL;
+		break;
+	}
+
+	return 0;
+}
+
+static int bpf_oom_ops_init_member(const struct btf_type *t,
+				   const struct btf_member *member,
+				   void *kdata, const void *udata)
+{
+	const struct bpf_oom_ops *uops = udata;
+	struct bpf_oom_ops *ops = kdata;
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_oom_ops, name):
+		if (uops->name[0])
+			strscpy_pad(ops->name, uops->name, sizeof(ops->name));
+		else
+			strscpy_pad(ops->name, "bpf_defined_policy");
+		return 1;
+	}
+	return 0;
+}
+
+static int bpf_oom_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static struct bpf_struct_ops bpf_oom_bpf_ops = {
+	.verifier_ops = &bpf_oom_verifier_ops,
+	.reg = bpf_oom_ops_reg,
+	.unreg = bpf_oom_ops_unreg,
+	.check_member = bpf_oom_ops_check_member,
+	.init_member = bpf_oom_ops_init_member,
+	.init = bpf_oom_ops_init,
+	.name = "bpf_oom_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &__bpf_oom_ops
+};
+
+static int __init bpf_oom_struct_ops_init(void)
+{
+	return register_bpf_struct_ops(&bpf_oom_bpf_ops, bpf_oom_ops);
+}
+late_initcall(bpf_oom_struct_ops_init);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5eb11fbba704..44bbcf033804 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -45,6 +45,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/cred.h>
 #include <linux/nmi.h>
+#include <linux/bpf_oom.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -246,6 +247,15 @@ static const char * const oom_constraint_text[] = {
 	[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
 };
 
+static const char *oom_handler_name(struct oom_control *oc)
+{
+#ifdef CONFIG_BPF_SYSCALL
+	if (oc->bpf_handler_name)
+		return oc->bpf_handler_name;
+#endif
+	return NULL;
+}
+
 /*
  * Determine the type of allocation constraint.
  */
@@ -461,6 +471,8 @@ static void dump_header(struct oom_control *oc)
 	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
 		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
 			current->signal->oom_score_adj);
+	if (oom_handler_name(oc))
+		pr_warn("oom bpf handler: %s\n", oom_handler_name(oc));
 	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
 		pr_warn("COMPACTION is disabled!!!\n");
 
@@ -1168,6 +1180,13 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	/*
+	 * Let bpf handle the OOM first. If it was able to free up some memory,
+	 * bail out. Otherwise fall back to the kernel OOM killer.
+	 */
+	if (bpf_handle_oom(oc))
+		return true;
+
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (6 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27 20:21   ` Martin KaFai Lau
  2026-02-02  4:49   ` Matt Bobrowski
  2026-01-27  2:44 ` [PATCH bpf-next v3 09/17] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
to be used by BPF OOM programs. It allows to kill a process
in exactly the same way the OOM killer does: using the OOM reaper,
bumping corresponding memcg and global statistics, respecting
memory.oom.group etc.

On success, it sets the oom_control's bpf_memory_freed field to true,
enabling the bpf program to bypass the kernel OOM killer.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 44bbcf033804..09897597907f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -46,6 +46,7 @@
 #include <linux/cred.h>
 #include <linux/nmi.h>
 #include <linux/bpf_oom.h>
+#include <linux/btf.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -1290,3 +1291,82 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	return -ENOSYS;
 #endif /* CONFIG_MMU */
 }
+
+#ifdef CONFIG_BPF_SYSCALL
+
+__bpf_kfunc_start_defs();
+/**
+ * bpf_oom_kill_process - Kill a process as OOM killer
+ * @oc: pointer to oom_control structure, describes OOM context
+ * @task: task to be killed
+ * @message__str: message to print in dmesg
+ *
+ * Kill a process in a way similar to the kernel OOM killer.
+ * This means dump the necessary information to dmesg, adjust memcg
+ * statistics, leverage the oom reaper, respect memory.oom.group etc.
+ *
+ * bpf_oom_kill_process() marks the forward progress by setting
+ * oc->bpf_memory_freed. If the progress was made, the bpf program
+ * is free to decide if the kernel oom killer should be invoked.
+ * Otherwise it's enforced, so that a bad bpf program can't
+ * deadlock the machine on memory.
+ */
+__bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
+				     struct task_struct *task,
+				     const char *message__str)
+{
+	if (oom_unkillable_task(task))
+		return -EPERM;
+
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		return -EINVAL;
+
+	/* paired with put_task_struct() in oom_kill_process() */
+	get_task_struct(task);
+
+	oc->chosen = task;
+
+	oom_kill_process(oc, message__str);
+
+	oc->chosen = NULL;
+	oc->bpf_memory_freed = true;
+
+	return 0;
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_oom_kfuncs)
+BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE)
+BTF_KFUNCS_END(bpf_oom_kfuncs)
+
+BTF_ID_LIST_SINGLE(bpf_oom_ops_ids, struct, bpf_oom_ops)
+
+static int bpf_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
+{
+	if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
+	    prog->aux->attach_btf_id != bpf_oom_ops_ids[0])
+		return -EACCES;
+	return 0;
+}
+
+static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_oom_kfuncs,
+	.filter         = bpf_oom_kfunc_filter,
+};
+
+static int __init bpf_oom_init(void)
+{
+	int err;
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					&bpf_oom_kfunc_set);
+	if (err)
+		pr_warn("error while registering bpf oom kfuncs: %d", err);
+
+	return err;
+}
+late_initcall(bpf_oom_init);
+
+#endif
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 09/17] mm: introduce bpf_out_of_memory() BPF kfunc
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (7 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-28 20:21   ` Matt Bobrowski
  2026-01-27  2:44 ` [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
an out of memory events and trigger the corresponding kernel OOM
handling mechanism.

It takes a trusted memcg pointer (or NULL for system-wide OOMs)
as an argument, as well as the page order.

If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM
can be declared and handled in the system at once, so if the function
is called in parallel to another OOM handling, it bails out with -EBUSY.
This mode is suited for global OOM's: any concurrent OOMs will likely
do the job and release some memory. In a blocking mode (which is
suited for memcg OOMs) the execution will wait on the oom_lock mutex.

The function is declared as sleepable. It guarantees that it won't
be called from an atomic context. It's required by the OOM handling
code, which shouldn't be called from a non-blocking context.

Handling of a memcg OOM almost always requires taking of the
css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
also guarantees that it can't be called with acquired css_set_lock,
so the kernel can't deadlock on it.

To avoid deadlocks on the oom lock, the function is filtered out for
bpf oom struct ops programs and all tracing programs.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/oom.h |  5 +++
 mm/oom_kill.c       | 85 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index c2dce336bcb4..851dba9287b5 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -21,6 +21,11 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+enum bpf_oom_flags {
+	BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK = 1 << 0,
+	BPF_OOM_FLAGS_LAST = 1 << 1,
+};
+
 /*
  * Details of the page allocation that triggered the oom killer that are used to
  * determine what should be killed.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 09897597907f..8f63a370b8f5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1334,6 +1334,53 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
 	return 0;
 }
 
+/**
+ * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
+ * @memcg__nullable: memcg or NULL for system-wide OOMs
+ * @order: order of page which wasn't allocated
+ * @flags: flags
+ *
+ * Declares the Out Of Memory state and invokes the OOM killer.
+ *
+ * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
+ * is true, the function will wait on it. Otherwise it bails out with -EBUSY
+ * if oom_lock is contended.
+ *
+ * Generally it's advised to pass wait_on_oom_lock=false for global OOMs
+ * and wait_on_oom_lock=true for memcg-scoped OOMs.
+ *
+ * Returns 1 if the forward progress was achieved and some memory was freed.
+ * Returns a negative value if an error occurred.
+ */
+__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
+				  int order, u64 flags)
+{
+	struct oom_control oc = {
+		.memcg = memcg__nullable,
+		.gfp_mask = GFP_KERNEL,
+		.order = order,
+	};
+	int ret;
+
+	if (flags & ~(BPF_OOM_FLAGS_LAST - 1))
+		return -EINVAL;
+
+	if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
+		return -EINVAL;
+
+	if (flags & BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK) {
+		ret = mutex_lock_killable(&oom_lock);
+		if (ret)
+			return ret;
+	} else if (!mutex_trylock(&oom_lock))
+		return -EBUSY;
+
+	ret = out_of_memory(&oc);
+
+	mutex_unlock(&oom_lock);
+	return ret;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_oom_kfuncs)
@@ -1356,14 +1403,48 @@ static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
 	.filter         = bpf_oom_kfunc_filter,
 };
 
+BTF_KFUNCS_START(bpf_declare_oom_kfuncs)
+BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE)
+BTF_KFUNCS_END(bpf_declare_oom_kfuncs)
+
+static int bpf_declare_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
+{
+	if (!btf_id_set8_contains(&bpf_declare_oom_kfuncs, kfunc_id))
+		return 0;
+
+	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
+	    prog->aux->attach_btf_id == bpf_oom_ops_ids[0])
+		return -EACCES;
+
+	if (prog->type == BPF_PROG_TYPE_TRACING)
+		return -EACCES;
+
+	return 0;
+}
+
+static const struct btf_kfunc_id_set bpf_declare_oom_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_declare_oom_kfuncs,
+	.filter         = bpf_declare_oom_kfunc_filter,
+};
+
 static int __init bpf_oom_init(void)
 {
 	int err;
 
 	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
 					&bpf_oom_kfunc_set);
-	if (err)
-		pr_warn("error while registering bpf oom kfuncs: %d", err);
+	if (err) {
+		pr_warn("error while registering struct_ops bpf oom kfuncs: %d", err);
+		return err;
+	}
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
+					&bpf_declare_oom_kfunc_set);
+	if (err) {
+		pr_warn("error while registering unspec bpf oom kfuncs: %d", err);
+		return err;
+	}
 
 	return err;
 }
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (8 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 09/17] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-02-02  5:39   ` Matt Bobrowski
  2026-01-27  2:44 ` [PATCH bpf-next v3 11/17] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Export tsk_is_oom_victim() helper as a BPF kfunc.
It's very useful to avoid redundant oom kills.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Suggested-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8f63a370b8f5..53f9f9674658 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1381,10 +1381,24 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
 	return ret;
 }
 
+/**
+ * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
+ * @task: task to check
+ *
+ * Returns true if the task has been previously selected by the OOM killer
+ * to be killed. It's expected that the task will be destroyed soon and some
+ * memory will be freed, so maybe no additional actions required.
+ */
+__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
+{
+	return tsk_is_oom_victim(task);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_oom_kfuncs)
 BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_task_is_oom_victim)
 BTF_KFUNCS_END(bpf_oom_kfuncs)
 
 BTF_ID_LIST_SINGLE(bpf_oom_ops_ids, struct, bpf_oom_ops)
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 11/17] bpf: selftests: introduce read_cgroup_file() helper
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (9 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  3:08   ` bot+bpf-ci
  2026-01-27  2:44 ` [PATCH bpf-next v3 12/17] bpf: selftests: BPF OOM struct ops test Roman Gushchin
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Implement read_cgroup_file() helper to read from cgroup control files,
e.g. statistics.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 45 ++++++++++++++++++++
 tools/testing/selftests/bpf/cgroup_helpers.h |  3 ++
 2 files changed, 48 insertions(+)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 20cede4db3ce..fc5f22409ce5 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -126,6 +126,51 @@ int enable_controllers(const char *relative_path, const char *controllers)
 	return __enable_controllers(cgroup_path, controllers);
 }
 
+static ssize_t __read_cgroup_file(const char *cgroup_path, const char *file,
+				 char *buf, size_t size)
+{
+	char file_path[PATH_MAX + 1];
+	ssize_t ret;
+	int fd;
+
+	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+	fd = open(file_path, O_RDONLY);
+	if (fd < 0) {
+		log_err("Opening %s", file_path);
+		return -1;
+	}
+
+	ret = read(fd, buf, size);
+	if (ret < 0) {
+		close(fd);
+		log_err("Reading %s", file_path);
+		return -1;
+	}
+
+	close(fd);
+	return ret;
+}
+
+/**
+ * read_cgroup_file() - Read from a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @file: The name of the file in cgroupfs to read from
+ * @buf: Buffer to read from the file
+ * @size: Size of the buffer
+ *
+ * Read from a file in the given cgroup's directory.
+ *
+ * If successful, the number of read bytes is returned.
+ */
+ssize_t read_cgroup_file(const char *relative_path, const char *file,
+			 char *buf, size_t size)
+{
+	char cgroup_path[PATH_MAX - 24];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __read_cgroup_file(cgroup_path, file, buf, size);
+}
+
 static int __write_cgroup_file(const char *cgroup_path, const char *file,
 			       const char *buf)
 {
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index 3857304be874..66a08b64838b 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -4,6 +4,7 @@
 
 #include <errno.h>
 #include <string.h>
+#include <sys/types.h>
 
 #define clean_errno() (errno == 0 ? "None" : strerror(errno))
 #define log_err(MSG, ...) fprintf(stderr, "(%s:%d: errno: %s) " MSG "\n", \
@@ -11,6 +12,8 @@
 
 /* cgroupv2 related */
 int enable_controllers(const char *relative_path, const char *controllers);
+ssize_t read_cgroup_file(const char *relative_path, const char *file,
+			char *buf, size_t size);
 int write_cgroup_file(const char *relative_path, const char *file,
 		      const char *buf);
 int write_cgroup_file_parent(const char *relative_path, const char *file,
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 12/17] bpf: selftests: BPF OOM struct ops test
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (10 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 11/17] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  2:44 ` [PATCH bpf-next v3 13/17] sched: psi: add a trace point to psi_avgs_work() Roman Gushchin
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Implement a kselftest for the OOM handling functionality.

The OOM handling policy which is implemented in BPF is to
kill all tasks belonging to the biggest leaf cgroup, which
doesn't contain unkillable tasks (tasks with oom_score_adj
set to -1000). Pagecache size is excluded from the accounting.

The test creates a hierarchy of memory cgroups, causes an
OOM at the top level, checks that the expected process is
killed and verifies the memcg's oom statistics.

The same BPF OOM policy is attached to a memory cgroup and
system-wide. In the first case the program does nothing and
returns false, so it's executed the second time, when it
properly handles the OOM.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 .../selftests/bpf/prog_tests/test_oom.c       | 256 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_oom.c  | 111 ++++++++
 2 files changed, 367 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom.c b/tools/testing/selftests/bpf/prog_tests/test_oom.c
new file mode 100644
index 000000000000..a1eadbe1ae83
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_oom.c
@@ -0,0 +1,256 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "test_oom.skel.h"
+
+struct cgroup_desc {
+	const char *path;
+	int fd;
+	unsigned long long id;
+	int pid;
+	size_t target;
+	size_t max;
+	int oom_score_adj;
+	bool victim;
+};
+
+#define MB (1024 * 1024)
+#define OOM_SCORE_ADJ_MIN	(-1000)
+#define OOM_SCORE_ADJ_MAX	1000
+
+static struct cgroup_desc cgroups[] = {
+	{ .path = "/oom_test", .max = 80 * MB},
+	{ .path = "/oom_test/cg1", .target = 10 * MB,
+	  .oom_score_adj = OOM_SCORE_ADJ_MAX },
+	{ .path = "/oom_test/cg2", .target = 40 * MB,
+	  .oom_score_adj = OOM_SCORE_ADJ_MIN },
+	{ .path = "/oom_test/cg3" },
+	{ .path = "/oom_test/cg3/cg4", .target = 30 * MB,
+	  .victim = true },
+	{ .path = "/oom_test/cg3/cg5", .target = 20 * MB },
+};
+
+static int spawn_task(struct cgroup_desc *desc)
+{
+	char *ptr;
+	int pid;
+
+	pid = fork();
+	if (pid < 0)
+		return pid;
+
+	if (pid > 0) {
+		/* parent */
+		desc->pid = pid;
+		return 0;
+	}
+
+	/* child */
+	if (desc->oom_score_adj) {
+		char buf[64];
+		int fd = open("/proc/self/oom_score_adj", O_WRONLY);
+
+		if (fd < 0)
+			return -1;
+
+		snprintf(buf, sizeof(buf), "%d", desc->oom_score_adj);
+		write(fd, buf, sizeof(buf));
+		close(fd);
+	}
+
+	ptr = (char *)malloc(desc->target);
+	if (!ptr)
+		return -ENOMEM;
+
+	memset(ptr, 'a', desc->target);
+
+	while (1)
+		sleep(1000);
+
+	return 0;
+}
+
+static void setup_environment(void)
+{
+	int i, err;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		goto cleanup;
+
+	for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+		cgroups[i].fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup"))
+			goto cleanup;
+
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+		if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id"))
+			goto cleanup;
+
+		/* Freeze the top-level cgroup */
+		if (i == 0) {
+			/* Freeze the top-level cgroup */
+			err = write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1");
+			if (!ASSERT_OK(err, "freeze cgroup"))
+				goto cleanup;
+		}
+
+		/* Recursively enable the memory controller */
+		if (!cgroups[i].target) {
+
+			err = write_cgroup_file(cgroups[i].path, "cgroup.subtree_control",
+						"+memory");
+			if (!ASSERT_OK(err, "enable memory controller"))
+				goto cleanup;
+		}
+
+		/* Set memory.max */
+		if (cgroups[i].max) {
+			char buf[256];
+
+			snprintf(buf, sizeof(buf), "%lu", cgroups[i].max);
+			err = write_cgroup_file(cgroups[i].path, "memory.max", buf);
+			if (!ASSERT_OK(err, "set memory.max"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "0");
+			write_cgroup_file(cgroups[i].path, "memory.swap.max", buf);
+
+		}
+
+		/* Spawn tasks creating memory pressure */
+		if (cgroups[i].target) {
+			char buf[256];
+
+			err = spawn_task(&cgroups[i]);
+			if (!ASSERT_OK(err, "spawn task"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "%d", cgroups[i].pid);
+			err = write_cgroup_file(cgroups[i].path, "cgroup.procs", buf);
+			if (!ASSERT_OK(err, "put child into a cgroup"))
+				goto cleanup;
+		}
+	}
+
+	return;
+
+cleanup:
+	cleanup_cgroup_environment();
+
+	// TODO return an error?
+}
+
+static int run_and_wait_for_oom(void)
+{
+	int ret = -1;
+	bool first = true;
+	char buf[4096] = {};
+	size_t size;
+
+	/* Unfreeze the top-level cgroup */
+	ret = write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
+	if (!ASSERT_OK(ret, "freeze cgroup"))
+		return -1;
+
+	for (;;) {
+		int i, status;
+		pid_t pid = wait(&status);
+
+		if (pid == -1) {
+			if (errno == EINTR)
+				continue;
+			/* ECHILD */
+			break;
+		}
+
+		if (!first)
+			continue;
+
+		first = false;
+
+		/* Check which process was terminated first */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+			if (!ASSERT_OK(cgroups[i].victim !=
+				       (pid == cgroups[i].pid),
+				       "correct process was killed")) {
+				ret = -1;
+				break;
+			}
+
+			if (!cgroups[i].victim)
+				continue;
+
+			/* Check the memcg oom counter */
+			size = read_cgroup_file(cgroups[i].path,
+						"memory.events",
+						buf, sizeof(buf));
+			if (!ASSERT_OK(size <= 0, "read memory.events")) {
+				ret = -1;
+				break;
+			}
+
+			if (!ASSERT_OK(strstr(buf, "oom_kill 1") == NULL,
+				       "oom_kill count check")) {
+				ret = -1;
+				break;
+			}
+		}
+
+		/* Kill all remaining tasks */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++)
+			if (cgroups[i].pid && cgroups[i].pid != pid)
+				kill(cgroups[i].pid, SIGKILL);
+	}
+
+	return ret;
+}
+
+void test_oom(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+	struct bpf_link *link1 = NULL, *link2 = NULL;
+	struct test_oom *skel;
+	int err = 0;
+
+	setup_environment();
+
+	skel = test_oom__open_and_load();
+	if (!skel) {
+		err = -errno;
+		CHECK_FAIL(err);
+		goto cleanup;
+	}
+
+	opts.flags = BPF_F_CGROUP_FD;
+	opts.target_fd = cgroups[0].fd;
+	link1 = bpf_map__attach_struct_ops_opts(skel->maps.test_bpf_oom, &opts);
+	if (!link1) {
+		err = -errno;
+		CHECK_FAIL(err);
+		goto cleanup;
+	}
+
+	opts.target_fd = get_root_cgroup();
+	link2 = bpf_map__attach_struct_ops_opts(skel->maps.test_bpf_oom, &opts);
+	if (!link2) {
+		err = -errno;
+		CHECK_FAIL(err);
+		goto cleanup;
+	}
+
+	/* Unfreeze all child tasks and create the memory pressure */
+	err = run_and_wait_for_oom();
+	CHECK_FAIL(err);
+
+cleanup:
+	bpf_link__destroy(link1);
+	bpf_link__destroy(link2);
+	write_cgroup_file(cgroups[0].path, "cgroup.kill", "1");
+	write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
+	cleanup_cgroup_environment();
+	test_oom__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_oom.c b/tools/testing/selftests/bpf/progs/test_oom.c
new file mode 100644
index 000000000000..7ff354e416bc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_oom.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define OOM_SCORE_ADJ_MIN	(-1000)
+
+static bool mem_cgroup_killable(struct mem_cgroup *memcg)
+{
+	struct task_struct *task;
+	bool ret = true;
+
+	bpf_for_each(css_task, task, &memcg->css, CSS_TASK_ITER_PROCS)
+		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			return false;
+
+	return ret;
+}
+
+/*
+ * Find the largest leaf cgroup (ignoring page cache) without unkillable tasks
+ * and kill all belonging tasks.
+ */
+SEC("struct_ops.s/handle_out_of_memory")
+int BPF_PROG(test_out_of_memory, struct oom_control *oc, struct bpf_struct_ops_link *link)
+{
+	struct task_struct *task;
+	struct mem_cgroup *root_memcg = oc->memcg;
+	struct mem_cgroup *memcg, *victim = NULL;
+	struct cgroup_subsys_state *css_pos, *css;
+	unsigned long usage, max_usage = 0;
+	unsigned long pagecache = 0;
+	int ret = 0;
+
+	if (root_memcg)
+		root_memcg = bpf_get_mem_cgroup(&root_memcg->css);
+	else
+		root_memcg = bpf_get_root_mem_cgroup();
+
+	if (!root_memcg)
+		return 0;
+
+	css = &root_memcg->css;
+	if (css && css->cgroup == link->cgroup)
+		goto exit;
+
+	bpf_rcu_read_lock();
+	bpf_for_each(css, css_pos, &root_memcg->css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
+		if (css_pos->cgroup->nr_descendants + css_pos->cgroup->nr_dying_descendants)
+			continue;
+
+		memcg = bpf_get_mem_cgroup(css_pos);
+		if (!memcg)
+			continue;
+
+		usage = bpf_mem_cgroup_usage(memcg);
+		pagecache = bpf_mem_cgroup_page_state(memcg, NR_FILE_PAGES);
+
+		if (usage > pagecache)
+			usage -= pagecache;
+		else
+			usage = 0;
+
+		if ((usage > max_usage) && mem_cgroup_killable(memcg)) {
+			max_usage = usage;
+			if (victim)
+				bpf_put_mem_cgroup(victim);
+			victim = bpf_get_mem_cgroup(&memcg->css);
+		}
+
+		bpf_put_mem_cgroup(memcg);
+	}
+	bpf_rcu_read_unlock();
+
+	if (!victim)
+		goto exit;
+
+	bpf_for_each(css_task, task, &victim->css, CSS_TASK_ITER_PROCS) {
+		struct task_struct *t = bpf_task_acquire(task);
+
+		if (t) {
+			/*
+			 * If the task is already an OOM victim, it will
+			 * quit soon and release some memory.
+			 */
+			if (bpf_task_is_oom_victim(task)) {
+				bpf_task_release(t);
+				ret = 1;
+				break;
+			}
+
+			bpf_oom_kill_process(oc, task, "bpf oom test");
+			bpf_task_release(t);
+			ret = 1;
+		}
+	}
+
+	bpf_put_mem_cgroup(victim);
+exit:
+	bpf_put_mem_cgroup(root_memcg);
+
+	return ret;
+}
+
+SEC(".struct_ops.link")
+struct bpf_oom_ops test_bpf_oom = {
+	.name = "bpf_test_policy",
+	.handle_out_of_memory = (void *)test_out_of_memory,
+};
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 13/17] sched: psi: add a trace point to psi_avgs_work()
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (11 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 12/17] bpf: selftests: BPF OOM struct ops test Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  2:44 ` [PATCH bpf-next v3 14/17] sched: psi: add cgroup_id field to psi_group structure Roman Gushchin
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Add a trace point to psi_avgs_work().

It can be used to attach a bpf handler which can monitor
PSI values system-wide or for specific cgroup(s) and
potentially perform some actions, e.g. declare an OOM.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/trace/events/psi.h | 27 +++++++++++++++++++++++++++
 kernel/sched/psi.c         |  6 ++++++
 2 files changed, 33 insertions(+)
 create mode 100644 include/trace/events/psi.h

diff --git a/include/trace/events/psi.h b/include/trace/events/psi.h
new file mode 100644
index 000000000000..57c46de18616
--- /dev/null
+++ b/include/trace/events/psi.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM psi
+
+#if !defined(_TRACE_PSI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PSI_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(psi_avgs_work,
+	TP_PROTO(struct psi_group *group),
+	TP_ARGS(group),
+	TP_STRUCT__entry(
+		__field(struct psi_group *, group)
+	),
+
+	TP_fast_assign(
+		__entry->group = group;
+	),
+
+	TP_printk("group=%p", __entry->group)
+);
+
+#endif /* _TRACE_PSI_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 59fdb7ebbf22..72757ba2ed96 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -141,6 +141,10 @@
 #include <linux/psi.h>
 #include "sched.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/psi.h>
+#undef CREATE_TRACE_POINTS
+
 static int psi_bug __read_mostly;
 
 DEFINE_STATIC_KEY_FALSE(psi_disabled);
@@ -607,6 +611,8 @@ static void psi_avgs_work(struct work_struct *work)
 				group->avg_next_update - now) + 1);
 	}
 
+	trace_psi_avgs_work(group);
+
 	mutex_unlock(&group->avgs_lock);
 }
 
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 14/17] sched: psi: add cgroup_id field to psi_group structure
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (12 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 13/17] sched: psi: add a trace point to psi_avgs_work() Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  2:44 ` [PATCH bpf-next v3 15/17] bpf: allow calling bpf_out_of_memory() from a PSI tracepoint Roman Gushchin
  2026-01-27  9:02 ` [PATCH bpf-next v3 00/17] mm: BPF OOM Michal Hocko
  15 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

To allow a more efficient filtering of cgroups in the psi work
tracepoint handler, let's add a u64 cgroup_id field to the psi_group
structure. For system PSI, 0 will be used.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/psi_types.h | 4 ++++
 kernel/sched/psi.c        | 1 +
 2 files changed, 5 insertions(+)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index dd10c22299ab..749a08d48abd 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -159,6 +159,10 @@ struct psi_trigger {
 
 struct psi_group {
 	struct psi_group *parent;
+
+	/* Cgroup id for cgroup PSI, 0 for system PSI */
+	u64 cgroup_id;
+
 	bool enabled;
 
 	/* Protects data used by the aggregator */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 72757ba2ed96..cf1ec4dc242b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1124,6 +1124,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 	if (!cgroup->psi)
 		return -ENOMEM;
 
+	cgroup->psi->cgroup_id = cgroup_id(cgroup);
 	cgroup->psi->pcpu = alloc_percpu(struct psi_group_cpu);
 	if (!cgroup->psi->pcpu) {
 		kfree(cgroup->psi);
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next v3 15/17] bpf: allow calling bpf_out_of_memory() from a PSI tracepoint
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (13 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 14/17] sched: psi: add cgroup_id field to psi_group structure Roman Gushchin
@ 2026-01-27  2:44 ` Roman Gushchin
  2026-01-27  9:02 ` [PATCH bpf-next v3 00/17] mm: BPF OOM Michal Hocko
  15 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27  2:44 UTC (permalink / raw)
  To: bpf
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Roman Gushchin

Allow calling bpf_out_of_memory() from a PSI tracepoint to enable
PSI-based OOM killer policies.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 53f9f9674658..276cf8a34449 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1421,6 +1421,13 @@ BTF_KFUNCS_START(bpf_declare_oom_kfuncs)
 BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE)
 BTF_KFUNCS_END(bpf_declare_oom_kfuncs)
 
+BTF_ID_LIST(bpf_oom_trace_ids)
+#ifdef CONFIG_PSI
+BTF_ID(typedef, btf_trace_psi_avgs_work)
+#else
+BTF_ID_UNUSED
+#endif
+
 static int bpf_declare_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
 {
 	if (!btf_id_set8_contains(&bpf_declare_oom_kfuncs, kfunc_id))
@@ -1430,7 +1437,8 @@ static int bpf_declare_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_i
 	    prog->aux->attach_btf_id == bpf_oom_ops_ids[0])
 		return -EACCES;
 
-	if (prog->type == BPF_PROG_TYPE_TRACING)
+	if (prog->type == BPF_PROG_TYPE_TRACING &&
+	    prog->aux->attach_btf_id != bpf_oom_trace_ids[0])
 		return -EACCES;
 
 	return 0;
-- 
2.52.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 11/17] bpf: selftests: introduce read_cgroup_file() helper
  2026-01-27  2:44 ` [PATCH bpf-next v3 11/17] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
@ 2026-01-27  3:08   ` bot+bpf-ci
  0 siblings, 0 replies; 63+ messages in thread
From: bot+bpf-ci @ 2026-01-27  3:08 UTC (permalink / raw)
  To: roman.gushchin, bpf
  Cc: mhocko, ast, mattbobrowski, shakeel.butt, inwardvessel,
	linux-kernel, linux-mm, surenb, hannes, akpm, roman.gushchin,
	ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1685 bytes --]

> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
> index 20cede4db3ce..fc5f22409ce5 100644
> --- a/tools/testing/selftests/bpf/cgroup_helpers.c
> +++ b/tools/testing/selftests/bpf/cgroup_helpers.c

[ ... ]

> +/**
> + * read_cgroup_file() - Read from a cgroup file
> + * @relative_path: The cgroup path, relative to the workdir
> + * @file: The name of the file in cgroupfs to read from
> + * @buf: Buffer to read from the file
          ^^^^^^^^^^^^^^^^^^^^^^^^

This isn't a bug, but should this say "Buffer to store data read from the
file" or "Buffer to read into"? The current wording suggests data is read
from the buffer, but the code reads from the file into the buffer. Compare
with write_cgroup_file() which says "Buffer to write to the file" - that
phrasing works for writes but the inverse doesn't quite work for reads.

> + * @size: Size of the buffer
> + *
> + * Read from a file in the given cgroup's directory.
> + *
> + * If successful, the number of read bytes is returned.
> + */

[ ... ]

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21382620878

AI-authorship-score: low
AI-authorship-explanation: The code closely follows existing patterns in the file with a minor documentation error typical of copy-paste adaptation, suggesting human authorship.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: Minor documentation error in test code where @buf description says the opposite of what the code does.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups
  2026-01-27  2:44 ` [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups Roman Gushchin
@ 2026-01-27  3:08   ` bot+bpf-ci
  2026-01-27  5:49   ` Yafang Shao
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: bot+bpf-ci @ 2026-01-27  3:08 UTC (permalink / raw)
  To: roman.gushchin, bpf
  Cc: mhocko, ast, mattbobrowski, shakeel.butt, inwardvessel,
	linux-kernel, linux-mm, surenb, hannes, akpm, roman.gushchin,
	ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2642 bytes --]

> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 69988af44b37..7b1903be6f69 100644

[ ... ]

> @@ -307,12 +308,23 @@ static void cgroup_bpf_release(struct work_struct *work)
>  					       bpf.release_work);
>  	struct bpf_prog_array *old_array;
>  	struct list_head *storages = &cgrp->bpf.storages;
> +	struct bpf_struct_ops_link *st_link, *st_tmp;
>  	struct bpf_cgroup_storage *storage, *stmp;
> +	LIST_HEAD(st_links);
>
>  	unsigned int atype;
>
>  	cgroup_lock();
>
> +	list_splice_init(&cgrp->bpf.struct_ops_links, &st_links);
> +	list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
> +		st_link->cgroup = NULL;
> +		st_link->cgroup_removed = true;
> +		cgroup_put(cgrp);
> +		if (IS_ERR(bpf_link_inc_not_zero(&st_link->link)))
> +			list_del(&st_link->list);
> +	}

Can this code access st_link after it has been freed?

Once st_link->cgroup is set to NULL, if another CPU concurrently closes
the link FD, bpf_struct_ops_map_link_dealloc() will see st_link->cgroup
as NULL, skip the cgroup_bpf_detach_struct_ops() call, and proceed to
kfree(st_link). Then the bpf_link_inc_not_zero() call here would access
freed memory.

The race window is between setting st_link->cgroup = NULL and calling
bpf_link_inc_not_zero(). During this window:

  CPU 0 (cgroup_bpf_release)     CPU 1 (close link FD)
  ---------------------------    ---------------------
  st_link->cgroup = NULL
                                 bpf_link_put_direct()
                                   bpf_link_free()
                                     bpf_struct_ops_map_link_dealloc()
                                       if (st_link->cgroup) // false
                                       kfree(st_link) // freed!
  bpf_link_inc_not_zero() // use-after-free

Would it be safer to call bpf_link_inc_not_zero() before modifying
st_link fields? That way, if the link is already at refcount zero,
the loop would skip it; otherwise, the held reference would prevent
the concurrent free.

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21382620878

AI-authorship-score: low
AI-authorship-explanation: The commit is authored by a known kernel maintainer with a consistent, human-like technical writing style and appropriate kernel development patterns.
issues-found: 1
issue-severity-score: medium
issue-severity-explanation: Use-after-free race condition in cgroup_bpf_release() that can cause kernel crash when link FD is closed concurrently with cgroup deletion.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 04/17] libbpf: introduce bpf_map__attach_struct_ops_opts()
  2026-01-27  2:44 ` [PATCH bpf-next v3 04/17] libbpf: introduce bpf_map__attach_struct_ops_opts() Roman Gushchin
@ 2026-01-27  3:08   ` bot+bpf-ci
  0 siblings, 0 replies; 63+ messages in thread
From: bot+bpf-ci @ 2026-01-27  3:08 UTC (permalink / raw)
  To: roman.gushchin, bpf
  Cc: mhocko, ast, mattbobrowski, shakeel.butt, inwardvessel,
	linux-kernel, linux-mm, surenb, hannes, akpm, roman.gushchin,
	ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3013 bytes --]

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 46d2762f5993..9ba67089bf9d 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c

[ ... ]

> @@ -13503,7 +13510,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>  		return &link->link;
>  	}
>
> -	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> +	link_opts.flags = OPTS_GET(opts, flags, 0);
> +	target_fd = OPTS_GET(opts, target_fd, 0);
> +	fd = bpf_link_create(map->fd, target_fd, BPF_STRUCT_OPS, &link_opts);

The code reads flags and target_fd from opts, but the expected_revision field
declared in struct bpf_struct_ops_opts is never read. Looking at similar opts
structures (bpf_cgroup_opts, bpf_tcx_opts, bpf_netkit_opts), they all pass
expected_revision to bpf_link_create_opts. For example:

    link_create_opts.cgroup.expected_revision = OPTS_GET(opts, expected_revision, 0);

However, bpf_link_create_opts has no struct_ops sub-struct to receive
expected_revision, and the kernel's bpf_attr link_create union has no
struct_ops.expected_revision field either.

Is expected_revision in bpf_struct_ops_opts intentionally unused? If so,
should the field be removed from the struct to avoid user confusion? If users
set expected_revision expecting it to be honored, the value is silently
ignored.

[ ... ]

> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index dfc37a615578..2c28cf80e7fe 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -920,6 +920,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +
> +struct bpf_struct_ops_opts {
> +	/* size of this struct, for forward/backward compatibility */
> +	size_t sz;
> +	__u32 flags;
> +	__u32 target_fd;
> +	__u64 expected_revision;
                 ^^^^^^^^^^^^^^^^^

The expected_revision field is declared here, matching the pattern used in
bpf_cgroup_opts, bpf_tcx_opts, and bpf_netkit_opts. However, unlike those
structures where expected_revision is read and passed through, this field
has no corresponding code path in bpf_map__attach_struct_ops_opts().

> +	size_t :0;
> +};
> +#define bpf_struct_ops_opts__last_field expected_revision

[ ... ]

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21382620878

AI-authorship-score: low
AI-authorship-explanation: The commit follows kernel conventions and existing code patterns, with an oversight (unused expected_revision field) more typical of human development than AI-generated code.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The expected_revision field in bpf_struct_ops_opts is declared but never used, causing silent data loss if users set this field expecting it to be honored.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups
  2026-01-27  2:44 ` [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups Roman Gushchin
  2026-01-27  3:08   ` bot+bpf-ci
@ 2026-01-27  5:49   ` Yafang Shao
  2026-01-28  3:10   ` Josh Don
  2026-01-28 11:25   ` Matt Bobrowski
  3 siblings, 0 replies; 63+ messages in thread
From: Yafang Shao @ 2026-01-27  5:49 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, linux-kernel, linux-mm,
	Suren Baghdasaryan, Johannes Weiner, Andrew Morton

On Tue, Jan 27, 2026 at 10:47 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Introduce an ability to attach bpf struct_ops'es to cgroups.
>
> From user's standpoint it works in the following way:
> a user passes a BPF_F_CGROUP_FD flag and specifies the target cgroup

Since both fdinfo and link_info show the cgroup ID, why not use
BPF_F_CGROUP_ID for better alignment?

> fd while creating a struct_ops link. As the result, the bpf struct_ops
> link will be created and attached to a cgroup.
>
> The cgroup.bpf structure maintains a list of attached struct ops links.
> If the cgroup is getting deleted, attached struct ops'es are getting
> auto-detached and the userspace program gets a notification.
>
> This change doesn't answer the question how bpf programs belonging
> to these struct ops'es will be executed. It will be done individually
> for every bpf struct ops which supports this.
>
> Please, note that unlike "normal" bpf programs, struct ops'es
> are not propagated to cgroup sub-trees.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/bpf-cgroup-defs.h |  3 ++
>  include/linux/bpf-cgroup.h      | 16 +++++++++
>  include/linux/bpf.h             |  3 ++
>  include/uapi/linux/bpf.h        |  3 ++
>  kernel/bpf/bpf_struct_ops.c     | 59 ++++++++++++++++++++++++++++++---
>  kernel/bpf/cgroup.c             | 46 +++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h  |  1 +
>  7 files changed, 127 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
> index c9e6b26abab6..6c5e37190dad 100644
> --- a/include/linux/bpf-cgroup-defs.h
> +++ b/include/linux/bpf-cgroup-defs.h
> @@ -71,6 +71,9 @@ struct cgroup_bpf {
>         /* temp storage for effective prog array used by prog_attach/detach */
>         struct bpf_prog_array *inactive;
>
> +       /* list of bpf struct ops links */
> +       struct list_head struct_ops_links;
> +
>         /* reference counter used to detach bpf programs after cgroup removal */
>         struct percpu_ref refcnt;
>
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 2f535331f926..a6c327257006 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -423,6 +423,11 @@ int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>  int cgroup_bpf_prog_query(const union bpf_attr *attr,
>                           union bpf_attr __user *uattr);
>
> +int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
> +                                struct bpf_struct_ops_link *link);
> +void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
> +                                 struct bpf_struct_ops_link *link);
> +
>  const struct bpf_func_proto *
>  cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
>  #else
> @@ -451,6 +456,17 @@ static inline int cgroup_bpf_link_attach(const union bpf_attr *attr,
>         return -EINVAL;
>  }
>
> +static inline int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
> +                                              struct bpf_struct_ops_link *link)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
> +                                               struct bpf_struct_ops_link *link)
> +{
> +}
> +
>  static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
>                                         union bpf_attr __user *uattr)
>  {
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 899dd911dc82..391888eb257c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1894,6 +1894,9 @@ struct bpf_raw_tp_link {
>  struct bpf_struct_ops_link {
>         struct bpf_link link;
>         struct bpf_map __rcu *map;
> +       struct cgroup *cgroup;
> +       bool cgroup_removed;
> +       struct list_head list;

We may need to support other structs in the future.
Could we implement a more generic solution, such as:

           int type;  // cgroup, task, etc
           void *private;  // ptr to type-specific data

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h
  2026-01-27  2:44 ` [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
@ 2026-01-27  5:50   ` Yafang Shao
  2026-01-28 11:28   ` Matt Bobrowski
  1 sibling, 0 replies; 63+ messages in thread
From: Yafang Shao @ 2026-01-27  5:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, linux-kernel, linux-mm,
	Suren Baghdasaryan, Johannes Weiner, Andrew Morton

On Tue, Jan 27, 2026 at 10:46 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Move struct bpf_struct_ops_link's definition into bpf.h,
> where other custom bpf links definitions are.
>
> It's necessary to access its members from outside of generic
> bpf_struct_ops implementation, which will be done by following
> patches in the series.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Feel free to add:

Acked-by: Yafang Shao <laoar.shao@gmail.com>

> ---
>  include/linux/bpf.h         | 6 ++++++
>  kernel/bpf/bpf_struct_ops.c | 6 ------
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 4427c6e98331..899dd911dc82 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1891,6 +1891,12 @@ struct bpf_raw_tp_link {
>         u64 cookie;
>  };
>
> +struct bpf_struct_ops_link {
> +       struct bpf_link link;
> +       struct bpf_map __rcu *map;
> +       wait_queue_head_t wait_hup;
> +};
> +
>  struct bpf_link_primer {
>         struct bpf_link *link;
>         struct file *file;
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index c43346cb3d76..de01cf3025b3 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -55,12 +55,6 @@ struct bpf_struct_ops_map {
>         struct bpf_struct_ops_value kvalue;
>  };
>
> -struct bpf_struct_ops_link {
> -       struct bpf_link link;
> -       struct bpf_map __rcu *map;
> -       wait_queue_head_t wait_hup;
> -};
> -
>  static DEFINE_MUTEX(update_mutex);
>
>  #define VALUE_PREFIX "bpf_struct_ops_"
> --
> 2.52.0
>
>


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 03/17] libbpf: fix return value on memory allocation failure
  2026-01-27  2:44 ` [PATCH bpf-next v3 03/17] libbpf: fix return value on memory allocation failure Roman Gushchin
@ 2026-01-27  5:52   ` Yafang Shao
  0 siblings, 0 replies; 63+ messages in thread
From: Yafang Shao @ 2026-01-27  5:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, linux-kernel, linux-mm,
	Suren Baghdasaryan, Johannes Weiner, Andrew Morton

On Tue, Jan 27, 2026 at 10:53 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> bpf_map__attach_struct_ops() returns -EINVAL instead of -ENOMEM
> on the memory allocation failure. Fix it.
>
> Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support")
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Yafang Shao <laoar.shao@gmail.com>

> ---
>  tools/lib/bpf/libbpf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 0c8bf0b5cce4..46d2762f5993 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -13480,7 +13480,7 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>
>         link = calloc(1, sizeof(*link));
>         if (!link)
> -               return libbpf_err_ptr(-EINVAL);
> +               return libbpf_err_ptr(-ENOMEM);
>
>         /* kern_vdata should be prepared during the loading phase. */
>         err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
> --
> 2.52.0
>
>


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 05/17] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  2026-01-27  2:44 ` [PATCH bpf-next v3 05/17] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
@ 2026-01-27  6:06   ` Yafang Shao
  2026-02-02  4:56   ` Matt Bobrowski
  1 sibling, 0 replies; 63+ messages in thread
From: Yafang Shao @ 2026-01-27  6:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, linux-kernel, linux-mm,
	Suren Baghdasaryan, Johannes Weiner, Andrew Morton,
	Kumar Kartikeya Dwivedi

On Tue, Jan 27, 2026 at 10:49 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Struct oom_control is used to describe the OOM context.
> It's memcg field defines the scope of OOM: it's NULL for global
> OOMs and a valid memcg pointer for memcg-scoped OOMs.
> Teach bpf verifier to recognize it as trusted or NULL pointer.
> It will provide the bpf OOM handler a trusted memcg pointer,
> which for example is required for iterating the memcg's subtree.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Acked-by: Yafang Shao <laoar.shao@gmail.com>

> ---
>  kernel/bpf/verifier.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index c2f2650db9fd..cca36edb460d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7242,6 +7242,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
>         struct file *vm_file;
>  };
>
> +BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control) {
> +       struct mem_cgroup *memcg;
> +};
> +
>  static bool type_is_rcu(struct bpf_verifier_env *env,
>                         struct bpf_reg_state *reg,
>                         const char *field_name, u32 btf_id)
> @@ -7284,6 +7288,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
>         BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
>         BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
>         BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
> +       BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control));
>
>         return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
>                                           "__safe_trusted_or_null");
> --
> 2.52.0
>
>


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 06/17] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
  2026-01-27  2:44 ` [PATCH bpf-next v3 06/17] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
@ 2026-01-27  6:12   ` Yafang Shao
  2026-02-02  3:50   ` Shakeel Butt
  1 sibling, 0 replies; 63+ messages in thread
From: Yafang Shao @ 2026-01-27  6:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, linux-kernel, linux-mm,
	Suren Baghdasaryan, Johannes Weiner, Andrew Morton

On Tue, Jan 27, 2026 at 10:49 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
> but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/memcontrol.h | 4 ++--
>  mm/memcontrol.c            | 2 --
>  2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 229ac9835adb..f3b8c71870d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -833,9 +833,9 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
>  {
>         return memcg ? cgroup_ino(memcg->css.cgroup) : 0;
>  }
> +#endif
>
>  struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino);
> -#endif
>
>  static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m)
>  {
> @@ -1298,12 +1298,12 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
>  {
>         return 0;
>  }
> +#endif

Given that mem_cgroup_ino() pairs with mem_cgroup_get_from_ino(),
should we also define mem_cgroup_ino() outside CONFIG_SHRINKER_DEBUG?

>
>  static inline struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
>  {
>         return NULL;
>  }
> -#endif

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 00/17] mm: BPF OOM
  2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
                   ` (14 preceding siblings ...)
  2026-01-27  2:44 ` [PATCH bpf-next v3 15/17] bpf: allow calling bpf_out_of_memory() from a PSI tracepoint Roman Gushchin
@ 2026-01-27  9:02 ` Michal Hocko
  2026-01-27 21:01   ` Roman Gushchin
  15 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2026-01-27  9:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Mon 26-01-26 18:44:03, Roman Gushchin wrote:
> This patchset adds an ability to customize the out of memory
> handling using bpf.
> 
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
> 
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
> 
> It provides a generic interface which is called before the existing OOM
> killer code and allows implementing any policy, e.g. picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).

Are you planning to write any highlevel documentation on how to use the
existing infrastructure to implement proper/correct OOM handlers with
these generic interfaces?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-27  2:44 ` [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops Roman Gushchin
@ 2026-01-27  9:38   ` Michal Hocko
  2026-01-27 21:12     ` Roman Gushchin
  2026-01-28  3:26   ` Josh Don
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2026-01-27  9:38 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
> Introduce a bpf struct ops for implementing custom OOM handling
> policies.
> 
> It's possible to load one bpf_oom_ops for the system and one
> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> cgroup tree is traversed from the OOM'ing memcg up to the root and
> corresponding BPF OOM handlers are executed until some memory is
> freed. If no memory is freed, the kernel OOM killer is invoked.
> 
> The struct ops provides the bpf_handle_out_of_memory() callback,
> which expected to return 1 if it was able to free some memory and 0
> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> field of the oom_control structure, which is expected to be set by
> kfuncs suitable for releasing memory (which will be introduced later
> in the patch series). If both are set, OOM is considered handled,
> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
> attached to the parent cgroup or the kernel OOM killer.

I still find this dual reporting a bit confusing. I can see your
intention in having a pre-defined "releasers" of the memory to trust BPF
handlers more but they do have access to oc->bpf_memory_freed so they
can manipulate it. Therefore an additional level of protection is rather
weak. 

It is also not really clear to me how this works while there is OOM
victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
will result in no killing therefore no bpf_memory_freed, right? Handler
itself should consider its work done. How exactly is this handled.

Also is there any way to handle the oom by increasing the memcg limit?
I do not see a callback for that.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc
  2026-01-27  2:44 ` [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
@ 2026-01-27 20:21   ` Martin KaFai Lau
  2026-01-27 20:47     ` Roman Gushchin
  2026-02-02  4:49   ` Matt Bobrowski
  1 sibling, 1 reply; 63+ messages in thread
From: Martin KaFai Lau @ 2026-01-27 20:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, bpf

On 1/26/26 6:44 PM, Roman Gushchin wrote:
> +static int bpf_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)

The filter callback is registered for BPF_PROG_TYPE_STRUCT_OPS. It is 
checking if a kfunc_id is allowed for other struct_ops progs also, e.g. 
the bpf-tcp-cc struct_ops progs.


> +{
> +	if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> +	    prog->aux->attach_btf_id != bpf_oom_ops_ids[0])
> +		return -EACCES;

The 'return -EACCES' should be the cause of the "calling kernel function 
XXX is not allowed" error reported by the CI. Take a look at 
btf_kfunc_is_allowed().

Take a look at bpf_qdisc_kfunc_filter(). I suspect it should be 
something like this, untested:

         if (btf_id_set8_contains(&bpf_oom_kfuncs, kfunc_id) &&
	    prog->aux->st_ops != &bpf_oom_bpf_ops)
                 return -EACCES;

         return 0;

> +
> +static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
> +	.owner          = THIS_MODULE,
> +	.set            = &bpf_oom_kfuncs,
> +	.filter         = bpf_oom_kfunc_filter,
> +};
> +
> +static int __init bpf_oom_init(void)
> +{
> +	int err;
> +
> +	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +					&bpf_oom_kfunc_set);



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc
  2026-01-27 20:21   ` Martin KaFai Lau
@ 2026-01-27 20:47     ` Roman Gushchin
  0 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27 20:47 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, bpf

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 1/26/26 6:44 PM, Roman Gushchin wrote:
>> +static int bpf_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
>
> The filter callback is registered for BPF_PROG_TYPE_STRUCT_OPS. It is
> checking if a kfunc_id is allowed for other struct_ops progs also,
> e.g. the bpf-tcp-cc struct_ops progs.
>
>
>> +{
>> +	if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
>> +	    prog->aux->attach_btf_id != bpf_oom_ops_ids[0])
>> +		return -EACCES;
>
> The 'return -EACCES' should be the cause of the "calling kernel
> function XXX is not allowed" error reported by the CI. Take a look at
> btf_kfunc_is_allowed().
>
> Take a look at bpf_qdisc_kfunc_filter(). I suspect it should be
> something like this, untested:
>
>         if (btf_id_set8_contains(&bpf_oom_kfuncs, kfunc_id) &&
> 	    prog->aux->st_ops != &bpf_oom_bpf_ops)
>                 return -EACCES;
>
>         return 0;

Oh, I see.. It's a bit surprising that these .filter() functions
have non-local effects...

Will fix in v4.

Thank you, Martin!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 00/17] mm: BPF OOM
  2026-01-27  9:02 ` [PATCH bpf-next v3 00/17] mm: BPF OOM Michal Hocko
@ 2026-01-27 21:01   ` Roman Gushchin
  2026-01-28  8:06     ` Michal Hocko
  0 siblings, 1 reply; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27 21:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

Michal Hocko <mhocko@suse.com> writes:

> On Mon 26-01-26 18:44:03, Roman Gushchin wrote:
>> This patchset adds an ability to customize the out of memory
>> handling using bpf.
>> 
>> It focuses on two parts:
>> 1) OOM handling policy,
>> 2) PSI-based OOM invocation.
>> 
>> The idea to use bpf for customizing the OOM handling is not new, but
>> unlike the previous proposal [1], which augmented the existing task
>> ranking policy, this one tries to be as generic as possible and
>> leverage the full power of the modern bpf.
>> 
>> It provides a generic interface which is called before the existing OOM
>> killer code and allows implementing any policy, e.g. picking a victim
>> task or memory cgroup or potentially even releasing memory in other
>> ways, e.g. deleting tmpfs files (the last one might require some
>> additional but relatively simple changes).
>
> Are you planning to write any highlevel documentation on how to use the
> existing infrastructure to implement proper/correct OOM handlers with
> these generic interfaces?

What do you expect from such a document, can you, please, elaborate?
I'm asking because the main promise of bpf is to provide some sort
of a safe playground, so anyone can experiment with writing their
bpf implementations (like sched_ext schedulers or bpf oom policies)
with minimum risk. Yes, it might work sub-optimally and kill too many
tasks, but it won't crash or deadlock the system.
So in way I don't want to prescribe the "right way" of writing
oom handler, but it totally makes sense to provide an example.

As of now the best way to get an example of a bpf handler is to look
into the commit "[PATCH bpf-next v3 12/17] bpf: selftests: BPF OOM
struct ops test".

Another viable idea (also suggested by Andrew Morton) is to develop
a production ready memcg-aware OOM killer in BPF, put the source code
into the kernel tree and make it loadable by default (obviously under a
config option). Myself or one of my colleagues will try to explore it a
bit later: the tricky part is this by-default loading because there are
no existing precedents.

Thanks!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-27  9:38   ` Michal Hocko
@ 2026-01-27 21:12     ` Roman Gushchin
  2026-01-28  8:00       ` Michal Hocko
  2026-02-02  4:06       ` Matt Bobrowski
  0 siblings, 2 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-27 21:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

Michal Hocko <mhocko@suse.com> writes:

> On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
>> Introduce a bpf struct ops for implementing custom OOM handling
>> policies.
>> 
>> It's possible to load one bpf_oom_ops for the system and one
>> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
>> cgroup tree is traversed from the OOM'ing memcg up to the root and
>> corresponding BPF OOM handlers are executed until some memory is
>> freed. If no memory is freed, the kernel OOM killer is invoked.
>> 
>> The struct ops provides the bpf_handle_out_of_memory() callback,
>> which expected to return 1 if it was able to free some memory and 0
>> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
>> field of the oom_control structure, which is expected to be set by
>> kfuncs suitable for releasing memory (which will be introduced later
>> in the patch series). If both are set, OOM is considered handled,
>> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
>> attached to the parent cgroup or the kernel OOM killer.
>
> I still find this dual reporting a bit confusing. I can see your
> intention in having a pre-defined "releasers" of the memory to trust BPF
> handlers more but they do have access to oc->bpf_memory_freed so they
> can manipulate it. Therefore an additional level of protection is rather
> weak.

No, they can't. They have only a read-only access.

> It is also not really clear to me how this works while there is OOM
> victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
> will result in no killing therefore no bpf_memory_freed, right? Handler
> itself should consider its work done. How exactly is this handled.

It's a good question, I see your point...
Basically we want to give a handler an option to exit with "I promise,
some memory will be freed soon" without doing anything destructive.
But keeping it save at the same time.

I don't have a perfect answer out of my head, maybe some sort of a
rate-limiter/counter might work? E.g. a handler can promise this N times
before the kernel kicks in? Any ideas?

> Also is there any way to handle the oom by increasing the memcg limit?
> I do not see a callback for that.

There is no kfunc yet, but it's a good idea (which we accidentally
discussed few days ago). I'll implement it.

Thank you!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups
  2026-01-27  2:44 ` [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups Roman Gushchin
  2026-01-27  3:08   ` bot+bpf-ci
  2026-01-27  5:49   ` Yafang Shao
@ 2026-01-28  3:10   ` Josh Don
  2026-01-28 18:52     ` Roman Gushchin
  2026-01-28 11:25   ` Matt Bobrowski
  3 siblings, 1 reply; 63+ messages in thread
From: Josh Don @ 2026-01-28  3:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, linux-kernel, linux-mm,
	Suren Baghdasaryan, Johannes Weiner, Andrew Morton

Hi Roman,

On Mon, Jan 26, 2026 at 6:50 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Introduce an ability to attach bpf struct_ops'es to cgroups.
>
[snip]
>  struct bpf_struct_ops_value {
>         struct bpf_struct_ops_common_value common;
> @@ -1220,6 +1222,10 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>                 st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data, link);
>                 bpf_map_put(&st_map->map);
>         }
> +
> +       if (st_link->cgroup)
> +               cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
> +

I was worried about concurrency with cgroup ops until I saw
cgroup_bpf_detach_struct_ops() takes cgroup_lock() internally (since
you take it inline sometimes below I falsely assumed it wasn't
present). In any case, I'm wondering why you need to pass in the
cgroup pointer to cgroup_bpf_detach_struct_ops() at all, rather than
just the link?


> @@ -1357,8 +1386,12 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>         struct bpf_link_primer link_primer;
>         struct bpf_struct_ops_map *st_map;
>         struct bpf_map *map;
> +       struct cgroup *cgrp;
>         int err;
>
> +       if (attr->link_create.flags & ~BPF_F_CGROUP_FD)
> +               return -EINVAL;
> +
>         map = bpf_map_get(attr->link_create.map_fd);
>         if (IS_ERR(map))
>                 return PTR_ERR(map);
> @@ -1378,11 +1411,26 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>         bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>                       attr->link_create.attach_type);
>
> +       init_waitqueue_head(&link->wait_hup);
> +
> +       if (attr->link_create.flags & BPF_F_CGROUP_FD) {
> +               cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
> +               if (IS_ERR(cgrp)) {
> +                       err = PTR_ERR(cgrp);
> +                       goto err_out;
> +               }
> +               link->cgroup = cgrp;
> +               err = cgroup_bpf_attach_struct_ops(cgrp, link);

We have to be careful at this point. cgroup release could now occur
concurrently which would clear link->cgroup. Maybe worth a comment
here since this is a bit subtle.

> +               if (err) {
> +                       cgroup_put(cgrp);
> +                       link->cgroup = NULL;
> +                       goto err_out;
> +               }
> +       }


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-27  2:44 ` [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops Roman Gushchin
  2026-01-27  9:38   ` Michal Hocko
@ 2026-01-28  3:26   ` Josh Don
  2026-01-28 19:03     ` Roman Gushchin
  2026-01-28 11:19   ` Michal Hocko
  2026-01-29 21:00   ` Martin KaFai Lau
  3 siblings, 1 reply; 63+ messages in thread
From: Josh Don @ 2026-01-28  3:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, linux-kernel, linux-mm,
	Suren Baghdasaryan, Johannes Weiner, Andrew Morton

Thanks Roman!

On Mon, Jan 26, 2026 at 6:51 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Introduce a bpf struct ops for implementing custom OOM handling
> policies.
>
> +bool bpf_handle_oom(struct oom_control *oc)
> +{
> +       struct bpf_struct_ops_link *st_link;
> +       struct bpf_oom_ops *bpf_oom_ops;
> +       struct mem_cgroup *memcg;
> +       struct bpf_map *map;
> +       int ret = 0;
> +
> +       /*
> +        * System-wide OOMs are handled by the struct ops attached
> +        * to the root memory cgroup
> +        */
> +       memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
> +
> +       rcu_read_lock_trace();
> +
> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
> +       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +               st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
> +                                               rcu_read_lock_trace_held());
> +               if (!st_link)
> +                       continue;
> +
> +               map = rcu_dereference_check((st_link->map),
> +                                           rcu_read_lock_trace_held());
> +               if (!map)
> +                       continue;
> +
> +               /* Call BPF OOM handler */
> +               bpf_oom_ops = bpf_struct_ops_data(map);
> +               ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
> +               if (ret && oc->bpf_memory_freed)
> +                       break;
> +               ret = 0;
> +       }
> +
> +       rcu_read_unlock_trace();
> +
> +       return ret && oc->bpf_memory_freed;

If bpf claims to have freed memory but didn't actually do so, that
seems like something potentially worth alerting to. Perhaps something
to add to the oom header output?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-27 21:12     ` Roman Gushchin
@ 2026-01-28  8:00       ` Michal Hocko
  2026-01-28 18:44         ` Roman Gushchin
  2026-02-02  4:06       ` Matt Bobrowski
  1 sibling, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2026-01-28  8:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Tue 27-01-26 21:12:56, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
> >> Introduce a bpf struct ops for implementing custom OOM handling
> >> policies.
> >> 
> >> It's possible to load one bpf_oom_ops for the system and one
> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
> >> corresponding BPF OOM handlers are executed until some memory is
> >> freed. If no memory is freed, the kernel OOM killer is invoked.
> >> 
> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> which expected to return 1 if it was able to free some memory and 0
> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> >> field of the oom_control structure, which is expected to be set by
> >> kfuncs suitable for releasing memory (which will be introduced later
> >> in the patch series). If both are set, OOM is considered handled,
> >> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
> >> attached to the parent cgroup or the kernel OOM killer.
> >
> > I still find this dual reporting a bit confusing. I can see your
> > intention in having a pre-defined "releasers" of the memory to trust BPF
> > handlers more but they do have access to oc->bpf_memory_freed so they
> > can manipulate it. Therefore an additional level of protection is rather
> > weak.
> 
> No, they can't. They have only a read-only access.

Could you explain this a bit more. This must be some BPF magic because
they are getting a standard pointer to oom_control.
 
> > It is also not really clear to me how this works while there is OOM
> > victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
> > will result in no killing therefore no bpf_memory_freed, right? Handler
> > itself should consider its work done. How exactly is this handled.
> 
> It's a good question, I see your point...
> Basically we want to give a handler an option to exit with "I promise,
> some memory will be freed soon" without doing anything destructive.
> But keeping it save at the same time.

Yes, something like OOM_BACKOFF, OOM_PROCESSED, OOM_FAILED.

> I don't have a perfect answer out of my head, maybe some sort of a
> rate-limiter/counter might work? E.g. a handler can promise this N times
> before the kernel kicks in? Any ideas?

Counters usually do not work very well for async operations. In this
case there is oom_repaer and/or task exit to finish the oom operation.
The former is bound and guaranteed to make a forward progress but there
is no time frame to assume when that happens as it depends on how many
tasks might be queued (usually a single one but this is not something to
rely on because of concurrent ooms in memcgs and also multiple tasks
could be killed at the same time).

Another complication is that there are multiple levels of OOM to track
(global, NUMA, memcg) so any watchdog would have to be aware of that as
well. I am really wondering whether we really need to be so careful with
handlers. It is not like you would allow any random oom handler to be
loaded, right? Would it make sense to start without this protection and
converge to something as we see how this evolves? Maybe this will raise
the bar for oom handlers as the price for bugs is going to be really
high.

> > Also is there any way to handle the oom by increasing the memcg limit?
> > I do not see a callback for that.
> 
> There is no kfunc yet, but it's a good idea (which we accidentally
> discussed few days ago). I'll implement it.

Cool!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 00/17] mm: BPF OOM
  2026-01-27 21:01   ` Roman Gushchin
@ 2026-01-28  8:06     ` Michal Hocko
  2026-01-28 16:59       ` Alexei Starovoitov
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2026-01-28  8:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Tue 27-01-26 21:01:48, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 26-01-26 18:44:03, Roman Gushchin wrote:
> >> This patchset adds an ability to customize the out of memory
> >> handling using bpf.
> >> 
> >> It focuses on two parts:
> >> 1) OOM handling policy,
> >> 2) PSI-based OOM invocation.
> >> 
> >> The idea to use bpf for customizing the OOM handling is not new, but
> >> unlike the previous proposal [1], which augmented the existing task
> >> ranking policy, this one tries to be as generic as possible and
> >> leverage the full power of the modern bpf.
> >> 
> >> It provides a generic interface which is called before the existing OOM
> >> killer code and allows implementing any policy, e.g. picking a victim
> >> task or memory cgroup or potentially even releasing memory in other
> >> ways, e.g. deleting tmpfs files (the last one might require some
> >> additional but relatively simple changes).
> >
> > Are you planning to write any highlevel documentation on how to use the
> > existing infrastructure to implement proper/correct OOM handlers with
> > these generic interfaces?
> 
> What do you expect from such a document, can you, please, elaborate?

Sure. Essentially an expected structure of the handler. What is the API
it can use, what is has to do and what it must not do. Essentially a
single place you can read and get enough information to start developing
your oom handler.

> I'm asking because the main promise of bpf is to provide some sort
> of a safe playground, so anyone can experiment with writing their
> bpf implementations (like sched_ext schedulers or bpf oom policies)
> with minimum risk. Yes, it might work sub-optimally and kill too many
> tasks, but it won't crash or deadlock the system.
> So in way I don't want to prescribe the "right way" of writing
> oom handler, but it totally makes sense to provide an example.
> 
> As of now the best way to get an example of a bpf handler is to look
> into the commit "[PATCH bpf-next v3 12/17] bpf: selftests: BPF OOM
> struct ops test".

Examples are really great but having a central place to document
available API is much more helpful IMHO. The generally scattered nature
of BPF hooks makes it really hard to even know what is available to oom
handlers to use.

> Another viable idea (also suggested by Andrew Morton) is to develop
> a production ready memcg-aware OOM killer in BPF, put the source code
> into the kernel tree and make it loadable by default (obviously under a
> config option). Myself or one of my colleagues will try to explore it a
> bit later: the tricky part is this by-default loading because there are
> no existing precedents.

It certainly makes sense to have trusted implementation of a commonly
requested oom policy that we couldn't implement due to specific nature
that doesn't really apply to many users. And have that in the tree. I am
not thrilled about auto-loading because this could be easily done by a
simple tooling.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-27  2:44 ` [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops Roman Gushchin
  2026-01-27  9:38   ` Michal Hocko
  2026-01-28  3:26   ` Josh Don
@ 2026-01-28 11:19   ` Michal Hocko
  2026-01-28 18:53     ` Roman Gushchin
  2026-01-29 21:00   ` Martin KaFai Lau
  3 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2026-01-28 11:19 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

Once additional point I forgot to mention previously

On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
> @@ -1168,6 +1180,13 @@ bool out_of_memory(struct oom_control *oc)
>  		return true;
>  	}
>  
> +	/*
> +	 * Let bpf handle the OOM first. If it was able to free up some memory,
> +	 * bail out. Otherwise fall back to the kernel OOM killer.
> +	 */
> +	if (bpf_handle_oom(oc))
> +		return true;
> +
>  	select_bad_process(oc);
>  	/* Found nothing?!?! */
>  	if (!oc->chosen) {

Should this check for is_sysrq_oom and always use the in kernel OOM
handling for Sysrq triggered ooms as a failsafe measure?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups
  2026-01-27  2:44 ` [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups Roman Gushchin
                     ` (2 preceding siblings ...)
  2026-01-28  3:10   ` Josh Don
@ 2026-01-28 11:25   ` Matt Bobrowski
  2026-01-28 19:18     ` Roman Gushchin
  3 siblings, 1 reply; 63+ messages in thread
From: Matt Bobrowski @ 2026-01-28 11:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Mon, Jan 26, 2026 at 06:44:05PM -0800, Roman Gushchin wrote:
> Introduce an ability to attach bpf struct_ops'es to cgroups.
> 
> From user's standpoint it works in the following way:
> a user passes a BPF_F_CGROUP_FD flag and specifies the target cgroup
> fd while creating a struct_ops link. As the result, the bpf struct_ops
> link will be created and attached to a cgroup.
> 
> The cgroup.bpf structure maintains a list of attached struct ops links.
> If the cgroup is getting deleted, attached struct ops'es are getting
> auto-detached and the userspace program gets a notification.
> 
> This change doesn't answer the question how bpf programs belonging
> to these struct ops'es will be executed. It will be done individually
> for every bpf struct ops which supports this.
> 
> Please, note that unlike "normal" bpf programs, struct ops'es
> are not propagated to cgroup sub-trees.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/bpf-cgroup-defs.h |  3 ++
>  include/linux/bpf-cgroup.h      | 16 +++++++++
>  include/linux/bpf.h             |  3 ++
>  include/uapi/linux/bpf.h        |  3 ++
>  kernel/bpf/bpf_struct_ops.c     | 59 ++++++++++++++++++++++++++++++---
>  kernel/bpf/cgroup.c             | 46 +++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h  |  1 +
>  7 files changed, 127 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
> index c9e6b26abab6..6c5e37190dad 100644
> --- a/include/linux/bpf-cgroup-defs.h
> +++ b/include/linux/bpf-cgroup-defs.h
> @@ -71,6 +71,9 @@ struct cgroup_bpf {
>  	/* temp storage for effective prog array used by prog_attach/detach */
>  	struct bpf_prog_array *inactive;
>  
> +	/* list of bpf struct ops links */
> +	struct list_head struct_ops_links;
> +
>  	/* reference counter used to detach bpf programs after cgroup removal */
>  	struct percpu_ref refcnt;
>  
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 2f535331f926..a6c327257006 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -423,6 +423,11 @@ int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>  int cgroup_bpf_prog_query(const union bpf_attr *attr,
>  			  union bpf_attr __user *uattr);
>  
> +int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
> +				 struct bpf_struct_ops_link *link);
> +void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
> +				  struct bpf_struct_ops_link *link);
> +
>  const struct bpf_func_proto *
>  cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
>  #else
> @@ -451,6 +456,17 @@ static inline int cgroup_bpf_link_attach(const union bpf_attr *attr,
>  	return -EINVAL;
>  }
>  
> +static inline int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
> +					       struct bpf_struct_ops_link *link)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
> +						struct bpf_struct_ops_link *link)
> +{
> +}
> +
>  static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
>  					union bpf_attr __user *uattr)
>  {
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 899dd911dc82..391888eb257c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1894,6 +1894,9 @@ struct bpf_raw_tp_link {
>  struct bpf_struct_ops_link {
>  	struct bpf_link link;
>  	struct bpf_map __rcu *map;
> +	struct cgroup *cgroup;
> +	bool cgroup_removed;
> +	struct list_head list;
>  	wait_queue_head_t wait_hup;
>  };
>  
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 44e7dbc278e3..28544e8af1cd 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1237,6 +1237,7 @@ enum bpf_perf_event_type {
>  #define BPF_F_AFTER		(1U << 4)
>  #define BPF_F_ID		(1U << 5)
>  #define BPF_F_PREORDER		(1U << 6)
> +#define BPF_F_CGROUP_FD		(1U << 7)
>  #define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
>  
>  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> @@ -6775,6 +6776,8 @@ struct bpf_link_info {
>  		} xdp;
>  		struct {
>  			__u32 map_id;
> +			__u32 :32;
> +			__u64 cgroup_id;
>  		} struct_ops;
>  		struct {
>  			__u32 pf;
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index de01cf3025b3..2e361e22cfa0 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -13,6 +13,8 @@
>  #include <linux/btf_ids.h>
>  #include <linux/rcupdate_wait.h>
>  #include <linux/poll.h>
> +#include <linux/bpf-cgroup.h>
> +#include <linux/cgroup.h>
>  
>  struct bpf_struct_ops_value {
>  	struct bpf_struct_ops_common_value common;
> @@ -1220,6 +1222,10 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>  		st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data, link);
>  		bpf_map_put(&st_map->map);
>  	}
> +
> +	if (st_link->cgroup)
> +		cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
> +
>  	kfree(st_link);
>  }
>  
> @@ -1228,6 +1234,7 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
>  {
>  	struct bpf_struct_ops_link *st_link;
>  	struct bpf_map *map;
> +	u64 cgrp_id = 0;

Assigning 0 to cgrp_id would technically be incorrect, right? Like,
cgroup_id() for !CONFIG_CGROUPS default to returning 1, and for
CONFIG_CGROUPS the ID allocation is done via the idr_alloc_cyclic()
API using a range between 1 and INT_MAX. Perhaps here it serves as a
valid sentinel value? Is that the rationale?

In general, shouldn't all the cgroup related logic within this source
file be protected by a CONFIG_CGROUPS ifdef? For example, both
cgroup_get_from_fd() and cgroup_put() lack stubs when building with
!CONFIG_CGROUPS.

>  	st_link = container_of(link, struct bpf_struct_ops_link, link);
>  	rcu_read_lock();
> @@ -1235,6 +1242,14 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
>  	if (map)
>  		seq_printf(seq, "map_id:\t%d\n", map->id);
>  	rcu_read_unlock();
> +
> +	cgroup_lock();
> +	if (st_link->cgroup)
> +		cgrp_id = cgroup_id(st_link->cgroup);
> +	cgroup_unlock();
> +
> +	if (cgrp_id)
> +		seq_printf(seq, "cgroup_id:\t%llu\n", cgrp_id);

Probably could introduce a simple inline helper for the
cgroup_lock()/cgroup_id()/cgroup_unlock() dance that's going on in
here and bpf_struct_ops_map_link_fill_link_info() below.

>  }
>  
>  static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
> @@ -1242,6 +1257,7 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
>  {
>  	struct bpf_struct_ops_link *st_link;
>  	struct bpf_map *map;
> +	u64 cgrp_id = 0;
>  
>  	st_link = container_of(link, struct bpf_struct_ops_link, link);
>  	rcu_read_lock();
> @@ -1249,6 +1265,13 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
>  	if (map)
>  		info->struct_ops.map_id = map->id;
>  	rcu_read_unlock();
> +
> +	cgroup_lock();
> +	if (st_link->cgroup)
> +		cgrp_id = cgroup_id(st_link->cgroup);
> +	cgroup_unlock();
> +
> +	info->struct_ops.cgroup_id = cgrp_id;

As mentioned above a simple inline helper could simply yield the
following here:

...
	  info->struct_ops.cgroup_id = bpf_struct_ops_lin_cgroup_id();
...

>  	return 0;
>  }
>  
> @@ -1327,6 +1350,9 @@ static int bpf_struct_ops_map_link_detach(struct bpf_link *link)
>  
>  	mutex_unlock(&update_mutex);
>  
> +	if (st_link->cgroup)
> +		cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
> +
>  	wake_up_interruptible_poll(&st_link->wait_hup, EPOLLHUP);
>  
>  	return 0;
> @@ -1339,6 +1365,9 @@ static __poll_t bpf_struct_ops_map_link_poll(struct file *file,
>  
>  	poll_wait(file, &st_link->wait_hup, pts);
>  
> +	if (st_link->cgroup_removed)
> +		return EPOLLHUP;
> +
>  	return rcu_access_pointer(st_link->map) ? 0 : EPOLLHUP;
>  }
>  
> @@ -1357,8 +1386,12 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>  	struct bpf_link_primer link_primer;
>  	struct bpf_struct_ops_map *st_map;
>  	struct bpf_map *map;
> +	struct cgroup *cgrp;
>  	int err;
>  
> +	if (attr->link_create.flags & ~BPF_F_CGROUP_FD)
> +		return -EINVAL;
> +

BPF_F_CGROUP_FD is dependent on the cgroup subsystem, therefore it
probably makes some sense to only accept BPF_F_CGROUP_FD when
CONFIG_BPF_CGROUP is enabled, otherwise -EOPNOTSUPP?

I'd also probably rewrite this such that we do:

...
	struct cgroup *cgrp = NULL;
	...
	if (attr->link_create.flags & ~BPF_F_CGROUP_FD) {
#if IS_ENABLED(CONFIG_CGROUP_BPF)
	cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
	if (IS_ERR(cgrp))
		return PTR_ERR(cgrp);
#else
	return -EOPNOTSUPP;
#endif
	}
...
	if (cgrp) {
		link->cgroup = cgrp;
		if (cgroup_bpf_attach_struct_ops(cgrp, link)) {
		   cgroup_put(cgrp);
		   goto err_out;
		}
	}

IMO the code is cleaner and reads better too.

>  	map = bpf_map_get(attr->link_create.map_fd);
>  	if (IS_ERR(map))
>  		return PTR_ERR(map);
> @@ -1378,11 +1411,26 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>  	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>  		      attr->link_create.attach_type);
>  
> +	init_waitqueue_head(&link->wait_hup);
> +
> +	if (attr->link_create.flags & BPF_F_CGROUP_FD) {
> +		cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
> +		if (IS_ERR(cgrp)) {
> +			err = PTR_ERR(cgrp);
> +			goto err_out;
> +		}
> +		link->cgroup = cgrp;
> +		err = cgroup_bpf_attach_struct_ops(cgrp, link);
> +		if (err) {
> +			cgroup_put(cgrp);
> +			link->cgroup = NULL;
> +			goto err_out;
> +		}
> +	}
> +
>  	err = bpf_link_prime(&link->link, &link_primer);
>  	if (err)
> -		goto err_out;
> -
> -	init_waitqueue_head(&link->wait_hup);
> +		goto err_put_cgroup;
>  
>  	/* Hold the update_mutex such that the subsystem cannot
>  	 * do link->ops->detach() before the link is fully initialized.
> @@ -1393,13 +1441,16 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>  		mutex_unlock(&update_mutex);
>  		bpf_link_cleanup(&link_primer);
>  		link = NULL;
> -		goto err_out;
> +		goto err_put_cgroup;
>  	}
>  	RCU_INIT_POINTER(link->map, map);
>  	mutex_unlock(&update_mutex);
>  
>  	return bpf_link_settle(&link_primer);
>  
> +err_put_cgroup:
> +	if (link && link->cgroup)
> +		cgroup_bpf_detach_struct_ops(link->cgroup, link);
>  err_out:
>  	bpf_map_put(map);
>  	kfree(link);
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 69988af44b37..7b1903be6f69 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -16,6 +16,7 @@
>  #include <linux/bpf-cgroup.h>
>  #include <linux/bpf_lsm.h>
>  #include <linux/bpf_verifier.h>
> +#include <linux/poll.h>
>  #include <net/sock.h>
>  #include <net/bpf_sk_storage.h>
>  
> @@ -307,12 +308,23 @@ static void cgroup_bpf_release(struct work_struct *work)
>  					       bpf.release_work);
>  	struct bpf_prog_array *old_array;
>  	struct list_head *storages = &cgrp->bpf.storages;
> +	struct bpf_struct_ops_link *st_link, *st_tmp;
>  	struct bpf_cgroup_storage *storage, *stmp;
> +	LIST_HEAD(st_links);
>  
>  	unsigned int atype;
>  
>  	cgroup_lock();
>  
> +	list_splice_init(&cgrp->bpf.struct_ops_links, &st_links);
> +	list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
> +		st_link->cgroup = NULL;
> +		st_link->cgroup_removed = true;
> +		cgroup_put(cgrp);
> +		if (IS_ERR(bpf_link_inc_not_zero(&st_link->link)))
> +			list_del(&st_link->list);
> +	}
> +
>  	for (atype = 0; atype < ARRAY_SIZE(cgrp->bpf.progs); atype++) {
>  		struct hlist_head *progs = &cgrp->bpf.progs[atype];
>  		struct bpf_prog_list *pl;
> @@ -346,6 +358,11 @@ static void cgroup_bpf_release(struct work_struct *work)
>  
>  	cgroup_unlock();
>  
> +	list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
> +		st_link->link.ops->detach(&st_link->link);
> +		bpf_link_put(&st_link->link);
> +	}
> +
>  	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
>  		cgroup_bpf_put(p);
>  
> @@ -525,6 +542,7 @@ static int cgroup_bpf_inherit(struct cgroup *cgrp)
>  		INIT_HLIST_HEAD(&cgrp->bpf.progs[i]);
>  
>  	INIT_LIST_HEAD(&cgrp->bpf.storages);
> +	INIT_LIST_HEAD(&cgrp->bpf.struct_ops_links);
>  
>  	for (i = 0; i < NR; i++)
>  		if (compute_effective_progs(cgrp, i, &arrays[i]))
> @@ -2759,3 +2777,31 @@ cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return NULL;
>  	}
>  }
> +
> +int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
> +				 struct bpf_struct_ops_link *link)
> +{
> +	int ret = 0;
> +
> +	cgroup_lock();
> +	if (percpu_ref_is_zero(&cgrp->bpf.refcnt)) {
> +		ret = -EBUSY;

If the cgroup is dying, then perhaps -EINVAL would be more appropriate
here, no? I'd argue that -EBUSY implies a temporary or transient
state.

> +		goto out;
> +	}
> +	list_add_tail(&link->list, &cgrp->bpf.struct_ops_links);
> +out:
> +	cgroup_unlock();
> +	return ret;
> +}
> +
> +void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
> +				  struct bpf_struct_ops_link *link)
> +{
> +	cgroup_lock();
> +	if (link->cgroup == cgrp) {
> +		list_del(&link->list);
> +		link->cgroup = NULL;
> +		cgroup_put(cgrp);
> +	}
> +	cgroup_unlock();
> +}

Within cgroup_bpf_attach_struct_ops() and
cgroup_bpf_detach_struct_ops() the cgrp pointer appears to be
superfluous? Both should probably only operate on link->cgroup
instead? A !link->cgroup when calling either should be considered as
-EINVAL.

> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 3ca7d76e05f0..d5492e60744a 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1237,6 +1237,7 @@ enum bpf_perf_event_type {
>  #define BPF_F_AFTER		(1U << 4)
>  #define BPF_F_ID		(1U << 5)
>  #define BPF_F_PREORDER		(1U << 6)
> +#define BPF_F_CGROUP_FD		(1U << 7)
>  #define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
>  
>  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> -- 
> 2.52.0
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h
  2026-01-27  2:44 ` [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
  2026-01-27  5:50   ` Yafang Shao
@ 2026-01-28 11:28   ` Matt Bobrowski
  1 sibling, 0 replies; 63+ messages in thread
From: Matt Bobrowski @ 2026-01-28 11:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Mon, Jan 26, 2026 at 06:44:04PM -0800, Roman Gushchin wrote:
> Move struct bpf_struct_ops_link's definition into bpf.h,
> where other custom bpf links definitions are.
> 
> It's necessary to access its members from outside of generic
> bpf_struct_ops implementation, which will be done by following
> patches in the series.

Looks OK to me:

Acked-by: Matt Bobrowski <mattbobrowski@google.com>

> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/bpf.h         | 6 ++++++
>  kernel/bpf/bpf_struct_ops.c | 6 ------
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 4427c6e98331..899dd911dc82 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1891,6 +1891,12 @@ struct bpf_raw_tp_link {
>  	u64 cookie;
>  };
>  
> +struct bpf_struct_ops_link {
> +	struct bpf_link link;
> +	struct bpf_map __rcu *map;
> +	wait_queue_head_t wait_hup;
> +};
> +
>  struct bpf_link_primer {
>  	struct bpf_link *link;
>  	struct file *file;
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index c43346cb3d76..de01cf3025b3 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -55,12 +55,6 @@ struct bpf_struct_ops_map {
>  	struct bpf_struct_ops_value kvalue;
>  };
>  
> -struct bpf_struct_ops_link {
> -	struct bpf_link link;
> -	struct bpf_map __rcu *map;
> -	wait_queue_head_t wait_hup;
> -};
> -
>  static DEFINE_MUTEX(update_mutex);
>  
>  #define VALUE_PREFIX "bpf_struct_ops_"
> -- 
> 2.52.0
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 00/17] mm: BPF OOM
  2026-01-28  8:06     ` Michal Hocko
@ 2026-01-28 16:59       ` Alexei Starovoitov
  2026-01-28 18:23         ` Roman Gushchin
  2026-02-02  3:26         ` Matt Bobrowski
  0 siblings, 2 replies; 63+ messages in thread
From: Alexei Starovoitov @ 2026-01-28 16:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, bpf, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton

On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote:
>
>
> > Another viable idea (also suggested by Andrew Morton) is to develop
> > a production ready memcg-aware OOM killer in BPF, put the source code
> > into the kernel tree and make it loadable by default (obviously under a
> > config option). Myself or one of my colleagues will try to explore it a
> > bit later: the tricky part is this by-default loading because there are
> > no existing precedents.
>
> It certainly makes sense to have trusted implementation of a commonly
> requested oom policy that we couldn't implement due to specific nature
> that doesn't really apply to many users. And have that in the tree. I am
> not thrilled about auto-loading because this could be easily done by a
> simple tooling.

Production ready bpf-oom program(s) must be part of this set.
We've seen enough attempts to add bpf st_ops in various parts of
the kernel without providing realistic bpf progs that will drive
those hooks. It's great to have flexibility and people need
to have a freedom to develop their own bpf-oom policy, but
the author of the patch set who's advocating for the new
bpf hooks must provide their real production progs and
share their real use case with the community.
It's not cool to hide it.
In that sense enabling auto-loading without requiring an end user
to install the toolchain and build bpf programs/rust/whatnot
is necessary too.
bpf-oom can be a self contained part of vmlinux binary.
We already have a mechanism to do that.
This way the end user doesn't need to be a bpf expert, doesn't need
to install clang, build the tools, etc.
They can just enable fancy new bpf-oom policy and see whether
it's helping their apps or not while knowing nothing about bpf.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 00/17] mm: BPF OOM
  2026-01-28 16:59       ` Alexei Starovoitov
@ 2026-01-28 18:23         ` Roman Gushchin
  2026-01-28 18:53           ` Alexei Starovoitov
  2026-02-02  3:26         ` Matt Bobrowski
  1 sibling, 1 reply; 63+ messages in thread
From: Roman Gushchin @ 2026-01-28 18:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michal Hocko, bpf, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>>
>> > Another viable idea (also suggested by Andrew Morton) is to develop
>> > a production ready memcg-aware OOM killer in BPF, put the source code
>> > into the kernel tree and make it loadable by default (obviously under a
>> > config option). Myself or one of my colleagues will try to explore it a
>> > bit later: the tricky part is this by-default loading because there are
>> > no existing precedents.
>>
>> It certainly makes sense to have trusted implementation of a commonly
>> requested oom policy that we couldn't implement due to specific nature
>> that doesn't really apply to many users. And have that in the tree. I am
>> not thrilled about auto-loading because this could be easily done by a
>> simple tooling.
>
> Production ready bpf-oom program(s) must be part of this set.
> We've seen enough attempts to add bpf st_ops in various parts of
> the kernel without providing realistic bpf progs that will drive
> those hooks. It's great to have flexibility and people need
> to have a freedom to develop their own bpf-oom policy, but
> the author of the patch set who's advocating for the new
> bpf hooks must provide their real production progs and
> share their real use case with the community.
> It's not cool to hide it.

In my case it's not about hiding, it's a chicken and egg problem:
the upstream first model contradicts with the idea to include the
production results into the patchset. In other words, I want to settle
down the interface before shipping something to prod.

I guess the compromise here is to initially include a bpf oom policy
inspired by what systemd-oomd does and what is proven to work for a
broad range of users. Policies suited for large datacenters can be
added later, but also their generic usefulness might be limited by the
need of proprietary userspace orchestration engines.

> In that sense enabling auto-loading without requiring an end user
> to install the toolchain and build bpf programs/rust/whatnot
> is necessary too.
> bpf-oom can be a self contained part of vmlinux binary.
> We already have a mechanism to do that.
> This way the end user doesn't need to be a bpf expert, doesn't need
> to install clang, build the tools, etc.
> They can just enable fancy new bpf-oom policy and see whether
> it's helping their apps or not while knowing nothing about bpf.

Fully agree here. Will implement in v4.

Thanks!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-28  8:00       ` Michal Hocko
@ 2026-01-28 18:44         ` Roman Gushchin
  0 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-28 18:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

Michal Hocko <mhocko@suse.com> writes:

> On Tue 27-01-26 21:12:56, Roman Gushchin wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
>> >> Introduce a bpf struct ops for implementing custom OOM handling
>> >> policies.
>> >> 
>> >> It's possible to load one bpf_oom_ops for the system and one
>> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
>> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
>> >> corresponding BPF OOM handlers are executed until some memory is
>> >> freed. If no memory is freed, the kernel OOM killer is invoked.
>> >> 
>> >> The struct ops provides the bpf_handle_out_of_memory() callback,
>> >> which expected to return 1 if it was able to free some memory and 0
>> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
>> >> field of the oom_control structure, which is expected to be set by
>> >> kfuncs suitable for releasing memory (which will be introduced later
>> >> in the patch series). If both are set, OOM is considered handled,
>> >> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
>> >> attached to the parent cgroup or the kernel OOM killer.
>> >
>> > I still find this dual reporting a bit confusing. I can see your
>> > intention in having a pre-defined "releasers" of the memory to trust BPF
>> > handlers more but they do have access to oc->bpf_memory_freed so they
>> > can manipulate it. Therefore an additional level of protection is rather
>> > weak.
>> 
>> No, they can't. They have only a read-only access.
>
> Could you explain this a bit more. This must be some BPF magic because
> they are getting a standard pointer to oom_control.

Yes, but bpf programs (unlike kernel modules) are going through the
verifier when being loaded to the kernel. The verifier ensures that
programs are safe: e.g. they can't access memory outside of safe areas,
they can't can infinite loops, dereference a NULL pointer etc.

So even it looks like a normal argument, it's read only. And the program
can't even read the memory outside of the structure itself, e.g. a
program doing something like (oc + 1)->bpf_memory_freed won't be allowed
to load.

>> > It is also not really clear to me how this works while there is OOM
>> > victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
>> > will result in no killing therefore no bpf_memory_freed, right? Handler
>> > itself should consider its work done. How exactly is this handled.
>> 
>> It's a good question, I see your point...
>> Basically we want to give a handler an option to exit with "I promise,
>> some memory will be freed soon" without doing anything destructive.
>> But keeping it save at the same time.
>
> Yes, something like OOM_BACKOFF, OOM_PROCESSED, OOM_FAILED.
>
>> I don't have a perfect answer out of my head, maybe some sort of a
>> rate-limiter/counter might work? E.g. a handler can promise this N times
>> before the kernel kicks in? Any ideas?
>
> Counters usually do not work very well for async operations. In this
> case there is oom_repaer and/or task exit to finish the oom operation.
> The former is bound and guaranteed to make a forward progress but there
> is no time frame to assume when that happens as it depends on how many
> tasks might be queued (usually a single one but this is not something to
> rely on because of concurrent ooms in memcgs and also multiple tasks
> could be killed at the same time).
> Another complication is that there are multiple levels of OOM to track
> (global, NUMA, memcg) so any watchdog would have to be aware of that as
> well.

Yeah, it has to be an atomic counter attached to the bpf oom "instance":
a policy attached to a specific cgroup or system-wide.

> I am really wondering whether we really need to be so careful with
> handlers. It is not like you would allow any random oom handler to be
> loaded, right? Would it make sense to start without this protection and
> converge to something as we see how this evolves? Maybe this will raise
> the bar for oom handlers as the price for bugs is going to be really
> high.

Right, bpf programs require CAP_SYSADMIN to be loaded.
I still would prefer to keep it 100% safe, but the more I think about it
the more I agree with you: likely limitations of the protection mechanism will
create more issues than the value of the protection itself.

>> > Also is there any way to handle the oom by increasing the memcg limit?
>> > I do not see a callback for that.
>> 
>> There is no kfunc yet, but it's a good idea (which we accidentally
>> discussed few days ago). I'll implement it.
>
> Cool!

Thank you!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups
  2026-01-28  3:10   ` Josh Don
@ 2026-01-28 18:52     ` Roman Gushchin
  0 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-28 18:52 UTC (permalink / raw)
  To: Josh Don
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, linux-kernel, linux-mm,
	Suren Baghdasaryan, Johannes Weiner, Andrew Morton

Josh Don <joshdon@google.com> writes:

> Hi Roman,
>
> On Mon, Jan 26, 2026 at 6:50 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Introduce an ability to attach bpf struct_ops'es to cgroups.
>>
> [snip]
>>  struct bpf_struct_ops_value {
>>         struct bpf_struct_ops_common_value common;
>> @@ -1220,6 +1222,10 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>>                 st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data, link);
>>                 bpf_map_put(&st_map->map);
>>         }
>> +
>> +       if (st_link->cgroup)
>> +               cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
>> +

Hi Josh!

>
> I was worried about concurrency with cgroup ops until I saw
> cgroup_bpf_detach_struct_ops() takes cgroup_lock() internally (since
> you take it inline sometimes below I falsely assumed it wasn't
> present). In any case, I'm wondering why you need to pass in the
> cgroup pointer to cgroup_bpf_detach_struct_ops() at all, rather than
> just the link?

Sure, good point.

>> @@ -1357,8 +1386,12 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>>         struct bpf_link_primer link_primer;
>>         struct bpf_struct_ops_map *st_map;
>>         struct bpf_map *map;
>> +       struct cgroup *cgrp;
>>         int err;
>>
>> +       if (attr->link_create.flags & ~BPF_F_CGROUP_FD)
>> +               return -EINVAL;
>> +
>>         map = bpf_map_get(attr->link_create.map_fd);
>>         if (IS_ERR(map))
>>                 return PTR_ERR(map);
>> @@ -1378,11 +1411,26 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>>         bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>>                       attr->link_create.attach_type);
>>
>> +       init_waitqueue_head(&link->wait_hup);
>> +
>> +       if (attr->link_create.flags & BPF_F_CGROUP_FD) {
>> +               cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
>> +               if (IS_ERR(cgrp)) {
>> +                       err = PTR_ERR(cgrp);
>> +                       goto err_out;
>> +               }
>> +               link->cgroup = cgrp;
>> +               err = cgroup_bpf_attach_struct_ops(cgrp, link);
>
> We have to be careful at this point. cgroup release could now occur
> concurrently which would clear link->cgroup. Maybe worth a comment
> here since this is a bit subtle.

Agree, will add.

Thanks!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-28 11:19   ` Michal Hocko
@ 2026-01-28 18:53     ` Roman Gushchin
  0 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-28 18:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: bpf, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

Michal Hocko <mhocko@suse.com> writes:

> Once additional point I forgot to mention previously
>
> On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
>> @@ -1168,6 +1180,13 @@ bool out_of_memory(struct oom_control *oc)
>>  		return true;
>>  	}
>>  
>> +	/*
>> +	 * Let bpf handle the OOM first. If it was able to free up some memory,
>> +	 * bail out. Otherwise fall back to the kernel OOM killer.
>> +	 */
>> +	if (bpf_handle_oom(oc))
>> +		return true;
>> +
>>  	select_bad_process(oc);
>>  	/* Found nothing?!?! */
>>  	if (!oc->chosen) {
>
> Should this check for is_sysrq_oom and always use the in kernel OOM
> handling for Sysrq triggered ooms as a failsafe measure?

Yep, good point. Will implement in v4.

Thanks!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 00/17] mm: BPF OOM
  2026-01-28 18:23         ` Roman Gushchin
@ 2026-01-28 18:53           ` Alexei Starovoitov
  0 siblings, 0 replies; 63+ messages in thread
From: Alexei Starovoitov @ 2026-01-28 18:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, bpf, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton

On Wed, Jan 28, 2026 at 10:23 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >>
> >> > Another viable idea (also suggested by Andrew Morton) is to develop
> >> > a production ready memcg-aware OOM killer in BPF, put the source code
> >> > into the kernel tree and make it loadable by default (obviously under a
> >> > config option). Myself or one of my colleagues will try to explore it a
> >> > bit later: the tricky part is this by-default loading because there are
> >> > no existing precedents.
> >>
> >> It certainly makes sense to have trusted implementation of a commonly
> >> requested oom policy that we couldn't implement due to specific nature
> >> that doesn't really apply to many users. And have that in the tree. I am
> >> not thrilled about auto-loading because this could be easily done by a
> >> simple tooling.
> >
> > Production ready bpf-oom program(s) must be part of this set.
> > We've seen enough attempts to add bpf st_ops in various parts of
> > the kernel without providing realistic bpf progs that will drive
> > those hooks. It's great to have flexibility and people need
> > to have a freedom to develop their own bpf-oom policy, but
> > the author of the patch set who's advocating for the new
> > bpf hooks must provide their real production progs and
> > share their real use case with the community.
> > It's not cool to hide it.
>
> In my case it's not about hiding, it's a chicken and egg problem:
> the upstream first model contradicts with the idea to include the
> production results into the patchset. In other words, I want to settle
> down the interface before shipping something to prod.
>
> I guess the compromise here is to initially include a bpf oom policy
> inspired by what systemd-oomd does and what is proven to work for a
> broad range of users.

Works for me.

> Policies suited for large datacenters can be
> added later, but also their generic usefulness might be limited by the
> need of proprietary userspace orchestration engines.

Agree. That's the flexibility part that makes the whole thing worth
while and the reason to do such oom policy as bpf progs.
But something tangible and useful needs to be there from day one.
systmed-oomd-like sounds very reasonable.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-28  3:26   ` Josh Don
@ 2026-01-28 19:03     ` Roman Gushchin
  0 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-28 19:03 UTC (permalink / raw)
  To: Josh Don
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski,
	Shakeel Butt, JP Kobryn, linux-kernel, linux-mm,
	Suren Baghdasaryan, Johannes Weiner, Andrew Morton

Josh Don <joshdon@google.com> writes:

> Thanks Roman!
>
> On Mon, Jan 26, 2026 at 6:51 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Introduce a bpf struct ops for implementing custom OOM handling
>> policies.
>>
>> +bool bpf_handle_oom(struct oom_control *oc)
>> +{
>> +       struct bpf_struct_ops_link *st_link;
>> +       struct bpf_oom_ops *bpf_oom_ops;
>> +       struct mem_cgroup *memcg;
>> +       struct bpf_map *map;
>> +       int ret = 0;
>> +
>> +       /*
>> +        * System-wide OOMs are handled by the struct ops attached
>> +        * to the root memory cgroup
>> +        */
>> +       memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
>> +
>> +       rcu_read_lock_trace();
>> +
>> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
>> +       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> +               st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
>> +                                               rcu_read_lock_trace_held());
>> +               if (!st_link)
>> +                       continue;
>> +
>> +               map = rcu_dereference_check((st_link->map),
>> +                                           rcu_read_lock_trace_held());
>> +               if (!map)
>> +                       continue;
>> +
>> +               /* Call BPF OOM handler */
>> +               bpf_oom_ops = bpf_struct_ops_data(map);
>> +               ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
>> +               if (ret && oc->bpf_memory_freed)
>> +                       break;
>> +               ret = 0;
>> +       }
>> +
>> +       rcu_read_unlock_trace();
>> +
>> +       return ret && oc->bpf_memory_freed;
>
> If bpf claims to have freed memory but didn't actually do so, that
> seems like something potentially worth alerting to. Perhaps something
> to add to the oom header output?

Michal pointed at a more fundamental problem: if a bpf handler performed
some actions (e.g. killed a program), how to safely allow other bpf
handlers to exit without performing redundant destructive operations?
Now it works on marking victim processes, so that subsequent kernel
oom handlers just bail out if they see a marked process.

I don't know to extend it to generic actions. E.g. we can have an atomic
counter attached to the bpf oom instance (link), we can bump it on
performing a destructive operation, but it's not clear when to clear it.

So maybe it's not worth it at all and it's better to drop this
protection mechanism altogether.

Thanks!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups
  2026-01-28 11:25   ` Matt Bobrowski
@ 2026-01-28 19:18     ` Roman Gushchin
  0 siblings, 0 replies; 63+ messages in thread
From: Roman Gushchin @ 2026-01-28 19:18 UTC (permalink / raw)
  To: Matt Bobrowski
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

Matt Bobrowski <mattbobrowski@google.com> writes:

> On Mon, Jan 26, 2026 at 06:44:05PM -0800, Roman Gushchin wrote:
>> Introduce an ability to attach bpf struct_ops'es to cgroups.
>> 
>> From user's standpoint it works in the following way:
>> a user passes a BPF_F_CGROUP_FD flag and specifies the target cgroup
>> fd while creating a struct_ops link. As the result, the bpf struct_ops
>> link will be created and attached to a cgroup.
>> 
>> The cgroup.bpf structure maintains a list of attached struct ops links.
>> If the cgroup is getting deleted, attached struct ops'es are getting
>> auto-detached and the userspace program gets a notification.
>> 
>> This change doesn't answer the question how bpf programs belonging
>> to these struct ops'es will be executed. It will be done individually
>> for every bpf struct ops which supports this.
>> 
>> Please, note that unlike "normal" bpf programs, struct ops'es
>> are not propagated to cgroup sub-trees.
>> 
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  include/linux/bpf-cgroup-defs.h |  3 ++
>>  include/linux/bpf-cgroup.h      | 16 +++++++++
>>  include/linux/bpf.h             |  3 ++
>>  include/uapi/linux/bpf.h        |  3 ++
>>  kernel/bpf/bpf_struct_ops.c     | 59 ++++++++++++++++++++++++++++++---
>>  kernel/bpf/cgroup.c             | 46 +++++++++++++++++++++++++
>>  tools/include/uapi/linux/bpf.h  |  1 +
>>  7 files changed, 127 insertions(+), 4 deletions(-)
>> 
>> diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
>> index c9e6b26abab6..6c5e37190dad 100644
>> --- a/include/linux/bpf-cgroup-defs.h
>> +++ b/include/linux/bpf-cgroup-defs.h
>> @@ -71,6 +71,9 @@ struct cgroup_bpf {
>>  	/* temp storage for effective prog array used by prog_attach/detach */
>>  	struct bpf_prog_array *inactive;
>>  
>> +	/* list of bpf struct ops links */
>> +	struct list_head struct_ops_links;
>> +
>>  	/* reference counter used to detach bpf programs after cgroup removal */
>>  	struct percpu_ref refcnt;
>>  
>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>> index 2f535331f926..a6c327257006 100644
>> --- a/include/linux/bpf-cgroup.h
>> +++ b/include/linux/bpf-cgroup.h
>> @@ -423,6 +423,11 @@ int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>>  int cgroup_bpf_prog_query(const union bpf_attr *attr,
>>  			  union bpf_attr __user *uattr);
>>  
>> +int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
>> +				 struct bpf_struct_ops_link *link);
>> +void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
>> +				  struct bpf_struct_ops_link *link);
>> +
>>  const struct bpf_func_proto *
>>  cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
>>  #else
>> @@ -451,6 +456,17 @@ static inline int cgroup_bpf_link_attach(const union bpf_attr *attr,
>>  	return -EINVAL;
>>  }
>>  
>> +static inline int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
>> +					       struct bpf_struct_ops_link *link)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +static inline void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
>> +						struct bpf_struct_ops_link *link)
>> +{
>> +}
>> +
>>  static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
>>  					union bpf_attr __user *uattr)
>>  {
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 899dd911dc82..391888eb257c 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1894,6 +1894,9 @@ struct bpf_raw_tp_link {
>>  struct bpf_struct_ops_link {
>>  	struct bpf_link link;
>>  	struct bpf_map __rcu *map;
>> +	struct cgroup *cgroup;
>> +	bool cgroup_removed;
>> +	struct list_head list;
>>  	wait_queue_head_t wait_hup;
>>  };
>>  
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 44e7dbc278e3..28544e8af1cd 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1237,6 +1237,7 @@ enum bpf_perf_event_type {
>>  #define BPF_F_AFTER		(1U << 4)
>>  #define BPF_F_ID		(1U << 5)
>>  #define BPF_F_PREORDER		(1U << 6)
>> +#define BPF_F_CGROUP_FD		(1U << 7)
>>  #define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
>>  
>>  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>> @@ -6775,6 +6776,8 @@ struct bpf_link_info {
>>  		} xdp;
>>  		struct {
>>  			__u32 map_id;
>> +			__u32 :32;
>> +			__u64 cgroup_id;
>>  		} struct_ops;
>>  		struct {
>>  			__u32 pf;
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index de01cf3025b3..2e361e22cfa0 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -13,6 +13,8 @@
>>  #include <linux/btf_ids.h>
>>  #include <linux/rcupdate_wait.h>
>>  #include <linux/poll.h>
>> +#include <linux/bpf-cgroup.h>
>> +#include <linux/cgroup.h>
>>  
>>  struct bpf_struct_ops_value {
>>  	struct bpf_struct_ops_common_value common;
>> @@ -1220,6 +1222,10 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>>  		st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data, link);
>>  		bpf_map_put(&st_map->map);
>>  	}
>> +
>> +	if (st_link->cgroup)
>> +		cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
>> +
>>  	kfree(st_link);
>>  }
>>  
>> @@ -1228,6 +1234,7 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
>>  {
>>  	struct bpf_struct_ops_link *st_link;
>>  	struct bpf_map *map;
>> +	u64 cgrp_id = 0;
>
> Assigning 0 to cgrp_id would technically be incorrect, right? Like,
> cgroup_id() for !CONFIG_CGROUPS default to returning 1, and for
> CONFIG_CGROUPS the ID allocation is done via the idr_alloc_cyclic()
> API using a range between 1 and INT_MAX. Perhaps here it serves as a
> valid sentinel value? Is that the rationale?

Yes. Idk, maybe (u64)-1 works better here, I don't have a strong
opinion. Realistically I doubt there are too many bpf users with
!CONFIG_CGROUPS. Alexei even suggested in the past to make CONFIG_MEMCG
mandatory, which implies CONFIG_CGROUPS.

> In general, shouldn't all the cgroup related logic within this source
> file be protected by a CONFIG_CGROUPS ifdef? For example, both
> cgroup_get_from_fd() and cgroup_put() lack stubs when building with
> !CONFIG_CGROUPS.
>
>>  	st_link = container_of(link, struct bpf_struct_ops_link, link);
>>  	rcu_read_lock();
>> @@ -1235,6 +1242,14 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
>>  	if (map)
>>  		seq_printf(seq, "map_id:\t%d\n", map->id);
>>  	rcu_read_unlock();
>> +
>> +	cgroup_lock();
>> +	if (st_link->cgroup)
>> +		cgrp_id = cgroup_id(st_link->cgroup);
>> +	cgroup_unlock();
>> +
>> +	if (cgrp_id)
>> +		seq_printf(seq, "cgroup_id:\t%llu\n", cgrp_id);
>
> Probably could introduce a simple inline helper for the
> cgroup_lock()/cgroup_id()/cgroup_unlock() dance that's going on in
> here and bpf_struct_ops_map_link_fill_link_info() below.

I'll try, thanks!

>
>>  }
>>  
>>  static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
>> @@ -1242,6 +1257,7 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
>>  {
>>  	struct bpf_struct_ops_link *st_link;
>>  	struct bpf_map *map;
>> +	u64 cgrp_id = 0;
>>  
>>  	st_link = container_of(link, struct bpf_struct_ops_link, link);
>>  	rcu_read_lock();
>> @@ -1249,6 +1265,13 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
>>  	if (map)
>>  		info->struct_ops.map_id = map->id;
>>  	rcu_read_unlock();
>> +
>> +	cgroup_lock();
>> +	if (st_link->cgroup)
>> +		cgrp_id = cgroup_id(st_link->cgroup);
>> +	cgroup_unlock();
>> +
>> +	info->struct_ops.cgroup_id = cgrp_id;
>
> As mentioned above a simple inline helper could simply yield the
> following here:
>
> ...
> 	  info->struct_ops.cgroup_id = bpf_struct_ops_lin_cgroup_id();
> ...
>
>>  	return 0;
>>  }
>>  
>> @@ -1327,6 +1350,9 @@ static int bpf_struct_ops_map_link_detach(struct bpf_link *link)
>>  
>>  	mutex_unlock(&update_mutex);
>>  
>> +	if (st_link->cgroup)
>> +		cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
>> +
>>  	wake_up_interruptible_poll(&st_link->wait_hup, EPOLLHUP);
>>  
>>  	return 0;
>> @@ -1339,6 +1365,9 @@ static __poll_t bpf_struct_ops_map_link_poll(struct file *file,
>>  
>>  	poll_wait(file, &st_link->wait_hup, pts);
>>  
>> +	if (st_link->cgroup_removed)
>> +		return EPOLLHUP;
>> +
>>  	return rcu_access_pointer(st_link->map) ? 0 : EPOLLHUP;
>>  }
>>  
>> @@ -1357,8 +1386,12 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>>  	struct bpf_link_primer link_primer;
>>  	struct bpf_struct_ops_map *st_map;
>>  	struct bpf_map *map;
>> +	struct cgroup *cgrp;
>>  	int err;
>>  
>> +	if (attr->link_create.flags & ~BPF_F_CGROUP_FD)
>> +		return -EINVAL;
>> +
>
> BPF_F_CGROUP_FD is dependent on the cgroup subsystem, therefore it
> probably makes some sense to only accept BPF_F_CGROUP_FD when
> CONFIG_BPF_CGROUP is enabled, otherwise -EOPNOTSUPP?
>
> I'd also probably rewrite this such that we do:
>
> ...
> 	struct cgroup *cgrp = NULL;
> 	...
> 	if (attr->link_create.flags & ~BPF_F_CGROUP_FD) {
> #if IS_ENABLED(CONFIG_CGROUP_BPF)
> 	cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
> 	if (IS_ERR(cgrp))
> 		return PTR_ERR(cgrp);
> #else
> 	return -EOPNOTSUPP;
> #endif
> 	}
> ...
> 	if (cgrp) {
> 		link->cgroup = cgrp;
> 		if (cgroup_bpf_attach_struct_ops(cgrp, link)) {
> 		   cgroup_put(cgrp);
> 		   goto err_out;
> 		}
> 	}
>
> IMO the code is cleaner and reads better too.
>
>>  	map = bpf_map_get(attr->link_create.map_fd);
>>  	if (IS_ERR(map))
>>  		return PTR_ERR(map);
>> @@ -1378,11 +1411,26 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>>  	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>>  		      attr->link_create.attach_type);
>>  
>> +	init_waitqueue_head(&link->wait_hup);
>> +
>> +	if (attr->link_create.flags & BPF_F_CGROUP_FD) {
>> +		cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
>> +		if (IS_ERR(cgrp)) {
>> +			err = PTR_ERR(cgrp);
>> +			goto err_out;
>> +		}
>> +		link->cgroup = cgrp;
>> +		err = cgroup_bpf_attach_struct_ops(cgrp, link);
>> +		if (err) {
>> +			cgroup_put(cgrp);
>> +			link->cgroup = NULL;
>> +			goto err_out;
>> +		}
>> +	}
>> +
>>  	err = bpf_link_prime(&link->link, &link_primer);
>>  	if (err)
>> -		goto err_out;
>> -
>> -	init_waitqueue_head(&link->wait_hup);
>> +		goto err_put_cgroup;
>>  
>>  	/* Hold the update_mutex such that the subsystem cannot
>>  	 * do link->ops->detach() before the link is fully initialized.
>> @@ -1393,13 +1441,16 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>>  		mutex_unlock(&update_mutex);
>>  		bpf_link_cleanup(&link_primer);
>>  		link = NULL;
>> -		goto err_out;
>> +		goto err_put_cgroup;
>>  	}
>>  	RCU_INIT_POINTER(link->map, map);
>>  	mutex_unlock(&update_mutex);
>>  
>>  	return bpf_link_settle(&link_primer);
>>  
>> +err_put_cgroup:
>> +	if (link && link->cgroup)
>> +		cgroup_bpf_detach_struct_ops(link->cgroup, link);
>>  err_out:
>>  	bpf_map_put(map);
>>  	kfree(link);
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 69988af44b37..7b1903be6f69 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -16,6 +16,7 @@
>>  #include <linux/bpf-cgroup.h>
>>  #include <linux/bpf_lsm.h>
>>  #include <linux/bpf_verifier.h>
>> +#include <linux/poll.h>
>>  #include <net/sock.h>
>>  #include <net/bpf_sk_storage.h>
>>  
>> @@ -307,12 +308,23 @@ static void cgroup_bpf_release(struct work_struct *work)
>>  					       bpf.release_work);
>>  	struct bpf_prog_array *old_array;
>>  	struct list_head *storages = &cgrp->bpf.storages;
>> +	struct bpf_struct_ops_link *st_link, *st_tmp;
>>  	struct bpf_cgroup_storage *storage, *stmp;
>> +	LIST_HEAD(st_links);
>>  
>>  	unsigned int atype;
>>  
>>  	cgroup_lock();
>>  
>> +	list_splice_init(&cgrp->bpf.struct_ops_links, &st_links);
>> +	list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
>> +		st_link->cgroup = NULL;
>> +		st_link->cgroup_removed = true;
>> +		cgroup_put(cgrp);
>> +		if (IS_ERR(bpf_link_inc_not_zero(&st_link->link)))
>> +			list_del(&st_link->list);
>> +	}
>> +
>>  	for (atype = 0; atype < ARRAY_SIZE(cgrp->bpf.progs); atype++) {
>>  		struct hlist_head *progs = &cgrp->bpf.progs[atype];
>>  		struct bpf_prog_list *pl;
>> @@ -346,6 +358,11 @@ static void cgroup_bpf_release(struct work_struct *work)
>>  
>>  	cgroup_unlock();
>>  
>> +	list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
>> +		st_link->link.ops->detach(&st_link->link);
>> +		bpf_link_put(&st_link->link);
>> +	}
>> +
>>  	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
>>  		cgroup_bpf_put(p);
>>  
>> @@ -525,6 +542,7 @@ static int cgroup_bpf_inherit(struct cgroup *cgrp)
>>  		INIT_HLIST_HEAD(&cgrp->bpf.progs[i]);
>>  
>>  	INIT_LIST_HEAD(&cgrp->bpf.storages);
>> +	INIT_LIST_HEAD(&cgrp->bpf.struct_ops_links);
>>  
>>  	for (i = 0; i < NR; i++)
>>  		if (compute_effective_progs(cgrp, i, &arrays[i]))
>> @@ -2759,3 +2777,31 @@ cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>  		return NULL;
>>  	}
>>  }
>> +
>> +int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
>> +				 struct bpf_struct_ops_link *link)
>> +{
>> +	int ret = 0;
>> +
>> +	cgroup_lock();
>> +	if (percpu_ref_is_zero(&cgrp->bpf.refcnt)) {
>> +		ret = -EBUSY;
>
> If the cgroup is dying, then perhaps -EINVAL would be more appropriate
> here, no? I'd argue that -EBUSY implies a temporary or transient
> state.

Idk, I thought about it and settled on -EBUSY to highlight the
transient nature of the issue. ENOENT is another option.
I don't really think EINVAL is the best choice here.

>
>> +		goto out;
>> +	}
>> +	list_add_tail(&link->list, &cgrp->bpf.struct_ops_links);
>> +out:
>> +	cgroup_unlock();
>> +	return ret;
>> +}
>> +
>> +void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
>> +				  struct bpf_struct_ops_link *link)
>> +{
>> +	cgroup_lock();
>> +	if (link->cgroup == cgrp) {
>> +		list_del(&link->list);
>> +		link->cgroup = NULL;
>> +		cgroup_put(cgrp);
>> +	}
>> +	cgroup_unlock();
>> +}
>
> Within cgroup_bpf_attach_struct_ops() and
> cgroup_bpf_detach_struct_ops() the cgrp pointer appears to be
> superfluous? Both should probably only operate on link->cgroup
> instead? A !link->cgroup when calling either should be considered as
> -EINVAL.

Ack.

Thank you for the review!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 09/17] mm: introduce bpf_out_of_memory() BPF kfunc
  2026-01-27  2:44 ` [PATCH bpf-next v3 09/17] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
@ 2026-01-28 20:21   ` Matt Bobrowski
  0 siblings, 0 replies; 63+ messages in thread
From: Matt Bobrowski @ 2026-01-28 20:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Mon, Jan 26, 2026 at 06:44:12PM -0800, Roman Gushchin wrote:
> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> an out of memory events and trigger the corresponding kernel OOM
> handling mechanism.
> 
> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> as an argument, as well as the page order.
> 
> If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM
> can be declared and handled in the system at once, so if the function
> is called in parallel to another OOM handling, it bails out with -EBUSY.
> This mode is suited for global OOM's: any concurrent OOMs will likely
> do the job and release some memory. In a blocking mode (which is
> suited for memcg OOMs) the execution will wait on the oom_lock mutex.
> 
> The function is declared as sleepable. It guarantees that it won't
> be called from an atomic context. It's required by the OOM handling
> code, which shouldn't be called from a non-blocking context.
> 
> Handling of a memcg OOM almost always requires taking of the
> css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
> also guarantees that it can't be called with acquired css_set_lock,
> so the kernel can't deadlock on it.
> 
> To avoid deadlocks on the oom lock, the function is filtered out for
> bpf oom struct ops programs and all tracing programs.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/oom.h |  5 +++
>  mm/oom_kill.c       | 85 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 88 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index c2dce336bcb4..851dba9287b5 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -21,6 +21,11 @@ enum oom_constraint {
>  	CONSTRAINT_MEMCG,
>  };
>  
> +enum bpf_oom_flags {
> +	BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK = 1 << 0,
> +	BPF_OOM_FLAGS_LAST = 1 << 1,
> +};
> +
>  /*
>   * Details of the page allocation that triggered the oom killer that are used to
>   * determine what should be killed.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 09897597907f..8f63a370b8f5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1334,6 +1334,53 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>  	return 0;
>  }
>  
> +/**
> + * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
> + * @memcg__nullable: memcg or NULL for system-wide OOMs
> + * @order: order of page which wasn't allocated
> + * @flags: flags
> + *
> + * Declares the Out Of Memory state and invokes the OOM killer.
> + *
> + * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
> + * is true, the function will wait on it. Otherwise it bails out with -EBUSY
> + * if oom_lock is contended.
> + *
> + * Generally it's advised to pass wait_on_oom_lock=false for global OOMs
> + * and wait_on_oom_lock=true for memcg-scoped OOMs.
> + *
> + * Returns 1 if the forward progress was achieved and some memory was freed.
> + * Returns a negative value if an error occurred.
> + */
> +__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> +				  int order, u64 flags)
> +{
> +	struct oom_control oc = {
> +		.memcg = memcg__nullable,
> +		.gfp_mask = GFP_KERNEL,
> +		.order = order,
> +	};
> +	int ret;
> +
> +	if (flags & ~(BPF_OOM_FLAGS_LAST - 1))
> +		return -EINVAL;
> +
> +	if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
> +		return -EINVAL;
> +
> +	if (flags & BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK) {
> +		ret = mutex_lock_killable(&oom_lock);

If contended and we end up waiting here, some forward progress could
have been made in the interim. Enough such that this pending OOM event
initiated by the call into bpf_out_of_memory() may no longer even be
warranted. What do you think about adding an escape hatch here, which
could simply be in the form of a user-defined function callback?

> +		if (ret)
> +			return ret;
> +	} else if (!mutex_trylock(&oom_lock))
> +		return -EBUSY;
> +
> +	ret = out_of_memory(&oc);
> +
> +	mutex_unlock(&oom_lock);
> +	return ret;
> +}
> +
>  __bpf_kfunc_end_defs();
>  
>  BTF_KFUNCS_START(bpf_oom_kfuncs)
> @@ -1356,14 +1403,48 @@ static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
>  	.filter         = bpf_oom_kfunc_filter,
>  };
>  
> +BTF_KFUNCS_START(bpf_declare_oom_kfuncs)
> +BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE)
> +BTF_KFUNCS_END(bpf_declare_oom_kfuncs)
> +
> +static int bpf_declare_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
> +{
> +	if (!btf_id_set8_contains(&bpf_declare_oom_kfuncs, kfunc_id))
> +		return 0;
> +
> +	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
> +	    prog->aux->attach_btf_id == bpf_oom_ops_ids[0])
> +		return -EACCES;
> +
> +	if (prog->type == BPF_PROG_TYPE_TRACING)
> +		return -EACCES;
> +
> +	return 0;
> +}
> +
> +static const struct btf_kfunc_id_set bpf_declare_oom_kfunc_set = {
> +	.owner          = THIS_MODULE,
> +	.set            = &bpf_declare_oom_kfuncs,
> +	.filter         = bpf_declare_oom_kfunc_filter,
> +};
> +
>  static int __init bpf_oom_init(void)
>  {
>  	int err;
>  
>  	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
>  					&bpf_oom_kfunc_set);
> -	if (err)
> -		pr_warn("error while registering bpf oom kfuncs: %d", err);
> +	if (err) {
> +		pr_warn("error while registering struct_ops bpf oom kfuncs: %d", err);
> +		return err;
> +	}
> +
> +	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
> +					&bpf_declare_oom_kfunc_set);
> +	if (err) {
> +		pr_warn("error while registering unspec bpf oom kfuncs: %d", err);
> +		return err;
> +	}
>  
>  	return err;
>  }
> -- 
> 2.52.0
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-27  2:44 ` [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops Roman Gushchin
                     ` (2 preceding siblings ...)
  2026-01-28 11:19   ` Michal Hocko
@ 2026-01-29 21:00   ` Martin KaFai Lau
  2026-01-30 23:29     ` Roman Gushchin
  3 siblings, 1 reply; 63+ messages in thread
From: Martin KaFai Lau @ 2026-01-29 21:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, bpf

On 1/26/26 6:44 PM, Roman Gushchin wrote:
> +bool bpf_handle_oom(struct oom_control *oc)
> +{
> +	struct bpf_struct_ops_link *st_link;
> +	struct bpf_oom_ops *bpf_oom_ops;
> +	struct mem_cgroup *memcg;
> +	struct bpf_map *map;
> +	int ret = 0;
> +
> +	/*
> +	 * System-wide OOMs are handled by the struct ops attached
> +	 * to the root memory cgroup
> +	 */
> +	memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
> +
> +	rcu_read_lock_trace();
> +
> +	/* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +		st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
> +						rcu_read_lock_trace_held());
> +		if (!st_link)
> +			continue;
> +
> +		map = rcu_dereference_check((st_link->map),
> +					    rcu_read_lock_trace_held());
> +		if (!map)
> +			continue;
> +
> +		/* Call BPF OOM handler */
> +		bpf_oom_ops = bpf_struct_ops_data(map);
> +		ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
> +		if (ret && oc->bpf_memory_freed)
> +			break;
> +		ret = 0;
> +	}
> +
> +	rcu_read_unlock_trace();
> +
> +	return ret && oc->bpf_memory_freed;
> +}
> +

[ ... ]

> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct cgroup *cgrp;
> +
> +	/* The link is not yet fully initialized, but cgroup should be set */
> +	if (!link)
> +		return -EOPNOTSUPP;
> +
> +	cgrp = st_link->cgroup;
> +	if (!cgrp)
> +		return -EINVAL;
> +
> +	if (cmpxchg(&cgrp->bpf.bpf_oom_link, NULL, st_link))
> +		return -EEXIST;
iiuc, this will allow only one oom_ops to be attached to a cgroup. 
Considering oom_ops is the only user of the cgrp->bpf.struct_ops_links 
(added in patch 2), the list should have only one element for now.

Copy some context from the patch 2 commit log.

 > This change doesn't answer the question how bpf programs belonging
 > to these struct ops'es will be executed. It will be done individually
 > for every bpf struct ops which supports this.
 >
 > Please, note that unlike "normal" bpf programs, struct ops'es
 > are not propagated to cgroup sub-trees.

There are NONE, BPF_F_ALLOW_OVERRIDE, and BPF_F_ALLOW_MULTI, which one 
may be closer to the bpf_handle_oom() semantic. If it needs to change 
the ordering (or allow multi) in the future, does it need a new flag or 
the existing BPF_F_xxx flags can be used.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-29 21:00   ` Martin KaFai Lau
@ 2026-01-30 23:29     ` Roman Gushchin
  2026-02-02 20:27       ` Martin KaFai Lau
  0 siblings, 1 reply; 63+ messages in thread
From: Roman Gushchin @ 2026-01-30 23:29 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, bpf

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 1/26/26 6:44 PM, Roman Gushchin wrote:
>> +bool bpf_handle_oom(struct oom_control *oc)
>> +{
>> +	struct bpf_struct_ops_link *st_link;
>> +	struct bpf_oom_ops *bpf_oom_ops;
>> +	struct mem_cgroup *memcg;
>> +	struct bpf_map *map;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * System-wide OOMs are handled by the struct ops attached
>> +	 * to the root memory cgroup
>> +	 */
>> +	memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
>> +
>> +	rcu_read_lock_trace();
>> +
>> +	/* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
>> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> +		st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
>> +						rcu_read_lock_trace_held());
>> +		if (!st_link)
>> +			continue;
>> +
>> +		map = rcu_dereference_check((st_link->map),
>> +					    rcu_read_lock_trace_held());
>> +		if (!map)
>> +			continue;
>> +
>> +		/* Call BPF OOM handler */
>> +		bpf_oom_ops = bpf_struct_ops_data(map);
>> +		ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
>> +		if (ret && oc->bpf_memory_freed)
>> +			break;
>> +		ret = 0;
>> +	}
>> +
>> +	rcu_read_unlock_trace();
>> +
>> +	return ret && oc->bpf_memory_freed;
>> +}
>> +
>
> [ ... ]
>
>> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
>> +{
>> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
>> +	struct cgroup *cgrp;
>> +
>> +	/* The link is not yet fully initialized, but cgroup should be set */
>> +	if (!link)
>> +		return -EOPNOTSUPP;
>> +
>> +	cgrp = st_link->cgroup;
>> +	if (!cgrp)
>> +		return -EINVAL;
>> +
>> +	if (cmpxchg(&cgrp->bpf.bpf_oom_link, NULL, st_link))
>> +		return -EEXIST;
> iiuc, this will allow only one oom_ops to be attached to a
> cgroup. Considering oom_ops is the only user of the
> cgrp->bpf.struct_ops_links (added in patch 2), the list should have
> only one element for now.
>
> Copy some context from the patch 2 commit log.

Hi Martin!

Sorry, I'm not quite sure what do you mean, can you please elaborate
more?

We decided (in conversations at LPC) that 1 bpf oom policy for
memcg is good for now (with a potential to extend in the future, if
there will be use cases). But it seems like there is a lot of interest
to attach struct ops'es to cgroups (there are already a couple of
patchsets posted based on my earlier v2 patches), so I tried to make the
bpf link mechanics suitable for multiple use cases from scratch.

Did I answer your question?

>
>> This change doesn't answer the question how bpf programs belonging
>> to these struct ops'es will be executed. It will be done individually
>> for every bpf struct ops which supports this.
>>
>> Please, note that unlike "normal" bpf programs, struct ops'es
>> are not propagated to cgroup sub-trees.
>
> There are NONE, BPF_F_ALLOW_OVERRIDE, and BPF_F_ALLOW_MULTI, which one
> may be closer to the bpf_handle_oom() semantic. If it needs to change
> the ordering (or allow multi) in the future, does it need a new flag
> or the existing BPF_F_xxx flags can be used.

I hope that existing flags can be used, but also I'm not sure we ever
would need multiple oom handlers per cgroup. Do you have any specific
concerns here?

Thanks!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 00/17] mm: BPF OOM
  2026-01-28 16:59       ` Alexei Starovoitov
  2026-01-28 18:23         ` Roman Gushchin
@ 2026-02-02  3:26         ` Matt Bobrowski
  2026-02-02 17:50           ` Alexei Starovoitov
  1 sibling, 1 reply; 63+ messages in thread
From: Matt Bobrowski @ 2026-02-02  3:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michal Hocko, Roman Gushchin, bpf, Alexei Starovoitov,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, joshdon

On Wed, Jan 28, 2026 at 08:59:34AM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> >
> > > Another viable idea (also suggested by Andrew Morton) is to develop
> > > a production ready memcg-aware OOM killer in BPF, put the source code
> > > into the kernel tree and make it loadable by default (obviously under a
> > > config option). Myself or one of my colleagues will try to explore it a
> > > bit later: the tricky part is this by-default loading because there are
> > > no existing precedents.
> >
> > It certainly makes sense to have trusted implementation of a commonly
> > requested oom policy that we couldn't implement due to specific nature
> > that doesn't really apply to many users. And have that in the tree. I am
> > not thrilled about auto-loading because this could be easily done by a
> > simple tooling.
> 
> Production ready bpf-oom program(s) must be part of this set.
> We've seen enough attempts to add bpf st_ops in various parts of
> the kernel without providing realistic bpf progs that will drive
> those hooks. It's great to have flexibility and people need
> to have a freedom to develop their own bpf-oom policy, but
> the author of the patch set who's advocating for the new
> bpf hooks must provide their real production progs and
> share their real use case with the community.
> It's not cool to hide it.
> In that sense enabling auto-loading without requiring an end user
> to install the toolchain and build bpf programs/rust/whatnot
> is necessary too.
> bpf-oom can be a self contained part of vmlinux binary.
> We already have a mechanism to do that.
> This way the end user doesn't need to be a bpf expert, doesn't need
> to install clang, build the tools, etc.
> They can just enable fancy new bpf-oom policy and see whether
> it's helping their apps or not while knowing nothing about bpf.

For the auto-loading capability you speak of here, I'm currently
interpreting it as being some form of conceptually similar extension
to the BPF preload functionality. Have I understood this correctly? If
so, I feel as though something like this would be a completely
independent stream of work, orthogonal to this BPF OOM feature, right?
Or, is that you'd like this new auto-loading capability completed as a
hard prerequisite before pulling in the BPF OOM feature?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 06/17] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
  2026-01-27  2:44 ` [PATCH bpf-next v3 06/17] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
  2026-01-27  6:12   ` Yafang Shao
@ 2026-02-02  3:50   ` Shakeel Butt
  1 sibling, 0 replies; 63+ messages in thread
From: Shakeel Butt @ 2026-02-02  3:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Matt Bobrowski, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Mon, Jan 26, 2026 at 06:44:09PM -0800, Roman Gushchin wrote:
> mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
> but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Michal Hocko <mhocko@suse.com>

This code has been changed in the mm-tree and you can directly use
mem_cgroup_get_from_id() after changes in the mm-tree.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-27 21:12     ` Roman Gushchin
  2026-01-28  8:00       ` Michal Hocko
@ 2026-02-02  4:06       ` Matt Bobrowski
  1 sibling, 0 replies; 63+ messages in thread
From: Matt Bobrowski @ 2026-02-02  4:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, bpf, Alexei Starovoitov, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Tue, Jan 27, 2026 at 09:12:56PM +0000, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
> >> Introduce a bpf struct ops for implementing custom OOM handling
> >> policies.
> >> 
> >> It's possible to load one bpf_oom_ops for the system and one
> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
> >> corresponding BPF OOM handlers are executed until some memory is
> >> freed. If no memory is freed, the kernel OOM killer is invoked.
> >> 
> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> which expected to return 1 if it was able to free some memory and 0
> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> >> field of the oom_control structure, which is expected to be set by
> >> kfuncs suitable for releasing memory (which will be introduced later
> >> in the patch series). If both are set, OOM is considered handled,
> >> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
> >> attached to the parent cgroup or the kernel OOM killer.
> >
> > I still find this dual reporting a bit confusing. I can see your
> > intention in having a pre-defined "releasers" of the memory to trust BPF
> > handlers more but they do have access to oc->bpf_memory_freed so they
> > can manipulate it. Therefore an additional level of protection is rather
> > weak.
> 
> No, they can't. They have only a read-only access.
> 
> > It is also not really clear to me how this works while there is OOM
> > victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
> > will result in no killing therefore no bpf_memory_freed, right? Handler
> > itself should consider its work done. How exactly is this handled.
> 
> It's a good question, I see your point...
> Basically we want to give a handler an option to exit with "I promise,
> some memory will be freed soon" without doing anything destructive.
> But keeping it save at the same time.
> 
> I don't have a perfect answer out of my head, maybe some sort of a
> rate-limiter/counter might work? E.g. a handler can promise this N times
> before the kernel kicks in? Any ideas?
> 
> > Also is there any way to handle the oom by increasing the memcg limit?
> > I do not see a callback for that.
> 
> There is no kfunc yet, but it's a good idea (which we accidentally
> discussed few days ago). I'll implement it.

Yes, please, this is something that I had mentioned to you the other
day too. With this kind of BPF kfunc, we'll basically be able to
handle memcg scoped OOM events inline without necessarily being forced
to kill off anything.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc
  2026-01-27  2:44 ` [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
  2026-01-27 20:21   ` Martin KaFai Lau
@ 2026-02-02  4:49   ` Matt Bobrowski
  1 sibling, 0 replies; 63+ messages in thread
From: Matt Bobrowski @ 2026-02-02  4:49 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Mon, Jan 26, 2026 at 06:44:11PM -0800, Roman Gushchin wrote:
> Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
> to be used by BPF OOM programs. It allows to kill a process
> in exactly the same way the OOM killer does: using the OOM reaper,
> bumping corresponding memcg and global statistics, respecting
> memory.oom.group etc.
> 
> On success, it sets the oom_control's bpf_memory_freed field to true,
> enabling the bpf program to bypass the kernel OOM killer.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  mm/oom_kill.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 80 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 44bbcf033804..09897597907f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -46,6 +46,7 @@
>  #include <linux/cred.h>
>  #include <linux/nmi.h>
>  #include <linux/bpf_oom.h>
> +#include <linux/btf.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -1290,3 +1291,82 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	return -ENOSYS;
>  #endif /* CONFIG_MMU */
>  }
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +
> +__bpf_kfunc_start_defs();
> +/**
> + * bpf_oom_kill_process - Kill a process as OOM killer
> + * @oc: pointer to oom_control structure, describes OOM context
> + * @task: task to be killed
> + * @message__str: message to print in dmesg
> + *
> + * Kill a process in a way similar to the kernel OOM killer.
> + * This means dump the necessary information to dmesg, adjust memcg
> + * statistics, leverage the oom reaper, respect memory.oom.group etc.
> + *
> + * bpf_oom_kill_process() marks the forward progress by setting
> + * oc->bpf_memory_freed. If the progress was made, the bpf program
> + * is free to decide if the kernel oom killer should be invoked.
> + * Otherwise it's enforced, so that a bad bpf program can't
> + * deadlock the machine on memory.
> + */
> +__bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
> +				     struct task_struct *task,
> +				     const char *message__str)
> +{
> +	if (oom_unkillable_task(task))
> +		return -EPERM;
> +
> +	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		return -EINVAL;

task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is also
representative of an unkillable task, so why not fold this up into the
above conditional? Also, why not bother checking states like
mm_flags_test(MMF_OOM_SKIP, task->mm) and in_vfork() here too?

In all fairness I'm a little surprised about constraints like
task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN being enforced
here. You could argue that the whole purpose of BPF OOM is such that
you can implement your own victim selection algorithms entirely in BPF
using your own set of heuristics and what not without needing to
strictly respect properties like oom_score_adj.

In any case, I think we should at least clearly document such
constraints.

> +	/* paired with put_task_struct() in oom_kill_process() */
> +	get_task_struct(task);
> +
> +	oc->chosen = task;
> +
> +	oom_kill_process(oc, message__str);
> +
> +	oc->chosen = NULL;
> +	oc->bpf_memory_freed = true;
> +
> +	return 0;
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(bpf_oom_kfuncs)
> +BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE)
> +BTF_KFUNCS_END(bpf_oom_kfuncs)
> +
> +BTF_ID_LIST_SINGLE(bpf_oom_ops_ids, struct, bpf_oom_ops)
> +
> +static int bpf_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
> +{
> +	if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> +	    prog->aux->attach_btf_id != bpf_oom_ops_ids[0])
> +		return -EACCES;
> +	return 0;
> +}
> +
> +static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
> +	.owner          = THIS_MODULE,
> +	.set            = &bpf_oom_kfuncs,
> +	.filter         = bpf_oom_kfunc_filter,
> +};
> +
> +static int __init bpf_oom_init(void)
> +{
> +	int err;
> +
> +	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +					&bpf_oom_kfunc_set);
> +	if (err)
> +		pr_warn("error while registering bpf oom kfuncs: %d", err);
> +
> +	return err;
> +}
> +late_initcall(bpf_oom_init);
> +
> +#endif
> -- 
> 2.52.0
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 05/17] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  2026-01-27  2:44 ` [PATCH bpf-next v3 05/17] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
  2026-01-27  6:06   ` Yafang Shao
@ 2026-02-02  4:56   ` Matt Bobrowski
  1 sibling, 0 replies; 63+ messages in thread
From: Matt Bobrowski @ 2026-02-02  4:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton, Kumar Kartikeya Dwivedi

On Mon, Jan 26, 2026 at 06:44:08PM -0800, Roman Gushchin wrote:
> Struct oom_control is used to describe the OOM context.
> It's memcg field defines the scope of OOM: it's NULL for global
> OOMs and a valid memcg pointer for memcg-scoped OOMs.
> Teach bpf verifier to recognize it as trusted or NULL pointer.
> It will provide the bpf OOM handler a trusted memcg pointer,
> which for example is required for iterating the memcg's subtree.

This is fine. Feel free to add:

Acked-by: Matt Bobrowski <mattbobrowski@google.com>

> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  kernel/bpf/verifier.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index c2f2650db9fd..cca36edb460d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7242,6 +7242,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
>  	struct file *vm_file;
>  };
>  
> +BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control) {
> +	struct mem_cgroup *memcg;
> +};
> +
>  static bool type_is_rcu(struct bpf_verifier_env *env,
>  			struct bpf_reg_state *reg,
>  			const char *field_name, u32 btf_id)
> @@ -7284,6 +7288,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
> +	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control));
>  
>  	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
>  					  "__safe_trusted_or_null");
> -- 
> 2.52.0
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc
  2026-01-27  2:44 ` [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
@ 2026-02-02  5:39   ` Matt Bobrowski
  2026-02-02 17:30     ` Alexei Starovoitov
  0 siblings, 1 reply; 63+ messages in thread
From: Matt Bobrowski @ 2026-02-02  5:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bpf, Michal Hocko, Alexei Starovoitov, Shakeel Butt, JP Kobryn,
	linux-kernel, linux-mm, Suren Baghdasaryan, Johannes Weiner,
	Andrew Morton

On Mon, Jan 26, 2026 at 06:44:13PM -0800, Roman Gushchin wrote:
> Export tsk_is_oom_victim() helper as a BPF kfunc.
> It's very useful to avoid redundant oom kills.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8f63a370b8f5..53f9f9674658 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1381,10 +1381,24 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
>  	return ret;
>  }
>  
> +/**
> + * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
> + * @task: task to check
> + *
> + * Returns true if the task has been previously selected by the OOM killer
> + * to be killed. It's expected that the task will be destroyed soon and some
> + * memory will be freed, so maybe no additional actions required.
> + */
> +__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
> +{
> +	return tsk_is_oom_victim(task);
> +}

Why not just do a direct memory read (i.e., task->signal->oom_mm)
within the BPF program? I'm not quite convinced that a BPF kfunc
wrapper for something like tsk_is_oom_victim() is warranted as you can
literally achieve the same semantics without one.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc
  2026-02-02  5:39   ` Matt Bobrowski
@ 2026-02-02 17:30     ` Alexei Starovoitov
  2026-02-03  0:14       ` Roman Gushchin
  0 siblings, 1 reply; 63+ messages in thread
From: Alexei Starovoitov @ 2026-02-02 17:30 UTC (permalink / raw)
  To: Matt Bobrowski
  Cc: Roman Gushchin, bpf, Michal Hocko, Alexei Starovoitov,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton

On Sun, Feb 1, 2026 at 9:39 PM Matt Bobrowski <mattbobrowski@google.com> wrote:
>
> On Mon, Jan 26, 2026 at 06:44:13PM -0800, Roman Gushchin wrote:
> > Export tsk_is_oom_victim() helper as a BPF kfunc.
> > It's very useful to avoid redundant oom kills.
> >
> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/oom_kill.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 8f63a370b8f5..53f9f9674658 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1381,10 +1381,24 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> >       return ret;
> >  }
> >
> > +/**
> > + * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
> > + * @task: task to check
> > + *
> > + * Returns true if the task has been previously selected by the OOM killer
> > + * to be killed. It's expected that the task will be destroyed soon and some
> > + * memory will be freed, so maybe no additional actions required.
> > + */
> > +__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
> > +{
> > +     return tsk_is_oom_victim(task);
> > +}
>
> Why not just do a direct memory read (i.e., task->signal->oom_mm)
> within the BPF program? I'm not quite convinced that a BPF kfunc
> wrapper for something like tsk_is_oom_victim() is warranted as you can
> literally achieve the same semantics without one.

+1
there is no need for this kfunc.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 00/17] mm: BPF OOM
  2026-02-02  3:26         ` Matt Bobrowski
@ 2026-02-02 17:50           ` Alexei Starovoitov
  2026-02-04 23:52             ` Matt Bobrowski
  0 siblings, 1 reply; 63+ messages in thread
From: Alexei Starovoitov @ 2026-02-02 17:50 UTC (permalink / raw)
  To: Matt Bobrowski
  Cc: Michal Hocko, Roman Gushchin, bpf, Alexei Starovoitov,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Josh Don

On Sun, Feb 1, 2026 at 7:26 PM Matt Bobrowski <mattbobrowski@google.com> wrote:
>
> On Wed, Jan 28, 2026 at 08:59:34AM -0800, Alexei Starovoitov wrote:
> > On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > >
> > > > Another viable idea (also suggested by Andrew Morton) is to develop
> > > > a production ready memcg-aware OOM killer in BPF, put the source code
> > > > into the kernel tree and make it loadable by default (obviously under a
> > > > config option). Myself or one of my colleagues will try to explore it a
> > > > bit later: the tricky part is this by-default loading because there are
> > > > no existing precedents.
> > >
> > > It certainly makes sense to have trusted implementation of a commonly
> > > requested oom policy that we couldn't implement due to specific nature
> > > that doesn't really apply to many users. And have that in the tree. I am
> > > not thrilled about auto-loading because this could be easily done by a
> > > simple tooling.
> >
> > Production ready bpf-oom program(s) must be part of this set.
> > We've seen enough attempts to add bpf st_ops in various parts of
> > the kernel without providing realistic bpf progs that will drive
> > those hooks. It's great to have flexibility and people need
> > to have a freedom to develop their own bpf-oom policy, but
> > the author of the patch set who's advocating for the new
> > bpf hooks must provide their real production progs and
> > share their real use case with the community.
> > It's not cool to hide it.
> > In that sense enabling auto-loading without requiring an end user
> > to install the toolchain and build bpf programs/rust/whatnot
> > is necessary too.
> > bpf-oom can be a self contained part of vmlinux binary.
> > We already have a mechanism to do that.
> > This way the end user doesn't need to be a bpf expert, doesn't need
> > to install clang, build the tools, etc.
> > They can just enable fancy new bpf-oom policy and see whether
> > it's helping their apps or not while knowing nothing about bpf.
>
> For the auto-loading capability you speak of here, I'm currently
> interpreting it as being some form of conceptually similar extension
> to the BPF preload functionality. Have I understood this correctly? If
> so, I feel as though something like this would be a completely
> independent stream of work, orthogonal to this BPF OOM feature, right?
> Or, is that you'd like this new auto-loading capability completed as a
> hard prerequisite before pulling in the BPF OOM feature?

It's not a hard prerequisite, but it has to be thought through.
bpf side is ready today. bpf preload is an example of it.
The oom side needs to design an interface to do it.
sysctl to enable builtin bpf-oom policy is probably too rigid.
Maybe a file in cgroupfs? Writing a name of bpf-oom policy would
trigger load and attach to that cgroup.
Or you can plug it exactly like bpf preload:
when bpffs is mounted all builtin bpf progs get loaded and create
".debug" files in bpffs.

I recall we discussed an ability to create files in bpffs from
tracepoints. This way bpffs can replicate cgroupfs directory
structure without user space involvement. New cgroup -> new directory
in cgroupfs -> tracepoint -> bpf prog -> new directory in bpffs
-> create "enable_bpf_oom.debug" file in there.
Writing to that file we trigger bpf prog that will attach bpf-oom
prog to that cgroup.
Could be any combination of the above or something else,
but needs to be designed and agreed upon.
Otherwise, I'm afraid, we will have bpf-oom progs in selftests
and users who want to experiment with it would need kernel source
code, clang, etc to try it. We need to lower the barrier to use it.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
  2026-01-30 23:29     ` Roman Gushchin
@ 2026-02-02 20:27       ` Martin KaFai Lau
  0 siblings, 0 replies; 63+ messages in thread
From: Martin KaFai Lau @ 2026-02-02 20:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Alexei Starovoitov, Matt Bobrowski, Shakeel Butt,
	JP Kobryn, linux-kernel, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, bpf



On 1/30/26 3:29 PM, Roman Gushchin wrote:
> Martin KaFai Lau <martin.lau@linux.dev> writes:
> 
>> On 1/26/26 6:44 PM, Roman Gushchin wrote:
>>> +bool bpf_handle_oom(struct oom_control *oc)
>>> +{
>>> +	struct bpf_struct_ops_link *st_link;
>>> +	struct bpf_oom_ops *bpf_oom_ops;
>>> +	struct mem_cgroup *memcg;
>>> +	struct bpf_map *map;
>>> +	int ret = 0;
>>> +
>>> +	/*
>>> +	 * System-wide OOMs are handled by the struct ops attached
>>> +	 * to the root memory cgroup
>>> +	 */
>>> +	memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
>>> +
>>> +	rcu_read_lock_trace();
>>> +
>>> +	/* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
>>> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>>> +		st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
>>> +						rcu_read_lock_trace_held());
>>> +		if (!st_link)
>>> +			continue;
>>> +
>>> +		map = rcu_dereference_check((st_link->map),
>>> +					    rcu_read_lock_trace_held());
>>> +		if (!map)
>>> +			continue;
>>> +
>>> +		/* Call BPF OOM handler */
>>> +		bpf_oom_ops = bpf_struct_ops_data(map);
>>> +		ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
>>> +		if (ret && oc->bpf_memory_freed)
>>> +			break;
>>> +		ret = 0;
>>> +	}
>>> +
>>> +	rcu_read_unlock_trace();
>>> +
>>> +	return ret && oc->bpf_memory_freed;
>>> +}
>>> +
>>
>> [ ... ]
>>
>>> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
>>> +{
>>> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
>>> +	struct cgroup *cgrp;
>>> +
>>> +	/* The link is not yet fully initialized, but cgroup should be set */
>>> +	if (!link)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	cgrp = st_link->cgroup;
>>> +	if (!cgrp)
>>> +		return -EINVAL;
>>> +
>>> +	if (cmpxchg(&cgrp->bpf.bpf_oom_link, NULL, st_link))
>>> +		return -EEXIST;
>> iiuc, this will allow only one oom_ops to be attached to a
>> cgroup. Considering oom_ops is the only user of the
>> cgrp->bpf.struct_ops_links (added in patch 2), the list should have
>> only one element for now.
>>
>> Copy some context from the patch 2 commit log.
> 
> Hi Martin!
> 
> Sorry, I'm not quite sure what do you mean, can you please elaborate
> more?
> 
> We decided (in conversations at LPC) that 1 bpf oom policy for
> memcg is good for now (with a potential to extend in the future, if
> there will be use cases). But it seems like there is a lot of interest
> to attach struct ops'es to cgroups (there are already a couple of
> patchsets posted based on my earlier v2 patches), so I tried to make the
> bpf link mechanics suitable for multiple use cases from scratch.
> 
> Did I answer your question?

Got it. The link list is for the future struct_ops implementations to 
attach to a cgroup.

I should have mentioned the context. My bad.

BPF_PROG_TYPE_SOCK_OPS is currently a cgroup BPF prog. I am thinking of 
adding a bpf_struct_ops support to have similar hooks as in the 
BPF_PROG_TYPE_SOCK_OPS. There are some issues that need to be worked 
out. A major one is that the current cgroup progs have expectations on 
the ordering and override behavior based on the BPF_F_* and the runtime 
cgroup hierarchy. I was trying to see if there are pieces in this set 
that can be built upon. The linked list is a start but will need more 
work to make it performant for networking use.

> 
>>
>>> This change doesn't answer the question how bpf programs belonging
>>> to these struct ops'es will be executed. It will be done individually
>>> for every bpf struct ops which supports this.
>>>
>>> Please, note that unlike "normal" bpf programs, struct ops'es
>>> are not propagated to cgroup sub-trees.
>>
>> There are NONE, BPF_F_ALLOW_OVERRIDE, and BPF_F_ALLOW_MULTI, which one
>> may be closer to the bpf_handle_oom() semantic. If it needs to change
>> the ordering (or allow multi) in the future, does it need a new flag
>> or the existing BPF_F_xxx flags can be used.
> 
> I hope that existing flags can be used, but also I'm not sure we ever
> would need multiple oom handlers per cgroup. Do you have any specific
> concerns here?

Another question that I have is the default behavior when none of the 
BPF_F_* is specified when attaching a struct_ops to a cgroup.

 From uapi/bpf.h:

* NONE (default): No further BPF programs allowed in the subtree

iiuc, the bpf_handle_oom() is not the same as NONE. Should each 
struct_ops implementation have its own default policy? For the 
BPF_PROG_TYPE_SOCK_OPS work, I am thinking the default policy should be 
BPF_F_ALLOW_MULTI which is always on/set now in the 
cgroup_bpf_link_attach().




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc
  2026-02-02 17:30     ` Alexei Starovoitov
@ 2026-02-03  0:14       ` Roman Gushchin
  2026-02-03 13:23         ` Michal Hocko
  0 siblings, 1 reply; 63+ messages in thread
From: Roman Gushchin @ 2026-02-03  0:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Michal Hocko
  Cc: Matt Bobrowski, bpf, Michal Hocko, Alexei Starovoitov,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Sun, Feb 1, 2026 at 9:39 PM Matt Bobrowski <mattbobrowski@google.com> wrote:
>>
>> On Mon, Jan 26, 2026 at 06:44:13PM -0800, Roman Gushchin wrote:
>> > Export tsk_is_oom_victim() helper as a BPF kfunc.
>> > It's very useful to avoid redundant oom kills.
>> >
>> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> > Suggested-by: Michal Hocko <mhocko@suse.com>
>> > ---
>> >  mm/oom_kill.c | 14 ++++++++++++++
>> >  1 file changed, 14 insertions(+)
>> >
>> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> > index 8f63a370b8f5..53f9f9674658 100644
>> > --- a/mm/oom_kill.c
>> > +++ b/mm/oom_kill.c
>> > @@ -1381,10 +1381,24 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
>> >       return ret;
>> >  }
>> >
>> > +/**
>> > + * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
>> > + * @task: task to check
>> > + *
>> > + * Returns true if the task has been previously selected by the OOM killer
>> > + * to be killed. It's expected that the task will be destroyed soon and some
>> > + * memory will be freed, so maybe no additional actions required.
>> > + */
>> > +__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
>> > +{
>> > +     return tsk_is_oom_victim(task);
>> > +}
>>
>> Why not just do a direct memory read (i.e., task->signal->oom_mm)
>> within the BPF program? I'm not quite convinced that a BPF kfunc
>> wrapper for something like tsk_is_oom_victim() is warranted as you can
>> literally achieve the same semantics without one.
>
> +1
> there is no need for this kfunc.

It was explicitly asked by Michal Hocko, who is (co)maintaining the oom
code. I don't have a strong opinion here. I agree that it can be easily
open-coded without a kfunc, but at the same time the cost of having an
extra kfunc is not high and it makes the API more consistent.

Michal, do you feel strongly about having a dedicated kfunc vs the
direct memory read?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc
  2026-02-03  0:14       ` Roman Gushchin
@ 2026-02-03 13:23         ` Michal Hocko
  2026-02-03 16:31           ` Alexei Starovoitov
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2026-02-03 13:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Matt Bobrowski, bpf, Alexei Starovoitov,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton

On Mon 02-02-26 16:14:37, Roman Gushchin wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Sun, Feb 1, 2026 at 9:39 PM Matt Bobrowski <mattbobrowski@google.com> wrote:
> >>
> >> On Mon, Jan 26, 2026 at 06:44:13PM -0800, Roman Gushchin wrote:
> >> > Export tsk_is_oom_victim() helper as a BPF kfunc.
> >> > It's very useful to avoid redundant oom kills.
> >> >
> >> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> >> > Suggested-by: Michal Hocko <mhocko@suse.com>
> >> > ---
> >> >  mm/oom_kill.c | 14 ++++++++++++++
> >> >  1 file changed, 14 insertions(+)
> >> >
> >> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >> > index 8f63a370b8f5..53f9f9674658 100644
> >> > --- a/mm/oom_kill.c
> >> > +++ b/mm/oom_kill.c
> >> > @@ -1381,10 +1381,24 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> >> >       return ret;
> >> >  }
> >> >
> >> > +/**
> >> > + * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
> >> > + * @task: task to check
> >> > + *
> >> > + * Returns true if the task has been previously selected by the OOM killer
> >> > + * to be killed. It's expected that the task will be destroyed soon and some
> >> > + * memory will be freed, so maybe no additional actions required.
> >> > + */
> >> > +__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
> >> > +{
> >> > +     return tsk_is_oom_victim(task);
> >> > +}
> >>
> >> Why not just do a direct memory read (i.e., task->signal->oom_mm)
> >> within the BPF program? I'm not quite convinced that a BPF kfunc
> >> wrapper for something like tsk_is_oom_victim() is warranted as you can
> >> literally achieve the same semantics without one.
> >
> > +1
> > there is no need for this kfunc.
> 
> It was explicitly asked by Michal Hocko, who is (co)maintaining the oom
> code. I don't have a strong opinion here. I agree that it can be easily
> open-coded without a kfunc, but at the same time the cost of having an
> extra kfunc is not high and it makes the API more consistent.
> 
> Michal, do you feel strongly about having a dedicated kfunc vs the
> direct memory read?

The reason I wanted this an explicit API is that oom states are quite
internal part of the oom synchronization. And I would really like to
have that completely transparent for oom policies. In other words I do
not want to touch all potential oom policies or break them in the worst
case just because we need to change this. So while a trivial interface
now (and hopefully for a long time) it is really an internal thing.

Do I insist? No, I do not but I would like to hear why this is a bad
idea.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc
  2026-02-03 13:23         ` Michal Hocko
@ 2026-02-03 16:31           ` Alexei Starovoitov
  2026-02-04  9:02             ` Michal Hocko
  0 siblings, 1 reply; 63+ messages in thread
From: Alexei Starovoitov @ 2026-02-03 16:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Matt Bobrowski, bpf, Alexei Starovoitov,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton

On Tue, Feb 3, 2026 at 5:23 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 02-02-26 16:14:37, Roman Gushchin wrote:
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >
> > > On Sun, Feb 1, 2026 at 9:39 PM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > >>
> > >> On Mon, Jan 26, 2026 at 06:44:13PM -0800, Roman Gushchin wrote:
> > >> > Export tsk_is_oom_victim() helper as a BPF kfunc.
> > >> > It's very useful to avoid redundant oom kills.
> > >> >
> > >> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > >> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > >> > ---
> > >> >  mm/oom_kill.c | 14 ++++++++++++++
> > >> >  1 file changed, 14 insertions(+)
> > >> >
> > >> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > >> > index 8f63a370b8f5..53f9f9674658 100644
> > >> > --- a/mm/oom_kill.c
> > >> > +++ b/mm/oom_kill.c
> > >> > @@ -1381,10 +1381,24 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> > >> >       return ret;
> > >> >  }
> > >> >
> > >> > +/**
> > >> > + * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
> > >> > + * @task: task to check
> > >> > + *
> > >> > + * Returns true if the task has been previously selected by the OOM killer
> > >> > + * to be killed. It's expected that the task will be destroyed soon and some
> > >> > + * memory will be freed, so maybe no additional actions required.
> > >> > + */
> > >> > +__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
> > >> > +{
> > >> > +     return tsk_is_oom_victim(task);
> > >> > +}
> > >>
> > >> Why not just do a direct memory read (i.e., task->signal->oom_mm)
> > >> within the BPF program? I'm not quite convinced that a BPF kfunc
> > >> wrapper for something like tsk_is_oom_victim() is warranted as you can
> > >> literally achieve the same semantics without one.
> > >
> > > +1
> > > there is no need for this kfunc.
> >
> > It was explicitly asked by Michal Hocko, who is (co)maintaining the oom
> > code. I don't have a strong opinion here. I agree that it can be easily
> > open-coded without a kfunc, but at the same time the cost of having an
> > extra kfunc is not high and it makes the API more consistent.
> >
> > Michal, do you feel strongly about having a dedicated kfunc vs the
> > direct memory read?
>
> The reason I wanted this an explicit API is that oom states are quite
> internal part of the oom synchronization. And I would really like to
> have that completely transparent for oom policies. In other words I do
> not want to touch all potential oom policies or break them in the worst
> case just because we need to change this. So while a trivial interface
> now (and hopefully for a long time) it is really an internal thing.
>
> Do I insist? No, I do not but I would like to hear why this is a bad
> idea.

It's a bad idea, since it doesn't address your goal.
bpf prog can access task->signal->oom_mm without kfunc just fine
and it will be doing so because performance matters and
static inline bool foo(task)
{
  return task->signal->oom_mm;
}

will be inlined as 2 loads while kfunc is a function call with 6 registers
being scratched.

If anything changes and, say, oom_mm will get renamed whether
it was kfunc or not doesn't change much. progs will adopt to a new
way easily with CORE. kfuncs can also be renamed/deleted, etc.
You're thinking about kfuncs as a stable api. It's definitely not.
It's not a layer of isolation either. kfuncs are necessary only
for the cases where bpf prog cannot do it on its own.

"internal thing" is also a wrong way of thinking of bpf-oom.
bpf-oom _will_ look into oom, cgroup and kernel internals in general.
All bpf progs do because they have to do that to achieve their goals.
Everything in mm/internal.h have been available to access by bpf progs
for a decade now. Did it cause any issue to mm development? No.
So let's not build some non-existent wall or "internal oom thing".


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc
  2026-02-03 16:31           ` Alexei Starovoitov
@ 2026-02-04  9:02             ` Michal Hocko
  2026-02-05  0:12               ` Alexei Starovoitov
  0 siblings, 1 reply; 63+ messages in thread
From: Michal Hocko @ 2026-02-04  9:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Roman Gushchin, Matt Bobrowski, bpf, Alexei Starovoitov,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton

On Tue 03-02-26 08:31:19, Alexei Starovoitov wrote:
> On Tue, Feb 3, 2026 at 5:23 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 02-02-26 16:14:37, Roman Gushchin wrote:
[...]
> > > Michal, do you feel strongly about having a dedicated kfunc vs the
> > > direct memory read?
> >
> > The reason I wanted this an explicit API is that oom states are quite
> > internal part of the oom synchronization. And I would really like to
> > have that completely transparent for oom policies. In other words I do
> > not want to touch all potential oom policies or break them in the worst
> > case just because we need to change this. So while a trivial interface
> > now (and hopefully for a long time) it is really an internal thing.
> >
> > Do I insist? No, I do not but I would like to hear why this is a bad
> > idea.
> 
> It's a bad idea, since it doesn't address your goal.
> bpf prog can access task->signal->oom_mm without kfunc just fine
> and it will be doing so because performance matters and
> static inline bool foo(task)
> {
>   return task->signal->oom_mm;
> }

OK, so my understanding was that BPF can only use exported
functionality. If those progs can access whatever they get a pointer for
and than traverse down the road then this is moot from a large part.

> will be inlined as 2 loads while kfunc is a function call with 6 registers
> being scratched.

performance is not really crucial in this context. We are OOM, couple of
loads vs. registers will not make much difference. It is really more
about code writers what they can/should be using. OOM is a piece of
complex code with many loose ends that might not be obvious.

> If anything changes and, say, oom_mm will get renamed whether
> it was kfunc or not doesn't change much. progs will adopt to a new
> way easily with CORE. kfuncs can also be renamed/deleted, etc.
> You're thinking about kfuncs as a stable api. It's definitely not.
> It's not a layer of isolation either. kfuncs are necessary only
> for the cases where bpf prog cannot do it on its own.

It is obviously not clear to me where that line is for BPF progs. Where
is this documented?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 00/17] mm: BPF OOM
  2026-02-02 17:50           ` Alexei Starovoitov
@ 2026-02-04 23:52             ` Matt Bobrowski
  0 siblings, 0 replies; 63+ messages in thread
From: Matt Bobrowski @ 2026-02-04 23:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michal Hocko, Roman Gushchin, bpf, Alexei Starovoitov,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton, Josh Don

On Mon, Feb 02, 2026 at 09:50:05AM -0800, Alexei Starovoitov wrote:
> On Sun, Feb 1, 2026 at 7:26 PM Matt Bobrowski <mattbobrowski@google.com> wrote:
> >
> > On Wed, Jan 28, 2026 at 08:59:34AM -0800, Alexei Starovoitov wrote:
> > > On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > >
> > > > > Another viable idea (also suggested by Andrew Morton) is to develop
> > > > > a production ready memcg-aware OOM killer in BPF, put the source code
> > > > > into the kernel tree and make it loadable by default (obviously under a
> > > > > config option). Myself or one of my colleagues will try to explore it a
> > > > > bit later: the tricky part is this by-default loading because there are
> > > > > no existing precedents.
> > > >
> > > > It certainly makes sense to have trusted implementation of a commonly
> > > > requested oom policy that we couldn't implement due to specific nature
> > > > that doesn't really apply to many users. And have that in the tree. I am
> > > > not thrilled about auto-loading because this could be easily done by a
> > > > simple tooling.
> > >
> > > Production ready bpf-oom program(s) must be part of this set.
> > > We've seen enough attempts to add bpf st_ops in various parts of
> > > the kernel without providing realistic bpf progs that will drive
> > > those hooks. It's great to have flexibility and people need
> > > to have a freedom to develop their own bpf-oom policy, but
> > > the author of the patch set who's advocating for the new
> > > bpf hooks must provide their real production progs and
> > > share their real use case with the community.
> > > It's not cool to hide it.
> > > In that sense enabling auto-loading without requiring an end user
> > > to install the toolchain and build bpf programs/rust/whatnot
> > > is necessary too.
> > > bpf-oom can be a self contained part of vmlinux binary.
> > > We already have a mechanism to do that.
> > > This way the end user doesn't need to be a bpf expert, doesn't need
> > > to install clang, build the tools, etc.
> > > They can just enable fancy new bpf-oom policy and see whether
> > > it's helping their apps or not while knowing nothing about bpf.
> >
> > For the auto-loading capability you speak of here, I'm currently
> > interpreting it as being some form of conceptually similar extension
> > to the BPF preload functionality. Have I understood this correctly? If
> > so, I feel as though something like this would be a completely
> > independent stream of work, orthogonal to this BPF OOM feature, right?
> > Or, is that you'd like this new auto-loading capability completed as a
> > hard prerequisite before pulling in the BPF OOM feature?
> 
> It's not a hard prerequisite, but it has to be thought through.
> bpf side is ready today. bpf preload is an example of it.
> The oom side needs to design an interface to do it.
> sysctl to enable builtin bpf-oom policy is probably too rigid.
> Maybe a file in cgroupfs? Writing a name of bpf-oom policy would
> trigger load and attach to that cgroup.
> Or you can plug it exactly like bpf preload:
> when bpffs is mounted all builtin bpf progs get loaded and create
> ".debug" files in bpffs.
> 
> I recall we discussed an ability to create files in bpffs from
> tracepoints. This way bpffs can replicate cgroupfs directory
> structure without user space involvement. New cgroup -> new directory
> in cgroupfs -> tracepoint -> bpf prog -> new directory in bpffs
> -> create "enable_bpf_oom.debug" file in there.
> Writing to that file we trigger bpf prog that will attach bpf-oom
> prog to that cgroup.
> Could be any combination of the above or something else,
> but needs to be designed and agreed upon.
> Otherwise, I'm afraid, we will have bpf-oom progs in selftests
> and users who want to experiment with it would need kernel source
> code, clang, etc to try it. We need to lower the barrier to use it.

OK, I see what you're saying here. I'll have a chat to Roman about
this and see what his thoughts are on it.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc
  2026-02-04  9:02             ` Michal Hocko
@ 2026-02-05  0:12               ` Alexei Starovoitov
  0 siblings, 0 replies; 63+ messages in thread
From: Alexei Starovoitov @ 2026-02-05  0:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Matt Bobrowski, bpf, Alexei Starovoitov,
	Shakeel Butt, JP Kobryn, LKML, linux-mm, Suren Baghdasaryan,
	Johannes Weiner, Andrew Morton

On Wed, Feb 4, 2026 at 1:02 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 03-02-26 08:31:19, Alexei Starovoitov wrote:
> > On Tue, Feb 3, 2026 at 5:23 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 02-02-26 16:14:37, Roman Gushchin wrote:
> [...]
> > > > Michal, do you feel strongly about having a dedicated kfunc vs the
> > > > direct memory read?
> > >
> > > The reason I wanted this an explicit API is that oom states are quite
> > > internal part of the oom synchronization. And I would really like to
> > > have that completely transparent for oom policies. In other words I do
> > > not want to touch all potential oom policies or break them in the worst
> > > case just because we need to change this. So while a trivial interface
> > > now (and hopefully for a long time) it is really an internal thing.
> > >
> > > Do I insist? No, I do not but I would like to hear why this is a bad
> > > idea.
> >
> > It's a bad idea, since it doesn't address your goal.
> > bpf prog can access task->signal->oom_mm without kfunc just fine
> > and it will be doing so because performance matters and
> > static inline bool foo(task)
> > {
> >   return task->signal->oom_mm;
> > }
>
> OK, so my understanding was that BPF can only use exported
> functionality. If those progs can access whatever they get a pointer for
> and than traverse down the road then this is moot from a large part.

bpf could access all kernel internals from day one 10 years ago.
We made it more ergonomic over the years.

> > If anything changes and, say, oom_mm will get renamed whether
> > it was kfunc or not doesn't change much. progs will adopt to a new
> > way easily with CORE. kfuncs can also be renamed/deleted, etc.
> > You're thinking about kfuncs as a stable api. It's definitely not.
> > It's not a layer of isolation either. kfuncs are necessary only
> > for the cases where bpf prog cannot do it on its own.
>
> It is obviously not clear to me where that line is for BPF progs. Where
> is this documented?

See Documentation/bpf/kfuncs.rst
Especially "kfunc lifecycle expectations" section.


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2026-02-05  0:12 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-27  2:44 [PATCH bpf-next v3 00/17] mm: BPF OOM Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 01/17] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
2026-01-27  5:50   ` Yafang Shao
2026-01-28 11:28   ` Matt Bobrowski
2026-01-27  2:44 ` [PATCH bpf-next v3 02/17] bpf: allow attaching struct_ops to cgroups Roman Gushchin
2026-01-27  3:08   ` bot+bpf-ci
2026-01-27  5:49   ` Yafang Shao
2026-01-28  3:10   ` Josh Don
2026-01-28 18:52     ` Roman Gushchin
2026-01-28 11:25   ` Matt Bobrowski
2026-01-28 19:18     ` Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 03/17] libbpf: fix return value on memory allocation failure Roman Gushchin
2026-01-27  5:52   ` Yafang Shao
2026-01-27  2:44 ` [PATCH bpf-next v3 04/17] libbpf: introduce bpf_map__attach_struct_ops_opts() Roman Gushchin
2026-01-27  3:08   ` bot+bpf-ci
2026-01-27  2:44 ` [PATCH bpf-next v3 05/17] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
2026-01-27  6:06   ` Yafang Shao
2026-02-02  4:56   ` Matt Bobrowski
2026-01-27  2:44 ` [PATCH bpf-next v3 06/17] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
2026-01-27  6:12   ` Yafang Shao
2026-02-02  3:50   ` Shakeel Butt
2026-01-27  2:44 ` [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops Roman Gushchin
2026-01-27  9:38   ` Michal Hocko
2026-01-27 21:12     ` Roman Gushchin
2026-01-28  8:00       ` Michal Hocko
2026-01-28 18:44         ` Roman Gushchin
2026-02-02  4:06       ` Matt Bobrowski
2026-01-28  3:26   ` Josh Don
2026-01-28 19:03     ` Roman Gushchin
2026-01-28 11:19   ` Michal Hocko
2026-01-28 18:53     ` Roman Gushchin
2026-01-29 21:00   ` Martin KaFai Lau
2026-01-30 23:29     ` Roman Gushchin
2026-02-02 20:27       ` Martin KaFai Lau
2026-01-27  2:44 ` [PATCH bpf-next v3 08/17] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
2026-01-27 20:21   ` Martin KaFai Lau
2026-01-27 20:47     ` Roman Gushchin
2026-02-02  4:49   ` Matt Bobrowski
2026-01-27  2:44 ` [PATCH bpf-next v3 09/17] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
2026-01-28 20:21   ` Matt Bobrowski
2026-01-27  2:44 ` [PATCH bpf-next v3 10/17] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
2026-02-02  5:39   ` Matt Bobrowski
2026-02-02 17:30     ` Alexei Starovoitov
2026-02-03  0:14       ` Roman Gushchin
2026-02-03 13:23         ` Michal Hocko
2026-02-03 16:31           ` Alexei Starovoitov
2026-02-04  9:02             ` Michal Hocko
2026-02-05  0:12               ` Alexei Starovoitov
2026-01-27  2:44 ` [PATCH bpf-next v3 11/17] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
2026-01-27  3:08   ` bot+bpf-ci
2026-01-27  2:44 ` [PATCH bpf-next v3 12/17] bpf: selftests: BPF OOM struct ops test Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 13/17] sched: psi: add a trace point to psi_avgs_work() Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 14/17] sched: psi: add cgroup_id field to psi_group structure Roman Gushchin
2026-01-27  2:44 ` [PATCH bpf-next v3 15/17] bpf: allow calling bpf_out_of_memory() from a PSI tracepoint Roman Gushchin
2026-01-27  9:02 ` [PATCH bpf-next v3 00/17] mm: BPF OOM Michal Hocko
2026-01-27 21:01   ` Roman Gushchin
2026-01-28  8:06     ` Michal Hocko
2026-01-28 16:59       ` Alexei Starovoitov
2026-01-28 18:23         ` Roman Gushchin
2026-01-28 18:53           ` Alexei Starovoitov
2026-02-02  3:26         ` Matt Bobrowski
2026-02-02 17:50           ` Alexei Starovoitov
2026-02-04 23:52             ` Matt Bobrowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox