[PATCH rfc 00/12] mm: BPF OOM

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH rfc 00/12] mm: BPF OOM
@ 2025-04-28  3:36 Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 01/12] mm: introduce a bpf hook for OOM handling Roman Gushchin
                   ` (14 more replies)
  0 siblings, 15 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

This patchset adds an ability to customize the out of memory
handling using bpf.

It focuses on two parts:
1) OOM handling policy,
2) PSI-based OOM invocation.

The idea to use bpf for customizing the OOM handling is not new, but
unlike the previous proposal [1], which augmented the existing task
ranking-based policy, this one tries to be as generic as possible and
leverage the full power of the modern bpf.

It provides a generic hook which is called before the existing OOM
killer code and allows implementing any policy, e.g.  picking a victim
task or memory cgroup or potentially even releasing memory in other
ways, e.g. deleting tmpfs files (the last one might require some
additional but relatively simple changes).

The past attempt to implement memory-cgroup aware policy [2] showed
that there are multiple opinions on what the best policy is.  As it's
highly workload-dependent and specific to a concrete way of organizing
workloads, the structure of the cgroup tree etc, a customizable
bpf-based implementation is preferable over a in-kernel implementation
with a dozen on sysctls.

The second part is related to the fundamental question on when to
declare the OOM event. It's a trade-off between the risk of
unnecessary OOM kills and associated work losses and the risk of
infinite trashing and effective soft lockups.  In the last few years
several PSI-based userspace solutions were developed (e.g. OOMd [3] or
systemd-OOMd [4]). The common idea was to use userspace daemons to
implement custom OOM logic as well as rely on PSI monitoring to avoid
stalls. In this scenario the userspace daemon was supposed to handle
the majority of OOMs, while the in-kernel OOM killer worked as the
last resort measure to guarantee that the system would never deadlock
on the memory. But this approach creates additional infrastructure
churn: userspace OOM daemon is a separate entity which needs to be
deployed, updated, monitored. A completely different pipeline needs to
be built to monitor both types of OOM events and collect associated
logs. A userspace daemon is more restricted in terms on what data is
available to it. Implementing a daemon which can work reliably under a
heavy memory pressure in the system is also tricky.

[1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
[2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
[3]: https://github.com/facebookincubator/oomd
[4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html

----

This is an RFC version, which is not intended to be merged in the current form.
Open questions/TODOs:
1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
   It has to be able to return a value, to be sleepable (to use cgroup iterators)
   and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
   Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
   arguments as trusted"), which is not safe. One option is to fake acquire/release
   semantics for the oom_control pointer. Other option is to introduce a completely
   new attachment or program type, similar to lsm hooks.
2) Currently lockdep complaints about a potential circular dependency because
   sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
   One way to fix it is to make it non-sleepable, but then it will require some
   additional work to allow it using cgroup iterators. It's intervened with 1).
3) What kind of hierarchical features are required? Do we want to nest oom policies?
   Do we want to attach oom policies to cgroups? I think it's too complicated,
   but if we want a full hierarchical support, it might be required.
   Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
   memcg, which is potentially outside of the ns of the loading process. Does
   it require some additional capabilities checks? Should it be removed?
4) Documentation is lacking and will be added in the next version.

Roman Gushchin (12):
  mm: introduce a bpf hook for OOM handling
  bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  bpf: treat fmodret tracing program's arguments as trusted
  mm: introduce bpf_oom_kill_process() bpf kfunc
  mm: introduce bpf kfuncs to deal with memcg pointers
  mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
  bpf: selftests: introduce read_cgroup_file() helper
  bpf: selftests: bpf OOM handler test
  sched: psi: bpf hook to handle psi events
  mm: introduce bpf_out_of_memory() bpf kfunc
  bpf: selftests: introduce open_cgroup_file() helper
  bpf: selftests: psi handler test

 include/linux/memcontrol.h                   |   2 +
 include/linux/oom.h                          |   5 +
 kernel/bpf/btf.c                             |   9 +-
 kernel/bpf/verifier.c                        |   5 +
 kernel/sched/psi.c                           |  36 ++-
 mm/Makefile                                  |   3 +
 mm/bpf_memcontrol.c                          | 108 +++++++++
 mm/oom_kill.c                                | 140 +++++++++++
 tools/testing/selftests/bpf/cgroup_helpers.c |  67 ++++++
 tools/testing/selftests/bpf/cgroup_helpers.h |   3 +
 tools/testing/selftests/bpf/prog_tests/oom.c | 227 ++++++++++++++++++
 tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_oom.c | 103 ++++++++
 tools/testing/selftests/bpf/progs/test_psi.c |  43 ++++
 14 files changed, 983 insertions(+), 2 deletions(-)
 create mode 100644 mm/bpf_memcontrol.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c

-- 
2.49.0.901.g37484f566f-goog

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 01/12] mm: introduce a bpf hook for OOM handling
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 02/12] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Introduce a bpf hook for implementing custom OOM handling policies.

The hook is int bpf_handle_out_of_memory(struct oom_control *oc)
function, which expected to return 1 if it was able to free some
memory and 0 otherwise. In the latter case it's guaranteed that
the in-kernel OOM killer will be invoked. Otherwise the kernel
also checks the bpf_memory_freed field of the oom_control structure,
which is expected to be set by kfuncs suitable for releasing memory.
It's a safety mechanism which prevents a bpf program to claim
forward progress without actually releasing memory.

The hook program is sleepable to enable using iterators, e.g.
cgroup iterators.

The hook is executed just before the kernel victim task selection
algorithm, so all heuristics and sysctls like panic on oom,
sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
are respected.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/oom.h |  5 ++++
 mm/oom_kill.c       | 68 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 1e0fc6931ce9..cc14aac9742c 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -51,6 +51,11 @@ struct oom_control {
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* Used by the bpf oom implementation to mark the forward progress */
+	bool bpf_memory_freed;
+#endif
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..d00776b63c0a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -45,6 +45,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/cred.h>
 #include <linux/nmi.h>
+#include <linux/bpf.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -1100,6 +1101,30 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+#ifdef CONFIG_BPF_SYSCALL
+int bpf_handle_out_of_memory(struct oom_control *oc);
+
+/*
+ * Returns true if the bpf oom program returns 1 and some memory was
+ * freed.
+ */
+static bool bpf_handle_oom(struct oom_control *oc)
+{
+	if (WARN_ON_ONCE(oc->chosen))
+		oc->chosen = NULL;
+
+	oc->bpf_memory_freed = false;
+
+	return bpf_handle_out_of_memory(oc) && oc->bpf_memory_freed;
+}
+
+#else
+static inline bool bpf_handle_oom(struct oom_control *oc)
+{
+	return 0;
+}
+#endif
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
@@ -1161,6 +1186,13 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	/*
+	 * Let bpf handle the OOM first. If it was able to free up some memory,
+	 * bail out. Otherwise fall back to the kernel OOM killer.
+	 */
+	if (bpf_handle_oom(oc))
+		return true;
+
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
@@ -1264,3 +1296,39 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	return -ENOSYS;
 #endif /* CONFIG_MMU */
 }
+
+#ifdef CONFIG_BPF_SYSCALL
+
+__bpf_hook_start();
+
+/*
+ * Bpf hook to customize the oom handling policy.
+ */
+__weak noinline int bpf_handle_out_of_memory(struct oom_control *oc)
+{
+	return 0;
+}
+
+__bpf_hook_end();
+
+BTF_KFUNCS_START(bpf_oom_hooks)
+BTF_ID_FLAGS(func, bpf_handle_out_of_memory, KF_SLEEPABLE)
+BTF_KFUNCS_END(bpf_oom_hooks)
+
+static const struct btf_kfunc_id_set bpf_oom_hook_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_oom_hooks,
+};
+static int __init bpf_oom_init(void)
+{
+	int err;
+
+	err = register_btf_fmodret_id_set(&bpf_oom_hook_set);
+	if (err)
+		pr_warn("error while registering bpf oom hooks: %d", err);
+
+	return err;
+}
+late_initcall(bpf_oom_init);
+
+#endif
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 02/12] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 01/12] mm: introduce a bpf hook for OOM handling Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 03/12] bpf: treat fmodret tracing program's arguments as trusted Roman Gushchin
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/bpf/verifier.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 54c6953a8b84..d2d9f9b87065 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7047,6 +7047,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) {
 	struct sock *sk;
 };
 
+BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control) {
+	struct mem_cgroup *memcg;
+};
+
 static bool type_is_rcu(struct bpf_verifier_env *env,
 			struct bpf_reg_state *reg,
 			const char *field_name, u32 btf_id)
@@ -7087,6 +7091,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
 				    const char *field_name, u32 btf_id)
 {
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
+	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control));
 
 	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
 					  "__safe_trusted_or_null");
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 03/12] bpf: treat fmodret tracing program's arguments as trusted
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 01/12] mm: introduce a bpf hook for OOM handling Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 02/12] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 04/12] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

*** DO NOT MERGE! ***

This is a temporarily workaround, which will be fixed/replaced
in the next version.

--

Bpf oom handler hook has to:
1) have a trusted pointer to the oom_control structure,
2) return a value,
3) be sleepable to use cgroup iterator functions.

fmodret tracing programs fulfill 2) and 3).
This patch enables 1), however this change contradicts
the commit c6b0337f0120 ("bpf: Don't mark arguments to fentry/fexit
programs as trusted.").

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/bpf/btf.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index a91822bae043..aa86c4eabfa0 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6424,7 +6424,14 @@ static bool prog_args_trusted(const struct bpf_prog *prog)
 
 	switch (prog->type) {
 	case BPF_PROG_TYPE_TRACING:
-		return atype == BPF_TRACE_RAW_TP || atype == BPF_TRACE_ITER;
+		switch (atype) {
+		case BPF_TRACE_RAW_TP:
+		case BPF_TRACE_ITER:
+		case BPF_MODIFY_RETURN:
+			return true;
+		default:
+			return false;
+		}
 	case BPF_PROG_TYPE_LSM:
 		return bpf_lsm_is_trusted(prog);
 	case BPF_PROG_TYPE_STRUCT_OPS:
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 04/12] mm: introduce bpf_oom_kill_process() bpf kfunc
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (2 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 03/12] bpf: treat fmodret tracing program's arguments as trusted Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 05/12] mm: introduce bpf kfuncs to deal with memcg pointers Roman Gushchin
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
to be used by bpf OOM programs. It allows to kill a process
in exactly the same way the OOM killer does: using the OOM reaper,
bumping corresponding memcg and global statistics, respecting
memory.oom.group etc.

On success, it sets om_control's bpf_memory_freed field to true,
enabling the bpf program to bypass the kernel OOM killer.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d00776b63c0a..2e922e75a9df 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1299,6 +1299,42 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 
 #ifdef CONFIG_BPF_SYSCALL
 
+__bpf_kfunc_start_defs();
+/*
+ * Kill a process in a way similar to the kernel OOM killer.
+ * This means dump the necessary information to dmesg, adjust memcg
+ * statistics, leverage the oom reaper, respect memory.oom.group etc.
+ *
+ * bpf_oom_kill_process() marks the forward progress by setting
+ * oc->bpf_memory_freed. If the progress was made, the bpf program
+ * is free to decide if the kernel oom killer should be invoked.
+ * Otherwise it's enforced, so that a bad bpf program can't
+ * deadlock the machine on memory.
+ */
+__bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
+				     struct task_struct *task,
+				     const char *message__str)
+{
+	if (oom_unkillable_task(task))
+		return -EPERM;
+
+	/* paired with put_task_struct() in oom_kill_process() */
+	task = tryget_task_struct(task);
+	if (!task)
+		return -EINVAL;
+
+	oc->chosen = task;
+
+	oom_kill_process(oc, message__str);
+
+	oc->chosen = NULL;
+	oc->bpf_memory_freed = true;
+
+	return 0;
+}
+
+__bpf_kfunc_end_defs();
+
 __bpf_hook_start();
 
 /*
@@ -1319,6 +1355,16 @@ static const struct btf_kfunc_id_set bpf_oom_hook_set = {
 	.owner = THIS_MODULE,
 	.set   = &bpf_oom_hooks,
 };
+
+BTF_KFUNCS_START(bpf_oom_kfuncs)
+BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_oom_kfuncs)
+
+static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_oom_kfuncs,
+};
+
 static int __init bpf_oom_init(void)
 {
 	int err;
@@ -1326,6 +1372,10 @@ static int __init bpf_oom_init(void)
 	err = register_btf_fmodret_id_set(&bpf_oom_hook_set);
 	if (err)
 		pr_warn("error while registering bpf oom hooks: %d", err);
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
+					&bpf_oom_kfunc_set);
+	if (err)
+		pr_warn("error while registering bpf oom kfuncs: %d", err);
 
 	return err;
 }
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 05/12] mm: introduce bpf kfuncs to deal with memcg pointers
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (3 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 04/12] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 06/12] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Roman Gushchin
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

To effectively operate with memory cgroups in bpf there is a need
to convert css pointers to memcg pointers. A simple container_of
cast which is used in the kernel code can't be used in bpf because
from the verifier's point of view that's a out-of-bounds memory access.

Introduce helper get/put kfuncs which can be used to get
a refcounted memcg pointer from the css pointer:
  - bpf_get_mem_cgroup,
  - bpf_put_mem_cgroup.

bpf_get_mem_cgroup() can take both memcg's css and the corresponding
cgroup's "self" css. It allows it to be used with the existing cgroup
iterator which iterates over cgroup tree, not memcg tree.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h |   2 +
 mm/Makefile                |   3 ++
 mm/bpf_memcontrol.c        | 101 +++++++++++++++++++++++++++++++++++++
 3 files changed, 106 insertions(+)
 create mode 100644 mm/bpf_memcontrol.c

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 53364526d877..a2ecd9caacfb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -932,6 +932,8 @@ static inline void mod_memcg_page_state(struct page *page,
 	rcu_read_unlock();
 }
 
+unsigned long memcg_events(struct mem_cgroup *memcg, int event);
+unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
 unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
 unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
 unsigned long lruvec_page_state_local(struct lruvec *lruvec,
diff --git a/mm/Makefile b/mm/Makefile
index e7f6bbf8ae5f..3eedba68e8cb 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
+ifdef CONFIG_BPF_SYSCALL
+obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
+endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
 obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
new file mode 100644
index 000000000000..dacdf53735e5
--- /dev/null
+++ b/mm/bpf_memcontrol.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Memory Controller-related BPF kfuncs and auxiliary code
+ *
+ * Author: Roman Gushchin <roman.gushchin@linux.dev>
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/bpf.h>
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc struct mem_cgroup *
+bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
+{
+	struct mem_cgroup *memcg = NULL;
+	bool rcu_unlock = false;
+
+	if (!root_mem_cgroup)
+		return NULL;
+
+	if (root_mem_cgroup->css.ss != css->ss) {
+		struct cgroup *cgroup = css->cgroup;
+		int ssid = root_mem_cgroup->css.ss->id;
+
+		rcu_read_lock();
+		rcu_unlock = true;
+		css = rcu_dereference_raw(cgroup->subsys[ssid]);
+	}
+
+	if (css && css_tryget(css))
+		memcg = container_of(css, struct mem_cgroup, css);
+
+	if (rcu_unlock)
+		rcu_read_unlock();
+
+	return memcg;
+}
+
+__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
+__bpf_kfunc unsigned long bpf_mem_cgroup_events(struct mem_cgroup *memcg, int event)
+{
+
+	if (event < 0 || event >= NR_VM_EVENT_ITEMS)
+		return (unsigned long)-1;
+
+	return memcg_events(memcg, event);
+}
+
+__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
+{
+	return mem_cgroup_usage(memcg, swap);
+}
+
+__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *memcg, int idx)
+{
+	if (idx < 0 || idx >= MEMCG_NR_STAT)
+		return (unsigned long)-1;
+
+	return memcg_page_state(memcg, idx);
+}
+
+__bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
+{
+	mem_cgroup_flush_stats(memcg);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
+BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
+
+BTF_ID_FLAGS(func, bpf_mem_cgroup_events, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_usage, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_page_state, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_flush_stats, KF_TRUSTED_ARGS)
+
+BTF_KFUNCS_END(bpf_memcontrol_kfuncs)
+
+static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_memcontrol_kfuncs,
+};
+
+static int __init bpf_memcontrol_init(void)
+{
+	int err;
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
+					&bpf_memcontrol_kfunc_set);
+	if (err)
+		pr_warn("error while registering bpf memcontrol kfuncs: %d", err);
+
+	return err;
+}
+late_initcall(bpf_memcontrol_init);
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 06/12] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (4 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 05/12] mm: introduce bpf kfuncs to deal with memcg pointers Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 07/12] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Introduce a bpf kfunc to get a trusted pointer to the root memory
cgroup. It's very handy to traverse the full memcg tree, e.g.
for handling a system-wide OOM.

It's possible to obtain this pointer by traversing the memcg tree
up from any known memcg, but it's sub-optimal and makes bpf programs
more complex and less efficient.

bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
however in reality it's not necessarily to bump the corresponding
reference counter - root memory cgroup is immortal, reference counting
is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/bpf_memcontrol.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index dacdf53735e5..94bc6c17d80b 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -10,6 +10,12 @@
 
 __bpf_kfunc_start_defs();
 
+__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
+{
+	/* css_get() is not needed */
+	return root_mem_cgroup;
+}
+
 __bpf_kfunc struct mem_cgroup *
 bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
 {
@@ -72,6 +78,7 @@ __bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
+BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
 
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 07/12] bpf: selftests: introduce read_cgroup_file() helper
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (5 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 06/12] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 08/12] bpf: selftests: bpf OOM handler test Roman Gushchin
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Implement read_cgroup_file() helper to read from cgroup control files,
e.g. statistics.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 39 ++++++++++++++++++++
 tools/testing/selftests/bpf/cgroup_helpers.h |  2 +
 2 files changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index e4535451322e..3ffd4b764f91 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -125,6 +125,45 @@ int enable_controllers(const char *relative_path, const char *controllers)
 	return __enable_controllers(cgroup_path, controllers);
 }
 
+static size_t __read_cgroup_file(const char *cgroup_path, const char *file,
+				 char *buf, size_t size)
+{
+	char file_path[PATH_MAX + 1];
+	size_t ret;
+	int fd;
+
+	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+	fd = open(file_path, O_RDONLY);
+	if (fd < 0) {
+		log_err("Opening %s", file_path);
+		return -1;
+	}
+
+	ret = read(fd, buf, size);
+	close(fd);
+	return ret;
+}
+
+/**
+ * read_cgroup_file() - Read to a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @file: The name of the file in cgroupfs to read to
+ * @buf: Buffer to read from the file
+ * @size: Size of the buffer
+ *
+ * Read to a file in the given cgroup's directory.
+ *
+ * If successful, the number of read bytes is returned.
+ */
+size_t read_cgroup_file(const char *relative_path, const char *file,
+			char *buf, size_t size)
+{
+	char cgroup_path[PATH_MAX - 24];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __read_cgroup_file(cgroup_path, file, buf, size);
+}
+
 static int __write_cgroup_file(const char *cgroup_path, const char *file,
 			       const char *buf)
 {
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index 502845160d88..821cb76db1f7 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -11,6 +11,8 @@
 
 /* cgroupv2 related */
 int enable_controllers(const char *relative_path, const char *controllers);
+size_t read_cgroup_file(const char *relative_path, const char *file,
+			char *buf, size_t size);
 int write_cgroup_file(const char *relative_path, const char *file,
 		      const char *buf);
 int write_cgroup_file_parent(const char *relative_path, const char *file,
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 08/12] bpf: selftests: bpf OOM handler test
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (6 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 07/12] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events Roman Gushchin
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Implement a pseudo-realistic test for the OOM handling
functionality.

The OOM handling policy which is implemented in bpf is to
kill all tasks belonging to the biggest leaf cgroup, which
doesn't contain unkillable tasks (tasks with oom_score_adj
set to -1000). Pagecache size is excluded from the accounting.

The test creates a hierarchy of memory cgroups, causes an
OOM at the top level, checks that the expected process will be
killed and checks memcg's oom statistics.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/testing/selftests/bpf/prog_tests/oom.c | 227 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_oom.c | 103 +++++++++
 2 files changed, 330 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c

diff --git a/tools/testing/selftests/bpf/prog_tests/oom.c b/tools/testing/selftests/bpf/prog_tests/oom.c
new file mode 100644
index 000000000000..224c25334385
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/oom.c
@@ -0,0 +1,227 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/stat.h>
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "test_oom.skel.h"
+
+struct cgroup_desc {
+	const char *path;
+	int fd;
+	unsigned long long id;
+	int pid;
+	size_t target;
+	size_t max;
+	int oom_score_adj;
+	bool victim;
+};
+
+#define MB (1024 * 1024)
+#define OOM_SCORE_ADJ_MIN	(-1000)
+#define OOM_SCORE_ADJ_MAX	1000
+
+static struct cgroup_desc cgroups[] = {
+	{ .path = "/oom_test", .max = 80 * MB},
+	{ .path = "/oom_test/cg1", .target = 10 * MB,
+	  .oom_score_adj = OOM_SCORE_ADJ_MAX },
+	{ .path = "/oom_test/cg2", .target = 40 * MB,
+	  .oom_score_adj = OOM_SCORE_ADJ_MIN },
+	{ .path = "/oom_test/cg3" },
+	{ .path = "/oom_test/cg3/cg4", .target = 30 * MB,
+	  .victim = true },
+	{ .path = "/oom_test/cg3/cg5", .target = 20 * MB },
+};
+
+static int spawn_task(struct cgroup_desc *desc)
+{
+	char *ptr;
+	int pid;
+
+	pid = fork();
+	if (pid < 0)
+		return pid;
+
+	if (pid > 0) {
+		/* parent */
+		desc->pid = pid;
+		return 0;
+	}
+
+	/* child */
+	if (desc->oom_score_adj) {
+		char buf[64];
+		int fd = open("/proc/self/oom_score_adj", O_WRONLY);
+
+		if (fd < 0)
+			return -1;
+
+		snprintf(buf, sizeof(buf), "%d", desc->oom_score_adj);
+		write(fd, buf, sizeof(buf));
+		close(fd);
+	}
+
+	ptr = (char *)malloc(desc->target);
+	if (!ptr)
+		return -ENOMEM;
+
+	memset(ptr, 'a', desc->target);
+
+	while (1)
+		sleep(1000);
+
+	return 0;
+}
+
+static void setup_environment(void)
+{
+	int i, err;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		goto cleanup;
+
+	for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+		cgroups[i].fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup"))
+			goto cleanup;
+
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+		if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id"))
+			goto cleanup;
+
+		if (i == 0) {
+			/* Freeze the top-level cgroup */
+			err = write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1");
+			if (!ASSERT_OK(err, "freeze cgroup"))
+				goto cleanup;
+		}
+
+		if (!cgroups[i].target) {
+			/* Recursively enable the memory controller */
+			err = write_cgroup_file(cgroups[i].path, "cgroup.subtree_control",
+						"+memory");
+			if (!ASSERT_OK(err, "enable memory controller"))
+				goto cleanup;
+		}
+
+		if (cgroups[i].max) {
+			char buf[256];
+
+			snprintf(buf, sizeof(buf), "%lu", cgroups[i].max);
+			err = write_cgroup_file(cgroups[i].path, "memory.max", buf);
+			if (!ASSERT_OK(err, "set memory.max"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "0");
+			write_cgroup_file(cgroups[i].path, "memory.swap.max", buf);
+
+		}
+
+		if (cgroups[i].target) {
+			char buf[256];
+
+			err = spawn_task(&cgroups[i]);
+			if (!ASSERT_OK(err, "spawn task"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "%d", cgroups[i].pid);
+			err = write_cgroup_file(cgroups[i].path, "cgroup.procs", buf);
+			if (!ASSERT_OK(err, "put child into a cgroup"))
+				goto cleanup;
+		}
+	}
+
+	return;
+
+cleanup:
+	cleanup_cgroup_environment();
+}
+
+static int run_and_wait_for_oom(void)
+{
+	int ret = -1;
+	bool first = true;
+	char buf[4096] = {};
+	size_t size;
+
+	ret = write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
+	if (!ASSERT_OK(ret, "freeze cgroup"))
+		return -1;
+
+	for (;;) {
+		int i, status;
+		pid_t pid = wait(&status);
+
+		if (pid == -1) {
+			if (errno == EINTR)
+				continue;
+			/* ECHILD */
+			break;
+		}
+
+		if (!first)
+			continue;
+
+		first = false;
+
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+			if (!ASSERT_OK(cgroups[i].victim !=
+				       (pid == cgroups[i].pid),
+				       "correct process was killed")) {
+				ret = -1;
+				break;
+			}
+
+			if (!cgroups[i].victim)
+				continue;
+
+			size = read_cgroup_file(cgroups[i].path,
+						"memory.events",
+						buf, sizeof(buf));
+			if (!ASSERT_OK(size <= 0, "read memory.events")) {
+				ret = -1;
+				break;
+			}
+
+			if (!ASSERT_OK(strstr(buf, "oom_kill 1") == NULL,
+				       "oom_kill count check")) {
+				ret = -1;
+				break;
+			}
+		}
+
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++)
+			if (cgroups[i].pid && cgroups[i].pid != pid)
+				kill(cgroups[i].pid, SIGKILL);
+	}
+
+	return ret;
+}
+
+void test_oom(void)
+{
+	struct test_oom *skel;
+	int err;
+
+	skel = test_oom__open_and_load();
+	err = test_oom__attach(skel);
+	if (!ASSERT_OK(err, "test_oom__attach"))
+		goto cleanup;
+
+	setup_environment();
+
+	run_and_wait_for_oom();
+
+	cleanup_cgroup_environment();
+cleanup:
+	test_oom__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_oom.c b/tools/testing/selftests/bpf/progs/test_oom.c
new file mode 100644
index 000000000000..a8224d7c3fed
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_oom.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define OOM_SCORE_ADJ_MIN	(-1000)
+
+void bpf_rcu_read_lock(void) __ksym;
+void bpf_rcu_read_unlock(void) __ksym;
+struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+struct mem_cgroup *bpf_get_root_mem_cgroup(void) __ksym;
+struct mem_cgroup *bpf_get_mem_cgroup(struct cgroup_subsys_state *css) __ksym;
+void bpf_put_mem_cgroup(struct mem_cgroup *memcg) __ksym;
+int bpf_oom_kill_process(struct oom_control *oc, struct task_struct *task,
+			 const char *message__str) __ksym;
+
+bool mem_cgroup_killable(struct mem_cgroup *memcg)
+{
+	struct task_struct *task;
+	bool ret = true;
+
+	bpf_for_each(css_task, task, &memcg->css, CSS_TASK_ITER_PROCS)
+		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			return false;
+
+	return ret;
+}
+
+/*
+ * Find the largest leaf cgroup (ignoring page cache) without unkillable tasks
+ * and kill all belonging tasks.
+ */
+SEC("fmod_ret.s/bpf_handle_out_of_memory")
+int BPF_PROG(test_bpf_out_of_memory, struct oom_control *oc)
+{
+	struct task_struct *task;
+	struct mem_cgroup *root_memcg = oc->memcg;
+	struct mem_cgroup *memcg, *victim = NULL;
+	struct cgroup_subsys_state *css_pos;
+	unsigned long usage, max_usage = 0;
+	unsigned long pagecache = 0;
+	int ret = 0;
+
+	if (root_memcg)
+		root_memcg = bpf_get_mem_cgroup(&root_memcg->css);
+	else
+		root_memcg = bpf_get_root_mem_cgroup();
+
+	if (!root_memcg)
+		return 0;
+
+	bpf_rcu_read_lock();
+	bpf_for_each(css, css_pos, &root_memcg->css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
+		if (css_pos->cgroup->nr_descendants + css_pos->cgroup->nr_dying_descendants)
+			continue;
+
+		memcg = bpf_get_mem_cgroup(css_pos);
+		if (!memcg)
+			continue;
+
+		usage = bpf_mem_cgroup_usage(memcg, false);
+		pagecache = bpf_mem_cgroup_page_state(memcg, NR_FILE_PAGES);
+
+		if (usage > pagecache)
+			usage -= pagecache;
+		else
+			usage = 0;
+
+		if ((usage > max_usage) && mem_cgroup_killable(memcg)) {
+			max_usage = usage;
+			if (victim)
+				bpf_put_mem_cgroup(victim);
+			victim = bpf_get_mem_cgroup(&memcg->css);
+		}
+
+		bpf_put_mem_cgroup(memcg);
+	}
+	bpf_rcu_read_unlock();
+
+	if (!victim)
+		goto exit;
+
+	bpf_for_each(css_task, task, &victim->css, CSS_TASK_ITER_PROCS) {
+		struct task_struct *t = bpf_task_acquire(task);
+
+		if (t) {
+			bpf_oom_kill_process(oc, task, "bpf oom test");
+			bpf_task_release(t);
+			ret = 1;
+		}
+	}
+
+	bpf_put_mem_cgroup(victim);
+exit:
+	bpf_put_mem_cgroup(root_memcg);
+
+	return ret;
+}
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (7 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 08/12] bpf: selftests: bpf OOM handler test Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  6:11   ` kernel test robot
  2025-04-30  0:28   ` Suren Baghdasaryan
  2025-04-28  3:36 ` [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc Roman Gushchin
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Introduce a bpf hook to handle psi events. The primary intended
purpose of this hook is to declare OOM events based on the reaching
a certain memory pressure level, similar to what systemd-oomd and oomd
are doing in userspace.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/sched/psi.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 1396674fa722..4c4eb4ead8f6 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -176,6 +176,32 @@ static void psi_avgs_work(struct work_struct *work);
 
 static void poll_timer_fn(struct timer_list *t);
 
+#ifdef CONFIG_BPF_SYSCALL
+__bpf_hook_start();
+
+__weak noinline int bpf_handle_psi_event(struct psi_trigger *t)
+{
+	return 0;
+}
+
+__bpf_hook_end();
+
+BTF_KFUNCS_START(bpf_psi_hooks)
+BTF_ID_FLAGS(func, bpf_handle_psi_event, KF_SLEEPABLE)
+BTF_KFUNCS_END(bpf_psi_hooks)
+
+static const struct btf_kfunc_id_set bpf_psi_hook_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_psi_hooks,
+};
+
+#else
+static inline int bpf_handle_psi_event(struct psi_trigger *t)
+{
+	return 0;
+}
+#endif
+
 static void group_init(struct psi_group *group)
 {
 	int cpu;
@@ -489,6 +515,7 @@ static void update_triggers(struct psi_group *group, u64 now,
 
 		/* Generate an event */
 		if (cmpxchg(&t->event, 0, 1) == 0) {
+			bpf_handle_psi_event(t);
 			if (t->of)
 				kernfs_notify(t->of->kn);
 			else
@@ -1655,6 +1682,8 @@ static const struct proc_ops psi_irq_proc_ops = {
 
 static int __init psi_proc_init(void)
 {
+	int err = 0;
+
 	if (psi_enable) {
 		proc_mkdir("pressure", NULL);
 		proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
@@ -1662,9 +1691,14 @@ static int __init psi_proc_init(void)
 		proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 		proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
+#endif
+#ifdef CONFIG_BPF_SYSCALL
+		err = register_btf_fmodret_id_set(&bpf_psi_hook_set);
+		if (err)
+			pr_err("error while registering bpf psi hooks: %d", err);
 #endif
 	}
-	return 0;
+	return err;
 }
 module_init(psi_proc_init);
 
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (8 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-29 11:46   ` Michal Hocko
  2025-04-28  3:36 ` [PATCH rfc 11/12] bpf: selftests: introduce open_cgroup_file() helper Roman Gushchin
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
an out of memory events and trigger the corresponding kernel OOM
handling mechanism.

It takes a trusted memcg pointer (or NULL for system-wide OOMs)
as an argument, as well as the page order.

Only one OOM can be declared and handled in the system at once,
so if the function is called in parallel to another OOM handling,
it bails out with -EBUSY.

The function is declared as sleepable. It guarantees that it won't
be called from an atomic context. It's required by the OOM handling
code, which is not guaranteed to work in a non-blocking context.

Handling of a memcg OOM almost always requires taking of the
css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
also guarantees that it can't be called with acquired css_set_lock,
so the kernel can't deadlock on it.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2e922e75a9df..246510572e34 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1333,6 +1333,27 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
 	return 0;
 }
 
+__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable, int order)
+{
+	struct oom_control oc = {
+		.memcg = memcg__nullable,
+		.order = order,
+	};
+	int ret = -EINVAL;
+
+	if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
+		goto out;
+
+	ret = -EBUSY;
+	if (mutex_trylock(&oom_lock)) {
+		ret = out_of_memory(&oc);
+		mutex_unlock(&oom_lock);
+	}
+
+out:
+	return ret;
+}
+
 __bpf_kfunc_end_defs();
 
 __bpf_hook_start();
@@ -1358,6 +1379,7 @@ static const struct btf_kfunc_id_set bpf_oom_hook_set = {
 
 BTF_KFUNCS_START(bpf_oom_kfuncs)
 BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(bpf_oom_kfuncs)
 
 static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 11/12] bpf: selftests: introduce open_cgroup_file() helper
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (9 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28  3:36 ` [PATCH rfc 12/12] bpf: selftests: psi handler test Roman Gushchin
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Implement the open_cgroup_file() helper which opens a cgroup
control file with the given flags and returns a file descriptor.

It's useful when a test needs to do something more sophisticated
than read/write, e.g. listen for poll events or keep the file
descriptor open.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 28 ++++++++++++++++++++
 tools/testing/selftests/bpf/cgroup_helpers.h |  1 +
 2 files changed, 29 insertions(+)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 3ffd4b764f91..50dbe4f45cb1 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -125,6 +125,34 @@ int enable_controllers(const char *relative_path, const char *controllers)
 	return __enable_controllers(cgroup_path, controllers);
 }
 
+static int __open_cgroup_file(const char *cgroup_path, const char *file,
+			      int flags)
+{
+	char file_path[PATH_MAX + 1];
+
+	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+	return open(file_path, flags);
+}
+
+/**
+ * open_cgroup_file() - Open a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @file: The name of the file in cgroupfs to open to
+ * @flags: Flags
+ *
+ * Open a file in the given cgroup's directory.
+ *
+ * If successful, fd is returned.
+ */
+int open_cgroup_file(const char *relative_path, const char *file,
+		     int flags)
+{
+	char cgroup_path[PATH_MAX - 24];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __open_cgroup_file(cgroup_path, file, flags);
+}
+
 static size_t __read_cgroup_file(const char *cgroup_path, const char *file,
 				 char *buf, size_t size)
 {
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index 821cb76db1f7..f45007d5fea5 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -11,6 +11,7 @@
 
 /* cgroupv2 related */
 int enable_controllers(const char *relative_path, const char *controllers);
+int open_cgroup_file(const char *relative_path, const char *file, int flags);
 size_t read_cgroup_file(const char *relative_path, const char *file,
 			char *buf, size_t size);
 int write_cgroup_file(const char *relative_path, const char *file,
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rfc 12/12] bpf: selftests: psi handler test
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (10 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 11/12] bpf: selftests: introduce open_cgroup_file() helper Roman Gushchin
@ 2025-04-28  3:36 ` Roman Gushchin
  2025-04-28 10:43 ` [PATCH rfc 00/12] mm: BPF OOM Matt Bobrowski
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28  3:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Alexei Starovoitov, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf, Roman Gushchin

Add a psi handler test. The test creates a cgroup with two child
sub-cgroups, sets up memory.high for one of those and puts
memory hungry processes in each of them.

Then it sets up a psi trigger for one of cgroups and waits
till the process in this cgroup will be killed by the OOM killer.
To make sure there was indeed an OOM event, it checks the
corresponding memcg statistics.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_psi.c |  43 ++++
 2 files changed, 277 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c

diff --git a/tools/testing/selftests/bpf/prog_tests/psi.c b/tools/testing/selftests/bpf/prog_tests/psi.c
new file mode 100644
index 000000000000..99d68bc20eee
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/psi.c
@@ -0,0 +1,234 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/stat.h>
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <string.h>
+
+#include "cgroup_helpers.h"
+#include "test_psi.skel.h"
+
+struct cgroup_desc {
+	const char *path;
+	int fd;
+	unsigned long long id;
+	int pid;
+	size_t target;
+	size_t high;
+	bool victim;
+	bool psi;
+};
+
+#define MB (1024 * 1024)
+
+static struct cgroup_desc cgroups[] = {
+	{ .path = "/oom_test" },
+	{ .path = "/oom_test/cg1", .target = 100 * MB },
+	{ .path = "/oom_test/cg2", .target = 500 * MB,
+	  .high = 40 * MB, .psi = true, .victim = true },
+};
+
+static int spawn_task(struct cgroup_desc *desc)
+{
+	char *ptr;
+	int pid;
+
+	pid = fork();
+	if (pid < 0)
+		return pid;
+
+	if (pid > 0) {
+		/* parent */
+		desc->pid = pid;
+		return 0;
+	}
+
+	/* child */
+	ptr = (char *)malloc(desc->target);
+	if (!ptr)
+		return -ENOMEM;
+
+	memset(ptr, 'a', desc->target);
+
+	while (1)
+		sleep(1000);
+
+	return 0;
+}
+
+static int setup_psi_alert(struct cgroup_desc *desc)
+{
+	const char *trig = "some 100000 1000000";
+	int fd;
+
+	fd = open_cgroup_file(desc->path, "memory.pressure", O_RDWR);
+	if (fd < 0) {
+		printf("memory.pressure open error: %s\n", strerror(errno));
+		return 1;
+	}
+
+	if (write(fd, trig, strlen(trig) + 1) < 0) {
+		printf("memory.pressure write error: %s\n", strerror(errno));
+		return 1;
+	}
+
+	/* keep fd open, otherwise the psi trigger will be deleted */
+	return 0;
+}
+
+static void setup_environment(void)
+{
+	int i, err;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		goto cleanup;
+
+	for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+		cgroups[i].fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup"))
+			goto cleanup;
+
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+		if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id"))
+			goto cleanup;
+
+		if (i == 0) {
+			/* Freeze the top-level cgroup */
+			err = write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1");
+			if (!ASSERT_OK(err, "freeze cgroup"))
+				goto cleanup;
+		}
+
+		if (!cgroups[i].target) {
+			/* Recursively enable the memory controller */
+			err = write_cgroup_file(cgroups[i].path, "cgroup.subtree_control",
+						"+memory");
+			if (!ASSERT_OK(err, "enable memory controller"))
+				goto cleanup;
+		}
+
+		if (cgroups[i].high) {
+			char buf[256];
+
+			snprintf(buf, sizeof(buf), "%lu", cgroups[i].high);
+			err = write_cgroup_file(cgroups[i].path, "memory.high", buf);
+			if (!ASSERT_OK(err, "set memory.high"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "0");
+			write_cgroup_file(cgroups[i].path, "memory.swap.max", buf);
+		}
+
+		if (cgroups[i].target) {
+			char buf[256];
+
+			err = spawn_task(&cgroups[i]);
+			if (!ASSERT_OK(err, "spawn task"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "%d", cgroups[i].pid);
+			err = write_cgroup_file(cgroups[i].path, "cgroup.procs", buf);
+			if (!ASSERT_OK(err, "put child into a cgroup"))
+				goto cleanup;
+		}
+
+		if (cgroups[i].psi) {
+			err = setup_psi_alert(&cgroups[i]);
+			if (!ASSERT_OK(err, "create psi trigger"))
+				goto cleanup;
+		}
+	}
+
+	return;
+
+cleanup:
+	cleanup_cgroup_environment();
+}
+
+static int run_and_wait_for_oom(void)
+{
+	int ret = -1;
+	bool first = true;
+	char buf[4096] = {};
+	size_t size;
+
+	ret = write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
+	if (!ASSERT_OK(ret, "freeze cgroup"))
+		return -1;
+
+	for (;;) {
+		int i, status;
+		pid_t pid = wait(&status);
+
+		if (pid == -1) {
+			if (errno == EINTR)
+				continue;
+			/* ECHILD */
+			break;
+		}
+
+		if (!first)
+			continue;
+
+		first = false;
+
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+			if (!ASSERT_OK(cgroups[i].victim !=
+				       (pid == cgroups[i].pid),
+				       "correct process was killed")) {
+				ret = -1;
+				break;
+			}
+
+			if (!cgroups[i].victim)
+				continue;
+
+			size = read_cgroup_file(cgroups[i].path, "memory.events",
+						buf, sizeof(buf));
+			if (!ASSERT_OK(size <= 0, "read memory.events")) {
+				ret = -1;
+				break;
+			}
+
+			if (!ASSERT_OK(strstr(buf, "oom_kill 1") == NULL,
+				       "oom_kill count check")) {
+				ret = -1;
+				break;
+			}
+		}
+
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++)
+			if (cgroups[i].pid && cgroups[i].pid != pid)
+				kill(cgroups[i].pid, SIGKILL);
+	}
+
+	return ret;
+}
+
+void test_psi(void)
+{
+	struct test_psi *skel;
+	int err;
+
+	skel = test_psi__open_and_load();
+	err = test_psi__attach(skel);
+	if (!ASSERT_OK(err, "test_psi__attach"))
+		goto cleanup;
+
+	setup_environment();
+
+	run_and_wait_for_oom();
+
+	cleanup_cgroup_environment();
+cleanup:
+	test_psi__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_psi.c b/tools/testing/selftests/bpf/progs/test_psi.c
new file mode 100644
index 000000000000..8cbc1e0a5b24
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_psi.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct mem_cgroup *bpf_get_mem_cgroup(struct cgroup_subsys_state *css) __ksym;
+void bpf_put_mem_cgroup(struct mem_cgroup *memcg) __ksym;
+int bpf_out_of_memory(struct mem_cgroup *memcg, int order) __ksym;
+
+SEC("fmod_ret.s/bpf_handle_psi_event")
+int BPF_PROG(test_psi_event, struct psi_trigger *t)
+{
+	struct cgroup *cgroup = NULL;
+	struct mem_cgroup *memcg;
+	u64 cgroup_id;
+
+	if (!t->of || !t->of->kn) {
+		bpf_out_of_memory(NULL, 0);
+		return 1;
+	}
+
+	cgroup_id = t->of->kn->__parent->id;
+	cgroup = bpf_cgroup_from_id(cgroup_id);
+	if (!cgroup)
+		return 0;
+
+	memcg = bpf_get_mem_cgroup(&cgroup->self);
+	if (!memcg) {
+		bpf_cgroup_release(cgroup);
+		return 0;
+	}
+
+	bpf_out_of_memory(memcg, 0);
+
+	bpf_put_mem_cgroup(memcg);
+	bpf_cgroup_release(cgroup);
+
+	return 1;
+}
-- 
2.49.0.901.g37484f566f-goog



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events
  2025-04-28  3:36 ` [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events Roman Gushchin
@ 2025-04-28  6:11   ` kernel test robot
  2025-04-30  0:28   ` Suren Baghdasaryan
  1 sibling, 0 replies; 33+ messages in thread
From: kernel test robot @ 2025-04-28  6:11 UTC (permalink / raw)
  To: Roman Gushchin, linux-kernel
  Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Alexei Starovoitov, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Suren Baghdasaryan, David Rientjes, Josh Don, Chuyi Zhou,
	cgroups, bpf, Roman Gushchin

Hi Roman,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Roman-Gushchin/mm-introduce-a-bpf-hook-for-OOM-handling/20250428-113742
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250428033617.3797686-10-roman.gushchin%40linux.dev
patch subject: [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events
config: sh-randconfig-001-20250428 (https://download.01.org/0day-ci/archive/20250428/202504281309.smYiDStM-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 10.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250428/202504281309.smYiDStM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504281309.smYiDStM-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/build_utility.c:94:
>> kernel/sched/psi.c:193:38: warning: 'bpf_psi_hook_set' defined but not used [-Wunused-const-variable=]
     193 | static const struct btf_kfunc_id_set bpf_psi_hook_set = {
         |                                      ^~~~~~~~~~~~~~~~


vim +/bpf_psi_hook_set +193 kernel/sched/psi.c

   192	
 > 193	static const struct btf_kfunc_id_set bpf_psi_hook_set = {
   194		.owner = THIS_MODULE,
   195		.set   = &bpf_psi_hooks,
   196	};
   197	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (11 preceding siblings ...)
  2025-04-28  3:36 ` [PATCH rfc 12/12] bpf: selftests: psi handler test Roman Gushchin
@ 2025-04-28 10:43 ` Matt Bobrowski
  2025-04-28 17:24   ` Roman Gushchin
  2025-04-29 11:42 ` Michal Hocko
  2025-04-29 22:44 ` Suren Baghdasaryan
  14 siblings, 1 reply; 33+ messages in thread
From: Matt Bobrowski @ 2025-04-28 10:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Suren Baghdasaryan, David Rientjes,
	Josh Don, Chuyi Zhou, cgroups, linux-mm, bpf

On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote:
> This patchset adds an ability to customize the out of memory
> handling using bpf.
> 
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
> 
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking-based policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
> 
> It provides a generic hook which is called before the existing OOM
> killer code and allows implementing any policy, e.g.  picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).
> 
> The past attempt to implement memory-cgroup aware policy [2] showed
> that there are multiple opinions on what the best policy is.  As it's
> highly workload-dependent and specific to a concrete way of organizing
> workloads, the structure of the cgroup tree etc, a customizable
> bpf-based implementation is preferable over a in-kernel implementation
> with a dozen on sysctls.
> 
> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls. In this scenario the userspace daemon was supposed to handle
> the majority of OOMs, while the in-kernel OOM killer worked as the
> last resort measure to guarantee that the system would never deadlock
> on the memory. But this approach creates additional infrastructure
> churn: userspace OOM daemon is a separate entity which needs to be
> deployed, updated, monitored. A completely different pipeline needs to
> be built to monitor both types of OOM events and collect associated
> logs. A userspace daemon is more restricted in terms on what data is
> available to it. Implementing a daemon which can work reliably under a
> heavy memory pressure in the system is also tricky.
> 
> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> [3]: https://github.com/facebookincubator/oomd
> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> 
> ----
> 
> This is an RFC version, which is not intended to be merged in the current form.
> Open questions/TODOs:
> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
>    arguments as trusted"), which is not safe. One option is to fake acquire/release
>    semantics for the oom_control pointer. Other option is to introduce a completely
>    new attachment or program type, similar to lsm hooks.

Thinking out loud now, but rather than introducing and having a single
BPF-specific function/interface, and BPF program for that matter,
which can effectively be used to short-circuit steps from within
out_of_memory(), why not introduce a
tcp_congestion_ops/sched_ext_ops-like interface which essentially
provides a multifaceted interface for controlling OOM killing
(->select_bad_process, ->oom_kill_process, etc), optionally also from
the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)?

I don't know whether that's what you meant by introducing a new
attachment, or program type, but an approach like this is what
immediately comes to mind when wanting to provide more than a single
implementation for a set of operations within the Linux kernel,
particularly also from the context of a BPF program.

> 2) Currently lockdep complaints about a potential circular dependency because
>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
>    One way to fix it is to make it non-sleepable, but then it will require some
>    additional work to allow it using cgroup iterators. It's intervened with 1).
> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
>    Do we want to attach oom policies to cgroups? I think it's too complicated,
>    but if we want a full hierarchical support, it might be required.
>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
>    memcg, which is potentially outside of the ns of the loading process. Does
>    it require some additional capabilities checks? Should it be removed?
> 4) Documentation is lacking and will be added in the next version.
> 
> 
> Roman Gushchin (12):
>   mm: introduce a bpf hook for OOM handling
>   bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
>   bpf: treat fmodret tracing program's arguments as trusted
>   mm: introduce bpf_oom_kill_process() bpf kfunc
>   mm: introduce bpf kfuncs to deal with memcg pointers
>   mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
>   bpf: selftests: introduce read_cgroup_file() helper
>   bpf: selftests: bpf OOM handler test
>   sched: psi: bpf hook to handle psi events
>   mm: introduce bpf_out_of_memory() bpf kfunc
>   bpf: selftests: introduce open_cgroup_file() helper
>   bpf: selftests: psi handler test
> 
>  include/linux/memcontrol.h                   |   2 +
>  include/linux/oom.h                          |   5 +
>  kernel/bpf/btf.c                             |   9 +-
>  kernel/bpf/verifier.c                        |   5 +
>  kernel/sched/psi.c                           |  36 ++-
>  mm/Makefile                                  |   3 +
>  mm/bpf_memcontrol.c                          | 108 +++++++++
>  mm/oom_kill.c                                | 140 +++++++++++
>  tools/testing/selftests/bpf/cgroup_helpers.c |  67 ++++++
>  tools/testing/selftests/bpf/cgroup_helpers.h |   3 +
>  tools/testing/selftests/bpf/prog_tests/oom.c | 227 ++++++++++++++++++
>  tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++
>  tools/testing/selftests/bpf/progs/test_oom.c | 103 ++++++++
>  tools/testing/selftests/bpf/progs/test_psi.c |  43 ++++
>  14 files changed, 983 insertions(+), 2 deletions(-)
>  create mode 100644 mm/bpf_memcontrol.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c
> 
> -- 
> 2.49.0.901.g37484f566f-goog
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-28 10:43 ` [PATCH rfc 00/12] mm: BPF OOM Matt Bobrowski
@ 2025-04-28 17:24   ` Roman Gushchin
  2025-04-29  1:56     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2025-04-28 17:24 UTC (permalink / raw)
  To: Matt Bobrowski
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Suren Baghdasaryan, David Rientjes,
	Josh Don, Chuyi Zhou, cgroups, linux-mm, bpf

On Mon, Apr 28, 2025 at 10:43:07AM +0000, Matt Bobrowski wrote:
> On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote:
> > This patchset adds an ability to customize the out of memory
> > handling using bpf.
> > 
> > It focuses on two parts:
> > 1) OOM handling policy,
> > 2) PSI-based OOM invocation.
> > 
> > The idea to use bpf for customizing the OOM handling is not new, but
> > unlike the previous proposal [1], which augmented the existing task
> > ranking-based policy, this one tries to be as generic as possible and
> > leverage the full power of the modern bpf.
> > 
> > It provides a generic hook which is called before the existing OOM
> > killer code and allows implementing any policy, e.g.  picking a victim
> > task or memory cgroup or potentially even releasing memory in other
> > ways, e.g. deleting tmpfs files (the last one might require some
> > additional but relatively simple changes).
> > 
> > The past attempt to implement memory-cgroup aware policy [2] showed
> > that there are multiple opinions on what the best policy is.  As it's
> > highly workload-dependent and specific to a concrete way of organizing
> > workloads, the structure of the cgroup tree etc, a customizable
> > bpf-based implementation is preferable over a in-kernel implementation
> > with a dozen on sysctls.
> > 
> > The second part is related to the fundamental question on when to
> > declare the OOM event. It's a trade-off between the risk of
> > unnecessary OOM kills and associated work losses and the risk of
> > infinite trashing and effective soft lockups.  In the last few years
> > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > stalls. In this scenario the userspace daemon was supposed to handle
> > the majority of OOMs, while the in-kernel OOM killer worked as the
> > last resort measure to guarantee that the system would never deadlock
> > on the memory. But this approach creates additional infrastructure
> > churn: userspace OOM daemon is a separate entity which needs to be
> > deployed, updated, monitored. A completely different pipeline needs to
> > be built to monitor both types of OOM events and collect associated
> > logs. A userspace daemon is more restricted in terms on what data is
> > available to it. Implementing a daemon which can work reliably under a
> > heavy memory pressure in the system is also tricky.
> > 
> > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> > [3]: https://github.com/facebookincubator/oomd
> > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> > 
> > ----
> > 
> > This is an RFC version, which is not intended to be merged in the current form.
> > Open questions/TODOs:
> > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> >    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> >    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> >    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> >    arguments as trusted"), which is not safe. One option is to fake acquire/release
> >    semantics for the oom_control pointer. Other option is to introduce a completely
> >    new attachment or program type, similar to lsm hooks.
> 
> Thinking out loud now, but rather than introducing and having a single
> BPF-specific function/interface, and BPF program for that matter,
> which can effectively be used to short-circuit steps from within
> out_of_memory(), why not introduce a
> tcp_congestion_ops/sched_ext_ops-like interface which essentially
> provides a multifaceted interface for controlling OOM killing
> (->select_bad_process, ->oom_kill_process, etc), optionally also from
> the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)?

It's certainly an option and I thought about it. I don't think we need a bunch
of hooks though. This patchset adds 2 and they belong to completely different
subsystems (mm and sched/psi), so Idk how well they can be gathered
into a single struct ops. But maybe it's fine.

The only potentially new hook I can envision now is one to customize
the oom reporting.

Thanks for the suggestion!



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-28 17:24   ` Roman Gushchin
@ 2025-04-29  1:56     ` Kumar Kartikeya Dwivedi
  2025-04-29 15:42       ` Roman Gushchin
  2025-05-02 17:26       ` Song Liu
  0 siblings, 2 replies; 33+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-29  1:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Matt Bobrowski, linux-kernel, Andrew Morton, Alexei Starovoitov,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Suren Baghdasaryan,
	David Rientjes, Josh Don, Chuyi Zhou, cgroups, linux-mm, bpf

On Mon, 28 Apr 2025 at 19:24, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Mon, Apr 28, 2025 at 10:43:07AM +0000, Matt Bobrowski wrote:
> > On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote:
> > > This patchset adds an ability to customize the out of memory
> > > handling using bpf.
> > >
> > > It focuses on two parts:
> > > 1) OOM handling policy,
> > > 2) PSI-based OOM invocation.
> > >
> > > The idea to use bpf for customizing the OOM handling is not new, but
> > > unlike the previous proposal [1], which augmented the existing task
> > > ranking-based policy, this one tries to be as generic as possible and
> > > leverage the full power of the modern bpf.
> > >
> > > It provides a generic hook which is called before the existing OOM
> > > killer code and allows implementing any policy, e.g.  picking a victim
> > > task or memory cgroup or potentially even releasing memory in other
> > > ways, e.g. deleting tmpfs files (the last one might require some
> > > additional but relatively simple changes).
> > >
> > > The past attempt to implement memory-cgroup aware policy [2] showed
> > > that there are multiple opinions on what the best policy is.  As it's
> > > highly workload-dependent and specific to a concrete way of organizing
> > > workloads, the structure of the cgroup tree etc, a customizable
> > > bpf-based implementation is preferable over a in-kernel implementation
> > > with a dozen on sysctls.
> > >
> > > The second part is related to the fundamental question on when to
> > > declare the OOM event. It's a trade-off between the risk of
> > > unnecessary OOM kills and associated work losses and the risk of
> > > infinite trashing and effective soft lockups.  In the last few years
> > > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > > stalls. In this scenario the userspace daemon was supposed to handle
> > > the majority of OOMs, while the in-kernel OOM killer worked as the
> > > last resort measure to guarantee that the system would never deadlock
> > > on the memory. But this approach creates additional infrastructure
> > > churn: userspace OOM daemon is a separate entity which needs to be
> > > deployed, updated, monitored. A completely different pipeline needs to
> > > be built to monitor both types of OOM events and collect associated
> > > logs. A userspace daemon is more restricted in terms on what data is
> > > available to it. Implementing a daemon which can work reliably under a
> > > heavy memory pressure in the system is also tricky.
> > >
> > > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> > > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> > > [3]: https://github.com/facebookincubator/oomd
> > > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> > >
> > > ----
> > >
> > > This is an RFC version, which is not intended to be merged in the current form.
> > > Open questions/TODOs:
> > > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> > >    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> > >    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> > >    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> > >    arguments as trusted"), which is not safe. One option is to fake acquire/release
> > >    semantics for the oom_control pointer. Other option is to introduce a completely
> > >    new attachment or program type, similar to lsm hooks.
> >
> > Thinking out loud now, but rather than introducing and having a single
> > BPF-specific function/interface, and BPF program for that matter,
> > which can effectively be used to short-circuit steps from within
> > out_of_memory(), why not introduce a
> > tcp_congestion_ops/sched_ext_ops-like interface which essentially
> > provides a multifaceted interface for controlling OOM killing
> > (->select_bad_process, ->oom_kill_process, etc), optionally also from
> > the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)?
>
> It's certainly an option and I thought about it. I don't think we need a bunch
> of hooks though. This patchset adds 2 and they belong to completely different
> subsystems (mm and sched/psi), so Idk how well they can be gathered
> into a single struct ops. But maybe it's fine.
>
> The only potentially new hook I can envision now is one to customize
> the oom reporting.
>

If you're considering scoping it down to a particular cgroup (as you
allude to in the TODO), or building a hierarchical interface, using
struct_ops will be much better than fmod_ret etc., which is global in
nature. Even if you don't support it now. I don't think a struct_ops
is warranted only when you have more than a few callbacks. As an
illustration, sched_ext started out without supporting hierarchical
attachment, but will piggy-back on the struct_ops interface to do so
in the near future.

> Thanks for the suggestion!
>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (12 preceding siblings ...)
  2025-04-28 10:43 ` [PATCH rfc 00/12] mm: BPF OOM Matt Bobrowski
@ 2025-04-29 11:42 ` Michal Hocko
  2025-04-29 14:44   ` Roman Gushchin
  2025-04-29 22:44 ` Suren Baghdasaryan
  14 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2025-04-29 11:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
> This patchset adds an ability to customize the out of memory
> handling using bpf.
> 
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
> 
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking-based policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
> 
> It provides a generic hook which is called before the existing OOM
> killer code and allows implementing any policy, e.g.  picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).

Makes sense to me. I still have a slight concern though. We have 3
different oom handlers smashed into a single one with special casing
involved. This is manageable (although not great) for the in kernel
code but I am wondering whether we should do better for BPF based OOM
implementations. Would it make sense to have different callbacks for
cpuset, memcg and global oom killer handlers?

I can see you have already added some helper functions to deal with
memcgs but I do not see anything to iterate processes or find a process to
kill etc. Is that functionality generally available (sorry I am not
really familiar with BPF all that much so please bear with me)?

I like the way how you naturalely hooked into existing OOM primitives
like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
you waiting for a first user that needs to implement oom victim
synchronization or do you plan to integrate that into tasks iterators?
I am mostly asking because it is exactly these kind of details that
make the current in kernel oom handler quite complex and it would be
great if custom ones do not have to reproduce that complexity and only
focus on the high level policy.

> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls.

This makes sense to me as well. I have to admit I am not fully familiar
with PSI integration into sched code but from what I can see the
evaluation is done on regular bases from the worker context kicked off
from the scheduler code. There shouldn't be any locking constrains which
is good. Is there any risk if the oom handler took too long though?

Also an important question. I can see selftests which are using the
infrastructure. But have you tried to implement a real OOM handler with
this proposed infrastructure?

> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> [3]: https://github.com/facebookincubator/oomd
> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> 
> ----
> 
> This is an RFC version, which is not intended to be merged in the current form.
> Open questions/TODOs:
> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
>    arguments as trusted"), which is not safe. One option is to fake acquire/release
>    semantics for the oom_control pointer. Other option is to introduce a completely
>    new attachment or program type, similar to lsm hooks.
> 2) Currently lockdep complaints about a potential circular dependency because
>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
>    One way to fix it is to make it non-sleepable, but then it will require some
>    additional work to allow it using cgroup iterators. It's intervened with 1).

I cannot see this in the code. Could you be more specific please? Where
is this might_fault coming from? Is this BPF constrain?

> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
>    Do we want to attach oom policies to cgroups? I think it's too complicated,
>    but if we want a full hierarchical support, it might be required.
>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
>    memcg, which is potentially outside of the ns of the loading process. Does
>    it require some additional capabilities checks? Should it be removed?

Yes, let's start simple and see where we get from there.

> 4) Documentation is lacking and will be added in the next version.

+1

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-04-28  3:36 ` [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc Roman Gushchin
@ 2025-04-29 11:46   ` Michal Hocko
  2025-04-29 21:31     ` Roman Gushchin
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2025-04-29 11:46 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> an out of memory events and trigger the corresponding kernel OOM
> handling mechanism.
> 
> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> as an argument, as well as the page order.
> 
> Only one OOM can be declared and handled in the system at once,
> so if the function is called in parallel to another OOM handling,
> it bails out with -EBUSY.

This makes sense for the global OOM handler because concurrent handlers
are cooperative. But is this really correct for memcg ooms which could
happen for different hierarchies? Currently we do block on oom_lock in
that case to make sure one oom doesn't starve others. Do we want the
same behavior for custom OOM handlers?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-29 11:42 ` Michal Hocko
@ 2025-04-29 14:44   ` Roman Gushchin
  2025-04-29 21:56     ` Suren Baghdasaryan
  2025-04-29 23:01     ` Suren Baghdasaryan
  0 siblings, 2 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-29 14:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

Michal Hocko <mhocko@suse.com> writes:

> On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
>> This patchset adds an ability to customize the out of memory
>> handling using bpf.
>> 
>> It focuses on two parts:
>> 1) OOM handling policy,
>> 2) PSI-based OOM invocation.
>> 
>> The idea to use bpf for customizing the OOM handling is not new, but
>> unlike the previous proposal [1], which augmented the existing task
>> ranking-based policy, this one tries to be as generic as possible and
>> leverage the full power of the modern bpf.
>> 
>> It provides a generic hook which is called before the existing OOM
>> killer code and allows implementing any policy, e.g.  picking a victim
>> task or memory cgroup or potentially even releasing memory in other
>> ways, e.g. deleting tmpfs files (the last one might require some
>> additional but relatively simple changes).
>
> Makes sense to me. I still have a slight concern though. We have 3
> different oom handlers smashed into a single one with special casing
> involved. This is manageable (although not great) for the in kernel
> code but I am wondering whether we should do better for BPF based OOM
> implementations. Would it make sense to have different callbacks for
> cpuset, memcg and global oom killer handlers?

Yes, it's certainly possible. If we go struct_ops path, we can even
have both the common hook which handles all types of OOM's and separate
hooks for each type. The user then can choose what's more convenient.
Good point.

>
> I can see you have already added some helper functions to deal with
> memcgs but I do not see anything to iterate processes or find a process to
> kill etc. Is that functionality generally available (sorry I am not
> really familiar with BPF all that much so please bear with me)?

Yes, task iterator is available since v6.7:
https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/

>
> I like the way how you naturalely hooked into existing OOM primitives
> like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
> you waiting for a first user that needs to implement oom victim
> synchronization or do you plan to integrate that into tasks iterators?

It can be implemented in bpf directly, but I agree that it probably
deserves at least an example in the test or a separate in-kernel helper.
In-kernel helper is probably a better idea.

> I am mostly asking because it is exactly these kind of details that
> make the current in kernel oom handler quite complex and it would be
> great if custom ones do not have to reproduce that complexity and only
> focus on the high level policy.

Totally agree.

>
>> The second part is related to the fundamental question on when to
>> declare the OOM event. It's a trade-off between the risk of
>> unnecessary OOM kills and associated work losses and the risk of
>> infinite trashing and effective soft lockups.  In the last few years
>> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
>> systemd-OOMd [4]). The common idea was to use userspace daemons to
>> implement custom OOM logic as well as rely on PSI monitoring to avoid
>> stalls.
>
> This makes sense to me as well. I have to admit I am not fully familiar
> with PSI integration into sched code but from what I can see the
> evaluation is done on regular bases from the worker context kicked off
> from the scheduler code. There shouldn't be any locking constrains which
> is good. Is there any risk if the oom handler took too long though?

It's a good question. In theory yes, it can affect the timing of other
PSI events. An option here is to move it into a separate work, however
I'm not sure if it worth the added complexity. I actually tried this
approach in an earlier version of this patchset, but the problem was
that the code for scheduling this work should be dynamically turned
on/off when a bpf program is attached/detached, otherwise it's an
obvious cpu overhead.
It's doable, but Idk if it's justified.

>
> Also an important question. I can see selftests which are using the
> infrastructure. But have you tried to implement a real OOM handler with
> this proposed infrastructure?

Not yet. Given the size and complexity of the infrastructure of my
current employer, it's not a short process. But we're working on it.

>
>> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
>> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
>> [3]: https://github.com/facebookincubator/oomd
>> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
>> 
>> ----
>> 
>> This is an RFC version, which is not intended to be merged in the current form.
>> Open questions/TODOs:
>> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
>>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
>>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
>>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
>>    arguments as trusted"), which is not safe. One option is to fake acquire/release
>>    semantics for the oom_control pointer. Other option is to introduce a completely
>>    new attachment or program type, similar to lsm hooks.
>> 2) Currently lockdep complaints about a potential circular dependency because
>>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
>>    One way to fix it is to make it non-sleepable, but then it will require some
>>    additional work to allow it using cgroup iterators. It's intervened with 1).
>
> I cannot see this in the code. Could you be more specific please? Where
> is this might_fault coming from? Is this BPF constrain?

It's in __bpf_prog_enter_sleepable(). But I hope I can make this hook
non-sleepable (by going struct_ops path) and the problem will go away.

>
>> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
>>    Do we want to attach oom policies to cgroups? I think it's too complicated,
>>    but if we want a full hierarchical support, it might be required.
>>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
>>    memcg, which is potentially outside of the ns of the loading process. Does
>>    it require some additional capabilities checks? Should it be removed?
>
> Yes, let's start simple and see where we get from there.

Agree.

Thank you for taking a look and your comments/ideas!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-29  1:56     ` Kumar Kartikeya Dwivedi
@ 2025-04-29 15:42       ` Roman Gushchin
  2025-05-02 17:26       ` Song Liu
  1 sibling, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-29 15:42 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Matt Bobrowski, linux-kernel, Andrew Morton, Alexei Starovoitov,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Suren Baghdasaryan,
	David Rientjes, Josh Don, Chuyi Zhou, cgroups, linux-mm, bpf

On Tue, Apr 29, 2025 at 03:56:54AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Mon, 28 Apr 2025 at 19:24, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Mon, Apr 28, 2025 at 10:43:07AM +0000, Matt Bobrowski wrote:
> > > On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote:
> > > > This patchset adds an ability to customize the out of memory
> > > > handling using bpf.
> > > >
> > > > It focuses on two parts:
> > > > 1) OOM handling policy,
> > > > 2) PSI-based OOM invocation.
> > > >
> > > > The idea to use bpf for customizing the OOM handling is not new, but
> > > > unlike the previous proposal [1], which augmented the existing task
> > > > ranking-based policy, this one tries to be as generic as possible and
> > > > leverage the full power of the modern bpf.
> > > >
> > > > It provides a generic hook which is called before the existing OOM
> > > > killer code and allows implementing any policy, e.g.  picking a victim
> > > > task or memory cgroup or potentially even releasing memory in other
> > > > ways, e.g. deleting tmpfs files (the last one might require some
> > > > additional but relatively simple changes).
> > > >
> > > > The past attempt to implement memory-cgroup aware policy [2] showed
> > > > that there are multiple opinions on what the best policy is.  As it's
> > > > highly workload-dependent and specific to a concrete way of organizing
> > > > workloads, the structure of the cgroup tree etc, a customizable
> > > > bpf-based implementation is preferable over a in-kernel implementation
> > > > with a dozen on sysctls.
> > > >
> > > > The second part is related to the fundamental question on when to
> > > > declare the OOM event. It's a trade-off between the risk of
> > > > unnecessary OOM kills and associated work losses and the risk of
> > > > infinite trashing and effective soft lockups.  In the last few years
> > > > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > > > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > > > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > > > stalls. In this scenario the userspace daemon was supposed to handle
> > > > the majority of OOMs, while the in-kernel OOM killer worked as the
> > > > last resort measure to guarantee that the system would never deadlock
> > > > on the memory. But this approach creates additional infrastructure
> > > > churn: userspace OOM daemon is a separate entity which needs to be
> > > > deployed, updated, monitored. A completely different pipeline needs to
> > > > be built to monitor both types of OOM events and collect associated
> > > > logs. A userspace daemon is more restricted in terms on what data is
> > > > available to it. Implementing a daemon which can work reliably under a
> > > > heavy memory pressure in the system is also tricky.
> > > >
> > > > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> > > > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> > > > [3]: https://github.com/facebookincubator/oomd
> > > > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> > > >
> > > > ----
> > > >
> > > > This is an RFC version, which is not intended to be merged in the current form.
> > > > Open questions/TODOs:
> > > > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> > > >    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> > > >    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> > > >    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> > > >    arguments as trusted"), which is not safe. One option is to fake acquire/release
> > > >    semantics for the oom_control pointer. Other option is to introduce a completely
> > > >    new attachment or program type, similar to lsm hooks.
> > >
> > > Thinking out loud now, but rather than introducing and having a single
> > > BPF-specific function/interface, and BPF program for that matter,
> > > which can effectively be used to short-circuit steps from within
> > > out_of_memory(), why not introduce a
> > > tcp_congestion_ops/sched_ext_ops-like interface which essentially
> > > provides a multifaceted interface for controlling OOM killing
> > > (->select_bad_process, ->oom_kill_process, etc), optionally also from
> > > the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)?
> >
> > It's certainly an option and I thought about it. I don't think we need a bunch
> > of hooks though. This patchset adds 2 and they belong to completely different
> > subsystems (mm and sched/psi), so Idk how well they can be gathered
> > into a single struct ops. But maybe it's fine.
> >
> > The only potentially new hook I can envision now is one to customize
> > the oom reporting.
> >
> 
> If you're considering scoping it down to a particular cgroup (as you
> allude to in the TODO), or building a hierarchical interface, using
> struct_ops will be much better than fmod_ret etc., which is global in
> nature. Even if you don't support it now. I don't think a struct_ops
> is warranted only when you have more than a few callbacks. As an
> illustration, sched_ext started out without supporting hierarchical
> attachment, but will piggy-back on the struct_ops interface to do so
> in the near future.

Good point! I agree.

Thanks


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-04-29 11:46   ` Michal Hocko
@ 2025-04-29 21:31     ` Roman Gushchin
  2025-04-30  7:27       ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2025-04-29 21:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

On Tue, Apr 29, 2025 at 01:46:07PM +0200, Michal Hocko wrote:
> On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> > Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> > an out of memory events and trigger the corresponding kernel OOM
> > handling mechanism.
> > 
> > It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> > as an argument, as well as the page order.
> > 
> > Only one OOM can be declared and handled in the system at once,
> > so if the function is called in parallel to another OOM handling,
> > it bails out with -EBUSY.
> 
> This makes sense for the global OOM handler because concurrent handlers
> are cooperative. But is this really correct for memcg ooms which could
> happen for different hierarchies? Currently we do block on oom_lock in
> that case to make sure one oom doesn't starve others. Do we want the
> same behavior for custom OOM handlers?

It's a good point and I had similar thoughts when I was working on it.
But I think it's orthogonal to the customization of the oom handling.
Even for the existing oom killer it makes no sense to serialize memcg ooms
in independent memcg subtrees. But I'm worried about the dmesg reporting,
it can become really messy for 2+ concurrent OOMs.

Also, some memory can be shared, so one OOM can eliminate a need for another
OOM, even if they look independent.

So my conclusion here is to leave things as they are until we'll get signs
of real world problems with the (lack of) concurrency between ooms.

Thanks


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-29 14:44   ` Roman Gushchin
@ 2025-04-29 21:56     ` Suren Baghdasaryan
  2025-04-29 22:17       ` Roman Gushchin
  2025-04-29 23:01     ` Suren Baghdasaryan
  1 sibling, 1 reply; 33+ messages in thread
From: Suren Baghdasaryan @ 2025-04-29 21:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, linux-kernel, Andrew Morton, Alexei Starovoitov,
	Johannes Weiner, Shakeel Butt, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

On Tue, Apr 29, 2025 at 7:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Michal Hocko <mhocko@suse.com> writes:
>
> > On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
> >> This patchset adds an ability to customize the out of memory
> >> handling using bpf.
> >>
> >> It focuses on two parts:
> >> 1) OOM handling policy,
> >> 2) PSI-based OOM invocation.
> >>
> >> The idea to use bpf for customizing the OOM handling is not new, but
> >> unlike the previous proposal [1], which augmented the existing task
> >> ranking-based policy, this one tries to be as generic as possible and
> >> leverage the full power of the modern bpf.
> >>
> >> It provides a generic hook which is called before the existing OOM
> >> killer code and allows implementing any policy, e.g.  picking a victim
> >> task or memory cgroup or potentially even releasing memory in other
> >> ways, e.g. deleting tmpfs files (the last one might require some
> >> additional but relatively simple changes).
> >
> > Makes sense to me. I still have a slight concern though. We have 3
> > different oom handlers smashed into a single one with special casing
> > involved. This is manageable (although not great) for the in kernel
> > code but I am wondering whether we should do better for BPF based OOM
> > implementations. Would it make sense to have different callbacks for
> > cpuset, memcg and global oom killer handlers?
>
> Yes, it's certainly possible. If we go struct_ops path, we can even
> have both the common hook which handles all types of OOM's and separate
> hooks for each type. The user then can choose what's more convenient.
> Good point.
>
> >
> > I can see you have already added some helper functions to deal with
> > memcgs but I do not see anything to iterate processes or find a process to
> > kill etc. Is that functionality generally available (sorry I am not
> > really familiar with BPF all that much so please bear with me)?
>
> Yes, task iterator is available since v6.7:
> https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/
>
> >
> > I like the way how you naturalely hooked into existing OOM primitives
> > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
> > you waiting for a first user that needs to implement oom victim
> > synchronization or do you plan to integrate that into tasks iterators?
>
> It can be implemented in bpf directly, but I agree that it probably
> deserves at least an example in the test or a separate in-kernel helper.
> In-kernel helper is probably a better idea.
>
> > I am mostly asking because it is exactly these kind of details that
> > make the current in kernel oom handler quite complex and it would be
> > great if custom ones do not have to reproduce that complexity and only
> > focus on the high level policy.
>
> Totally agree.
>
> >
> >> The second part is related to the fundamental question on when to
> >> declare the OOM event. It's a trade-off between the risk of
> >> unnecessary OOM kills and associated work losses and the risk of
> >> infinite trashing and effective soft lockups.  In the last few years
> >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> >> systemd-OOMd [4]). The common idea was to use userspace daemons to
> >> implement custom OOM logic as well as rely on PSI monitoring to avoid
> >> stalls.
> >
> > This makes sense to me as well. I have to admit I am not fully familiar
> > with PSI integration into sched code but from what I can see the
> > evaluation is done on regular bases from the worker context kicked off
> > from the scheduler code. There shouldn't be any locking constrains which
> > is good. Is there any risk if the oom handler took too long though?
>
> It's a good question. In theory yes, it can affect the timing of other
> PSI events. An option here is to move it into a separate work, however
> I'm not sure if it worth the added complexity. I actually tried this
> approach in an earlier version of this patchset, but the problem was
> that the code for scheduling this work should be dynamically turned
> on/off when a bpf program is attached/detached, otherwise it's an
> obvious cpu overhead.
> It's doable, but Idk if it's justified.
>
> >
> > Also an important question. I can see selftests which are using the
> > infrastructure. But have you tried to implement a real OOM handler with
> > this proposed infrastructure?
>
> Not yet. Given the size and complexity of the infrastructure of my
> current employer, it's not a short process. But we're working on it.

Hi Roman,
This might end up being very useful for Android. Since we have a
shared current employer, we might be able to provide an earlier test
environment for this concept on Android and speed up development of a
real OOM handler. I'll be following the development of this patchset
and will see if we can come up with an early prototype for testing.

>
> >
> >> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> >> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> >> [3]: https://github.com/facebookincubator/oomd
> >> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> >>
> >> ----
> >>
> >> This is an RFC version, which is not intended to be merged in the current form.
> >> Open questions/TODOs:
> >> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> >>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> >>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> >>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> >>    arguments as trusted"), which is not safe. One option is to fake acquire/release
> >>    semantics for the oom_control pointer. Other option is to introduce a completely
> >>    new attachment or program type, similar to lsm hooks.
> >> 2) Currently lockdep complaints about a potential circular dependency because
> >>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
> >>    One way to fix it is to make it non-sleepable, but then it will require some
> >>    additional work to allow it using cgroup iterators. It's intervened with 1).
> >
> > I cannot see this in the code. Could you be more specific please? Where
> > is this might_fault coming from? Is this BPF constrain?
>
> It's in __bpf_prog_enter_sleepable(). But I hope I can make this hook
> non-sleepable (by going struct_ops path) and the problem will go away.
>
> >
> >> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
> >>    Do we want to attach oom policies to cgroups? I think it's too complicated,
> >>    but if we want a full hierarchical support, it might be required.
> >>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
> >>    memcg, which is potentially outside of the ns of the loading process. Does
> >>    it require some additional capabilities checks? Should it be removed?
> >
> > Yes, let's start simple and see where we get from there.
>
> Agree.
>
> Thank you for taking a look and your comments/ideas!
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-29 21:56     ` Suren Baghdasaryan
@ 2025-04-29 22:17       ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-29 22:17 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, linux-kernel, Andrew Morton, Alexei Starovoitov,
	Johannes Weiner, Shakeel Butt, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

On Tue, Apr 29, 2025 at 02:56:31PM -0700, Suren Baghdasaryan wrote:
> On Tue, Apr 29, 2025 at 7:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > Michal Hocko <mhocko@suse.com> writes:
> >
> > > On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
> > >> This patchset adds an ability to customize the out of memory
> > >> handling using bpf.
> > >>
> > >> It focuses on two parts:
> > >> 1) OOM handling policy,
> > >> 2) PSI-based OOM invocation.
> > >>
> > >> The idea to use bpf for customizing the OOM handling is not new, but
> > >> unlike the previous proposal [1], which augmented the existing task
> > >> ranking-based policy, this one tries to be as generic as possible and
> > >> leverage the full power of the modern bpf.
> > >>
> > >> It provides a generic hook which is called before the existing OOM
> > >> killer code and allows implementing any policy, e.g.  picking a victim
> > >> task or memory cgroup or potentially even releasing memory in other
> > >> ways, e.g. deleting tmpfs files (the last one might require some
> > >> additional but relatively simple changes).
> > >
> > > Makes sense to me. I still have a slight concern though. We have 3
> > > different oom handlers smashed into a single one with special casing
> > > involved. This is manageable (although not great) for the in kernel
> > > code but I am wondering whether we should do better for BPF based OOM
> > > implementations. Would it make sense to have different callbacks for
> > > cpuset, memcg and global oom killer handlers?
> >
> > Yes, it's certainly possible. If we go struct_ops path, we can even
> > have both the common hook which handles all types of OOM's and separate
> > hooks for each type. The user then can choose what's more convenient.
> > Good point.
> >
> > >
> > > I can see you have already added some helper functions to deal with
> > > memcgs but I do not see anything to iterate processes or find a process to
> > > kill etc. Is that functionality generally available (sorry I am not
> > > really familiar with BPF all that much so please bear with me)?
> >
> > Yes, task iterator is available since v6.7:
> > https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/
> >
> > >
> > > I like the way how you naturalely hooked into existing OOM primitives
> > > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
> > > you waiting for a first user that needs to implement oom victim
> > > synchronization or do you plan to integrate that into tasks iterators?
> >
> > It can be implemented in bpf directly, but I agree that it probably
> > deserves at least an example in the test or a separate in-kernel helper.
> > In-kernel helper is probably a better idea.
> >
> > > I am mostly asking because it is exactly these kind of details that
> > > make the current in kernel oom handler quite complex and it would be
> > > great if custom ones do not have to reproduce that complexity and only
> > > focus on the high level policy.
> >
> > Totally agree.
> >
> > >
> > >> The second part is related to the fundamental question on when to
> > >> declare the OOM event. It's a trade-off between the risk of
> > >> unnecessary OOM kills and associated work losses and the risk of
> > >> infinite trashing and effective soft lockups.  In the last few years
> > >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > >> systemd-OOMd [4]). The common idea was to use userspace daemons to
> > >> implement custom OOM logic as well as rely on PSI monitoring to avoid
> > >> stalls.
> > >
> > > This makes sense to me as well. I have to admit I am not fully familiar
> > > with PSI integration into sched code but from what I can see the
> > > evaluation is done on regular bases from the worker context kicked off
> > > from the scheduler code. There shouldn't be any locking constrains which
> > > is good. Is there any risk if the oom handler took too long though?
> >
> > It's a good question. In theory yes, it can affect the timing of other
> > PSI events. An option here is to move it into a separate work, however
> > I'm not sure if it worth the added complexity. I actually tried this
> > approach in an earlier version of this patchset, but the problem was
> > that the code for scheduling this work should be dynamically turned
> > on/off when a bpf program is attached/detached, otherwise it's an
> > obvious cpu overhead.
> > It's doable, but Idk if it's justified.
> >
> > >
> > > Also an important question. I can see selftests which are using the
> > > infrastructure. But have you tried to implement a real OOM handler with
> > > this proposed infrastructure?
> >
> > Not yet. Given the size and complexity of the infrastructure of my
> > current employer, it's not a short process. But we're working on it.
> 
> Hi Roman,
> This might end up being very useful for Android. Since we have a
> shared current employer, we might be able to provide an earlier test
> environment for this concept on Android and speed up development of a
> real OOM handler. I'll be following the development of this patchset
> and will see if we can come up with an early prototype for testing.

Hi Suren,

Sounds great, thank you!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
                   ` (13 preceding siblings ...)
  2025-04-29 11:42 ` Michal Hocko
@ 2025-04-29 22:44 ` Suren Baghdasaryan
  2025-04-29 23:01   ` Roman Gushchin
  14 siblings, 1 reply; 33+ messages in thread
From: Suren Baghdasaryan @ 2025-04-29 22:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Shakeel Butt, David Rientjes, Josh Don, Chuyi Zhou,
	cgroups, linux-mm, bpf

On Sun, Apr 27, 2025 at 8:36 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> This patchset adds an ability to customize the out of memory
> handling using bpf.
>
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
>
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking-based policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
>
> It provides a generic hook which is called before the existing OOM
> killer code and allows implementing any policy, e.g.  picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).
>
> The past attempt to implement memory-cgroup aware policy [2] showed
> that there are multiple opinions on what the best policy is.  As it's
> highly workload-dependent and specific to a concrete way of organizing
> workloads, the structure of the cgroup tree etc, a customizable
> bpf-based implementation is preferable over a in-kernel implementation
> with a dozen on sysctls.
>
> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls. In this scenario the userspace daemon was supposed to handle
> the majority of OOMs, while the in-kernel OOM killer worked as the
> last resort measure to guarantee that the system would never deadlock
> on the memory. But this approach creates additional infrastructure
> churn: userspace OOM daemon is a separate entity which needs to be
> deployed, updated, monitored. A completely different pipeline needs to
> be built to monitor both types of OOM events and collect associated
> logs. A userspace daemon is more restricted in terms on what data is
> available to it. Implementing a daemon which can work reliably under a
> heavy memory pressure in the system is also tricky.

I didn't read the whole patchset yet but want to mention couple
features that we should not forget:
- memory reaping. Maybe you already call oom_reap_task_mm() after BPF
oom-handler kills a process or maybe BPF handler is expected to
implement it?
- kill reporting to userspace. I think BPF handler would be expected
to implement it?

>
> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> [3]: https://github.com/facebookincubator/oomd
> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
>
> ----
>
> This is an RFC version, which is not intended to be merged in the current form.
> Open questions/TODOs:
> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
>    arguments as trusted"), which is not safe. One option is to fake acquire/release
>    semantics for the oom_control pointer. Other option is to introduce a completely
>    new attachment or program type, similar to lsm hooks.
> 2) Currently lockdep complaints about a potential circular dependency because
>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
>    One way to fix it is to make it non-sleepable, but then it will require some
>    additional work to allow it using cgroup iterators. It's intervened with 1).
> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
>    Do we want to attach oom policies to cgroups? I think it's too complicated,
>    but if we want a full hierarchical support, it might be required.
>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
>    memcg, which is potentially outside of the ns of the loading process. Does
>    it require some additional capabilities checks? Should it be removed?
> 4) Documentation is lacking and will be added in the next version.
>
>
> Roman Gushchin (12):
>   mm: introduce a bpf hook for OOM handling
>   bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
>   bpf: treat fmodret tracing program's arguments as trusted
>   mm: introduce bpf_oom_kill_process() bpf kfunc
>   mm: introduce bpf kfuncs to deal with memcg pointers
>   mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
>   bpf: selftests: introduce read_cgroup_file() helper
>   bpf: selftests: bpf OOM handler test
>   sched: psi: bpf hook to handle psi events
>   mm: introduce bpf_out_of_memory() bpf kfunc
>   bpf: selftests: introduce open_cgroup_file() helper
>   bpf: selftests: psi handler test
>
>  include/linux/memcontrol.h                   |   2 +
>  include/linux/oom.h                          |   5 +
>  kernel/bpf/btf.c                             |   9 +-
>  kernel/bpf/verifier.c                        |   5 +
>  kernel/sched/psi.c                           |  36 ++-
>  mm/Makefile                                  |   3 +
>  mm/bpf_memcontrol.c                          | 108 +++++++++
>  mm/oom_kill.c                                | 140 +++++++++++
>  tools/testing/selftests/bpf/cgroup_helpers.c |  67 ++++++
>  tools/testing/selftests/bpf/cgroup_helpers.h |   3 +
>  tools/testing/selftests/bpf/prog_tests/oom.c | 227 ++++++++++++++++++
>  tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++
>  tools/testing/selftests/bpf/progs/test_oom.c | 103 ++++++++
>  tools/testing/selftests/bpf/progs/test_psi.c |  43 ++++
>  14 files changed, 983 insertions(+), 2 deletions(-)
>  create mode 100644 mm/bpf_memcontrol.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c
>
> --
> 2.49.0.901.g37484f566f-goog
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-29 22:44 ` Suren Baghdasaryan
@ 2025-04-29 23:01   ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-29 23:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Shakeel Butt, David Rientjes, Josh Don, Chuyi Zhou,
	cgroups, linux-mm, bpf

On Tue, Apr 29, 2025 at 03:44:14PM -0700, Suren Baghdasaryan wrote:
> On Sun, Apr 27, 2025 at 8:36 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > This patchset adds an ability to customize the out of memory
> > handling using bpf.
> >
> > It focuses on two parts:
> > 1) OOM handling policy,
> > 2) PSI-based OOM invocation.
> >
> > The idea to use bpf for customizing the OOM handling is not new, but
> > unlike the previous proposal [1], which augmented the existing task
> > ranking-based policy, this one tries to be as generic as possible and
> > leverage the full power of the modern bpf.
> >
> > It provides a generic hook which is called before the existing OOM
> > killer code and allows implementing any policy, e.g.  picking a victim
> > task or memory cgroup or potentially even releasing memory in other
> > ways, e.g. deleting tmpfs files (the last one might require some
> > additional but relatively simple changes).
> >
> > The past attempt to implement memory-cgroup aware policy [2] showed
> > that there are multiple opinions on what the best policy is.  As it's
> > highly workload-dependent and specific to a concrete way of organizing
> > workloads, the structure of the cgroup tree etc, a customizable
> > bpf-based implementation is preferable over a in-kernel implementation
> > with a dozen on sysctls.
> >
> > The second part is related to the fundamental question on when to
> > declare the OOM event. It's a trade-off between the risk of
> > unnecessary OOM kills and associated work losses and the risk of
> > infinite trashing and effective soft lockups.  In the last few years
> > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > stalls. In this scenario the userspace daemon was supposed to handle
> > the majority of OOMs, while the in-kernel OOM killer worked as the
> > last resort measure to guarantee that the system would never deadlock
> > on the memory. But this approach creates additional infrastructure
> > churn: userspace OOM daemon is a separate entity which needs to be
> > deployed, updated, monitored. A completely different pipeline needs to
> > be built to monitor both types of OOM events and collect associated
> > logs. A userspace daemon is more restricted in terms on what data is
> > available to it. Implementing a daemon which can work reliably under a
> > heavy memory pressure in the system is also tricky.
> 
> I didn't read the whole patchset yet but want to mention couple
> features that we should not forget:
> - memory reaping. Maybe you already call oom_reap_task_mm() after BPF
> oom-handler kills a process or maybe BPF handler is expected to
> implement it?
> - kill reporting to userspace. I think BPF handler would be expected
> to implement it?

The patchset implements the bpf_oom_kill_process() helper, which kills
the desired process the same way as the kernel oom killer: with the help
of the oom reaper, dumping corresponding stats into dmesg and bumping
corresponding memcg- and system-level stats.

For additional reporting generic bpf<->userspace interaction mechanims
can be used, e.g. ringbuffer.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-29 14:44   ` Roman Gushchin
  2025-04-29 21:56     ` Suren Baghdasaryan
@ 2025-04-29 23:01     ` Suren Baghdasaryan
  1 sibling, 0 replies; 33+ messages in thread
From: Suren Baghdasaryan @ 2025-04-29 23:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, linux-kernel, Andrew Morton, Alexei Starovoitov,
	Johannes Weiner, Shakeel Butt, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

On Tue, Apr 29, 2025 at 7:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Michal Hocko <mhocko@suse.com> writes:
>
> > On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
> >> This patchset adds an ability to customize the out of memory
> >> handling using bpf.
> >>
> >> It focuses on two parts:
> >> 1) OOM handling policy,
> >> 2) PSI-based OOM invocation.
> >>
> >> The idea to use bpf for customizing the OOM handling is not new, but
> >> unlike the previous proposal [1], which augmented the existing task
> >> ranking-based policy, this one tries to be as generic as possible and
> >> leverage the full power of the modern bpf.
> >>
> >> It provides a generic hook which is called before the existing OOM
> >> killer code and allows implementing any policy, e.g.  picking a victim
> >> task or memory cgroup or potentially even releasing memory in other
> >> ways, e.g. deleting tmpfs files (the last one might require some
> >> additional but relatively simple changes).
> >
> > Makes sense to me. I still have a slight concern though. We have 3
> > different oom handlers smashed into a single one with special casing
> > involved. This is manageable (although not great) for the in kernel
> > code but I am wondering whether we should do better for BPF based OOM
> > implementations. Would it make sense to have different callbacks for
> > cpuset, memcg and global oom killer handlers?
>
> Yes, it's certainly possible. If we go struct_ops path, we can even
> have both the common hook which handles all types of OOM's and separate
> hooks for each type. The user then can choose what's more convenient.
> Good point.
>
> >
> > I can see you have already added some helper functions to deal with
> > memcgs but I do not see anything to iterate processes or find a process to
> > kill etc. Is that functionality generally available (sorry I am not
> > really familiar with BPF all that much so please bear with me)?
>
> Yes, task iterator is available since v6.7:
> https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/
>
> >
> > I like the way how you naturalely hooked into existing OOM primitives
> > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
> > you waiting for a first user that needs to implement oom victim
> > synchronization or do you plan to integrate that into tasks iterators?
>
> It can be implemented in bpf directly, but I agree that it probably
> deserves at least an example in the test or a separate in-kernel helper.
> In-kernel helper is probably a better idea.
>
> > I am mostly asking because it is exactly these kind of details that
> > make the current in kernel oom handler quite complex and it would be
> > great if custom ones do not have to reproduce that complexity and only
> > focus on the high level policy.
>
> Totally agree.
>
> >
> >> The second part is related to the fundamental question on when to
> >> declare the OOM event. It's a trade-off between the risk of
> >> unnecessary OOM kills and associated work losses and the risk of
> >> infinite trashing and effective soft lockups.  In the last few years
> >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> >> systemd-OOMd [4]). The common idea was to use userspace daemons to
> >> implement custom OOM logic as well as rely on PSI monitoring to avoid
> >> stalls.
> >
> > This makes sense to me as well. I have to admit I am not fully familiar
> > with PSI integration into sched code but from what I can see the
> > evaluation is done on regular bases from the worker context kicked off
> > from the scheduler code. There shouldn't be any locking constrains which
> > is good. Is there any risk if the oom handler took too long though?
>
> It's a good question. In theory yes, it can affect the timing of other
> PSI events. An option here is to move it into a separate work, however
> I'm not sure if it worth the added complexity. I actually tried this
> approach in an earlier version of this patchset, but the problem was
> that the code for scheduling this work should be dynamically turned
> on/off when a bpf program is attached/detached, otherwise it's an
> obvious cpu overhead.
> It's doable, but Idk if it's justified.

I think this is a legitimate concern. bpf_handle_psi_event() can block
update_triggers() and delay other PSI triggers.

>
> >
> > Also an important question. I can see selftests which are using the
> > infrastructure. But have you tried to implement a real OOM handler with
> > this proposed infrastructure?
>
> Not yet. Given the size and complexity of the infrastructure of my
> current employer, it's not a short process. But we're working on it.
>
> >
> >> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> >> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> >> [3]: https://github.com/facebookincubator/oomd
> >> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> >>
> >> ----
> >>
> >> This is an RFC version, which is not intended to be merged in the current form.
> >> Open questions/TODOs:
> >> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> >>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> >>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> >>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> >>    arguments as trusted"), which is not safe. One option is to fake acquire/release
> >>    semantics for the oom_control pointer. Other option is to introduce a completely
> >>    new attachment or program type, similar to lsm hooks.
> >> 2) Currently lockdep complaints about a potential circular dependency because
> >>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
> >>    One way to fix it is to make it non-sleepable, but then it will require some
> >>    additional work to allow it using cgroup iterators. It's intervened with 1).
> >
> > I cannot see this in the code. Could you be more specific please? Where
> > is this might_fault coming from? Is this BPF constrain?
>
> It's in __bpf_prog_enter_sleepable(). But I hope I can make this hook
> non-sleepable (by going struct_ops path) and the problem will go away.
>
> >
> >> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
> >>    Do we want to attach oom policies to cgroups? I think it's too complicated,
> >>    but if we want a full hierarchical support, it might be required.
> >>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
> >>    memcg, which is potentially outside of the ns of the loading process. Does
> >>    it require some additional capabilities checks? Should it be removed?
> >
> > Yes, let's start simple and see where we get from there.
>
> Agree.
>
> Thank you for taking a look and your comments/ideas!
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events
  2025-04-28  3:36 ` [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events Roman Gushchin
  2025-04-28  6:11   ` kernel test robot
@ 2025-04-30  0:28   ` Suren Baghdasaryan
  2025-04-30  0:58     ` Roman Gushchin
  1 sibling, 1 reply; 33+ messages in thread
From: Suren Baghdasaryan @ 2025-04-30  0:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Shakeel Butt, David Rientjes, Josh Don, Chuyi Zhou,
	cgroups, linux-mm, bpf

On Sun, Apr 27, 2025 at 8:37 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Introduce a bpf hook to handle psi events. The primary intended
> purpose of this hook is to declare OOM events based on the reaching
> a certain memory pressure level, similar to what systemd-oomd and oomd
> are doing in userspace.

It's a bit awkward that this requires additional userspace action to
create PSI triggers. I have almost no experience with BPF, so this
might be a stupid question, but maybe we could provide a bpf kfunc for
the BPF handler to register its PSI trigger(s) upon handler
registration?


>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  kernel/sched/psi.c | 36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 1396674fa722..4c4eb4ead8f6 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -176,6 +176,32 @@ static void psi_avgs_work(struct work_struct *work);
>
>  static void poll_timer_fn(struct timer_list *t);
>
> +#ifdef CONFIG_BPF_SYSCALL
> +__bpf_hook_start();
> +
> +__weak noinline int bpf_handle_psi_event(struct psi_trigger *t)
> +{
> +       return 0;
> +}
> +
> +__bpf_hook_end();
> +
> +BTF_KFUNCS_START(bpf_psi_hooks)
> +BTF_ID_FLAGS(func, bpf_handle_psi_event, KF_SLEEPABLE)
> +BTF_KFUNCS_END(bpf_psi_hooks)
> +
> +static const struct btf_kfunc_id_set bpf_psi_hook_set = {
> +       .owner = THIS_MODULE,
> +       .set   = &bpf_psi_hooks,
> +};
> +
> +#else
> +static inline int bpf_handle_psi_event(struct psi_trigger *t)
> +{
> +       return 0;
> +}
> +#endif
> +
>  static void group_init(struct psi_group *group)
>  {
>         int cpu;
> @@ -489,6 +515,7 @@ static void update_triggers(struct psi_group *group, u64 now,
>
>                 /* Generate an event */
>                 if (cmpxchg(&t->event, 0, 1) == 0) {
> +                       bpf_handle_psi_event(t);
>                         if (t->of)
>                                 kernfs_notify(t->of->kn);
>                         else
> @@ -1655,6 +1682,8 @@ static const struct proc_ops psi_irq_proc_ops = {
>
>  static int __init psi_proc_init(void)
>  {
> +       int err = 0;
> +
>         if (psi_enable) {
>                 proc_mkdir("pressure", NULL);
>                 proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
> @@ -1662,9 +1691,14 @@ static int __init psi_proc_init(void)
>                 proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
>  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>                 proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
> +#endif
> +#ifdef CONFIG_BPF_SYSCALL
> +               err = register_btf_fmodret_id_set(&bpf_psi_hook_set);
> +               if (err)
> +                       pr_err("error while registering bpf psi hooks: %d", err);
>  #endif
>         }
> -       return 0;
> +       return err;
>  }
>  module_init(psi_proc_init);
>
> --
> 2.49.0.901.g37484f566f-goog
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events
  2025-04-30  0:28   ` Suren Baghdasaryan
@ 2025-04-30  0:58     ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-04-30  0:58 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Shakeel Butt, David Rientjes, Josh Don, Chuyi Zhou,
	cgroups, linux-mm, bpf

On Tue, Apr 29, 2025 at 05:28:59PM -0700, Suren Baghdasaryan wrote:
> On Sun, Apr 27, 2025 at 8:37 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > Introduce a bpf hook to handle psi events. The primary intended
> > purpose of this hook is to declare OOM events based on the reaching
> > a certain memory pressure level, similar to what systemd-oomd and oomd
> > are doing in userspace.
> 
> It's a bit awkward that this requires additional userspace action to
> create PSI triggers. I have almost no experience with BPF, so this
> might be a stupid question, but maybe we could provide a bpf kfunc for
> the BPF handler to register its PSI trigger(s) upon handler
> registration?

It looks like it's doable using struct_ops path: the .init callback
can create psi triggers and "attach" them to the loaded bpf program.
But I need to figure out the details.

Good point, thank you!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-04-29 21:31     ` Roman Gushchin
@ 2025-04-30  7:27       ` Michal Hocko
  2025-04-30 14:53         ` Roman Gushchin
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2025-04-30  7:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

On Tue 29-04-25 21:31:35, Roman Gushchin wrote:
> On Tue, Apr 29, 2025 at 01:46:07PM +0200, Michal Hocko wrote:
> > On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> > > Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> > > an out of memory events and trigger the corresponding kernel OOM
> > > handling mechanism.
> > > 
> > > It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> > > as an argument, as well as the page order.
> > > 
> > > Only one OOM can be declared and handled in the system at once,
> > > so if the function is called in parallel to another OOM handling,
> > > it bails out with -EBUSY.
> > 
> > This makes sense for the global OOM handler because concurrent handlers
> > are cooperative. But is this really correct for memcg ooms which could
> > happen for different hierarchies? Currently we do block on oom_lock in
> > that case to make sure one oom doesn't starve others. Do we want the
> > same behavior for custom OOM handlers?
> 
> It's a good point and I had similar thoughts when I was working on it.
> But I think it's orthogonal to the customization of the oom handling.
> Even for the existing oom killer it makes no sense to serialize memcg ooms
> in independent memcg subtrees. But I'm worried about the dmesg reporting,
> it can become really messy for 2+ concurrent OOMs.
> 
> Also, some memory can be shared, so one OOM can eliminate a need for another
> OOM, even if they look independent.
> 
> So my conclusion here is to leave things as they are until we'll get signs
> of real world problems with the (lack of) concurrency between ooms.

How do we learn about that happening though? I do not think we have any
counters to watch to suspect that some oom handlers cannot run.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-04-30  7:27       ` Michal Hocko
@ 2025-04-30 14:53         ` Roman Gushchin
  2025-05-05  8:08           ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2025-04-30 14:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

On Wed, Apr 30, 2025 at 09:27:39AM +0200, Michal Hocko wrote:
> On Tue 29-04-25 21:31:35, Roman Gushchin wrote:
> > On Tue, Apr 29, 2025 at 01:46:07PM +0200, Michal Hocko wrote:
> > > On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> > > > Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> > > > an out of memory events and trigger the corresponding kernel OOM
> > > > handling mechanism.
> > > > 
> > > > It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> > > > as an argument, as well as the page order.
> > > > 
> > > > Only one OOM can be declared and handled in the system at once,
> > > > so if the function is called in parallel to another OOM handling,
> > > > it bails out with -EBUSY.
> > > 
> > > This makes sense for the global OOM handler because concurrent handlers
> > > are cooperative. But is this really correct for memcg ooms which could
> > > happen for different hierarchies? Currently we do block on oom_lock in
> > > that case to make sure one oom doesn't starve others. Do we want the
> > > same behavior for custom OOM handlers?
> > 
> > It's a good point and I had similar thoughts when I was working on it.
> > But I think it's orthogonal to the customization of the oom handling.
> > Even for the existing oom killer it makes no sense to serialize memcg ooms
> > in independent memcg subtrees. But I'm worried about the dmesg reporting,
> > it can become really messy for 2+ concurrent OOMs.
> > 
> > Also, some memory can be shared, so one OOM can eliminate a need for another
> > OOM, even if they look independent.
> > 
> > So my conclusion here is to leave things as they are until we'll get signs
> > of real world problems with the (lack of) concurrency between ooms.
> 
> How do we learn about that happening though? I do not think we have any
> counters to watch to suspect that some oom handlers cannot run.

The bpf program which declares an OOM can handle this: e.g. retry, wait
and retry, etc. We can also try to mimick the existing behavior and wait
on oom_lock (potentially splitting it into multiple locks to support
concurrent ooms in various memcgs). Do you think it's preferable?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 00/12] mm: BPF OOM
  2025-04-29  1:56     ` Kumar Kartikeya Dwivedi
  2025-04-29 15:42       ` Roman Gushchin
@ 2025-05-02 17:26       ` Song Liu
  1 sibling, 0 replies; 33+ messages in thread
From: Song Liu @ 2025-05-02 17:26 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Roman Gushchin, Matt Bobrowski, linux-kernel, Andrew Morton,
	Alexei Starovoitov, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Suren Baghdasaryan, David Rientjes, Josh Don, Chuyi Zhou,
	cgroups, linux-mm, bpf

On Mon, Apr 28, 2025 at 6:57 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
[...]
> >
> > It's certainly an option and I thought about it. I don't think we need a bunch
> > of hooks though. This patchset adds 2 and they belong to completely different
> > subsystems (mm and sched/psi), so Idk how well they can be gathered
> > into a single struct ops. But maybe it's fine.
> >
> > The only potentially new hook I can envision now is one to customize
> > the oom reporting.
> >
>
> If you're considering scoping it down to a particular cgroup (as you
> allude to in the TODO), or building a hierarchical interface, using
> struct_ops will be much better than fmod_ret etc., which is global in
> nature. Even if you don't support it now. I don't think a struct_ops
> is warranted only when you have more than a few callbacks. As an
> illustration, sched_ext started out without supporting hierarchical
> attachment, but will piggy-back on the struct_ops interface to do so
> in the near future.

+1 for using struct_ops, which is the best way to enable BPF in
existing use cases.

Song


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-04-30 14:53         ` Roman Gushchin
@ 2025-05-05  8:08           ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2025-05-05  8:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Shakeel Butt, Suren Baghdasaryan, David Rientjes, Josh Don,
	Chuyi Zhou, cgroups, linux-mm, bpf

On Wed 30-04-25 14:53:50, Roman Gushchin wrote:
> On Wed, Apr 30, 2025 at 09:27:39AM +0200, Michal Hocko wrote:
> > On Tue 29-04-25 21:31:35, Roman Gushchin wrote:
> > > On Tue, Apr 29, 2025 at 01:46:07PM +0200, Michal Hocko wrote:
> > > > On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> > > > > Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> > > > > an out of memory events and trigger the corresponding kernel OOM
> > > > > handling mechanism.
> > > > > 
> > > > > It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> > > > > as an argument, as well as the page order.
> > > > > 
> > > > > Only one OOM can be declared and handled in the system at once,
> > > > > so if the function is called in parallel to another OOM handling,
> > > > > it bails out with -EBUSY.
> > > > 
> > > > This makes sense for the global OOM handler because concurrent handlers
> > > > are cooperative. But is this really correct for memcg ooms which could
> > > > happen for different hierarchies? Currently we do block on oom_lock in
> > > > that case to make sure one oom doesn't starve others. Do we want the
> > > > same behavior for custom OOM handlers?
> > > 
> > > It's a good point and I had similar thoughts when I was working on it.
> > > But I think it's orthogonal to the customization of the oom handling.
> > > Even for the existing oom killer it makes no sense to serialize memcg ooms
> > > in independent memcg subtrees. But I'm worried about the dmesg reporting,
> > > it can become really messy for 2+ concurrent OOMs.
> > > 
> > > Also, some memory can be shared, so one OOM can eliminate a need for another
> > > OOM, even if they look independent.
> > > 
> > > So my conclusion here is to leave things as they are until we'll get signs
> > > of real world problems with the (lack of) concurrency between ooms.
> > 
> > How do we learn about that happening though? I do not think we have any
> > counters to watch to suspect that some oom handlers cannot run.
> 
> The bpf program which declares an OOM can handle this: e.g. retry, wait
> and retry, etc. We can also try to mimick the existing behavior and wait
> on oom_lock (potentially splitting it into multiple locks to support
> concurrent ooms in various memcgs). Do you think it's preferable?

Yes, I would just provide different callbacks for global and memcg ooms
and do the blockin for the latter. It will be consistent with the in
kernel implementation (therefore less surprising behavior).

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-05-05  8:08 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-28  3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 01/12] mm: introduce a bpf hook for OOM handling Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 02/12] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 03/12] bpf: treat fmodret tracing program's arguments as trusted Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 04/12] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 05/12] mm: introduce bpf kfuncs to deal with memcg pointers Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 06/12] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 07/12] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 08/12] bpf: selftests: bpf OOM handler test Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events Roman Gushchin
2025-04-28  6:11   ` kernel test robot
2025-04-30  0:28   ` Suren Baghdasaryan
2025-04-30  0:58     ` Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc Roman Gushchin
2025-04-29 11:46   ` Michal Hocko
2025-04-29 21:31     ` Roman Gushchin
2025-04-30  7:27       ` Michal Hocko
2025-04-30 14:53         ` Roman Gushchin
2025-05-05  8:08           ` Michal Hocko
2025-04-28  3:36 ` [PATCH rfc 11/12] bpf: selftests: introduce open_cgroup_file() helper Roman Gushchin
2025-04-28  3:36 ` [PATCH rfc 12/12] bpf: selftests: psi handler test Roman Gushchin
2025-04-28 10:43 ` [PATCH rfc 00/12] mm: BPF OOM Matt Bobrowski
2025-04-28 17:24   ` Roman Gushchin
2025-04-29  1:56     ` Kumar Kartikeya Dwivedi
2025-04-29 15:42       ` Roman Gushchin
2025-05-02 17:26       ` Song Liu
2025-04-29 11:42 ` Michal Hocko
2025-04-29 14:44   ` Roman Gushchin
2025-04-29 21:56     ` Suren Baghdasaryan
2025-04-29 22:17       ` Roman Gushchin
2025-04-29 23:01     ` Suren Baghdasaryan
2025-04-29 22:44 ` Suren Baghdasaryan
2025-04-29 23:01   ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox