* [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller
@ 2026-02-04 8:56 Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 01/12] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
` (11 more replies)
0 siblings, 12 replies; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 8:56 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
Cc: Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
Changelog:
v6:
Based on the bot+bof-ci comments, fixed the following issues.
Added fast-path check with unlikely() before SRCU lock acquisition to
optimize the no-BPF case in BPF_MEMCG_CALL.
Add missing newline in pr_warn message to bpf_memcontrol_init.
Added comprehensive child process exit status checking with WIFEXITED()
and WEXITSTATUS(), and added zombie process prevention in
real_test_memcg_ops.
Changed malloc() to calloc() for BSS data allocation in all test
functions and samples main function.
Change srcu_read_lock(&memcg_bpf_srcu) to
lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online
and memcontrol_bpf_offline.
v5:
Based on the bot+bof-ci comments, fixed the following issues.
Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable
declaration to the beginning of need_threshold() function.
The 'u64 current_ts' variable must be declared before any
executable statements
Improved input validation in samples/bpf/memcg.c by adding a new
parse_u64() helper function. This function properly handles errors
from strtoull() and provides better error messages when parsing
threshold and over_high_ms command-line arguments.
Move check for prog->sleepable after validating member offsets in
mm/bpf_memcontrol.c bpf_memcg_ops_check_member.
Fixed sscanf return value checking in prog_tests/memcg_ops.c.
Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because
sscanf returns the number of successfully matched items, not a negative
value on error. This makes the test more reliable when reading timing
data from temporary files.
v4:
Fix the issues according to the comments from bot+bof-ci.
According to JP Kobryn's comments, move exit(0) from
real_test_memcg_ops_child_work to real_test_memcg_ops.
Fix issues in the bpf_memcg_ops_reg function.
v3:
According to the comments from Michal Koutný and Chen Ridong, update hooks
to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and
handle_cgroup_offline.
According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE
support to memcg_bpf_ops.
v2:
According to Tejun Heo's comments, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments from Roman Gushchin and Michal Hocko, designed
concrete use case scenarios and provided test results.
eBPF infrastructure provides rich visibility into system performance
metrics through various tracepoints and statistics.
This patch series introduces BPF struct_ops for the memory controller.
Then the eBPF program can help the system control the memory controller
based on system performance metrics, thereby improving the utilization
of system memory resources while ensuring memory limits are respected.
The following example illustrates how memcg eBPF can improve memory
utilization in some scenarios.
The example running on x86_64 QEMU (10 CPUs, 4GB RAM), using a
file in tmpfs on the host as a swap device to reduce I/O impact.
root@ubuntu:~# cat /proc/sys/vm/swappiness
60
This the high priority memcg.
root@ubuntu:~# mkdir /sys/fs/cgroup/high
This the low priority memcg.
root@ubuntu:~# mkdir /sys/fs/cgroup/low
root@ubuntu:~# free
total used free shared buff/cache available
Mem: 4007276 392320 3684940 908 101476 3614956
Swap: 10485756 0 10485756
First, The following test uses memory.low to reduce the likelihood of tasks in
high-priority memory cgroups being reclaimed.
root@ubuntu:~# echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60
[1] 1176
stress-ng: info: [1177] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1176] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1177] dispatching hogs: 4 vm
stress-ng: info: [1176] dispatching hogs: 4 vm
stress-ng: metrc: [1177] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1177] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1177] vm 27047770 60.07 217.79 8.87 450289.91 119330.63 94.34 886936
stress-ng: info: [1177] skipped: 0
stress-ng: info: [1177] passed: 4: vm (4)
stress-ng: info: [1177] failed: 0
stress-ng: info: [1177] metrics untrustworthy: 0
stress-ng: info: [1177] successful run completed in 1 min, 0.07 secs
stress-ng: metrc: [1176] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1176] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1176] vm 679754 60.12 11.82 72.78 11307.18 8034.42 35.18 469884
stress-ng: info: [1176] skipped: 0
stress-ng: info: [1176] passed: 4: vm (4)
stress-ng: info: [1176] failed: 0
stress-ng: info: [1176] metrics untrustworthy: 0
stress-ng: info: [1176] successful run completed in 1 min, 0.13 secs
[1]+ Done cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes $((3 * 1024 * 1024 * 1024)) --vm-method all --seed 2025 --metrics -t 60
The following test continues to use memory.low to reduce the likelihood
of tasks in high-priority memory cgroups (memcg) being reclaimed.
In this scenario, a Python script within the high-priority memcg simulates
a low-load task.
As a result, the Python script's performance is not affected by memory
reclamation (as it sleeps after allocating memory).
However, the performance of stress-ng is still impacted due to
the memory.low setting.
root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high python3 -c \
"import time; a = bytearray(3*1024*1024*1024); time.sleep(62)"
[1] 1196
stress-ng: info: [1196] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1196] dispatching hogs: 4 vm
stress-ng: metrc: [1196] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1196] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1196] vm 886893 60.10 17.76 56.61 14756.92 11925.69 30.94 788676
stress-ng: info: [1196] skipped: 0
stress-ng: info: [1196] passed: 4: vm (4)
stress-ng: info: [1196] failed: 0
stress-ng: info: [1196] metrics untrustworthy: 0
stress-ng: info: [1196] successful run completed in 1 min, 0.10 secs
[1]+ Done cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes $((3 * 1024 * 1024 * 1024)) --vm-method all --seed 2025 --metrics -t 60
root@ubuntu:~# echo 0 > /sys/fs/cgroup/high/memory.low
Now, we switch to using the memcg eBPF program for memory priority control.
memcg is a test program added to samples/bpf in this patch series.
It loads memcg.bpf.c into the kernel.
memcg.bpf.c monitors PGFAULT events in the high-priority memory cgroup.
When the number of events triggered within one second exceeds a predefined
threshold, the eBPF hook for the memory cgroup activates its control for
one second.
The following command configures the high-priority memory cgroup to
return below_min during memory reclamation if the number of PGFAULT
events per second exceeds one.
root@ubuntu:~# ./memcg --low_path=/sys/fs/cgroup/low \
--high_path=/sys/fs/cgroup/high \
--threshold=1 --use_below_min
Successfully attached!
root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60
[1] 1220
stress-ng: info: [1220] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1221] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1220] dispatching hogs: 4 vm
stress-ng: info: [1221] dispatching hogs: 4 vm
stress-ng: metrc: [1221] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1221] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1221] vm 24295240 60.08 221.36 7.64 404392.49 106095.60 95.29 886684
stress-ng: info: [1221] skipped: 0
stress-ng: info: [1221] passed: 4: vm (4)
stress-ng: info: [1221] failed: 0
stress-ng: info: [1221] metrics untrustworthy: 0
stress-ng: info: [1221] successful run completed in 1 min, 0.11 secs
stress-ng: metrc: [1220] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1220] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1220] vm 685732 60.13 11.69 75.98 11403.88 7822.30 36.45 496496
stress-ng: info: [1220] skipped: 0
stress-ng: info: [1220] passed: 4: vm (4)
stress-ng: info: [1220] failed: 0
stress-ng: info: [1220] metrics untrustworthy: 0
stress-ng: info: [1220] successful run completed in 1 min, 0.14 secs
[1]+ Done cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes $((3 * 1024 * 1024 * 1024)) --vm-method all --seed 2025 --metrics -t 60
This test demonstrates that because the Python process within the
high-priority memory cgroup is sleeping after memory allocation,
no page fault events occur.
As a result, the stress-ng process in the low-priority memory cgroup
achieves normal memory performance.
root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high python3 -c \
"import time; a = bytearray(3*1024*1024*1024); time.sleep(62)"
[1] 1238
stress-ng: info: [1238] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1238] dispatching hogs: 4 vm
stress-ng: metrc: [1238] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1238] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1238] vm 33107485 60.08 205.41 13.19 551082.91 151448.44 90.97 886064
stress-ng: info: [1238] skipped: 0
stress-ng: info: [1238] passed: 4: vm (4)
stress-ng: info: [1238] failed: 0
stress-ng: info: [1238] metrics untrustworthy: 0
stress-ng: info: [1238] successful run completed in 1 min, 0.09 secs
In this patch series, I've incorporated a portion of Roman's patch in
[1] to ensure the entire series can be compiled cleanly with bpf-next.
I made some modifications to bpf_struct_ops_link_create
in "bpf: Pass flags in bpf_link_create for struct_ops" and
"libbpf: Support passing user-defined flags for struct_ops" to allow
the flags parameter to be passed into the kernel.
With this change, patch "mm/bpf: Add BPF_F_ALLOW_OVERRIDE support for
memcg_bpf_ops" enables BPF_F_ALLOW_OVERRIDE support for memcg_bpf_ops.
Patch "mm: memcontrol: Add BPF struct_ops for memory controller"
introduces BPF struct_ops support to the memory controller, enabling
custom and dynamic control over memory pressure. This is achieved
through a new struct_ops type, `memcg_bpf_ops`.
The `memcg_bpf_ops` struct provides the following hooks:
- `get_high_delay_ms`: Returns a custom throttling delay in
milliseconds for a cgroup that has breached its `memory.high`
limit. This is the primary mechanism for BPF-driven throttling.
- `below_low`: Overrides the `memory.low` protection check. If this
hook returns true, the cgroup is considered to be protected by its
`memory.low` setting, regardless of its actual usage.
- `below_min`: Similar to `below_low`, this overrides the `memory.min`
protection check.
- `handle_cgroup_online`/`offline`: Callbacks invoked when a cgroup
with an attached program comes online or goes offline, allowing for
state management.
Patch "samples/bpf: Add memcg priority control example" introduces
the programs memcg.c and memcg.bpf.c that were used in the previous
examples.
[1] https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/
Hui Zhu (7):
bpf: Pass flags in bpf_link_create for struct_ops
libbpf: Support passing user-defined flags for struct_ops
mm: memcontrol: Add BPF struct_ops for memory controller
selftests/bpf: Add tests for memcg_bpf_ops
mm/bpf: Add BPF_F_ALLOW_OVERRIDE support for memcg_bpf_ops
selftests/bpf: Add test for memcg_bpf_ops hierarchies
samples/bpf: Add memcg priority control example
Roman Gushchin (5):
bpf: move bpf_struct_ops_link into bpf.h
bpf: initial support for attaching struct ops to cgroups
bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
libbpf: introduce bpf_map__attach_struct_ops_opts()
MAINTAINERS | 4 +
include/linux/bpf.h | 8 +
include/linux/memcontrol.h | 122 +++-
kernel/bpf/bpf_struct_ops.c | 22 +-
kernel/bpf/verifier.c | 5 +
mm/bpf_memcontrol.c | 281 +++++++-
mm/memcontrol.c | 35 +-
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 8 +-
samples/bpf/memcg.bpf.c | 130 ++++
samples/bpf/memcg.c | 343 ++++++++++
tools/include/uapi/linux/bpf.h | 2 +-
tools/lib/bpf/bpf.c | 8 +
tools/lib/bpf/libbpf.c | 19 +-
tools/lib/bpf/libbpf.h | 14 +
tools/lib/bpf/libbpf.map | 1 +
.../selftests/bpf/prog_tests/memcg_ops.c | 626 ++++++++++++++++++
tools/testing/selftests/bpf/progs/memcg_ops.c | 130 ++++
18 files changed, 1732 insertions(+), 27 deletions(-)
create mode 100644 samples/bpf/memcg.bpf.c
create mode 100644 samples/bpf/memcg.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 01/12] bpf: move bpf_struct_ops_link into bpf.h
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
@ 2026-02-04 8:56 ` Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 02/12] bpf: initial support for attaching struct ops to cgroups Hui Zhu
` (10 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 8:56 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
From: Roman Gushchin <roman.gushchin@linux.dev>
Move struct bpf_struct_ops_link's definition into bpf.h,
where other custom bpf links definitions are.
It's necessary to access its members from outside of generic
bpf_struct_ops implementation, which will be done by following
patches in the series.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
include/linux/bpf.h | 6 ++++++
kernel/bpf/bpf_struct_ops.c | 6 ------
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4427c6e98331..899dd911dc82 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1891,6 +1891,12 @@ struct bpf_raw_tp_link {
u64 cookie;
};
+struct bpf_struct_ops_link {
+ struct bpf_link link;
+ struct bpf_map __rcu *map;
+ wait_queue_head_t wait_hup;
+};
+
struct bpf_link_primer {
struct bpf_link *link;
struct file *file;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index c43346cb3d76..de01cf3025b3 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -55,12 +55,6 @@ struct bpf_struct_ops_map {
struct bpf_struct_ops_value kvalue;
};
-struct bpf_struct_ops_link {
- struct bpf_link link;
- struct bpf_map __rcu *map;
- wait_queue_head_t wait_hup;
-};
-
static DEFINE_MUTEX(update_mutex);
#define VALUE_PREFIX "bpf_struct_ops_"
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 02/12] bpf: initial support for attaching struct ops to cgroups
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 01/12] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
@ 2026-02-04 8:56 ` Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 03/12] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Hui Zhu
` (9 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 8:56 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
From: Roman Gushchin <roman.gushchin@linux.dev>
When a struct ops is being attached and a bpf link is created,
allow to pass a cgroup fd using bpf attr, so that struct ops
can be attached to a cgroup instead of globally.
Attached struct ops doesn't hold a reference to the cgroup,
only preserves cgroup id.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
include/linux/bpf.h | 1 +
kernel/bpf/bpf_struct_ops.c | 15 +++++++++++++++
2 files changed, 16 insertions(+)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 899dd911dc82..720055d1dbce 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1895,6 +1895,7 @@ struct bpf_struct_ops_link {
struct bpf_link link;
struct bpf_map __rcu *map;
wait_queue_head_t wait_hup;
+ u64 cgroup_id;
};
struct bpf_link_primer {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index de01cf3025b3..c807793e7633 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -13,6 +13,7 @@
#include <linux/btf_ids.h>
#include <linux/rcupdate_wait.h>
#include <linux/poll.h>
+#include <linux/cgroup.h>
struct bpf_struct_ops_value {
struct bpf_struct_ops_common_value common;
@@ -1377,6 +1378,20 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
}
bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
attr->link_create.attach_type);
+#ifdef CONFIG_CGROUPS
+ if (attr->link_create.cgroup.relative_fd) {
+ struct cgroup *cgrp;
+
+ cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
+ if (IS_ERR(cgrp)) {
+ err = PTR_ERR(cgrp);
+ goto err_out;
+ }
+
+ link->cgroup_id = cgroup_id(cgrp);
+ cgroup_put(cgrp);
+ }
+#endif /* CONFIG_CGROUPS */
err = bpf_link_prime(&link->link, &link_primer);
if (err)
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 03/12] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 01/12] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 02/12] bpf: initial support for attaching struct ops to cgroups Hui Zhu
@ 2026-02-04 8:56 ` Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 04/12] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Hui Zhu
` (8 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 8:56 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
Cc: Kumar Kartikeya Dwivedi
From: Roman Gushchin <roman.gushchin@linux.dev>
Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/bpf/verifier.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c2f2650db9fd..cca36edb460d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7242,6 +7242,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
struct file *vm_file;
};
+BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control) {
+ struct mem_cgroup *memcg;
+};
+
static bool type_is_rcu(struct bpf_verifier_env *env,
struct bpf_reg_state *reg,
const char *field_name, u32 btf_id)
@@ -7284,6 +7288,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
+ BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control));
return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
"__safe_trusted_or_null");
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 04/12] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
` (2 preceding siblings ...)
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 03/12] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Hui Zhu
@ 2026-02-04 8:56 ` Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 05/12] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
` (7 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 8:56 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
From: Roman Gushchin <roman.gushchin@linux.dev>
mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
include/linux/memcontrol.h | 4 ++--
mm/memcontrol.c | 2 --
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 229ac9835adb..f3b8c71870d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -833,9 +833,9 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
{
return memcg ? cgroup_ino(memcg->css.cgroup) : 0;
}
+#endif
struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino);
-#endif
static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m)
{
@@ -1298,12 +1298,12 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
{
return 0;
}
+#endif
static inline struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
{
return NULL;
}
-#endif
static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3808845bc8cc..1f74fce27677 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3658,7 +3658,6 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
return xa_load(&mem_cgroup_ids, id);
}
-#ifdef CONFIG_SHRINKER_DEBUG
struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
{
struct cgroup *cgrp;
@@ -3679,7 +3678,6 @@ struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
return memcg;
}
-#endif
static void free_mem_cgroup_per_node_info(struct mem_cgroup_per_node *pn)
{
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 05/12] libbpf: introduce bpf_map__attach_struct_ops_opts()
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
` (3 preceding siblings ...)
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 04/12] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Hui Zhu
@ 2026-02-04 8:56 ` Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 06/12] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
` (6 subsequent siblings)
11 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 8:56 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
From: Roman Gushchin <roman.gushchin@linux.dev>
Introduce bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops(), which takes additional struct
bpf_struct_ops_opts argument.
struct bpf_struct_ops_opts has the relative_fd member, which allows
to pass an additional file descriptor argument. It can be used to
attach struct ops maps to cgroups.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
tools/lib/bpf/bpf.c | 8 ++++++++
tools/lib/bpf/libbpf.c | 18 ++++++++++++++++--
tools/lib/bpf/libbpf.h | 14 ++++++++++++++
tools/lib/bpf/libbpf.map | 1 +
4 files changed, 39 insertions(+), 2 deletions(-)
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 5846de364209..84a53c594f48 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -884,6 +884,14 @@ int bpf_link_create(int prog_fd, int target_fd,
if (!OPTS_ZEROED(opts, cgroup))
return libbpf_err(-EINVAL);
break;
+ case BPF_STRUCT_OPS:
+ relative_fd = OPTS_GET(opts, cgroup.relative_fd, 0);
+ attr.link_create.cgroup.relative_fd = relative_fd;
+ attr.link_create.cgroup.expected_revision =
+ OPTS_GET(opts, cgroup.expected_revision, 0);
+ if (!OPTS_ZEROED(opts, cgroup))
+ return libbpf_err(-EINVAL);
+ break;
default:
if (!OPTS_ZEROED(opts, flags))
return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 0c8bf0b5cce4..70a00da54ff5 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -13462,12 +13462,19 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
return close(link->fd);
}
-struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
+ const struct bpf_struct_ops_opts *opts)
{
+ DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
struct bpf_link_struct_ops *link;
__u32 zero = 0;
int err, fd;
+ if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
+ pr_warn("map '%s': invalid opts\n", map->name);
+ return libbpf_err_ptr(-EINVAL);
+ }
+
if (!bpf_map__is_struct_ops(map)) {
pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
return libbpf_err_ptr(-EINVAL);
@@ -13503,7 +13510,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
return &link->link;
}
- fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
+ link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
+
+ fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);
if (fd < 0) {
free(link);
return libbpf_err_ptr(fd);
@@ -13515,6 +13524,11 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
return &link->link;
}
+struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+{
+ return bpf_map__attach_struct_ops_opts(map, NULL);
+}
+
/*
* Swap the back struct_ops of a link with a new struct_ops map.
*/
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index dfc37a615578..5aef44bcfcc2 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -920,6 +920,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
struct bpf_map;
LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
+
+struct bpf_struct_ops_opts {
+ /* size of this struct, for forward/backward compatibility */
+ size_t sz;
+ __u32 flags;
+ __u32 relative_fd;
+ __u64 expected_revision;
+ size_t :0;
+};
+#define bpf_struct_ops_opts__last_field expected_revision
+
+LIBBPF_API struct bpf_link *
+bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
+ const struct bpf_struct_ops_opts *opts);
LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
struct bpf_iter_attach_opts {
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index d18fbcea7578..4779190c97b6 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -454,4 +454,5 @@ LIBBPF_1.7.0 {
bpf_prog_assoc_struct_ops;
bpf_program__assoc_struct_ops;
btf__permute;
+ bpf_map__attach_struct_ops_opts;
} LIBBPF_1.6.0;
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 06/12] bpf: Pass flags in bpf_link_create for struct_ops
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
` (4 preceding siblings ...)
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 05/12] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
@ 2026-02-04 8:56 ` Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 07/12] libbpf: Support passing user-defined flags " Hui Zhu
` (5 subsequent siblings)
11 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 8:56 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
Cc: Hui Zhu, Geliang Tang
From: Hui Zhu <zhuhui@kylinos.cn>
To support features like allowing overrides in cgroup hierarchies,
we need a way to pass flags from userspace to the kernel when
attaching a struct_ops.
Extend `bpf_struct_ops_link` to include a `flags` field. This field
is populated from `attr->link_create.flags` during link creation. This
will allow struct_ops implementations, such as the upcoming memory
controller ops, to interpret these flags and modify their attachment
behavior accordingly.
UAPI Change:
This patch updates the comment in include/uapi/linux/bpf.h to reflect
that the cgroup-bpf attach flags (such as BPF_F_ALLOW_OVERRIDE) are
now applicable to both BPF_PROG_ATTACH and BPF_LINK_CREATE commands.
Previously, these flags were only documented for BPF_PROG_ATTACH.
The actual flag definitions remain unchanged, so this is a compatible
extension of the existing API. Older userspace will continue to work
(by not passing flags), and newer userspace can opt-in to the new
functionality by setting appropriate flags.
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
include/linux/bpf.h | 1 +
kernel/bpf/bpf_struct_ops.c | 1 +
tools/include/uapi/linux/bpf.h | 2 +-
3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 720055d1dbce..13c933cfc614 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1896,6 +1896,7 @@ struct bpf_struct_ops_link {
struct bpf_map __rcu *map;
wait_queue_head_t wait_hup;
u64 cgroup_id;
+ u32 flags;
};
struct bpf_link_primer {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index c807793e7633..0df608c88403 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -1392,6 +1392,7 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
cgroup_put(cgrp);
}
#endif /* CONFIG_CGROUPS */
+ link->flags = attr->link_create.flags;
err = bpf_link_prime(&link->link, &link_primer);
if (err)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3ca7d76e05f0..4e1c5d6d91ae 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1185,7 +1185,7 @@ enum bpf_perf_event_type {
BPF_PERF_EVENT_EVENT = 6,
};
-/* cgroup-bpf attach flags used in BPF_PROG_ATTACH command
+/* cgroup-bpf attach flags used in BPF_PROG_ATTACH and BPF_LINK_CREATE command
*
* NONE(default): No further bpf programs allowed in the subtree.
*
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 07/12] libbpf: Support passing user-defined flags for struct_ops
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
` (5 preceding siblings ...)
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 06/12] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
@ 2026-02-04 9:00 ` Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 08/12] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
` (4 subsequent siblings)
11 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 9:00 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
Cc: Hui Zhu, Geliang Tang
From: Hui Zhu <zhuhui@kylinos.cn>
Building on the previous change that added flags to the kernel's link
creation path, this patch exposes this functionality through libbpf.
The `bpf_struct_ops_opts` struct is extended with a `flags` member,
which is then passed to the `bpf_link_create` syscall within
`bpf_map__attach_struct_ops_opts`.
This enables userspace applications to pass flags, such as
`BPF_F_ALLOW_OVERRIDE`, when attaching struct_ops to cgroups,
providing more control over the attachment behavior in nested
hierarchies.
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
tools/lib/bpf/libbpf.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 70a00da54ff5..06c936bad211 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -13511,6 +13511,7 @@ struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
}
link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
+ link_opts.flags = OPTS_GET(opts, flags, 0);
fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);
if (fd < 0) {
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 08/12] mm: memcontrol: Add BPF struct_ops for memory controller
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
` (6 preceding siblings ...)
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 07/12] libbpf: Support passing user-defined flags " Hui Zhu
@ 2026-02-04 9:00 ` Hui Zhu
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 09/12] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
` (3 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 9:00 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
Cc: Hui Zhu, Geliang Tang
From: Hui Zhu <zhuhui@kylinos.cn>
Introduce BPF struct_ops support to the memory controller, enabling
custom and dynamic control over memory pressure. This is achieved
through a new struct_ops type, `memcg_bpf_ops`.
This new interface allows a BPF program to implement hooks that
influence a memory cgroup's behavior. The `memcg_bpf_ops` struct
provides the following hooks:
- `get_high_delay_ms`: Returns a custom throttling delay in
milliseconds for a cgroup that has breached its `memory.high`
limit. This is the primary mechanism for BPF-driven throttling.
- `below_low`: Overrides the `memory.low` protection check. If this
hook returns true, the cgroup is considered to be protected by its
`memory.low` setting, regardless of its actual usage.
- `below_min`: Similar to `below_low`, this overrides the `memory.min`
protection check.
- `handle_cgroup_online`/`offline`: Callbacks invoked when a cgroup
with an attached program comes online or goes offline, allowing for
state management.
This patch integrates these hooks into the core memory control logic.
The `get_high_delay_ms` value is incorporated into charge paths like
`try_charge_memcg` and the high-limit handler
`__mem_cgroup_handle_over_high`. The `below_low` and `below_min`
hooks are checked within their respective protection functions.
Lifecycle management is handled to ensure BPF programs are correctly
inherited by child cgroups and cleaned up on detachment. SRCU is used
to protect concurrent access to the `memcg->bpf_ops` pointer.
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
include/linux/memcontrol.h | 117 +++++++++++++++++-
mm/bpf_memcontrol.c | 247 ++++++++++++++++++++++++++++++++++++-
mm/memcontrol.c | 33 +++--
3 files changed, 384 insertions(+), 13 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f3b8c71870d8..d91dbb95069b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
#include <linux/writeback.h>
#include <linux/page-flags.h>
#include <linux/shrinker.h>
+#include <linux/srcu.h>
struct mem_cgroup;
struct obj_cgroup;
@@ -181,6 +182,37 @@ struct obj_cgroup {
};
};
+#ifdef CONFIG_BPF_SYSCALL
+/**
+ * struct memcg_bpf_ops - BPF callbacks for memory cgroup operations
+ * @handle_cgroup_online: Called when a cgroup comes online
+ * @handle_cgroup_offline: Called when a cgroup goes offline
+ * @below_low: Override memory.low protection check. If this callback returns
+ * true, mem_cgroup_below_low() will return true immediately without
+ * performing the standard comparison. If it returns false, the
+ * original memory.low threshold comparison will proceed normally.
+ * @below_min: Override memory.min protection check. If this callback returns
+ * true, mem_cgroup_below_min() will return true immediately without
+ * performing the standard comparison. If it returns false, the
+ * original memory.min threshold comparison will proceed normally.
+ * @get_high_delay_ms: Return custom throttle delay in milliseconds
+ *
+ * This structure defines the interface for BPF programs to customize
+ * memory cgroup behavior through struct_ops programs.
+ */
+struct memcg_bpf_ops {
+ void (*handle_cgroup_online)(struct mem_cgroup *memcg);
+
+ void (*handle_cgroup_offline)(struct mem_cgroup *memcg);
+
+ bool (*below_low)(struct mem_cgroup *memcg);
+
+ bool (*below_min)(struct mem_cgroup *memcg);
+
+ unsigned int (*get_high_delay_ms)(struct mem_cgroup *memcg);
+};
+#endif /* CONFIG_BPF_SYSCALL */
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -321,6 +353,10 @@ struct mem_cgroup {
spinlock_t event_list_lock;
#endif /* CONFIG_MEMCG_V1 */
+#ifdef CONFIG_BPF_SYSCALL
+ struct memcg_bpf_ops *bpf_ops;
+#endif
+
struct mem_cgroup_per_node *nodeinfo[];
};
@@ -554,6 +590,76 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
}
+#ifdef CONFIG_BPF_SYSCALL
+
+/* SRCU for protecting concurrent access to memcg->bpf_ops */
+extern struct srcu_struct memcg_bpf_srcu;
+
+/**
+ * BPF_MEMCG_CALL - Safely invoke a BPF memcg callback
+ * @memcg: The memory cgroup
+ * @op: The operation name (struct member)
+ * @default_val: Default return value if no BPF program attached
+ *
+ * This macro safely calls a BPF callback under SRCU protection.
+ *
+ * The first READ_ONCE() serves as a fast-path check to avoid the overhead
+ * of SRCU read lock acquisition when no BPF program is attached. This keeps
+ * the common no-BPF case performance unchanged. The second READ_ONCE() under
+ * SRCU protection ensures we see a consistent view of bpf_ops after acquiring
+ * the lock, protecting against concurrent updates.
+ */
+#define BPF_MEMCG_CALL(memcg, op, default_val) ({ \
+ typeof(default_val) __ret = (default_val); \
+ struct memcg_bpf_ops *__ops; \
+ int __idx; \
+ \
+ if (unlikely(READ_ONCE((memcg)->bpf_ops))) { \
+ __idx = srcu_read_lock(&memcg_bpf_srcu); \
+ __ops = READ_ONCE((memcg)->bpf_ops); \
+ if (__ops && __ops->op) \
+ __ret = __ops->op(memcg); \
+ srcu_read_unlock(&memcg_bpf_srcu, __idx); \
+ } \
+ __ret; \
+})
+
+static inline bool bpf_memcg_below_low(struct mem_cgroup *memcg)
+{
+ return BPF_MEMCG_CALL(memcg, below_low, false);
+}
+
+static inline bool bpf_memcg_below_min(struct mem_cgroup *memcg)
+{
+ return BPF_MEMCG_CALL(memcg, below_min, false);
+}
+
+static inline unsigned long bpf_memcg_get_high_delay(struct mem_cgroup *memcg)
+{
+ unsigned int ret;
+
+ ret = BPF_MEMCG_CALL(memcg, get_high_delay_ms, 0U);
+ return msecs_to_jiffies(ret);
+}
+
+#undef BPF_MEMCG_CALL
+
+extern void memcontrol_bpf_online(struct mem_cgroup *memcg);
+extern void memcontrol_bpf_offline(struct mem_cgroup *memcg);
+
+#else /* CONFIG_BPF_SYSCALL */
+
+static inline unsigned long
+bpf_memcg_get_high_delay(struct mem_cgroup *memcg) { return 0; }
+static inline bool
+bpf_memcg_below_low(struct mem_cgroup *memcg) { return false; }
+static inline bool
+bpf_memcg_below_min(struct mem_cgroup *memcg) { return false; }
+static inline void memcontrol_bpf_online(struct mem_cgroup *memcg) { }
+static inline void memcontrol_bpf_offline(struct mem_cgroup *memcg) { }
+
+#endif /* CONFIG_BPF_SYSCALL */
+
static inline void mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg,
unsigned long *min,
@@ -625,6 +731,9 @@ static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
if (mem_cgroup_unprotected(target, memcg))
return false;
+ if (bpf_memcg_below_low(memcg))
+ return true;
+
return READ_ONCE(memcg->memory.elow) >=
page_counter_read(&memcg->memory);
}
@@ -635,6 +744,9 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
if (mem_cgroup_unprotected(target, memcg))
return false;
+ if (bpf_memcg_below_min(memcg))
+ return true;
+
return READ_ONCE(memcg->memory.emin) >=
page_counter_read(&memcg->memory);
}
@@ -909,12 +1021,13 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
return READ_ONCE(mz->lru_zone_size[zone_idx][lru]);
}
-void __mem_cgroup_handle_over_high(gfp_t gfp_mask);
+void __mem_cgroup_handle_over_high(gfp_t gfp_mask,
+ unsigned long bpf_high_delay);
static inline void mem_cgroup_handle_over_high(gfp_t gfp_mask)
{
if (unlikely(current->memcg_nr_pages_over_high))
- __mem_cgroup_handle_over_high(gfp_mask);
+ __mem_cgroup_handle_over_high(gfp_mask, 0);
}
unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index 716df49d7647..72b720400628 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -8,6 +8,9 @@
#include <linux/memcontrol.h>
#include <linux/bpf.h>
+/* Protects memcg->bpf_ops pointer for read and write. */
+DEFINE_SRCU(memcg_bpf_srcu);
+
__bpf_kfunc_start_defs();
/**
@@ -179,15 +182,255 @@ static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
.set = &bpf_memcontrol_kfuncs,
};
+/**
+ * memcontrol_bpf_online - Inherit BPF programs for a new online cgroup.
+ * @memcg: The memory cgroup that is coming online.
+ *
+ * When a new memcg is brought online, it inherits the BPF programs
+ * attached to its parent. This ensures consistent BPF-based memory
+ * control policies throughout the cgroup hierarchy.
+ *
+ * After inheriting, if the BPF program has an online handler, it is
+ * invoked for the new memcg.
+ */
+void memcontrol_bpf_online(struct mem_cgroup *memcg)
+{
+ struct memcg_bpf_ops *ops;
+ struct mem_cgroup *parent_memcg;
+
+ /* The root cgroup does not inherit from a parent. */
+ if (mem_cgroup_is_root(memcg))
+ return;
+
+ /*
+ * Because only functions bpf_memcg_ops_reg and bpf_memcg_ops_unreg
+ * write to memcg->bpf_ops under the protection of cgroup_mutex,
+ * ensuring that cgroup_mutex is already locked here allows safe
+ * reading and writing of memcg->bpf_ops without needing to acquire
+ * a lock on memcg_bpf_srcu.
+ */
+ lockdep_assert_held(&cgroup_mutex);
+
+ parent_memcg = parent_mem_cgroup(memcg);
+
+ /* Inherit the BPF program from the parent cgroup. */
+ ops = READ_ONCE(parent_memcg->bpf_ops);
+ if (!ops)
+ return;
+ WRITE_ONCE(memcg->bpf_ops, ops);
+
+ /*
+ * If the BPF program implements it, call the online handler to
+ * allow the program to perform setup tasks for the new cgroup.
+ */
+ if (ops->handle_cgroup_online)
+ ops->handle_cgroup_online(memcg);
+}
+
+/**
+ * memcontrol_bpf_offline - Run BPF cleanup for an offline cgroup.
+ * @memcg: The memory cgroup that is going offline.
+ *
+ * If a BPF program is attached and implements an offline handler,
+ * it is invoked to perform cleanup tasks before the memcg goes
+ * completely offline.
+ */
+void memcontrol_bpf_offline(struct mem_cgroup *memcg)
+{
+ struct memcg_bpf_ops *ops;
+
+ /* Same with function memcontrol_bpf_online. */
+ lockdep_assert_held(&cgroup_mutex);
+
+ ops = READ_ONCE(memcg->bpf_ops);
+ if (!ops || !ops->handle_cgroup_offline)
+ return;
+
+ ops->handle_cgroup_offline(memcg);
+}
+
+static int memcg_ops_btf_struct_access(struct bpf_verifier_log *log,
+ const struct bpf_reg_state *reg,
+ int off, int size)
+{
+ return -EACCES;
+}
+
+static bool memcg_ops_is_valid_access(int off, int size, enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+const struct bpf_verifier_ops bpf_memcg_verifier_ops = {
+ .get_func_proto = bpf_base_func_proto,
+ .btf_struct_access = memcg_ops_btf_struct_access,
+ .is_valid_access = memcg_ops_is_valid_access,
+};
+
+static void cfi_handle_cgroup_online(struct mem_cgroup *memcg)
+{
+}
+
+static void cfi_handle_cgroup_offline(struct mem_cgroup *memcg)
+{
+}
+
+static bool cfi_below_low(struct mem_cgroup *memcg)
+{
+ return false;
+}
+
+static bool cfi_below_min(struct mem_cgroup *memcg)
+{
+ return false;
+}
+
+static unsigned int cfi_get_high_delay_ms(struct mem_cgroup *memcg)
+{
+ return 0;
+}
+
+static struct memcg_bpf_ops cfi_bpf_memcg_ops = {
+ .handle_cgroup_online = cfi_handle_cgroup_online,
+ .handle_cgroup_offline = cfi_handle_cgroup_offline,
+ .below_low = cfi_below_low,
+ .below_min = cfi_below_min,
+ .get_high_delay_ms = cfi_get_high_delay_ms,
+};
+
+static int bpf_memcg_ops_init(struct btf *btf)
+{
+ return 0;
+}
+
+static int bpf_memcg_ops_check_member(const struct btf_type *t,
+ const struct btf_member *member,
+ const struct bpf_prog *prog)
+{
+ u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+ switch (moff) {
+ case offsetof(struct memcg_bpf_ops, handle_cgroup_online):
+ case offsetof(struct memcg_bpf_ops, handle_cgroup_offline):
+ case offsetof(struct memcg_bpf_ops, below_low):
+ case offsetof(struct memcg_bpf_ops, below_min):
+ case offsetof(struct memcg_bpf_ops, get_high_delay_ms):
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (prog->sleepable)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int bpf_memcg_ops_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ return 0;
+}
+
+/**
+ * clean_memcg_bpf_ops - Clear BPF ops from a memory cgroup hierarchy
+ * @memcg: Root memory cgroup to start from
+ * @ops: The specific BPF ops to remove
+ *
+ * Walks the cgroup hierarchy and clears bpf_ops for any cgroup that
+ * matches @ops.
+ */
+static void clean_memcg_bpf_ops(struct mem_cgroup *memcg,
+ struct memcg_bpf_ops *ops)
+{
+ struct mem_cgroup *iter = NULL;
+
+ while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
+ if (READ_ONCE(iter->bpf_ops) == ops)
+ WRITE_ONCE(iter->bpf_ops, NULL);
+ }
+}
+
+static int bpf_memcg_ops_reg(void *kdata, struct bpf_link *link)
+{
+ struct bpf_struct_ops_link *ops_link
+ = container_of(link, struct bpf_struct_ops_link, link);
+ struct memcg_bpf_ops *ops = kdata;
+ struct mem_cgroup *memcg, *iter = NULL;
+ int err = 0;
+
+ memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
+ if (IS_ERR(memcg))
+ return PTR_ERR(memcg);
+
+ cgroup_lock();
+ while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
+ if (READ_ONCE(iter->bpf_ops)) {
+ mem_cgroup_iter_break(memcg, iter);
+ err = -EBUSY;
+ break;
+ }
+ WRITE_ONCE(iter->bpf_ops, ops);
+ }
+ if (err)
+ clean_memcg_bpf_ops(memcg, ops);
+ cgroup_unlock();
+
+ mem_cgroup_put(memcg);
+ return err;
+}
+
+/* Unregister the struct ops instance */
+static void bpf_memcg_ops_unreg(void *kdata, struct bpf_link *link)
+{
+ struct bpf_struct_ops_link *ops_link
+ = container_of(link, struct bpf_struct_ops_link, link);
+ struct memcg_bpf_ops *ops = kdata;
+ struct mem_cgroup *memcg;
+
+ memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
+ if (IS_ERR_OR_NULL(memcg))
+ goto out;
+
+ cgroup_lock();
+ clean_memcg_bpf_ops(memcg, ops);
+ cgroup_unlock();
+
+ mem_cgroup_put(memcg);
+
+out:
+ synchronize_srcu(&memcg_bpf_srcu);
+}
+
+static struct bpf_struct_ops bpf_memcg_bpf_ops = {
+ .verifier_ops = &bpf_memcg_verifier_ops,
+ .init = bpf_memcg_ops_init,
+ .check_member = bpf_memcg_ops_check_member,
+ .init_member = bpf_memcg_ops_init_member,
+ .reg = bpf_memcg_ops_reg,
+ .unreg = bpf_memcg_ops_unreg,
+ .name = "memcg_bpf_ops",
+ .owner = THIS_MODULE,
+ .cfi_stubs = &cfi_bpf_memcg_ops,
+};
+
static int __init bpf_memcontrol_init(void)
{
- int err;
+ int err, err2;
err = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
&bpf_memcontrol_kfunc_set);
if (err)
pr_warn("error while registering bpf memcontrol kfuncs: %d", err);
- return err;
+ err2 = register_bpf_struct_ops(&bpf_memcg_bpf_ops, memcg_bpf_ops);
+ if (err2)
+ pr_warn("error while registering memcontrol bpf ops: %d\n",
+ err2);
+
+ return err ? err : err2;
}
late_initcall(bpf_memcontrol_init);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f74fce27677..3f358d9bc412 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2252,7 +2252,8 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
* try_charge() (context permitting), as well as from the userland
* return path where reclaim is always able to block.
*/
-void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
+void
+__mem_cgroup_handle_over_high(gfp_t gfp_mask, unsigned long bpf_high_delay)
{
unsigned long penalty_jiffies;
unsigned long pflags;
@@ -2294,11 +2295,15 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
* memory.high is breached and reclaim is unable to keep up. Throttle
* allocators proactively to slow down excessive growth.
*/
- penalty_jiffies = calculate_high_delay(memcg, nr_pages,
- mem_find_max_overage(memcg));
+ if (nr_pages) {
+ penalty_jiffies = calculate_high_delay(
+ memcg, nr_pages, mem_find_max_overage(memcg));
- penalty_jiffies += calculate_high_delay(memcg, nr_pages,
- swap_find_max_overage(memcg));
+ penalty_jiffies += calculate_high_delay(
+ memcg, nr_pages, swap_find_max_overage(memcg));
+ } else
+ penalty_jiffies = 0;
+ penalty_jiffies = max(penalty_jiffies, bpf_high_delay);
/*
* Clamp the max delay per usermode return so as to still keep the
@@ -2356,6 +2361,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
bool raised_max_event = false;
unsigned long pflags;
bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
+ struct mem_cgroup *orig_memcg;
retry:
if (consume_stock(memcg, nr_pages))
@@ -2481,6 +2487,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
+ orig_memcg = memcg;
/*
* If the hierarchy is above the normal consumption range, schedule
* reclaim on returning to userland. We can perform reclaim here
@@ -2530,10 +2537,15 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
* kernel. If this is successful, the return path will see it
* when it rechecks the overage and simply bail out.
*/
- if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
- !(current->flags & PF_MEMALLOC) &&
- gfpflags_allow_blocking(gfp_mask))
- __mem_cgroup_handle_over_high(gfp_mask);
+ if (!(current->flags & PF_MEMALLOC) &&
+ gfpflags_allow_blocking(gfp_mask)) {
+ unsigned long bpf_high_delay;
+
+ bpf_high_delay = bpf_memcg_get_high_delay(orig_memcg);
+ if (bpf_high_delay ||
+ current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH)
+ __mem_cgroup_handle_over_high(gfp_mask, bpf_high_delay);
+ }
return 0;
}
@@ -3906,6 +3918,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
*/
xa_store(&mem_cgroup_ids, memcg->id.id, memcg, GFP_KERNEL);
+ memcontrol_bpf_online(memcg);
+
return 0;
offline_kmem:
memcg_offline_kmem(memcg);
@@ -3925,6 +3939,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
zswap_memcg_offline_cleanup(memcg);
+ memcontrol_bpf_offline(memcg);
memcg_offline_kmem(memcg);
reparent_deferred_split_queue(memcg);
reparent_shrinker_deferred(memcg);
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 09/12] selftests/bpf: Add tests for memcg_bpf_ops
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
` (7 preceding siblings ...)
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 08/12] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
@ 2026-02-04 9:00 ` Hui Zhu
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 10/12] mm/bpf: Add BPF_F_ALLOW_OVERRIDE support " Hui Zhu
` (2 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 9:00 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
Cc: Hui Zhu, Geliang Tang
From: Hui Zhu <zhuhui@kylinos.cn>
Add a comprehensive selftest suite for the `memcg_bpf_ops`
functionality. These tests validate that BPF programs can correctly
influence memory cgroup throttling behavior by implementing the new
hooks.
The test suite is added in `prog_tests/memcg_ops.c` and covers
several key scenarios:
1. `test_memcg_ops_over_high`:
Verifies that a BPF program can trigger throttling on a low-priority
cgroup by returning a delay from the `get_high_delay_ms` hook when a
high-priority cgroup is under pressure.
2. `test_memcg_ops_below_low_over_high`:
Tests the combination of the `below_low` and `get_high_delay_ms`
hooks, ensuring they work together as expected.
3. `test_memcg_ops_below_min_over_high`:
Validates the interaction between the `below_min` and
`get_high_delay_ms` hooks.
The test framework sets up a cgroup hierarchy with high and low
priority groups, attaches BPF programs, runs memory-intensive
workloads, and asserts that the observed throttling (measured by
workload execution time) matches expectations.
The BPF program (`progs/memcg_ops.c`) uses a tracepoint on
`memcg:count_memcg_events` (specifically PGFAULT) to detect memory
pressure and trigger the appropriate hooks in response. This test
suite provides essential validation for the new memory control
mechanisms.
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
MAINTAINERS | 2 +
.../selftests/bpf/prog_tests/memcg_ops.c | 555 ++++++++++++++++++
tools/testing/selftests/bpf/progs/memcg_ops.c | 130 ++++
3 files changed, 687 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 491d567f7dc8..7e07bb330eae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6471,6 +6471,8 @@ F: mm/memcontrol-v1.h
F: mm/page_counter.c
F: mm/swap_cgroup.c
F: samples/cgroup/*
+F: tools/testing/selftests/bpf/prog_tests/memcg_ops.c
+F: tools/testing/selftests/bpf/progs/memcg_ops.c
F: tools/testing/selftests/cgroup/memcg_protection.m
F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
F: tools/testing/selftests/cgroup/test_kmem.c
diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
new file mode 100644
index 000000000000..8c787439f83c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
@@ -0,0 +1,555 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Memory controller eBPF struct ops test
+ */
+
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include "cgroup_helpers.h"
+
+struct local_config {
+ u64 threshold;
+ u64 high_cgroup_id;
+ bool use_below_low;
+ bool use_below_min;
+ unsigned int over_high_ms;
+} local_config;
+
+#include "memcg_ops.skel.h"
+
+#define TRIGGER_THRESHOLD 1
+#define OVER_HIGH_MS 2000
+#define FILE_SIZE (64 * 1024 * 1024ul)
+#define BUFFER_SIZE (4096)
+#define CG_LIMIT (120 * 1024 * 1024ul)
+
+#define CG_DIR "/memcg_ops_test"
+#define CG_HIGH_DIR CG_DIR "/high"
+#define CG_LOW_DIR CG_DIR "/low"
+
+static int
+setup_cgroup(u64 *high_cgroup_id, int *low_cgroup_fd, int *high_cgroup_fd)
+{
+ int ret;
+ char limit_buf[20];
+
+ ret = setup_cgroup_environment();
+ if (!ASSERT_OK(ret, "setup_cgroup_environment"))
+ goto cleanup;
+
+ ret = create_and_get_cgroup(CG_DIR);
+ if (!ASSERT_GE(ret, 0, "create_and_get_cgroup "CG_DIR))
+ goto cleanup;
+ close(ret);
+ ret = enable_controllers(CG_DIR, "memory");
+ if (!ASSERT_OK(ret, "enable_controllers"))
+ goto cleanup;
+ snprintf(limit_buf, 20, "%lu", CG_LIMIT);
+ ret = write_cgroup_file(CG_DIR, "memory.max", limit_buf);
+ if (!ASSERT_OK(ret, "write_cgroup_file memory.max"))
+ goto cleanup;
+ ret = write_cgroup_file(CG_DIR, "memory.swap.max", "0");
+ if (!ASSERT_OK(ret, "write_cgroup_file memory.swap.max"))
+ goto cleanup;
+
+ ret = create_and_get_cgroup(CG_HIGH_DIR);
+ if (!ASSERT_GE(ret, 0, "create_and_get_cgroup "CG_HIGH_DIR))
+ goto cleanup;
+ if (high_cgroup_fd)
+ *high_cgroup_fd = ret;
+ else
+ close(ret);
+ *high_cgroup_id = get_cgroup_id(CG_HIGH_DIR);
+ if (!ASSERT_GT(*high_cgroup_id, 0, "get_cgroup_id"))
+ goto cleanup;
+
+ ret = create_and_get_cgroup(CG_LOW_DIR);
+ if (!ASSERT_GE(ret, 0, "create_and_get_cgroup "CG_LOW_DIR))
+ goto cleanup;
+ if (low_cgroup_fd)
+ *low_cgroup_fd = ret;
+ else
+ close(ret);
+
+ return 0;
+
+cleanup:
+ cleanup_cgroup_environment();
+ return -1;
+}
+
+int write_file(const char *filename)
+{
+ int ret = -1;
+ size_t written = 0;
+ char *buffer;
+ FILE *fp;
+
+ fp = fopen(filename, "wb");
+ if (!fp)
+ goto out;
+
+ buffer = malloc(BUFFER_SIZE);
+ if (!buffer)
+ goto cleanup_fp;
+
+ memset(buffer, 'A', BUFFER_SIZE);
+
+ while (written < FILE_SIZE) {
+ size_t to_write = (FILE_SIZE - written < BUFFER_SIZE) ?
+ (FILE_SIZE - written) :
+ BUFFER_SIZE;
+
+ if (fwrite(buffer, 1, to_write, fp) != to_write)
+ goto cleanup;
+ written += to_write;
+ }
+
+ ret = 0;
+cleanup:
+ free(buffer);
+cleanup_fp:
+ fclose(fp);
+out:
+ return ret;
+}
+
+int read_file(const char *filename, int iterations)
+{
+ int ret = -1;
+ long page_size = sysconf(_SC_PAGESIZE);
+ char *p;
+ char *map;
+ size_t i;
+ int fd;
+ struct stat sb;
+
+ fd = open(filename, O_RDONLY);
+ if (fd == -1)
+ goto out;
+
+ if (fstat(fd, &sb) == -1)
+ goto cleanup_fd;
+
+ if (sb.st_size != FILE_SIZE) {
+ fprintf(stderr, "File size mismatch: expected %lu, got %lu\n",
+ (unsigned long)FILE_SIZE, (unsigned long)sb.st_size);
+ goto cleanup_fd;
+ }
+
+ map = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
+ if (map == MAP_FAILED)
+ goto cleanup_fd;
+
+ for (int iter = 0; iter < iterations; iter++) {
+ for (i = 0; i < FILE_SIZE; i += page_size) {
+ /* access a byte to trigger page fault */
+ p = &map[i];
+ __asm__ __volatile__("" : : "r"(p) : "memory");
+ }
+
+ if (env.verbosity >= VERBOSE_NORMAL)
+ printf("%s %d %d done\n", __func__, getpid(), iter);
+ }
+
+ if (munmap(map, FILE_SIZE) == -1)
+ goto cleanup_fd;
+
+ ret = 0;
+
+cleanup_fd:
+ close(fd);
+out:
+ return ret;
+}
+
+static int
+real_test_memcg_ops_child_work(const char *cgroup_path,
+ char *data_filename,
+ char *time_filename,
+ int read_times)
+{
+ struct timeval start, end;
+ double elapsed;
+ FILE *fp;
+ int ret = -1;
+
+ if (!ASSERT_OK(join_parent_cgroup(cgroup_path), "join_parent_cgroup"))
+ goto out;
+
+ if (env.verbosity >= VERBOSE_NORMAL)
+ printf("%s %d begin\n", __func__, getpid());
+
+ gettimeofday(&start, NULL);
+
+ if (!ASSERT_OK(write_file(data_filename), "write_file"))
+ goto out;
+
+ if (env.verbosity >= VERBOSE_NORMAL)
+ printf("%s %d write_file done\n", __func__, getpid());
+
+ if (!ASSERT_OK(read_file(data_filename, read_times), "read_file"))
+ goto out;
+
+ gettimeofday(&end, NULL);
+
+ elapsed = (end.tv_sec - start.tv_sec) +
+ (end.tv_usec - start.tv_usec) / 1000000.0;
+
+ if (env.verbosity >= VERBOSE_NORMAL)
+ printf("%s %d end %.6f\n", __func__, getpid(), elapsed);
+
+ fp = fopen(time_filename, "w");
+ if (!ASSERT_OK_PTR(fp, "fopen"))
+ goto out;
+ fprintf(fp, "%.6f", elapsed);
+ fclose(fp);
+
+ ret = 0;
+out:
+ return ret;
+}
+
+static int get_time(char *time_filename, double *time)
+{
+ int ret = -1;
+ FILE *fp;
+ char buf[64];
+
+ fp = fopen(time_filename, "r");
+ if (!ASSERT_OK_PTR(fp, "fopen"))
+ goto out;
+
+ if (!ASSERT_OK_PTR(fgets(buf, sizeof(buf), fp), "fgets"))
+ goto cleanup;
+
+ if (sscanf(buf, "%lf", time) != 1) {
+ PRINT_FAIL("sscanf %s", buf);
+ goto cleanup;
+ }
+
+ ret = 0;
+cleanup:
+ fclose(fp);
+out:
+ return ret;
+}
+
+static void real_test_memcg_ops(int read_times)
+{
+ int ret;
+ char data_file1[] = "/tmp/test_data_XXXXXX";
+ char data_file2[] = "/tmp/test_data_XXXXXX";
+ char time_file1[] = "/tmp/test_time_XXXXXX";
+ char time_file2[] = "/tmp/test_time_XXXXXX";
+ pid_t pid1, pid2;
+ double time1, time2;
+ int status;
+
+ ret = mkstemp(data_file1);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ return;
+ close(ret);
+ ret = mkstemp(data_file2);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ goto cleanup_data_file1;
+ close(ret);
+ ret = mkstemp(time_file1);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ goto cleanup_data_file2;
+ close(ret);
+ ret = mkstemp(time_file2);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ goto cleanup_time_file1;
+ close(ret);
+
+ pid1 = fork();
+ if (!ASSERT_GE(pid1, 0, "fork"))
+ goto cleanup;
+ if (pid1 == 0) {
+ exit(real_test_memcg_ops_child_work(CG_LOW_DIR,
+ data_file1,
+ time_file1,
+ read_times));
+ }
+
+ pid2 = fork();
+ if (!ASSERT_GE(pid2, 0, "fork")) {
+ /* Reap first child to avoid a zombie if second fork fails. */
+ (void)waitpid(pid1, NULL, 0);
+ goto cleanup;
+ }
+ if (pid2 == 0) {
+ exit(real_test_memcg_ops_child_work(CG_HIGH_DIR,
+ data_file2,
+ time_file2,
+ read_times));
+ }
+
+ ret = waitpid(pid1, &status, 0);
+ if (!ASSERT_GT(ret, 0, "child1 waitpid"))
+ goto cleanup;
+ if (!ASSERT_TRUE(WIFEXITED(status), "child1 exited normally"))
+ goto cleanup;
+ if (!ASSERT_EQ(WEXITSTATUS(status), 0, "child1 exit status"))
+ goto cleanup;
+
+ ret = waitpid(pid2, &status, 0);
+ if (!ASSERT_GT(ret, 0, "child2 waitpid"))
+ goto cleanup;
+ if (!ASSERT_TRUE(WIFEXITED(status), "child2 exited normally"))
+ goto cleanup;
+ if (!ASSERT_EQ(WEXITSTATUS(status), 0, "child2 exit status"))
+ goto cleanup;
+
+ if (get_time(time_file1, &time1))
+ goto cleanup;
+
+ if (get_time(time_file2, &time2))
+ goto cleanup;
+
+ if (time1 < time2 || time1 - time2 <= 1)
+ PRINT_FAIL("Low priority cgroup not slower: low=%f vs high=%f",
+ time1, time2);
+
+cleanup:
+ unlink(time_file2);
+cleanup_time_file1:
+ unlink(time_file1);
+cleanup_data_file2:
+ unlink(data_file2);
+cleanup_data_file1:
+ unlink(data_file1);
+}
+
+void test_memcg_ops_over_high(void)
+{
+ int err, map_fd;
+ struct memcg_ops *skel = NULL;
+ struct bpf_map *map;
+ struct memcg_ops__bss *bss_data;
+ __u32 key = 0;
+ struct bpf_program *prog = NULL;
+ struct bpf_link *link = NULL, *link2 = NULL;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+ u64 high_cgroup_id;
+ int low_cgroup_fd = -1;
+
+ err = setup_cgroup(&high_cgroup_id, &low_cgroup_fd, NULL);
+ if (!ASSERT_OK(err, "setup_cgroup"))
+ goto out;
+
+ skel = memcg_ops__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "memcg_ops__open_and_load"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, ".bss");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name .bss"))
+ goto out;
+
+ map_fd = bpf_map__fd(map);
+ bss_data = calloc(1, bpf_map__value_size(map));
+ if (!ASSERT_OK_PTR(bss_data, "calloc(1, bpf_map__value_size(map))"))
+ goto out;
+ bss_data->local_config.high_cgroup_id = high_cgroup_id;
+ bss_data->local_config.threshold = TRIGGER_THRESHOLD;
+ bss_data->local_config.use_below_low = false;
+ bss_data->local_config.use_below_min = false;
+ bss_data->local_config.over_high_ms = OVER_HIGH_MS;
+ err = bpf_map_update_elem(map_fd, &key, bss_data, BPF_EXIST);
+ free(bss_data);
+ if (!ASSERT_OK(err, "bpf_map_update_elem"))
+ goto out;
+
+ prog = bpf_object__find_program_by_name(skel->obj,
+ "handle_count_memcg_events");
+ if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+ goto out;
+
+ link = bpf_program__attach(prog);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "low_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name low_mcg_ops"))
+ goto out;
+
+ opts.relative_fd = low_cgroup_fd;
+ link2 = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link2, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ real_test_memcg_ops(5);
+
+out:
+ bpf_link__destroy(link);
+ bpf_link__destroy(link2);
+ if (skel) {
+ memcg_ops__detach(skel);
+ memcg_ops__destroy(skel);
+ }
+ close(low_cgroup_fd);
+ cleanup_cgroup_environment();
+}
+
+void test_memcg_ops_below_low_over_high(void)
+{
+ int err, map_fd;
+ struct memcg_ops *skel = NULL;
+ struct bpf_map *map;
+ struct memcg_ops__bss *bss_data;
+ __u32 key = 0;
+ struct bpf_program *prog = NULL;
+ struct bpf_link *link = NULL, *link_high = NULL, *link_low = NULL;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+ u64 high_cgroup_id;
+ int high_cgroup_fd = -1, low_cgroup_fd = -1;
+
+ err = setup_cgroup(&high_cgroup_id, &low_cgroup_fd, &high_cgroup_fd);
+ if (!ASSERT_OK(err, "setup_cgroup"))
+ goto out;
+
+ skel = memcg_ops__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "memcg_ops__open_and_load"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, ".bss");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name .bss"))
+ goto out;
+
+ map_fd = bpf_map__fd(map);
+ bss_data = calloc(1, bpf_map__value_size(map));
+ if (!ASSERT_OK_PTR(bss_data, "calloc(1, bpf_map__value_size(map))"))
+ goto out;
+ bss_data->local_config.high_cgroup_id = high_cgroup_id;
+ bss_data->local_config.threshold = TRIGGER_THRESHOLD;
+ bss_data->local_config.use_below_low = true;
+ bss_data->local_config.use_below_min = false;
+ bss_data->local_config.over_high_ms = OVER_HIGH_MS;
+ err = bpf_map_update_elem(map_fd, &key, bss_data, BPF_EXIST);
+ free(bss_data);
+ if (!ASSERT_OK(err, "bpf_map_update_elem"))
+ goto out;
+
+ prog = bpf_object__find_program_by_name(skel->obj,
+ "handle_count_memcg_events");
+ if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+ goto out;
+
+ link = bpf_program__attach(prog);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "high_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name high_mcg_ops"))
+ goto out;
+ opts.relative_fd = high_cgroup_fd;
+ link_high = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link_high, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "low_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name low_mcg_ops"))
+ goto out;
+ opts.relative_fd = low_cgroup_fd;
+ link_low = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link_low, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ real_test_memcg_ops(50);
+
+out:
+ bpf_link__destroy(link);
+ bpf_link__destroy(link_high);
+ bpf_link__destroy(link_low);
+ if (skel) {
+ memcg_ops__detach(skel);
+ memcg_ops__destroy(skel);
+ }
+ close(high_cgroup_fd);
+ close(low_cgroup_fd);
+ cleanup_cgroup_environment();
+}
+
+void test_memcg_ops_below_min_over_high(void)
+{
+ int err, map_fd;
+ struct memcg_ops *skel = NULL;
+ struct bpf_map *map;
+ struct memcg_ops__bss *bss_data;
+ __u32 key = 0;
+ struct bpf_program *prog = NULL;
+ struct bpf_link *link = NULL, *link_high = NULL, *link_low = NULL;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+ u64 high_cgroup_id;
+ int high_cgroup_fd = -1, low_cgroup_fd = -1;
+
+ err = setup_cgroup(&high_cgroup_id, &low_cgroup_fd, &high_cgroup_fd);
+ if (!ASSERT_OK(err, "setup_cgroup"))
+ goto out;
+
+ skel = memcg_ops__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "memcg_ops__open_and_load"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, ".bss");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name .bss"))
+ goto out;
+
+ map_fd = bpf_map__fd(map);
+ bss_data = calloc(1, bpf_map__value_size(map));
+ if (!ASSERT_OK_PTR(bss_data, "calloc(1, bpf_map__value_size(map))"))
+ goto out;
+ bss_data->local_config.high_cgroup_id = high_cgroup_id;
+ bss_data->local_config.threshold = TRIGGER_THRESHOLD;
+ bss_data->local_config.use_below_low = false;
+ bss_data->local_config.use_below_min = true;
+ bss_data->local_config.over_high_ms = OVER_HIGH_MS;
+ err = bpf_map_update_elem(map_fd, &key, bss_data, BPF_EXIST);
+ free(bss_data);
+ if (!ASSERT_OK(err, "bpf_map_update_elem"))
+ goto out;
+
+ prog = bpf_object__find_program_by_name(skel->obj,
+ "handle_count_memcg_events");
+ if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+ goto out;
+
+ link = bpf_program__attach(prog);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "high_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name high_mcg_ops"))
+ goto out;
+ opts.relative_fd = high_cgroup_fd;
+ link_high = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link_high, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "low_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name low_mcg_ops"))
+ goto out;
+ opts.relative_fd = low_cgroup_fd;
+ link_low = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link_low, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ real_test_memcg_ops(50);
+
+out:
+ bpf_link__destroy(link);
+ bpf_link__destroy(link_high);
+ bpf_link__destroy(link_low);
+ if (skel) {
+ memcg_ops__detach(skel);
+ memcg_ops__destroy(skel);
+ }
+ close(high_cgroup_fd);
+ close(low_cgroup_fd);
+ cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/memcg_ops.c b/tools/testing/selftests/bpf/progs/memcg_ops.c
new file mode 100644
index 000000000000..97c5897933c7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/memcg_ops.c
@@ -0,0 +1,130 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define ONE_SECOND_NS 1000000000
+
+struct local_config {
+ u64 threshold;
+ u64 high_cgroup_id;
+ bool use_below_low;
+ bool use_below_min;
+ unsigned int over_high_ms;
+} local_config;
+
+struct AggregationData {
+ u64 sum;
+ u64 window_start_ts;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct AggregationData);
+} aggregation_map SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, u64);
+} trigger_ts_map SEC(".maps");
+
+SEC("tp/memcg/count_memcg_events")
+int
+handle_count_memcg_events(struct trace_event_raw_memcg_rstat_events *ctx)
+{
+ u32 key = 0;
+ struct AggregationData *data;
+ u64 current_ts;
+
+ if (ctx->id != local_config.high_cgroup_id ||
+ (ctx->item != PGFAULT))
+ goto out;
+
+ data = bpf_map_lookup_elem(&aggregation_map, &key);
+ if (!data)
+ goto out;
+
+ current_ts = bpf_ktime_get_ns();
+
+ if (current_ts - data->window_start_ts < ONE_SECOND_NS) {
+ data->sum += ctx->val;
+ } else {
+ data->window_start_ts = current_ts;
+ data->sum = ctx->val;
+ }
+
+ if (data->sum > local_config.threshold) {
+ bpf_map_update_elem(&trigger_ts_map, &key, ¤t_ts,
+ BPF_ANY);
+ data->sum = 0;
+ data->window_start_ts = current_ts;
+ }
+
+out:
+ return 0;
+}
+
+static bool need_threshold(void)
+{
+ u32 key = 0;
+ u64 *trigger_ts;
+ bool ret = false;
+ u64 current_ts;
+
+ trigger_ts = bpf_map_lookup_elem(&trigger_ts_map, &key);
+ if (!trigger_ts || *trigger_ts == 0)
+ goto out;
+
+ current_ts = bpf_ktime_get_ns();
+
+ if (current_ts - *trigger_ts < ONE_SECOND_NS)
+ ret = true;
+
+out:
+ return ret;
+}
+
+SEC("struct_ops/below_low")
+bool below_low_impl(struct mem_cgroup *memcg)
+{
+ if (!local_config.use_below_low)
+ return false;
+
+ return need_threshold();
+}
+
+SEC("struct_ops/below_min")
+bool below_min_impl(struct mem_cgroup *memcg)
+{
+ if (!local_config.use_below_min)
+ return false;
+
+ return need_threshold();
+}
+
+SEC("struct_ops/get_high_delay_ms")
+unsigned int get_high_delay_ms_impl(struct mem_cgroup *memcg)
+{
+ if (local_config.over_high_ms && need_threshold())
+ return local_config.over_high_ms;
+
+ return 0;
+}
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops high_mcg_ops = {
+ .below_low = (void *)below_low_impl,
+ .below_min = (void *)below_min_impl,
+};
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops low_mcg_ops = {
+ .get_high_delay_ms = (void *)get_high_delay_ms_impl,
+};
+
+char LICENSE[] SEC("license") = "GPL";
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 10/12] mm/bpf: Add BPF_F_ALLOW_OVERRIDE support for memcg_bpf_ops
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
` (8 preceding siblings ...)
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 09/12] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
@ 2026-02-04 9:00 ` Hui Zhu
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 11/12] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 12/12] samples/bpf: Add memcg priority control example Hui Zhu
11 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 9:00 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
Cc: Hui Zhu, Geliang Tang
From: Hui Zhu <zhuhui@kylinos.cn>
To allow for more flexible attachment policies in nested cgroup
hierarchies, this patch introduces support for the
`BPF_F_ALLOW_OVERRIDE` flag for `memcg_bpf_ops`.
When a `memcg_bpf_ops` is attached to a cgroup with this flag, it
permits child cgroups to attach their own, different `memcg_bpf_ops`,
overriding the parent's inherited program. Without this flag,
attaching a BPF program to a cgroup that already has one (either
directly or via inheritance) will fail.
The implementation involves:
- Adding a `bpf_ops_flags` field to `struct mem_cgroup`.
- During registration (`bpf_memcg_ops_reg`), checking for existing
programs and the `BPF_F_ALLOW_OVERRIDE` flag.
- During unregistration (`bpf_memcg_ops_unreg`), correctly restoring
the parent's BPF program to the cgroup hierarchy.
- Ensuring flags are inherited by child cgroups during online events.
This change enables complex, multi-level policy enforcement where
different subtrees of the cgroup hierarchy can have distinct memory
management BPF programs.
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
include/linux/memcontrol.h | 1 +
mm/bpf_memcontrol.c | 96 ++++++++++++++++++++++++++------------
2 files changed, 66 insertions(+), 31 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d91dbb95069b..c7b32a01a854 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -355,6 +355,7 @@ struct mem_cgroup {
#ifdef CONFIG_BPF_SYSCALL
struct memcg_bpf_ops *bpf_ops;
+ u32 bpf_ops_flags;
#endif
struct mem_cgroup_per_node *nodeinfo[];
diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index 72b720400628..909751263f98 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -204,10 +204,11 @@ void memcontrol_bpf_online(struct mem_cgroup *memcg)
/*
* Because only functions bpf_memcg_ops_reg and bpf_memcg_ops_unreg
- * write to memcg->bpf_ops under the protection of cgroup_mutex,
- * ensuring that cgroup_mutex is already locked here allows safe
- * reading and writing of memcg->bpf_ops without needing to acquire
- * a lock on memcg_bpf_srcu.
+ * write to memcg->bpf_ops and memcg->bpf_ops_flags under the
+ * protection of cgroup_mutex, ensuring that cgroup_mutex is already
+ * locked here allows safe reading and writing of memcg->bpf_ops and
+ * memcg->bpf_ops_flags without needing to acquire a lock on
+ * memcg_bpf_srcu.
*/
lockdep_assert_held(&cgroup_mutex);
@@ -218,6 +219,7 @@ void memcontrol_bpf_online(struct mem_cgroup *memcg)
if (!ops)
return;
WRITE_ONCE(memcg->bpf_ops, ops);
+ memcg->bpf_ops_flags = parent_memcg->bpf_ops_flags;
/*
* If the BPF program implements it, call the online handler to
@@ -239,7 +241,7 @@ void memcontrol_bpf_offline(struct mem_cgroup *memcg)
{
struct memcg_bpf_ops *ops;
- /* Same with function memcontrol_bpf_online. */
+ /* Same locking rules as memcontrol_bpf_online(). */
lockdep_assert_held(&cgroup_mutex);
ops = READ_ONCE(memcg->bpf_ops);
@@ -335,48 +337,62 @@ static int bpf_memcg_ops_init_member(const struct btf_type *t,
return 0;
}
-/**
- * clean_memcg_bpf_ops - Clear BPF ops from a memory cgroup hierarchy
- * @memcg: Root memory cgroup to start from
- * @ops: The specific BPF ops to remove
- *
- * Walks the cgroup hierarchy and clears bpf_ops for any cgroup that
- * matches @ops.
- */
-static void clean_memcg_bpf_ops(struct mem_cgroup *memcg,
- struct memcg_bpf_ops *ops)
-{
- struct mem_cgroup *iter = NULL;
-
- while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
- if (READ_ONCE(iter->bpf_ops) == ops)
- WRITE_ONCE(iter->bpf_ops, NULL);
- }
-}
-
static int bpf_memcg_ops_reg(void *kdata, struct bpf_link *link)
{
struct bpf_struct_ops_link *ops_link
= container_of(link, struct bpf_struct_ops_link, link);
- struct memcg_bpf_ops *ops = kdata;
- struct mem_cgroup *memcg, *iter = NULL;
+ struct memcg_bpf_ops *ops = kdata, *old_ops;
+ struct mem_cgroup *memcg, *iter;
int err = 0;
+ if (ops_link->flags & ~BPF_F_ALLOW_OVERRIDE) {
+ pr_err("only BPF_F_ALLOW_OVERRIDE supported for struct_ops\n");
+ return -EOPNOTSUPP;
+ }
+
memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
if (IS_ERR(memcg))
return PTR_ERR(memcg);
cgroup_lock();
+
+ /*
+ * Check if memcg has bpf_ops and whether it is inherited from
+ * parent.
+ * If inherited and BPF_F_ALLOW_OVERRIDE is set, allow override.
+ */
+ old_ops = READ_ONCE(memcg->bpf_ops);
+ if (old_ops) {
+ struct mem_cgroup *parent_memcg = parent_mem_cgroup(memcg);
+
+ if (!parent_memcg ||
+ !(memcg->bpf_ops_flags & BPF_F_ALLOW_OVERRIDE) ||
+ READ_ONCE(parent_memcg->bpf_ops) != old_ops) {
+ err = -EBUSY;
+ goto unlock_out;
+ }
+ }
+
+ /* Check for incompatible bpf_ops in descendants. */
+ iter = NULL;
while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
- if (READ_ONCE(iter->bpf_ops)) {
+ struct memcg_bpf_ops *iter_ops = READ_ONCE(iter->bpf_ops);
+
+ if (iter_ops && iter_ops != old_ops) {
+ /* cannot override existing bpf_ops of sub-cgroup. */
mem_cgroup_iter_break(memcg, iter);
err = -EBUSY;
- break;
+ goto unlock_out;
}
+ }
+
+ iter = NULL;
+ while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
WRITE_ONCE(iter->bpf_ops, ops);
+ iter->bpf_ops_flags = ops_link->flags;
}
- if (err)
- clean_memcg_bpf_ops(memcg, ops);
+
+unlock_out:
cgroup_unlock();
mem_cgroup_put(memcg);
@@ -390,13 +406,31 @@ static void bpf_memcg_ops_unreg(void *kdata, struct bpf_link *link)
= container_of(link, struct bpf_struct_ops_link, link);
struct memcg_bpf_ops *ops = kdata;
struct mem_cgroup *memcg;
+ struct mem_cgroup *iter;
+ struct memcg_bpf_ops *parent_bpf_ops = NULL;
+ u32 parent_bpf_ops_flags = 0;
memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
if (IS_ERR_OR_NULL(memcg))
goto out;
cgroup_lock();
- clean_memcg_bpf_ops(memcg, ops);
+
+ /* Get the parent bpf_ops and bpf_ops_flags */
+ iter = parent_mem_cgroup(memcg);
+ if (iter) {
+ parent_bpf_ops = READ_ONCE(iter->bpf_ops);
+ parent_bpf_ops_flags = iter->bpf_ops_flags;
+ }
+
+ iter = NULL;
+ while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
+ if (READ_ONCE(iter->bpf_ops) == ops) {
+ WRITE_ONCE(iter->bpf_ops, parent_bpf_ops);
+ iter->bpf_ops_flags = parent_bpf_ops_flags;
+ }
+ }
+
cgroup_unlock();
mem_cgroup_put(memcg);
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 11/12] selftests/bpf: Add test for memcg_bpf_ops hierarchies
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
` (9 preceding siblings ...)
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 10/12] mm/bpf: Add BPF_F_ALLOW_OVERRIDE support " Hui Zhu
@ 2026-02-04 9:00 ` Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 12/12] samples/bpf: Add memcg priority control example Hui Zhu
11 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 9:00 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
Cc: Hui Zhu, Geliang Tang
From: Hui Zhu <zhuhui@kylinos.cn>
Add a new selftest, `test_memcg_ops_hierarchies`, to validate the
behavior of attaching `memcg_bpf_ops` in a nested cgroup hierarchy,
specifically testing the `BPF_F_ALLOW_OVERRIDE` flag.
The test case performs the following steps:
1. Creates a three-level deep cgroup hierarchy: `/cg`, `/cg/cg`, and
`/cg/cg/cg`.
2. Attaches a BPF struct_ops to the top-level cgroup (`/cg`) with the
`BPF_F_ALLOW_OVERRIDE` flag.
3. Successfully attaches a new struct_ops to the middle cgroup
(`/cg/cg`) without the flag, overriding the inherited one.
4. Asserts that attaching another struct_ops to the deepest cgroup
(`/cg/cg/cg`) fails with -EBUSY, because its parent did not specify
`BPF_F_ALLOW_OVERRIDE`.
This test ensures that the attachment logic correctly enforces the
override rules across a cgroup subtree.
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
.../selftests/bpf/prog_tests/memcg_ops.c | 71 +++++++++++++++++++
1 file changed, 71 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
index 8c787439f83c..378ee3b3bc01 100644
--- a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
+++ b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
@@ -553,3 +553,74 @@ void test_memcg_ops_below_min_over_high(void)
close(low_cgroup_fd);
cleanup_cgroup_environment();
}
+
+void test_memcg_ops_hierarchies(void)
+{
+ int ret, first = -1, second = -1, third = -1;
+ struct memcg_ops *skel = NULL;
+ struct bpf_map *map;
+ struct bpf_link *link1 = NULL, *link2 = NULL, *link3 = NULL;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+
+ ret = setup_cgroup_environment();
+ if (!ASSERT_OK(ret, "setup_cgroup_environment"))
+ goto cleanup;
+
+ first = create_and_get_cgroup("/cg");
+ if (!ASSERT_GE(first, 0, "create_and_get_cgroup /cg"))
+ goto cleanup;
+ ret = enable_controllers("/cg", "memory");
+ if (!ASSERT_OK(ret, "enable_controllers"))
+ goto cleanup;
+
+ second = create_and_get_cgroup("/cg/cg");
+ if (!ASSERT_GE(second, 0, "create_and_get_cgroup /cg/cg"))
+ goto cleanup;
+ ret = enable_controllers("/cg/cg", "memory");
+ if (!ASSERT_OK(ret, "enable_controllers"))
+ goto cleanup;
+
+ third = create_and_get_cgroup("/cg/cg/cg");
+ if (!ASSERT_GE(third, 0, "create_and_get_cgroup /cg/cg/cg"))
+ goto cleanup;
+ ret = enable_controllers("/cg/cg/cg", "memory");
+ if (!ASSERT_OK(ret, "enable_controllers"))
+ goto cleanup;
+
+ skel = memcg_ops__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "memcg_ops__open_and_load"))
+ goto cleanup;
+
+ map = bpf_object__find_map_by_name(skel->obj, "low_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name low_mcg_ops"))
+ goto cleanup;
+
+ opts.relative_fd = first;
+ opts.flags = BPF_F_ALLOW_OVERRIDE;
+ link1 = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link1, "bpf_map__attach_struct_ops_opts"))
+ goto cleanup;
+
+ opts.relative_fd = second;
+ opts.flags = 0;
+ link2 = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link2, "bpf_map__attach_struct_ops_opts"))
+ goto cleanup;
+
+ opts.relative_fd = third;
+ opts.flags = 0;
+ link3 = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_ERR_PTR(link3, "bpf_map__attach_struct_ops_opts"))
+ goto cleanup;
+
+cleanup:
+ bpf_link__destroy(link1);
+ bpf_link__destroy(link2);
+ bpf_link__destroy(link3);
+ memcg_ops__detach(skel);
+ memcg_ops__destroy(skel);
+ close(first);
+ close(second);
+ close(third);
+ cleanup_cgroup_environment();
+}
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v6 12/12] samples/bpf: Add memcg priority control example
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
` (10 preceding siblings ...)
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 11/12] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
@ 2026-02-04 9:00 ` Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
11 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-02-04 9:00 UTC (permalink / raw)
To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
Masahiro Yamada, davem, Jakub Kicinski, Jesper Dangaard Brouer,
JP Kobryn, Willem de Bruijn, Jason Xing, Paul Chaignon,
Anton Protopopov, Amery Hung, Chen Ridong, Lance Yang,
Jiayuan Chen, linux-kernel, linux-mm, cgroups, bpf, netdev,
linux-kselftest
Cc: Hui Zhu, Geliang Tang
From: Hui Zhu <zhuhui@kylinos.cn>
Add a sample program to demonstrate a practical use case for the
`memcg_bpf_ops` feature: priority-based memory throttling.
The sample consists of a BPF program and a userspace loader:
1. memcg.bpf.c: A BPF program that monitors PGFAULT events on a
high-priority cgroup. When activity exceeds a threshold, it uses
the `get_high_delay_ms`, `below_low`, or `below_min` hooks to
apply pressure on a low-priority cgroup.
2. memcg.c: A userspace loader that configures and attaches the BPF
program. It takes command-line arguments for the high and low
priority cgroup paths, a pressure threshold, and the desired
throttling delay (`over_high_ms`).
This provides a clear, working example of how to implement a dynamic,
priority-aware memory management policy. A user can create two
cgroups, run workloads of different priorities, and observe the
low-priority workload being throttled to protect the high-priority one.
Example usage:
# ./memcg --low_path /sys/fs/cgroup/low \
# --high_path /sys/fs/cgroup/high \
# --threshold 100 --over_high_ms 1024
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
MAINTAINERS | 2 +
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 8 +-
samples/bpf/memcg.bpf.c | 130 +++++++++++++++
samples/bpf/memcg.c | 343 ++++++++++++++++++++++++++++++++++++++++
5 files changed, 483 insertions(+), 1 deletion(-)
create mode 100644 samples/bpf/memcg.bpf.c
create mode 100644 samples/bpf/memcg.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 7e07bb330eae..819ef271e011 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6470,6 +6470,8 @@ F: mm/memcontrol-v1.c
F: mm/memcontrol-v1.h
F: mm/page_counter.c
F: mm/swap_cgroup.c
+F: samples/bpf/memcg.bpf.c
+F: samples/bpf/memcg.c
F: samples/cgroup/*
F: tools/testing/selftests/bpf/prog_tests/memcg_ops.c
F: tools/testing/selftests/bpf/progs/memcg_ops.c
diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
index 0002cd359fb1..0de6569cdefd 100644
--- a/samples/bpf/.gitignore
+++ b/samples/bpf/.gitignore
@@ -49,3 +49,4 @@ iperf.*
/vmlinux.h
/bpftool/
/libbpf/
+memcg
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 95a4fa1f1e44..b00698bdc53b 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -37,6 +37,7 @@ tprogs-y += xdp_fwd
tprogs-y += task_fd_query
tprogs-y += ibumad
tprogs-y += hbm
+tprogs-y += memcg
# Libbpf dependencies
LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -122,6 +123,7 @@ always-y += task_fd_query_kern.o
always-y += ibumad_kern.o
always-y += hbm_out_kern.o
always-y += hbm_edt_kern.o
+always-y += memcg.bpf.o
COMMON_CFLAGS = $(TPROGS_USER_CFLAGS)
TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
@@ -289,6 +291,8 @@ $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
$(obj)/hbm.o: $(src)/hbm.h
$(obj)/hbm_edt_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
+memcg: $(obj)/memcg.skel.h
+
# Override includes for xdp_sample_user.o because $(srctree)/usr/include in
# TPROGS_CFLAGS causes conflicts
XDP_SAMPLE_CFLAGS += -Wall -O2 \
@@ -347,11 +351,13 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
-c $(filter %.bpf.c,$^) -o $@
-LINKED_SKELS := xdp_router_ipv4.skel.h
+LINKED_SKELS := xdp_router_ipv4.skel.h memcg.skel.h
clean-files += $(LINKED_SKELS)
xdp_router_ipv4.skel.h-deps := xdp_router_ipv4.bpf.o xdp_sample.bpf.o
+memcg.skel.h-deps := memcg.bpf.o
+
LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
BPF_SRCS_LINKED := $(notdir $(wildcard $(src)/*.bpf.c))
diff --git a/samples/bpf/memcg.bpf.c b/samples/bpf/memcg.bpf.c
new file mode 100644
index 000000000000..97c5897933c7
--- /dev/null
+++ b/samples/bpf/memcg.bpf.c
@@ -0,0 +1,130 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define ONE_SECOND_NS 1000000000
+
+struct local_config {
+ u64 threshold;
+ u64 high_cgroup_id;
+ bool use_below_low;
+ bool use_below_min;
+ unsigned int over_high_ms;
+} local_config;
+
+struct AggregationData {
+ u64 sum;
+ u64 window_start_ts;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct AggregationData);
+} aggregation_map SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, u64);
+} trigger_ts_map SEC(".maps");
+
+SEC("tp/memcg/count_memcg_events")
+int
+handle_count_memcg_events(struct trace_event_raw_memcg_rstat_events *ctx)
+{
+ u32 key = 0;
+ struct AggregationData *data;
+ u64 current_ts;
+
+ if (ctx->id != local_config.high_cgroup_id ||
+ (ctx->item != PGFAULT))
+ goto out;
+
+ data = bpf_map_lookup_elem(&aggregation_map, &key);
+ if (!data)
+ goto out;
+
+ current_ts = bpf_ktime_get_ns();
+
+ if (current_ts - data->window_start_ts < ONE_SECOND_NS) {
+ data->sum += ctx->val;
+ } else {
+ data->window_start_ts = current_ts;
+ data->sum = ctx->val;
+ }
+
+ if (data->sum > local_config.threshold) {
+ bpf_map_update_elem(&trigger_ts_map, &key, ¤t_ts,
+ BPF_ANY);
+ data->sum = 0;
+ data->window_start_ts = current_ts;
+ }
+
+out:
+ return 0;
+}
+
+static bool need_threshold(void)
+{
+ u32 key = 0;
+ u64 *trigger_ts;
+ bool ret = false;
+ u64 current_ts;
+
+ trigger_ts = bpf_map_lookup_elem(&trigger_ts_map, &key);
+ if (!trigger_ts || *trigger_ts == 0)
+ goto out;
+
+ current_ts = bpf_ktime_get_ns();
+
+ if (current_ts - *trigger_ts < ONE_SECOND_NS)
+ ret = true;
+
+out:
+ return ret;
+}
+
+SEC("struct_ops/below_low")
+bool below_low_impl(struct mem_cgroup *memcg)
+{
+ if (!local_config.use_below_low)
+ return false;
+
+ return need_threshold();
+}
+
+SEC("struct_ops/below_min")
+bool below_min_impl(struct mem_cgroup *memcg)
+{
+ if (!local_config.use_below_min)
+ return false;
+
+ return need_threshold();
+}
+
+SEC("struct_ops/get_high_delay_ms")
+unsigned int get_high_delay_ms_impl(struct mem_cgroup *memcg)
+{
+ if (local_config.over_high_ms && need_threshold())
+ return local_config.over_high_ms;
+
+ return 0;
+}
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops high_mcg_ops = {
+ .below_low = (void *)below_low_impl,
+ .below_min = (void *)below_min_impl,
+};
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops low_mcg_ops = {
+ .get_high_delay_ms = (void *)get_high_delay_ms_impl,
+};
+
+char LICENSE[] SEC("license") = "GPL";
diff --git a/samples/bpf/memcg.c b/samples/bpf/memcg.c
new file mode 100644
index 000000000000..0ed174608a15
--- /dev/null
+++ b/samples/bpf/memcg.c
@@ -0,0 +1,343 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <getopt.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#ifndef __MEMCG_RSTAT_SIMPLE_BPF_SKEL_H__
+#define u64 uint64_t
+#endif
+
+struct local_config {
+ u64 threshold;
+ u64 high_cgroup_id;
+ bool use_below_low;
+ bool use_below_min;
+ unsigned int over_high_ms;
+} local_config;
+
+#include "memcg.skel.h"
+
+static bool exiting;
+
+static void sig_handler(int sig)
+{
+ exiting = true;
+}
+
+static void usage(const char *name)
+{
+ fprintf(stderr,
+ "Usage: %s --low_path=<path> --high_path=<path> \\\n"
+ " --threshold=<value> [OPTIONS]\n\n",
+ name);
+ fprintf(stderr, "Required arguments:\n");
+ fprintf(stderr,
+ " -l, --low_path=PATH Low priority memcgroup path\n");
+ fprintf(stderr,
+ " -g, --high_path=PATH High priority memcgroup path\n");
+ fprintf(stderr,
+ " -t, --threshold=VALUE The sum of 'val' PGFAULT of\n");
+ fprintf(stderr,
+ " high priority memcgroup in\n");
+ fprintf(stderr,
+ " 1 sec to trigger low priority\n");
+ fprintf(stderr,
+ " cgroup over_high\n\n");
+ fprintf(stderr, "Optional arguments:\n");
+ fprintf(stderr, " -o, --over_high_ms=VALUE\n");
+ fprintf(stderr,
+ " Low_path over_high_ms value\n");
+ fprintf(stderr,
+ " (default: 0)\n");
+ fprintf(stderr, " -L, --use_below_low Enable use_below_low flag\n");
+ fprintf(stderr, " -M, --use_below_min Enable use_below_min flag\n");
+ fprintf(stderr,
+ " -O, --allow_override Enable BPF_F_ALLOW_OVERRIDE\n");
+ fprintf(stderr,
+ " flag\n");
+ fprintf(stderr, " -h, --help Show this help message\n\n");
+ fprintf(stderr, "Examples:\n");
+ fprintf(stderr, " # Using long options:\n");
+ fprintf(stderr, " %s --low_path=/sys/fs/cgroup/low \\\n", name);
+ fprintf(stderr, " --high_path=/sys/fs/cgroup/high \\\n");
+ fprintf(stderr, " --threshold=1000 --over_high_ms=500 \\\n"
+ " --use_below_low\n\n");
+ fprintf(stderr, " # Using short options:\n");
+ fprintf(stderr, " %s -l /sys/fs/cgroup/low \\\n"
+ " -g /sys/fs/cgroup/high \\\n",
+ name);
+ fprintf(stderr, " -t 1000 -o 500 -L -M\n");
+}
+
+static uint64_t get_cgroup_id(const char *cgroup_path)
+{
+ struct stat st;
+
+ if (cgroup_path == NULL) {
+ fprintf(stderr, "Error: cgroup_path is NULL\n");
+ return 0;
+ }
+
+ if (stat(cgroup_path, &st) < 0) {
+ fprintf(stderr, "Error: stat(%s) failed: %d\n",
+ cgroup_path, errno);
+ return 0;
+ }
+
+ return (uint64_t)st.st_ino;
+}
+
+static uint64_t parse_u64(const char *str, const char *name)
+{
+ uint64_t value;
+
+ errno = 0;
+ value = strtoull(str, NULL, 10);
+
+ if (errno != 0) {
+ fprintf(stderr,
+ "ERROR: strtoull '%s' failed: %d\n",
+ str, errno);
+ usage(name);
+ exit(-errno);
+ }
+
+ return value;
+}
+
+int main(int argc, char **argv)
+{
+ int low_cgroup_fd = -1, high_cgroup_fd = -1;
+ uint64_t threshold = 0, high_cgroup_id;
+ unsigned int over_high_ms = 0;
+ bool use_below_low = false, use_below_min = false;
+ __u32 opts_flags = 0;
+ const char *low_path = NULL;
+ const char *high_path = NULL;
+ const char *bpf_obj_file = "memcg.bpf.o";
+ struct bpf_object *obj = NULL;
+ struct bpf_program *prog = NULL;
+ struct bpf_link *link = NULL, *link_low = NULL, *link_high = NULL;
+ struct bpf_map *map;
+ struct memcg__bss *bss_data;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+ int err = -EINVAL;
+ int map_fd;
+ int opt;
+ int option_index = 0;
+
+ static struct option long_options[] = {
+ {"low_path", required_argument, 0, 'l'},
+ {"high_path", required_argument, 0, 'g'},
+ {"threshold", required_argument, 0, 't'},
+ {"over_high_ms", required_argument, 0, 'o'},
+ {"use_below_low", no_argument, 0, 'L'},
+ {"use_below_min", no_argument, 0, 'M'},
+ {"allow_override", no_argument, 0, 'O'},
+ {"help", no_argument, 0, 'h'},
+ {0, 0, 0, 0 }
+ };
+
+ while ((opt = getopt_long(argc, argv, "l:g:t:o:LMOh",
+ long_options, &option_index)) != -1) {
+ switch (opt) {
+ case 'l':
+ low_path = optarg;
+ break;
+ case 'g':
+ high_path = optarg;
+ break;
+ case 't':
+ threshold = parse_u64(optarg, argv[0]);
+ break;
+ case 'o':
+ over_high_ms = (unsigned int)parse_u64(optarg, argv[0]);
+ break;
+ case 'L':
+ use_below_low = true;
+ break;
+ case 'M':
+ use_below_min = true;
+ break;
+ case 'O':
+ opts_flags = BPF_F_ALLOW_OVERRIDE;
+ break;
+ case 'h':
+ usage(argv[0]);
+ return 0;
+ default:
+ usage(argv[0]);
+ return -EINVAL;
+ }
+ }
+
+ if (!low_path || !high_path || !threshold) {
+ fprintf(stderr,
+ "ERROR: Missing required arguments\n\n");
+ usage(argv[0]);
+ goto out;
+ }
+
+ low_cgroup_fd = open(low_path, O_RDONLY);
+ if (low_cgroup_fd < 0) {
+ fprintf(stderr,
+ "ERROR: open low cgroup '%s' failed: %d\n",
+ low_path, errno);
+ err = -errno;
+ goto out;
+ }
+
+ high_cgroup_id = get_cgroup_id(high_path);
+ if (!high_cgroup_id)
+ goto out;
+ high_cgroup_fd = open(high_path, O_RDONLY);
+ if (high_cgroup_fd < 0) {
+ fprintf(stderr,
+ "ERROR: open high cgroup '%s' failed: %d\n",
+ high_path, errno);
+ err = -errno;
+ goto out;
+ }
+
+ obj = bpf_object__open_file(bpf_obj_file, NULL);
+ err = libbpf_get_error(obj);
+ if (err) {
+ fprintf(stderr,
+ "ERROR: opening BPF object file '%s' failed: %d\n",
+ bpf_obj_file, err);
+ goto out;
+ }
+
+ map = bpf_object__find_map_by_name(obj, ".bss");
+ if (!map) {
+ fprintf(stderr, "ERROR: Failed to find .bss map\n");
+ err = -ESRCH;
+ goto out;
+ }
+
+ err = bpf_object__load(obj);
+ if (err) {
+ fprintf(stderr,
+ "ERROR: loading BPF object file failed: %d\n",
+ err);
+ goto out;
+ }
+
+ map_fd = bpf_map__fd(map);
+ bss_data = calloc(1, bpf_map__value_size(map));
+ if (bss_data) {
+ __u32 key = 0;
+
+ bss_data->local_config.high_cgroup_id = high_cgroup_id;
+ bss_data->local_config.threshold = threshold;
+ bss_data->local_config.over_high_ms = over_high_ms;
+ bss_data->local_config.use_below_low = use_below_low;
+ bss_data->local_config.use_below_min = use_below_min;
+
+ err = bpf_map_update_elem(map_fd, &key, bss_data, BPF_EXIST);
+ free(bss_data);
+ if (err) {
+ fprintf(stderr,
+ "ERROR: update config failed: %d\n",
+ err);
+ goto out;
+ }
+ } else {
+ fprintf(stderr,
+ "ERROR: allocate memory failed\n");
+ err = -ENOMEM;
+ goto out;
+ }
+
+ prog = bpf_object__find_program_by_name(obj,
+ "handle_count_memcg_events");
+ if (!prog) {
+ fprintf(stderr,
+ "ERROR: finding a prog in BPF object file failed\n");
+ goto out;
+ }
+
+ link = bpf_program__attach(prog);
+ err = libbpf_get_error(link);
+ if (err) {
+ fprintf(stderr,
+ "ERROR: bpf_program__attach failed: %d\n",
+ err);
+ goto out;
+ }
+
+ if (over_high_ms) {
+ map = bpf_object__find_map_by_name(obj, "low_mcg_ops");
+ if (!map) {
+ fprintf(stderr,
+ "ERROR: Failed to find low_mcg_ops map\n");
+ err = -ESRCH;
+ goto out;
+ }
+ LIBBPF_OPTS_RESET(opts,
+ .flags = opts_flags,
+ .relative_fd = low_cgroup_fd,
+ );
+ link_low = bpf_map__attach_struct_ops_opts(map, &opts);
+ err = libbpf_get_error(link_low);
+ if (err) {
+ fprintf(stderr,
+ "Failed to attach struct ops low_mcg_ops: %d\n",
+ err);
+ goto out;
+ }
+ }
+
+ if (use_below_low || use_below_min) {
+ map = bpf_object__find_map_by_name(obj, "high_mcg_ops");
+ if (!map) {
+ fprintf(stderr,
+ "ERROR: Failed to find high_mcg_ops map\n");
+ err = -ESRCH;
+ goto out;
+ }
+ LIBBPF_OPTS_RESET(opts,
+ .flags = opts_flags,
+ .relative_fd = high_cgroup_fd,
+ );
+ link_high = bpf_map__attach_struct_ops_opts(map, &opts);
+ err = libbpf_get_error(link_high);
+ if (err) {
+ fprintf(stderr,
+ "Failed to attach struct ops high_mcg_ops: %d\n",
+ err);
+ goto out;
+ }
+ }
+
+ printf("Successfully attached!\n");
+
+ signal(SIGINT, sig_handler);
+ signal(SIGTERM, sig_handler);
+
+ while (!exiting)
+ pause();
+
+ printf("Exiting...\n");
+
+out:
+ bpf_link__destroy(link);
+ bpf_link__destroy(link_low);
+ bpf_link__destroy(link_high);
+ bpf_object__close(obj);
+ close(low_cgroup_fd);
+ close(high_cgroup_fd);
+ return err;
+}
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v6 11/12] selftests/bpf: Add test for memcg_bpf_ops hierarchies
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 11/12] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
@ 2026-02-04 9:28 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-02-04 9:28 UTC (permalink / raw)
To: hui.zhu, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny, kernel,
brauner, rdunlap, brgerst, masahiroy, davem, kuba, hawk,
inwardvessel, willemb, kerneljasonxing, paul.chaignon,
a.s.protopopov, ameryhung, chenridong, lance.yang, jiayuan.chen,
linux-kernel, linux-mm, cgroups, bpf, netdev, linu, x-kselftest
Cc: zhuhui, geliang, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 2129 bytes --]
> diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
> index 8c787439f..378ee3b3b 100644
> --- a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
> +++ b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
[ ... ]
> +void test_memcg_ops_hierarchies(void)
> +{
> + int ret, first = -1, second = -1, third = -1;
> + struct memcg_ops *skel = NULL;
[ ... ]
> +cleanup:
> + bpf_link__destroy(link1);
> + bpf_link__destroy(link2);
> + bpf_link__destroy(link3);
> + memcg_ops__detach(skel);
> + memcg_ops__destroy(skel);
Can this crash if skel is NULL? Unlike the other tests in this file
which guard these calls with if (skel), this cleanup path calls
memcg_ops__detach() unconditionally. If any of the earlier goto cleanup
paths are taken before memcg_ops__open_and_load() succeeds, skel remains
NULL.
The generated skeleton __detach() function does not have a NULL check:
static inline void
%1$s__detach(struct %1$s *obj)
{
bpf_object__detach_skeleton(obj->skeleton);
}
This would dereference NULL when accessing obj->skeleton.
For comparison, test_memcg_ops_over_high(), test_memcg_ops_below_low_over_high(),
and test_memcg_ops_below_min_over_high() in the same file all use:
if (skel) {
memcg_ops__detach(skel);
memcg_ops__destroy(skel);
}
> + close(first);
> + close(second);
> + close(third);
> + cleanup_cgroup_environment();
> +}
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660
AI-authorship-score: low
AI-authorship-explanation: The code follows consistent patterns with other tests in the file and uses standard BPF selftest conventions, suggesting human authorship with good domain knowledge.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: NULL pointer dereference crash in selftest cleanup path when cgroup setup fails, affecting test reliability but not production kernel code.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v6 12/12] samples/bpf: Add memcg priority control example
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 12/12] samples/bpf: Add memcg priority control example Hui Zhu
@ 2026-02-04 9:28 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-02-04 9:28 UTC (permalink / raw)
To: hui.zhu, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny, kernel,
brauner, rdunlap, brgerst, masahiroy, davem, kuba, hawk,
inwardvessel, willemb, kerneljasonxing, paul.chaignon,
a.s.protopopov, ameryhung, chenridong, lance.yang, jiayuan.chen,
linux-kernel, linux-mm, cgroups, bpf, netdev, linu, x-kselftest
Cc: zhuhui, geliang, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 2354 bytes --]
> 1. memcg.bpf.c: A BPF program that monitors PGFAULT events on a
> high-priority cgroup. When activity exceeds a threshold, it uses
> the `get_high_delay_ms`, `below_low`, or `below_min` hooks to
> apply pressure on a low-priority cgroup.
Is this description accurate? Looking at the code, below_low and
below_min are implemented in high_mcg_ops which is attached to the
high-priority cgroup:
high_mcg_ops = {
.below_low = (void *)below_low_impl,
.below_min = (void *)below_min_impl,
};
And in memcg.c, high_mcg_ops is attached via:
.relative_fd = high_cgroup_fd,
According to the memcg_bpf_ops documentation, below_low and below_min
provide protection for a cgroup (returning true means the cgroup is
protected from reclaim), rather than applying pressure. Only
get_high_delay_ms (attached to the low-priority cgroup via low_mcg_ops)
actually throttles/applies pressure.
Could the commit message clarify that below_low/below_min protect the
high-priority cgroup, while get_high_delay_ms throttles the low-priority
one?
> diff --git a/samples/bpf/memcg.c b/samples/bpf/memcg.c
> new file mode 100644
> index 000000000..0ed174608
> --- /dev/null
> +++ b/samples/bpf/memcg.c
[ ... ]
> + prog = bpf_object__find_program_by_name(obj,
> + "handle_count_memcg_events");
> + if (!prog) {
> + fprintf(stderr,
> + "ERROR: finding a prog in BPF object file failed\n");
> + goto out;
> + }
If bpf_object__find_program_by_name() fails here, should the err variable
be set to an error code before the goto? As written, err could be 0 (from
the previous successful bpf_map_update_elem() call), causing main() to
return 0 (success) even though the program failed.
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660
AI-authorship-score: low
AI-authorship-explanation: The technical misunderstanding about hook purposes and dual authorship suggest human collaboration rather than AI generation.
issues-found: 2
issue-severity-score: low
issue-severity-explanation: The issues found are a misleading commit message description and a missing error code assignment in sample code, neither of which causes system instability.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v6 07/12] libbpf: Support passing user-defined flags for struct_ops
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 07/12] libbpf: Support passing user-defined flags " Hui Zhu
@ 2026-02-04 9:28 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-02-04 9:28 UTC (permalink / raw)
To: hui.zhu, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny, kernel,
brauner, rdunlap, brgerst, masahiroy, davem, kuba, hawk,
inwardvessel, willemb, kerneljasonxing, paul.chaignon,
a.s.protopopov, ameryhung, chenridong, lance.yang, jiayuan.chen,
linux-kernel, linux-mm, cgroups, bpf, netdev, linu, x-kselftest
Cc: zhuhui, geliang, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 1479 bytes --]
> The `bpf_struct_ops_opts` struct is extended with a `flags` member,
> which is then passed to the `bpf_link_create` syscall within
> `bpf_map__attach_struct_ops_opts`.
The commit message states the struct is "extended with a `flags` member", but
the flags field already exists in bpf_struct_ops_opts. It was added in commit
a09c73b18 ("libbpf: introduce bpf_map__attach_struct_ops_opts()"):
struct bpf_struct_ops_opts {
size_t sz;
__u32 flags;
__u32 relative_fd;
__u64 expected_revision;
size_t :0;
};
This patch adds code to USE the existing flags field, not extend the struct.
Would a message like "Pass the existing flags field from bpf_struct_ops_opts
to bpf_link_create" be more accurate?
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660
AI-authorship-score: medium
AI-authorship-explanation: The commit message contains a factual inaccuracy about extending the struct when the field already existed, which could indicate AI-generated text that didn't fully analyze the prior commit in the series.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The only issue is an inaccurate commit message that claims to extend a struct with a field that already exists; this is a documentation issue, not a code bug.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v6 06/12] bpf: Pass flags in bpf_link_create for struct_ops
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 06/12] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
@ 2026-02-04 9:28 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-02-04 9:28 UTC (permalink / raw)
To: hui.zhu, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny, kernel,
brauner, rdunlap, brgerst, masahiroy, davem, kuba, hawk,
inwardvessel, willemb, kerneljasonxing, paul.chaignon,
a.s.protopopov, ameryhung, chenridong, lance.yang, jiayuan.chen,
linux-kernel, linux-mm, cgroups, bpf, netdev, linu, x-kselftest
Cc: zhuhui, geliang, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 1431 bytes --]
> UAPI Change:
> This patch updates the comment in include/uapi/linux/bpf.h to reflect
> that the cgroup-bpf attach flags (such as BPF_F_ALLOW_OVERRIDE) are
> now applicable to both BPF_PROG_ATTACH and BPF_LINK_CREATE commands.
The commit message states that include/uapi/linux/bpf.h was updated, but
the diff shows only tools/include/uapi/linux/bpf.h was modified.
Should include/uapi/linux/bpf.h also be updated to match? Currently the
main UAPI header still says:
/* cgroup-bpf attach flags used in BPF_PROG_ATTACH command
while the tools copy now says:
/* cgroup-bpf attach flags used in BPF_PROG_ATTACH and BPF_LINK_CREATE command
This leaves the two copies inconsistent.
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660
AI-authorship-score: low
AI-authorship-explanation: The minimal code change, multi-author collaboration, and human-like error in the commit message suggest human authorship, though the well-structured commit message has some AI-like characteristics.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The commit message incorrectly claims to update include/uapi/linux/bpf.h when only tools/include/uapi/linux/bpf.h was modified, leaving the two UAPI header copies inconsistent.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v6 05/12] libbpf: introduce bpf_map__attach_struct_ops_opts()
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 05/12] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
@ 2026-02-04 9:28 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-02-04 9:28 UTC (permalink / raw)
To: hui.zhu, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny, kernel,
brauner, rdunlap, brgerst, masahiroy, davem, kuba, hawk,
inwardvessel, willemb, kerneljasonxing, paul.chaignon,
a.s.protopopov, ameryhung, chenridong, lance.yang, jiayuan.chen,
linux-kernel, linux-mm, cgroups, bpf, netdev, linu, x-kselftest
Cc: ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 2624 bytes --]
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 0c8bf0b5c..70a00da54 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
[ ... ]
> @@ -13503,7 +13510,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> return &link->link;
> }
>
> - fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> + link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
> +
> + fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);
The expected_revision field from bpf_struct_ops_opts is defined in the public
API but doesn't appear to be forwarded here. bpf_link_create() reads
opts->cgroup.expected_revision for BPF_STRUCT_OPS, and similar functions like
bpf_program__attach_cgroup_opts() forward all fields including expected_revision.
Was the expected_revision field intentionally left unused, or should there be a
line here like:
link_opts.cgroup.expected_revision = OPTS_GET(opts, expected_revision, 0);
Note: A later commit in this series (e83943e72 "libbpf: Support passing
user-defined flags for struct_ops") adds forwarding for the flags field, but
expected_revision remains unused through the end of the series.
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index dfc37a615..5aef44bcf 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -920,6 +920,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
> struct bpf_map;
>
> LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +
> +struct bpf_struct_ops_opts {
> + /* size of this struct, for forward/backward compatibility */
> + size_t sz;
> + __u32 flags;
> + __u32 relative_fd;
> + __u64 expected_revision;
^^^^^^^^^^^^^^^^^
The expected_revision field is defined here but is not read by
bpf_map__attach_struct_ops_opts() in libbpf.c, so any value set by users
will be silently ignored.
> + size_t :0;
> +};
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660
AI-authorship-score: low
AI-authorship-explanation: The commit follows standard libbpf API extension patterns with consistent naming and structure typical of experienced kernel developers.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The expected_revision field in the public API struct is silently ignored, which could confuse users but does not cause system instability or crashes.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2026-02-04 9:28 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-04 8:56 [RFC PATCH bpf-next v6 00/12] mm: memcontrol: Add BPF hooks for memory controller Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 01/12] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 02/12] bpf: initial support for attaching struct ops to cgroups Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 03/12] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 04/12] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Hui Zhu
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 05/12] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
2026-02-04 8:56 ` [RFC PATCH bpf-next v6 06/12] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 07/12] libbpf: Support passing user-defined flags " Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 08/12] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 09/12] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 10/12] mm/bpf: Add BPF_F_ALLOW_OVERRIDE support " Hui Zhu
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 11/12] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
2026-02-04 9:00 ` [RFC PATCH bpf-next v6 12/12] samples/bpf: Add memcg priority control example Hui Zhu
2026-02-04 9:28 ` bot+bpf-ci
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox