Hi Yafang.

On Wed, Jan 14, 2026 at 08:13:44PM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> On Wed, Jan 14, 2026 at 5:56 PM Michal Koutný <mkoutny@suse.com> wrote:
> >
> > On Tue, Jan 13, 2026 at 08:12:37PM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > bpf_numab_ops enables NUMA balancing for tasks within a specific memcg,
> > > even when global NUMA balancing is disabled. This allows selective NUMA
> > > optimization for workloads that benefit from it, while avoiding potential
> > > latency spikes for other workloads.
> > >
> > > The policy must be attached to a leaf memory cgroup.
> >
> > Why this restriction?
> 
> We have several potential design options to consider:
> 
> Option 1.   Stateless cgroup bpf prog
>   Attach the BPF program to a specific cgroup and traverse upward
> through the hierarchy within the hook, as demonstrated in Roman's
> BPF-OOM series:
> 
>     https://lore.kernel.org/bpf/877bwcpisd.fsf@linux.dev/
> 
>     for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
>         bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
>         if (!bpf_oom_ops)
>             continue;
> 
>           ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
>     }
> 
>    - Benefit
>      The design is relatively simple and does not require manual
> lifecycle management of the BPF program, hence the "stateless"
> designation.
>    - Drawback
>       It may introduce potential overhead in the performance-critical
> hotpath due to the traversal.
> 
> Option 2:  Stateful cgroup bpf prog
>   Attach the BPF program to all descendant cgroups, explicitly
> handling cgroup fork/exit events. This approach is similar to the one
> used in my BPF-THP series:
> 
>      https://lwn.net/ml/all/20251026100159.6103-4-laoar.shao@gmail.com/
> 
>   This requires the kernel to record every cgroup where the program is
> attached — for example, by maintaining a per-program list of cgroups
> (struct bpf_mm_ops with a bpf_thp_list). Because we must track this
> attachment state, I refer to this as a "stateful" approach.
> 
>   - Benefit: Avoids the upward traversal overhead of Option 1.
>   - Drawback: Introduces complexity for managing attachment state and
> lifecycle (attach/detach, cgroup creation/destruction).
> 
> Option 3:  Restrict Attachment to Leaf Cgroups
>   This is the approach taken in the current patchset. It simplifies
> the kernel implementation by only allowing BPF programs to be attached
> to leaf cgroups (those without children).
>   This design is inspired by our production experience, where it has
> worked well. It encourages users to attach programs directly to the
> cgroup they intend to target, avoiding ambiguity in hierarchy
> propagation.
> 
> Which of these options do you prefer? Do you have other suggestions?

I appreciate the breakdown.
With the options 1 and 2, I'm not sure whether they aren't reinventing a
wheel. Namely the stuff from kernel/bpf/cgroup.c:
- compute_effective_progs() where progs are composed/prepared into a
  sequence (depending on BPF_F_ALLOW_MULTI) and then
- bpf_prog_run_array_cg() runs them and joins the results into a
  verdict.

(Those BPF_F_* are flags known to userspace already.)

So I think it'd boil down to the type of result that individual ops
return in order to be able to apply some "reduce" function on those.

> > Do you envision how these extensions would apply hierarchically?
> 
> This feature can be applied hierarchically, though it adds complexity
> to the kernel. However, I believe that by providing the core
> functionality, we can empower users to manage their specific use cases
> effectively. We do not need to implement every possible variation for
> them.

I'd also look around how sched_ext resolved (solves?) this. Namely the
step from one global sched_ext class to per-cg extensions.
I'll Cc Tejun for more info.


> > Regardless of that, being a "leaf memcg" is not a stationary condition
> > (mkdirs, writes to `cgroup.subtree_control`) so it should also be
> > prepared for that.
> 
> In the current implementation, the user has to attach the bpf prog to
> the new cgroup as well ;)

I'd say that's not ideal UX (I imagine there's some high level policy
set to upper cgroups but the internal cgroup (sub)structure of
containers may be opaque to the admin, production experience might have
been lucky not to hit this case).
In this regard, something like the option 1 sounds better and
performance can be improved later. Option 1's "reduce" function takes
the result from the lowest ancestor but hierarchical logic should be
reversed with higher cgroups overriding the lowers.

(I don't want to claim what's the correct design, I want to make you
aware of other places in kernel that solve similar challenge.)

Thanks,
Michal