From: Yafang Shao <laoar.shao@gmail.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: roman.gushchin@linux.dev, inwardvessel@gmail.com,
shakeel.butt@linux.dev, akpm@linux-foundation.org,
ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
yu.c.chen@intel.com, zhao1.liu@intel.com, bpf@vger.kernel.org,
linux-mm@kvack.org, tj@kernel.org
Subject: Re: [RFC PATCH bpf-next 2/3] mm: add support for bpf based numa balancing
Date: Fri, 16 Jan 2026 10:45:38 +0800 [thread overview]
Message-ID: <CALOAHbAVcCv_1-yYp7QpEidxPm6vx2p6nzFdKEt61TF8LMCUPw@mail.gmail.com> (raw)
In-Reply-To: <z5lvdg7fonhyrt4zphak6hnhlazyntyrbvcpxtr32rrksktg3j@wpvmby6yonbr>
On Thu, Jan 15, 2026 at 6:24 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hi Yafang.
>
> On Wed, Jan 14, 2026 at 08:13:44PM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > On Wed, Jan 14, 2026 at 5:56 PM Michal Koutný <mkoutny@suse.com> wrote:
> > >
> > > On Tue, Jan 13, 2026 at 08:12:37PM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > bpf_numab_ops enables NUMA balancing for tasks within a specific memcg,
> > > > even when global NUMA balancing is disabled. This allows selective NUMA
> > > > optimization for workloads that benefit from it, while avoiding potential
> > > > latency spikes for other workloads.
> > > >
> > > > The policy must be attached to a leaf memory cgroup.
> > >
> > > Why this restriction?
> >
> > We have several potential design options to consider:
> >
> > Option 1. Stateless cgroup bpf prog
> > Attach the BPF program to a specific cgroup and traverse upward
> > through the hierarchy within the hook, as demonstrated in Roman's
> > BPF-OOM series:
> >
> > https://lore.kernel.org/bpf/877bwcpisd.fsf@linux.dev/
> >
> > for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
> > bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
> > if (!bpf_oom_ops)
> > continue;
> >
> > ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
> > }
> >
> > - Benefit
> > The design is relatively simple and does not require manual
> > lifecycle management of the BPF program, hence the "stateless"
> > designation.
> > - Drawback
> > It may introduce potential overhead in the performance-critical
> > hotpath due to the traversal.
> >
> > Option 2: Stateful cgroup bpf prog
> > Attach the BPF program to all descendant cgroups, explicitly
> > handling cgroup fork/exit events. This approach is similar to the one
> > used in my BPF-THP series:
> >
> > https://lwn.net/ml/all/20251026100159.6103-4-laoar.shao@gmail.com/
> >
> > This requires the kernel to record every cgroup where the program is
> > attached — for example, by maintaining a per-program list of cgroups
> > (struct bpf_mm_ops with a bpf_thp_list). Because we must track this
> > attachment state, I refer to this as a "stateful" approach.
> >
> > - Benefit: Avoids the upward traversal overhead of Option 1.
> > - Drawback: Introduces complexity for managing attachment state and
> > lifecycle (attach/detach, cgroup creation/destruction).
> >
> > Option 3: Restrict Attachment to Leaf Cgroups
> > This is the approach taken in the current patchset. It simplifies
> > the kernel implementation by only allowing BPF programs to be attached
> > to leaf cgroups (those without children).
> > This design is inspired by our production experience, where it has
> > worked well. It encourages users to attach programs directly to the
> > cgroup they intend to target, avoiding ambiguity in hierarchy
> > propagation.
> >
> > Which of these options do you prefer? Do you have other suggestions?
>
> I appreciate the breakdown.
> With the options 1 and 2, I'm not sure whether they aren't reinventing a
> wheel. Namely the stuff from kernel/bpf/cgroup.c:
> - compute_effective_progs() where progs are composed/prepared into a
> sequence (depending on BPF_F_ALLOW_MULTI) and then
> - bpf_prog_run_array_cg() runs them and joins the results into a
> verdict.
>
> (Those BPF_F_* are flags known to userspace already.)
My understanding is that struct-ops-based cgroup bpf serves as a more
efficient replacement for the legacy cgroup bpf. For instance:
Legacy Feature New replacement
BPF_F_ALLOW_OVERRIDE/REPLACE ->update()
BPF_F_BEFORE/BPF_F_AFTER a priority in the struct-ops?
BPF_F_ALLOW_MULTI a simple for-loop
bpf_prog_run_array_cg() a simple ->bpf_hook()
IOW, all control flow can be handled within the struct bpf_XXX_ops{}
itself without exposing any uAPI changes.
I believe we should avoid adding new features to the legacy cgroup bpf
(kernel/bpf/cgroup.c) and instead implement all new functionality
using struct-ops. This approach minimizes changes to the uAPI, since
the kAPI is easier to maintain than the uAPI.
Alexei, Daniel, Andrii, I'd appreciate your input to confirm or
correct my understanding.
>
> So I think it'd boil down to the type of result that individual ops
> return in order to be able to apply some "reduce" function on those.
>
> > > Do you envision how these extensions would apply hierarchically?
> >
> > This feature can be applied hierarchically, though it adds complexity
> > to the kernel. However, I believe that by providing the core
> > functionality, we can empower users to manage their specific use cases
> > effectively. We do not need to implement every possible variation for
> > them.
>
> I'd also look around how sched_ext resolved (solves?) this. Namely the
> step from one global sched_ext class to per-cg extensions.
We're planning to experiment sched_ext (especially the LLC-aware
scheduler) in our k8s environment this year to tackle LLC performance
on AMD EPYC. I might work on it later, but I'm new to it right now.
> I'll Cc Tejun for more info.
>
>
> > > Regardless of that, being a "leaf memcg" is not a stationary condition
> > > (mkdirs, writes to `cgroup.subtree_control`) so it should also be
> > > prepared for that.
> >
> > In the current implementation, the user has to attach the bpf prog to
> > the new cgroup as well ;)
>
> I'd say that's not ideal UX (I imagine there's some high level policy
> set to upper cgroups but the internal cgroup (sub)structure of
> containers may be opaque to the admin, production experience might have
> been lucky not to hit this case).
> In this regard, something like the option 1 sounds better and
> performance can be improved later.
Agreed.
In option 1, if a performance bottleneck emerges that we can't handle
well, the user can always attach a BPF prog directly to the leaf
cgroup ;)
This way, we avoid premature optimization for performance.
> Option 1's "reduce" function takes
> the result from the lowest ancestor but hierarchical logic should be
> reversed with higher cgroups overriding the lowers.
The exact definition of the “reduce” function isn’t fully clear to me
at this point. That said, if multiple hierarchical attachment becomes
a real use case, we can always refactor it into a more generic
solution later.
>
> (I don't want to claim what's the correct design, I want to make you
> aware of other places in kernel that solve similar challenge.)
Understood.
Thanks a lot for your review.
--
Regards
Yafang
next prev parent reply other threads:[~2026-01-16 2:46 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-13 12:12 [RFC PATCH bpf-next 0/3] BPF-based NUMA balancing Yafang Shao
2026-01-13 12:12 ` [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing Yafang Shao
2026-01-13 12:42 ` bot+bpf-ci
2026-01-13 12:48 ` Yafang Shao
2026-01-13 12:12 ` [RFC PATCH bpf-next 2/3] mm: add support for bpf based " Yafang Shao
2026-01-13 12:29 ` bot+bpf-ci
2026-01-13 12:46 ` Yafang Shao
2026-01-14 9:56 ` Michal Koutný
2026-01-14 12:13 ` Yafang Shao
2026-01-15 10:24 ` Michal Koutný
2026-01-16 2:45 ` Yafang Shao [this message]
2026-01-13 12:12 ` [RFC PATCH bpf-next 3/3] mm: set numa balancing hot threshold with bpf Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALOAHbAVcCv_1-yYp7QpEidxPm6vx2p6nzFdKEt61TF8LMCUPw@mail.gmail.com \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=inwardvessel@gmail.com \
--cc=linux-mm@kvack.org \
--cc=mkoutny@suse.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
--cc=yu.c.chen@intel.com \
--cc=zhao1.liu@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox