Re: [RFC PATCH bpf-next 2/3] mm: add support for bpf based numa balancing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yafang Shao <laoar.shao@gmail.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: roman.gushchin@linux.dev, inwardvessel@gmail.com,
	shakeel.butt@linux.dev,  akpm@linux-foundation.org,
	ast@kernel.org, daniel@iogearbox.net,  andrii@kernel.org,
	yu.c.chen@intel.com, zhao1.liu@intel.com,  bpf@vger.kernel.org,
	linux-mm@kvack.org, tj@kernel.org
Subject: Re: [RFC PATCH bpf-next 2/3] mm: add support for bpf based numa balancing
Date: Fri, 16 Jan 2026 10:45:38 +0800	[thread overview]
Message-ID: <CALOAHbAVcCv_1-yYp7QpEidxPm6vx2p6nzFdKEt61TF8LMCUPw@mail.gmail.com> (raw)
In-Reply-To: <z5lvdg7fonhyrt4zphak6hnhlazyntyrbvcpxtr32rrksktg3j@wpvmby6yonbr>

On Thu, Jan 15, 2026 at 6:24 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hi Yafang.
>
> On Wed, Jan 14, 2026 at 08:13:44PM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > On Wed, Jan 14, 2026 at 5:56 PM Michal Koutný <mkoutny@suse.com> wrote:
> > >
> > > On Tue, Jan 13, 2026 at 08:12:37PM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > bpf_numab_ops enables NUMA balancing for tasks within a specific memcg,
> > > > even when global NUMA balancing is disabled. This allows selective NUMA
> > > > optimization for workloads that benefit from it, while avoiding potential
> > > > latency spikes for other workloads.
> > > >
> > > > The policy must be attached to a leaf memory cgroup.
> > >
> > > Why this restriction?
> >
> > We have several potential design options to consider:
> >
> > Option 1.   Stateless cgroup bpf prog
> >   Attach the BPF program to a specific cgroup and traverse upward
> > through the hierarchy within the hook, as demonstrated in Roman's
> > BPF-OOM series:
> >
> >     https://lore.kernel.org/bpf/877bwcpisd.fsf@linux.dev/
> >
> >     for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
> >         bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
> >         if (!bpf_oom_ops)
> >             continue;
> >
> >           ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
> >     }
> >
> >    - Benefit
> >      The design is relatively simple and does not require manual
> > lifecycle management of the BPF program, hence the "stateless"
> > designation.
> >    - Drawback
> >       It may introduce potential overhead in the performance-critical
> > hotpath due to the traversal.
> >
> > Option 2:  Stateful cgroup bpf prog
> >   Attach the BPF program to all descendant cgroups, explicitly
> > handling cgroup fork/exit events. This approach is similar to the one
> > used in my BPF-THP series:
> >
> >      https://lwn.net/ml/all/20251026100159.6103-4-laoar.shao@gmail.com/
> >
> >   This requires the kernel to record every cgroup where the program is
> > attached — for example, by maintaining a per-program list of cgroups
> > (struct bpf_mm_ops with a bpf_thp_list). Because we must track this
> > attachment state, I refer to this as a "stateful" approach.
> >
> >   - Benefit: Avoids the upward traversal overhead of Option 1.
> >   - Drawback: Introduces complexity for managing attachment state and
> > lifecycle (attach/detach, cgroup creation/destruction).
> >
> > Option 3:  Restrict Attachment to Leaf Cgroups
> >   This is the approach taken in the current patchset. It simplifies
> > the kernel implementation by only allowing BPF programs to be attached
> > to leaf cgroups (those without children).
> >   This design is inspired by our production experience, where it has
> > worked well. It encourages users to attach programs directly to the
> > cgroup they intend to target, avoiding ambiguity in hierarchy
> > propagation.
> >
> > Which of these options do you prefer? Do you have other suggestions?
>
> I appreciate the breakdown.
> With the options 1 and 2, I'm not sure whether they aren't reinventing a
> wheel. Namely the stuff from kernel/bpf/cgroup.c:
> - compute_effective_progs() where progs are composed/prepared into a
>   sequence (depending on BPF_F_ALLOW_MULTI) and then
> - bpf_prog_run_array_cg() runs them and joins the results into a
>   verdict.
>
> (Those BPF_F_* are flags known to userspace already.)

My understanding is that struct-ops-based cgroup bpf serves as a more
efficient replacement for the legacy cgroup bpf. For instance:

  Legacy Feature                                                 New replacement
  BPF_F_ALLOW_OVERRIDE/REPLACE          ->update()
  BPF_F_BEFORE/BPF_F_AFTER                    a priority in the struct-ops?
  BPF_F_ALLOW_MULTI                                    a simple for-loop
  bpf_prog_run_array_cg()                                  a simple ->bpf_hook()

IOW, all control flow can be handled within the struct bpf_XXX_ops{}
itself without exposing any uAPI changes.
I believe we should avoid adding new features to the legacy cgroup bpf
(kernel/bpf/cgroup.c) and instead implement all new functionality
using struct-ops. This approach minimizes changes to the uAPI, since
the kAPI is easier to maintain than the uAPI.

Alexei, Daniel, Andrii, I'd appreciate your input to confirm or
correct my understanding.

>
> So I think it'd boil down to the type of result that individual ops
> return in order to be able to apply some "reduce" function on those.
>
> > > Do you envision how these extensions would apply hierarchically?
> >
> > This feature can be applied hierarchically, though it adds complexity
> > to the kernel. However, I believe that by providing the core
> > functionality, we can empower users to manage their specific use cases
> > effectively. We do not need to implement every possible variation for
> > them.
>
> I'd also look around how sched_ext resolved (solves?) this. Namely the
> step from one global sched_ext class to per-cg extensions.

We're planning to experiment sched_ext (especially the LLC-aware
scheduler) in our k8s environment this year to tackle LLC performance
on AMD EPYC. I might work on it later, but I'm new to it right now.

> I'll Cc Tejun for more info.
>
>
> > > Regardless of that, being a "leaf memcg" is not a stationary condition
> > > (mkdirs, writes to `cgroup.subtree_control`) so it should also be
> > > prepared for that.
> >
> > In the current implementation, the user has to attach the bpf prog to
> > the new cgroup as well ;)
>
> I'd say that's not ideal UX (I imagine there's some high level policy
> set to upper cgroups but the internal cgroup (sub)structure of
> containers may be opaque to the admin, production experience might have
> been lucky not to hit this case).
> In this regard, something like the option 1 sounds better and
> performance can be improved later.

Agreed.
In option 1, if a performance bottleneck emerges that we can't handle
well, the user can always attach a BPF prog directly to the leaf
cgroup ;)
This way, we avoid premature optimization for performance.

> Option 1's "reduce" function takes
> the result from the lowest ancestor but hierarchical logic should be
> reversed with higher cgroups overriding the lowers.

The exact definition of the “reduce” function isn’t fully clear to me
at this point. That said, if multiple hierarchical attachment becomes
a real use case, we can always refactor it into a more generic
solution later.

>
> (I don't want to claim what's the correct design, I want to make you
> aware of other places in kernel that solve similar challenge.)

Understood.
Thanks a lot for your review.

-- 
Regards
Yafang

next prev parent reply	other threads:[~2026-01-16  2:46 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-13 12:12 [RFC PATCH bpf-next 0/3] BPF-based NUMA balancing Yafang Shao
2026-01-13 12:12 ` [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing Yafang Shao
2026-01-13 12:42   ` bot+bpf-ci
2026-01-13 12:48     ` Yafang Shao
2026-01-13 12:12 ` [RFC PATCH bpf-next 2/3] mm: add support for bpf based " Yafang Shao
2026-01-13 12:29   ` bot+bpf-ci
2026-01-13 12:46     ` Yafang Shao
2026-01-14  9:56   ` Michal Koutný
2026-01-14 12:13     ` Yafang Shao
2026-01-15 10:24       ` Michal Koutný
2026-01-16  2:45         ` Yafang Shao [this message]
2026-01-13 12:12 ` [RFC PATCH bpf-next 3/3] mm: set numa balancing hot threshold with bpf Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbAVcCv_1-yYp7QpEidxPm6vx2p6nzFdKEt61TF8LMCUPw@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=inwardvessel@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=mkoutny@suse.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=yu.c.chen@intel.com \
    --cc=zhao1.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox