From: Yafang Shao <laoar.shao@gmail.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: roman.gushchin@linux.dev, inwardvessel@gmail.com,
shakeel.butt@linux.dev, akpm@linux-foundation.org,
ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
yu.c.chen@intel.com, zhao1.liu@intel.com, bpf@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [RFC PATCH bpf-next 2/3] mm: add support for bpf based numa balancing
Date: Wed, 14 Jan 2026 20:13:44 +0800 [thread overview]
Message-ID: <CALOAHbB3Ruc+veQxPC8NgvxwBDrnX5XkmZN-vz1pu3U05MXnQg@mail.gmail.com> (raw)
In-Reply-To: <cfyq2n7igavmwwf5jv5uamiyhprgsf4ez7au6ssv3rw54vjh4w@nc43vkqhz5yq>
On Wed, Jan 14, 2026 at 5:56 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Tue, Jan 13, 2026 at 08:12:37PM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > bpf_numab_ops enables NUMA balancing for tasks within a specific memcg,
> > even when global NUMA balancing is disabled. This allows selective NUMA
> > optimization for workloads that benefit from it, while avoiding potential
> > latency spikes for other workloads.
> >
> > The policy must be attached to a leaf memory cgroup.
>
> Why this restriction?
We have several potential design options to consider:
Option 1. Stateless cgroup bpf prog
Attach the BPF program to a specific cgroup and traverse upward
through the hierarchy within the hook, as demonstrated in Roman's
BPF-OOM series:
https://lore.kernel.org/bpf/877bwcpisd.fsf@linux.dev/
for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
if (!bpf_oom_ops)
continue;
ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
}
- Benefit
The design is relatively simple and does not require manual
lifecycle management of the BPF program, hence the "stateless"
designation.
- Drawback
It may introduce potential overhead in the performance-critical
hotpath due to the traversal.
Option 2: Stateful cgroup bpf prog
Attach the BPF program to all descendant cgroups, explicitly
handling cgroup fork/exit events. This approach is similar to the one
used in my BPF-THP series:
https://lwn.net/ml/all/20251026100159.6103-4-laoar.shao@gmail.com/
This requires the kernel to record every cgroup where the program is
attached — for example, by maintaining a per-program list of cgroups
(struct bpf_mm_ops with a bpf_thp_list). Because we must track this
attachment state, I refer to this as a "stateful" approach.
- Benefit: Avoids the upward traversal overhead of Option 1.
- Drawback: Introduces complexity for managing attachment state and
lifecycle (attach/detach, cgroup creation/destruction).
Option 3: Restrict Attachment to Leaf Cgroups
This is the approach taken in the current patchset. It simplifies
the kernel implementation by only allowing BPF programs to be attached
to leaf cgroups (those without children).
This design is inspired by our production experience, where it has
worked well. It encourages users to attach programs directly to the
cgroup they intend to target, avoiding ambiguity in hierarchy
propagation.
Which of these options do you prefer? Do you have other suggestions?
> Do you envision how these extensions would apply hierarchically?
This feature can be applied hierarchically, though it adds complexity
to the kernel. However, I believe that by providing the core
functionality, we can empower users to manage their specific use cases
effectively. We do not need to implement every possible variation for
them.
> Regardless of that, being a "leaf memcg" is not a stationary condition
> (mkdirs, writes to `cgroup.subtree_control`) so it should also be
> prepared for that.
In the current implementation, the user has to attach the bpf prog to
the new cgroup as well ;)
>
> Also, I think (please correct me) that NUMA balancing doesn't need
> memory controller (in contrast with OOM),
Correct.
> so the attachment shouldn't be
> through struct mem_cgroup but plain struct cgroup::bpf. If you could
> consider this or add some details about this decision, it'd be great.
That's a good suggestion. By removing the dependency on memcg, we can
simplify the design.
--
Regards
Yafang
next prev parent reply other threads:[~2026-01-14 12:14 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-13 12:12 [RFC PATCH bpf-next 0/3] BPF-based NUMA balancing Yafang Shao
2026-01-13 12:12 ` [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing Yafang Shao
2026-01-13 12:42 ` bot+bpf-ci
2026-01-13 12:48 ` Yafang Shao
2026-01-13 12:12 ` [RFC PATCH bpf-next 2/3] mm: add support for bpf based " Yafang Shao
2026-01-13 12:29 ` bot+bpf-ci
2026-01-13 12:46 ` Yafang Shao
2026-01-14 9:56 ` Michal Koutný
2026-01-14 12:13 ` Yafang Shao [this message]
2026-01-13 12:12 ` [RFC PATCH bpf-next 3/3] mm: set numa balancing hot threshold with bpf Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALOAHbB3Ruc+veQxPC8NgvxwBDrnX5XkmZN-vz1pu3U05MXnQg@mail.gmail.com \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=inwardvessel@gmail.com \
--cc=linux-mm@kvack.org \
--cc=mkoutny@suse.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=yu.c.chen@intel.com \
--cc=zhao1.liu@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox