[RFC PATCH bpf-next 0/3] BPF-based NUMA balancing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next 0/3] BPF-based NUMA balancing
@ 2026-01-13 12:12 Yafang Shao
  2026-01-13 12:12 ` [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing Yafang Shao
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Yafang Shao @ 2026-01-13 12:12 UTC (permalink / raw)
  To: roman.gushchin, inwardvessel, shakeel.butt, akpm, ast, daniel,
	andrii, mkoutny, yu.c.chen, zhao1.liu
  Cc: bpf, linux-mm, Yafang Shao

In our large fleet of Kubernetes-managed servers, NUMA balancing has been
historically disabled globally on each server. With increasing deployment
of AMD EPYC servers in our fleet, cross-NUMA access has become a critical
performance issue, prompting us to consider enabling NUMA balancing to
address it.

However, enabling NUMA balancing globally is not acceptable as it would
increase overall system overhead and potentially introduce latency spikes
for latency-sensitive workloads. Instead, we aim to enable it selectively
for workloads that can genuinely benefit from it. Even for such workloads,
we require fine-grained per-workload tuning capabilities.

To maximize cross-NUMA page migration while minimizing overhead, we
propose tuning NUMA balancing per workload using BPF.

This patchset introduces a new BPF hook ->numab_hook() as a memory cgroup
based struct-ops. This enables NUMA balancing for specific workloads
while keeping global NUMA balancing disabled. It also allows tuning
NUMA balancing parameters per workload. Patch #3 demonstrates how to
adjust the hot threshold per workload using BPF.

Since bpf_struct_ops and cgroups integration [0] is still under
development by Roman, this patchset temporarily embeds the cgroup ID
into the struct-ops for review purposes. We can migrate to the new
approach once it's available.

This is still an RFC with limited testing. Any feedback is welcome.

[0]. https://lore.kernel.org/bpf/CAADnVQJGiH_yF=AoFSRy4zh20uneJgBfqGshubLM6aVq069Fhg@mail.gmail.com/

Yafang Shao (3):
  sched: add helpers for numa balancing
  mm: add support for bpf based numa balancing
  mm: set numa balancing hot threshold with bpf

 MAINTAINERS                          |   1 +
 include/linux/memcontrol.h           |   6 +
 include/linux/sched/numa_balancing.h |  44 +++++
 kernel/sched/fair.c                  |  17 +-
 kernel/sched/sched.h                 |   2 -
 mm/Makefile                          |   5 +
 mm/bpf_numa_balancing.c              | 252 +++++++++++++++++++++++++++
 mm/memory-tiers.c                    |   3 +-
 mm/mempolicy.c                       |   3 +-
 mm/migrate.c                         |   7 +-
 mm/vmscan.c                          |   7 +-
 11 files changed, 326 insertions(+), 21 deletions(-)
 create mode 100644 mm/bpf_numa_balancing.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing
  2026-01-13 12:12 [RFC PATCH bpf-next 0/3] BPF-based NUMA balancing Yafang Shao
@ 2026-01-13 12:12 ` Yafang Shao
  2026-01-13 12:42   ` bot+bpf-ci
  2026-01-13 12:12 ` [RFC PATCH bpf-next 2/3] mm: add support for bpf based " Yafang Shao
  2026-01-13 12:12 ` [RFC PATCH bpf-next 3/3] mm: set numa balancing hot threshold with bpf Yafang Shao
  2 siblings, 1 reply; 8+ messages in thread
From: Yafang Shao @ 2026-01-13 12:12 UTC (permalink / raw)
  To: roman.gushchin, inwardvessel, shakeel.butt, akpm, ast, daniel,
	andrii, mkoutny, yu.c.chen, zhao1.liu
  Cc: bpf, linux-mm, Yafang Shao

Three new helpers task_numab_enabled(), task_numab_mode_normal() and
task_numab_mode_tiering() are introduced for later use.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/sched/numa_balancing.h | 27 +++++++++++++++++++++++++++
 kernel/sched/fair.c                  | 15 +++++++--------
 kernel/sched/sched.h                 |  1 -
 mm/memory-tiers.c                    |  3 ++-
 mm/mempolicy.c                       |  3 +--
 mm/migrate.c                         |  7 ++++---
 mm/vmscan.c                          |  7 +++----
 7 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 52b22c5c396d..792b6665f476 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -8,6 +8,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/sched/sysctl.h>
 
 #define TNF_MIGRATED	0x01
 #define TNF_NO_GROUP	0x02
@@ -32,6 +33,28 @@ extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p, bool final);
 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 				int src_nid, int dst_cpu);
+
+extern struct static_key_false sched_numa_balancing;
+static inline bool task_numab_enabled(struct task_struct *p)
+{
+	if (static_branch_unlikely(&sched_numa_balancing))
+		return true;
+	return false;
+}
+
+static inline bool task_numab_mode_normal(void)
+{
+	if (sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL)
+		return true;
+	return false;
+}
+
+static inline bool task_numab_mode_tiering(void)
+{
+	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
+		return true;
+	return false;
+}
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -52,6 +75,10 @@ static inline bool should_numa_migrate_memory(struct task_struct *p,
 {
 	return true;
 }
+static inline bool task_numab_enabled(struct task_struct *p)
+{
+	return false;
+}
 #endif
 
 #endif /* _LINUX_SCHED_NUMA_BALANCING_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..4f6583ef83b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1932,8 +1932,8 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
 	last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
 
-	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
-	    !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
+	if (!(task_numab_mode_tiering()) && !node_is_toptier(src_nid) &&
+	    !cpupid_valid(last_cpupid))
 		return false;
 
 	/*
@@ -3140,7 +3140,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	struct numa_group *ng;
 	int priv;
 
-	if (!static_branch_likely(&sched_numa_balancing))
+	if (!task_numab_enabled(p))
 		return;
 
 	/* for example, ksmd faulting in a user's mm */
@@ -3151,8 +3151,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	 * NUMA faults statistics are unnecessary for the slow memory
 	 * node for memory tiering mode.
 	 */
-	if (!node_is_toptier(mem_node) &&
-	    (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ||
+	if (!node_is_toptier(mem_node) && (task_numab_mode_tiering() ||
 	     !cpupid_valid(last_cpupid)))
 		return;
 
@@ -3611,7 +3610,7 @@ static void update_scan_period(struct task_struct *p, int new_cpu)
 	int src_nid = cpu_to_node(task_cpu(p));
 	int dst_nid = cpu_to_node(new_cpu);
 
-	if (!static_branch_likely(&sched_numa_balancing))
+	if (!task_numab_enabled(p))
 		return;
 
 	if (!p->mm || !p->numa_faults || (p->flags & PF_EXITING))
@@ -9353,7 +9352,7 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	unsigned long src_weight, dst_weight;
 	int src_nid, dst_nid, dist;
 
-	if (!static_branch_likely(&sched_numa_balancing))
+	if (!task_numab_enabled(p))
 		return 0;
 
 	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
@@ -13374,7 +13373,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
-	if (static_branch_unlikely(&sched_numa_balancing))
+	if (task_numab_enabled(curr))
 		task_tick_numa(rq, curr);
 
 	update_misfit_status(curr, rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5..1247e4b0c2b0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2269,7 +2269,6 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 
 #endif /* !CONFIG_JUMP_LABEL */
 
-extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;
 
 static inline u64 global_rt_period(void)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 864811fff409..cb14d557a995 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -7,6 +7,7 @@
 #include <linux/memory-tiers.h>
 #include <linux/notifier.h>
 #include <linux/sched/sysctl.h>
+#include <linux/sched/numa_balancing.h>
 
 #include "internal.h"
 
@@ -64,7 +65,7 @@ static const struct bus_type memory_tier_subsys = {
  */
 bool folio_use_access_time(struct folio *folio)
 {
-	return (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
+	return (task_numab_mode_tiering()) &&
 	       !node_is_toptier(folio_nid(folio));
 }
 #endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 68a98ba57882..589bf37bc4ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -863,8 +863,7 @@ bool folio_can_map_prot_numa(struct folio *folio, struct vm_area_struct *vma,
 	 * Skip scanning top tier node if normal numa
 	 * balancing is disabled
 	 */
-	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
-	    node_is_toptier(nid))
+	if (!task_numab_mode_normal() && node_is_toptier(nid))
 		return false;
 
 	if (folio_use_access_time(folio))
diff --git a/mm/migrate.c b/mm/migrate.c
index 5169f9717f60..aa540f4d4cc8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -41,6 +41,7 @@
 #include <linux/ptrace.h>
 #include <linux/memory.h>
 #include <linux/sched/sysctl.h>
+#include <linux/sched/numa_balancing.h>
 #include <linux/memory-tiers.h>
 #include <linux/pagewalk.h>
 
@@ -802,7 +803,7 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
 	 * memory node, reset cpupid, because that is used to record
 	 * page access time in slow memory node.
 	 */
-	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) {
+	if (task_numab_mode_tiering()) {
 		bool f_toptier = node_is_toptier(folio_nid(folio));
 		bool t_toptier = node_is_toptier(folio_nid(newfolio));
 
@@ -2685,7 +2686,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
 		int z;
 
-		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING))
+		if (!task_numab_mode_tiering())
 			return -EAGAIN;
 		for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 			if (managed_zone(pgdat->node_zones + z))
@@ -2737,7 +2738,7 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 	if (nr_succeeded) {
 		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
 		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
-		if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
+		if (task_numab_mode_tiering()
 		    && !node_is_toptier(folio_nid(folio))
 		    && node_is_toptier(node))
 			mod_lruvec_state(lruvec, PGPROMOTE_SUCCESS, nr_succeeded);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5ba..7ee5695326e3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -65,6 +65,7 @@
 #include <linux/swapops.h>
 #include <linux/balloon_compaction.h>
 #include <linux/sched/sysctl.h>
+#include <linux/sched/numa_balancing.h>
 
 #include "internal.h"
 #include "swap.h"
@@ -4843,9 +4844,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 	if (!current_is_kswapd() || sc->order)
 		return false;
 
-	mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
-	       WMARK_PROMO : WMARK_HIGH;
-
+	mark = task_numab_mode_tiering() ? WMARK_PROMO : WMARK_HIGH;
 	for (i = 0; i <= sc->reclaim_idx; i++) {
 		struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
 		unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH;
@@ -6774,7 +6773,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
 		enum zone_stat_item item;
 		unsigned long free_pages;
 
-		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
+		if (task_numab_mode_tiering())
 			mark = promo_wmark_pages(zone);
 		else
 			mark = high_wmark_pages(zone);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH bpf-next 2/3] mm: add support for bpf based numa balancing
  2026-01-13 12:12 [RFC PATCH bpf-next 0/3] BPF-based NUMA balancing Yafang Shao
  2026-01-13 12:12 ` [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing Yafang Shao
@ 2026-01-13 12:12 ` Yafang Shao
  2026-01-13 12:29   ` bot+bpf-ci
  2026-01-13 12:12 ` [RFC PATCH bpf-next 3/3] mm: set numa balancing hot threshold with bpf Yafang Shao
  2 siblings, 1 reply; 8+ messages in thread
From: Yafang Shao @ 2026-01-13 12:12 UTC (permalink / raw)
  To: roman.gushchin, inwardvessel, shakeel.butt, akpm, ast, daniel,
	andrii, mkoutny, yu.c.chen, zhao1.liu
  Cc: bpf, linux-mm, Yafang Shao

bpf_numab_ops enables NUMA balancing for tasks within a specific memcg,
even when global NUMA balancing is disabled. This allows selective NUMA
optimization for workloads that benefit from it, while avoiding potential
latency spikes for other workloads.

The policy must be attached to a leaf memory cgroup. To reduce lookup
overhead, we can cache memcg::bpf_numab in the mm_struct of tasks within
the memcg when it becomes a performance bottleneck.

The cgroup ID is embedded in bpf_numab_ops as a compile-time constant,
which restricts each instance to a single cgroup and prevents attachment
to multiple cgroups. Roman is working on a solution to remove this
limitation, after which we can migrate to the new approach.

Currently only the normal mode is supported.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 MAINTAINERS                          |   1 +
 include/linux/memcontrol.h           |   6 +
 include/linux/sched/numa_balancing.h |  10 +-
 mm/Makefile                          |   5 +
 mm/bpf_numa_balancing.c              | 224 +++++++++++++++++++++++++++
 5 files changed, 245 insertions(+), 1 deletion(-)
 create mode 100644 mm/bpf_numa_balancing.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 70c2b73b3941..0d2c083557e0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4807,6 +4807,7 @@ L:	bpf@vger.kernel.org
 L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/bpf_memcontrol.c
+F:	mm/bpf_numa_balancing.c
 
 BPF [MISC]
 L:	bpf@vger.kernel.org
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 229ac9835adb..b02e8f380275 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct obj_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct bpf_numab_ops;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -284,6 +285,11 @@ struct mem_cgroup {
 	struct lru_gen_mm_list mm_list;
 #endif
 
+#ifdef CONFIG_BPF
+	/* per cgroup NUMA balancing control */
+	struct bpf_numab_ops __rcu *bpf_numab;
+#endif
+
 #ifdef CONFIG_MEMCG_V1
 	/* Legacy consumer-oriented counters */
 	struct page_counter kmem;		/* v1 only */
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 792b6665f476..c58d32ab39a7 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -35,17 +35,25 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 				int src_nid, int dst_cpu);
 
 extern struct static_key_false sched_numa_balancing;
+extern struct static_key_false bpf_numab_enabled_key;
+int bpf_numab_hook(struct task_struct *p);
 static inline bool task_numab_enabled(struct task_struct *p)
 {
 	if (static_branch_unlikely(&sched_numa_balancing))
 		return true;
-	return false;
+	if (!static_branch_unlikely(&bpf_numab_enabled_key))
+		return false;
+
+	/* A BPF prog is attached. */
+	return bpf_numab_hook(p);
 }
 
 static inline bool task_numab_mode_normal(void)
 {
 	if (sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL)
 		return true;
+	if (static_branch_unlikely(&bpf_numab_enabled_key))
+		return true;
 	return false;
 }
 
diff --git a/mm/Makefile b/mm/Makefile
index bf46fe31dc14..c2b887491f09 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,8 +107,13 @@ ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
 ifdef CONFIG_BPF_SYSCALL
+ifdef CONFIG_NUMA_BALANCING
 obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
 endif
+endif
+ifdef CONFIG_BPF_SYSCALL
+obj-$(CONFIG_MEMCG) += bpf_numa_balancing.o
+endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
 obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
diff --git a/mm/bpf_numa_balancing.c b/mm/bpf_numa_balancing.c
new file mode 100644
index 000000000000..aac4eec7c6ba
--- /dev/null
+++ b/mm/bpf_numa_balancing.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/memcontrol.h>
+#include <linux/sched/numa_balancing.h>
+
+typedef int numab_fn_t(struct task_struct *p);
+
+struct bpf_numab_ops {
+	numab_fn_t *numab_hook;
+
+	/* TODO:
+	 * The cgroup_id embedded in this struct is set at compile time
+	 * and cannot be modified during BPF program attach time.
+	 * Modifying it at attach time requires libbpf support,
+	 * which is currently under development by Roman.
+	 */
+	int cgroup_id;
+};
+
+static DEFINE_SPINLOCK(numab_ops_lock);
+DEFINE_STATIC_KEY_FALSE(bpf_numab_enabled_key);
+
+int bpf_numab_hook(struct task_struct *p)
+{
+	struct bpf_numab_ops *bpf_numab;
+	struct mem_cgroup *task_memcg;
+	int ret = 0;
+
+	if (!p->mm)
+		return 0;
+
+	/* We can cache memcg::bpf_numab to mm::bpf_numab if it becomes a bettleneck */
+	rcu_read_lock();
+	task_memcg = mem_cgroup_from_task(rcu_dereference(p->mm->owner));
+	if (!task_memcg)
+		goto out;
+
+	/* Users can install BPF NUMA policies on leaf memory cgroups.
+	 * This eliminates the need to traverse the cgroup hierarchy or
+	 * propagate policies during registration, simplifying the kernel design.
+	 */
+	bpf_numab = rcu_dereference(task_memcg->bpf_numab);
+	if (!bpf_numab || !bpf_numab->numab_hook)
+		goto out;
+
+	ret = bpf_numab->numab_hook(p);
+
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+static const struct bpf_func_proto *
+bpf_numab_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static bool bpf_numab_ops_is_valid_access(int off, int size,
+					  enum bpf_access_type type,
+					  const struct bpf_prog *prog,
+					  struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_verifier_ops bpf_numab_verifier_ops = {
+	.get_func_proto = bpf_numab_get_func_proto,
+	.is_valid_access = bpf_numab_ops_is_valid_access,
+};
+
+static int bpf_numab_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_numab_check_member(const struct btf_type *t,
+				  const struct btf_member *member,
+				  const struct bpf_prog *prog)
+{
+	/* The call site operates under RCU protection. */
+	if (prog->sleepable)
+		return -EINVAL;
+	return 0;
+}
+
+static int bpf_numab_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	const struct bpf_numab_ops *ubpf_numab;
+	struct bpf_numab_ops *kbpf_numab;
+	u32 moff;
+
+	ubpf_numab = (const struct bpf_numab_ops *)udata;
+	kbpf_numab = (struct bpf_numab_ops *)kdata;
+
+	moff = __btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct bpf_numab_ops, cgroup_id):
+		/* bpf_struct_ops only handles func ptrs and zero-ed members.
+		 * Return 1 to bypass the default handler.
+		 */
+		kbpf_numab->cgroup_id = ubpf_numab->cgroup_id;
+		return 1;
+	}
+	return 0;
+}
+
+static int bpf_numab_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_numab_ops *ops = kdata;
+	struct mem_cgroup *memcg;
+	int err = 0;
+
+	/* Only the link mode is supported. */
+	if (!link)
+		return -EOPNOTSUPP;
+
+	/* Depends on CONFIG_SHRINKER_DEBUG */
+	memcg = mem_cgroup_get_from_ino(ops->cgroup_id);
+	if (!memcg || IS_ERR(memcg))
+		return -ENOENT;
+
+	spin_lock(&numab_ops_lock);
+	/* Each memory cgroup can have at most one attached BPF program to ensure
+	 * exclusive control and avoid interference between different BPF policies.
+	 */
+	if (rcu_access_pointer(memcg->bpf_numab)) {
+		err = -EBUSY;
+		goto out;
+	}
+	rcu_assign_pointer(memcg->bpf_numab, ops);
+	spin_unlock(&numab_ops_lock);
+	static_branch_inc(&bpf_numab_enabled_key);
+
+out:
+	mem_cgroup_put(memcg);
+	return err;
+}
+
+static void bpf_numab_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_numab_ops *ops = kdata;
+	struct mem_cgroup *memcg;
+
+	memcg = mem_cgroup_get_from_ino(ops->cgroup_id);
+	if (!memcg)
+		return;
+
+	spin_lock(&numab_ops_lock);
+	if (!rcu_access_pointer(memcg->bpf_numab)) {
+		spin_unlock(&numab_ops_lock);
+		return;
+	}
+	rcu_replace_pointer(memcg->bpf_numab, NULL, lockdep_is_held(&numab_ops_lock));
+	spin_unlock(&numab_ops_lock);
+	static_branch_dec(&bpf_numab_enabled_key);
+	synchronize_rcu();
+}
+
+static int bpf_numab_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+	struct bpf_numab_ops *ops = kdata;
+	struct mem_cgroup *memcg;
+
+	memcg = mem_cgroup_get_from_ino(ops->cgroup_id);
+	if (!memcg)
+		return -EINVAL;
+
+	spin_lock(&numab_ops_lock);
+	/* The update can proceed regardless of whether memcg->bpf_numab has been previously set. */
+	rcu_replace_pointer(memcg->bpf_numab, ops, lockdep_is_held(&numab_ops_lock));
+	spin_unlock(&numab_ops_lock);
+	synchronize_rcu();
+	return 0;
+}
+
+static int bpf_numab_validate(void *kdata)
+{
+	struct bpf_numab_ops *ops = kdata;
+
+	if (!ops->numab_hook) {
+		pr_err("bpf_numab: required ops isn't implemented\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int bpf_numa_balancing(struct task_struct *p)
+{
+	return 1;
+}
+
+static struct bpf_numab_ops __bpf_numab_ops = {
+	.numab_hook = (numab_fn_t *)bpf_numa_balancing,
+};
+
+static struct bpf_struct_ops bpf_bpf_numab_ops = {
+	.verifier_ops = &bpf_numab_verifier_ops,
+	.init = bpf_numab_init,
+	.check_member = bpf_numab_check_member,
+	.init_member = bpf_numab_init_member,
+	.reg = bpf_numab_reg,
+	.unreg = bpf_numab_unreg,
+	.update = bpf_numab_update,
+	.validate = bpf_numab_validate,
+	.cfi_stubs = &__bpf_numab_ops,
+	.owner = THIS_MODULE,
+	.name = "bpf_numab_ops",
+};
+
+static int __init bpf_numab_ops_init(void)
+{
+	int err;
+
+	err = register_bpf_struct_ops(&bpf_bpf_numab_ops, bpf_numab_ops);
+	if (err)
+		pr_err("bpf_numab: Failed to register struct_ops (%d)\n", err);
+	return err;
+}
+late_initcall(bpf_numab_ops_init);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH bpf-next 3/3] mm: set numa balancing hot threshold with bpf
  2026-01-13 12:12 [RFC PATCH bpf-next 0/3] BPF-based NUMA balancing Yafang Shao
  2026-01-13 12:12 ` [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing Yafang Shao
  2026-01-13 12:12 ` [RFC PATCH bpf-next 2/3] mm: add support for bpf based " Yafang Shao
@ 2026-01-13 12:12 ` Yafang Shao
  2 siblings, 0 replies; 8+ messages in thread
From: Yafang Shao @ 2026-01-13 12:12 UTC (permalink / raw)
  To: roman.gushchin, inwardvessel, shakeel.butt, akpm, ast, daniel,
	andrii, mkoutny, yu.c.chen, zhao1.liu
  Cc: bpf, linux-mm, Yafang Shao

Our experimentation with NUMA balancing across our server fleet
revealed that different workloads require distinct hot thresholds.
This allows migrating the maximum number of cross-NUMA pages
while avoiding significant latency impact on sensitive workloads.

We can also configure other per-workload NUMA balancing parameters via BPF,
such as scan_size_mb in /sys/kernel/debug/sched/numa_balancing/.
This can be implemented later if the core approach proves acceptable.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/sched/numa_balancing.h |  9 +++++++++
 kernel/sched/fair.c                  |  2 +-
 kernel/sched/sched.h                 |  1 -
 mm/bpf_numa_balancing.c              | 28 ++++++++++++++++++++++++++++
 4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index c58d32ab39a7..bbf5b884aa47 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -36,7 +36,9 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false bpf_numab_enabled_key;
+extern unsigned int sysctl_numa_balancing_hot_threshold;
 int bpf_numab_hook(struct task_struct *p);
+unsigned int bpf_numab_hot_thresh(struct task_struct *p);
 static inline bool task_numab_enabled(struct task_struct *p)
 {
 	if (static_branch_unlikely(&sched_numa_balancing))
@@ -63,6 +65,13 @@ static inline bool task_numab_mode_tiering(void)
 		return true;
 	return false;
 }
+
+static inline unsigned int task_numab_hot_thresh(struct task_struct *p)
+{
+	if (!static_branch_unlikely(&bpf_numab_enabled_key))
+		return sysctl_numa_balancing_hot_threshold;
+	return bpf_numab_hot_thresh(p);
+}
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f6583ef83b2..d51ddd46f4be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1917,7 +1917,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 			return true;
 		}
 
-		def_th = sysctl_numa_balancing_hot_threshold;
+		def_th = task_numab_hot_thresh(p);
 		rate_limit = MB_TO_PAGES(sysctl_numa_balancing_promote_rate_limit);
 		numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1247e4b0c2b0..d72eaa472d7d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2961,7 +2961,6 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
-extern unsigned int sysctl_numa_balancing_hot_threshold;
 
 #ifdef CONFIG_SCHED_HRTICK
 
diff --git a/mm/bpf_numa_balancing.c b/mm/bpf_numa_balancing.c
index aac4eec7c6ba..26e80434f337 100644
--- a/mm/bpf_numa_balancing.c
+++ b/mm/bpf_numa_balancing.c
@@ -9,6 +9,7 @@ typedef int numab_fn_t(struct task_struct *p);
 
 struct bpf_numab_ops {
 	numab_fn_t *numab_hook;
+	unsigned int hot_thresh;
 
 	/* TODO:
 	 * The cgroup_id embedded in this struct is set at compile time
@@ -52,6 +53,30 @@ int bpf_numab_hook(struct task_struct *p)
 	return ret;
 }
 
+unsigned int bpf_numab_hot_thresh(struct task_struct *p)
+{
+	unsigned int ret = sysctl_numa_balancing_hot_threshold;
+	struct bpf_numab_ops *bpf_numab;
+	struct mem_cgroup *task_memcg;
+
+	if (unlikely(!p->mm))
+		return ret;
+
+	rcu_read_lock();
+	task_memcg = mem_cgroup_from_task(rcu_dereference(p->mm->owner));
+	if (!task_memcg)
+		goto out;
+
+	bpf_numab = rcu_dereference(task_memcg->bpf_numab);
+	if (!bpf_numab || !bpf_numab->hot_thresh)
+		goto out;
+
+	ret = bpf_numab->hot_thresh;
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
 static const struct bpf_func_proto *
 bpf_numab_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -105,6 +130,9 @@ static int bpf_numab_init_member(const struct btf_type *t,
 		 */
 		kbpf_numab->cgroup_id = ubpf_numab->cgroup_id;
 		return 1;
+	case offsetof(struct bpf_numab_ops, hot_thresh):
+		kbpf_numab->hot_thresh = ubpf_numab->hot_thresh;
+		return 1;
 	}
 	return 0;
 }
-- 
2.43.5



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] mm: add support for bpf based numa balancing
  2026-01-13 12:12 ` [RFC PATCH bpf-next 2/3] mm: add support for bpf based " Yafang Shao
@ 2026-01-13 12:29   ` bot+bpf-ci
  2026-01-13 12:46     ` Yafang Shao
  0 siblings, 1 reply; 8+ messages in thread
From: bot+bpf-ci @ 2026-01-13 12:29 UTC (permalink / raw)
  To: laoar.shao, roman.gushchin, inwardvessel, shakeel.butt, akpm,
	ast, daniel, andrii, mkoutny, yu.c.chen, zhao1.liu
  Cc: bpf, linux-mm, laoar.shao, ast, andrii, daniel, martin.lau,
	eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 4142 bytes --]

> diff --git a/mm/Makefile b/mm/Makefile
> index bf46fe31dc14..c2b887491f09 100644

[ ... ]

> @@ -107,8 +107,13 @@ ifdef CONFIG_SWAP
>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
>  ifdef CONFIG_BPF_SYSCALL
> +ifdef CONFIG_NUMA_BALANCING
>  obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
>  endif
> +endif
> +ifdef CONFIG_BPF_SYSCALL
> +obj-$(CONFIG_MEMCG) += bpf_numa_balancing.o
> +endif

Should bpf_numa_balancing.o also require CONFIG_SHRINKER_DEBUG? The code
calls mem_cgroup_get_from_ino() which is only defined when
CONFIG_SHRINKER_DEBUG is enabled. Without it, compilation will fail due to
an undefined reference.

Also, should it require CONFIG_NUMA_BALANCING? Without that config option,
task_numab_enabled() returns false unconditionally, so the BPF hook would
never be invoked even if a program is attached.

> diff --git a/mm/bpf_numa_balancing.c b/mm/bpf_numa_balancing.c
> new file mode 100644
> index 000000000000..aac4eec7c6ba
> --- /dev/null
> +++ b/mm/bpf_numa_balancing.c

[ ... ]

> +static int bpf_numab_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_numab_ops *ops = kdata;
> +	struct mem_cgroup *memcg;
> +	int err = 0;
> +
> +	/* Only the link mode is supported. */
> +	if (!link)
> +		return -EOPNOTSUPP;
> +
> +	/* Depends on CONFIG_SHRINKER_DEBUG */
> +	memcg = mem_cgroup_get_from_ino(ops->cgroup_id);
> +	if (!memcg || IS_ERR(memcg))
> +		return -ENOENT;
> +
> +	spin_lock(&numab_ops_lock);
> +	/* Each memory cgroup can have at most one attached BPF program to ensure
> +	 * exclusive control and avoid interference between different BPF policies.
> +	 */
> +	if (rcu_access_pointer(memcg->bpf_numab)) {
> +		err = -EBUSY;
> +		goto out;
> +	}
> +	rcu_assign_pointer(memcg->bpf_numab, ops);
> +	spin_unlock(&numab_ops_lock);
> +	static_branch_inc(&bpf_numab_enabled_key);
> +
> +out:
> +	mem_cgroup_put(memcg);
> +	return err;
> +}

When rcu_access_pointer(memcg->bpf_numab) is non-NULL, the code sets err
to -EBUSY and jumps to out:, but this bypasses the spin_unlock(). Is the
spinlock left held when returning -EBUSY?

> +static void bpf_numab_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_numab_ops *ops = kdata;
> +	struct mem_cgroup *memcg;
> +
> +	memcg = mem_cgroup_get_from_ino(ops->cgroup_id);
> +	if (!memcg)
> +		return;

mem_cgroup_get_from_ino() returns ERR_PTR() on failure, not NULL. If the
cgroup was deleted between reg and unreg, this function would get an
ERR_PTR value, the if (!memcg) check would pass, and then
rcu_access_pointer(memcg->bpf_numab) would dereference an invalid pointer.
Should this check be if (!memcg || IS_ERR(memcg)) like in bpf_numab_reg()?

Also, mem_cgroup_get_from_ino() takes a reference on the memcg via
cgroup_get_e_css(). Where is the corresponding mem_cgroup_put() call?
The function appears to leak the memcg reference on all paths.

> +	spin_lock(&numab_ops_lock);
> +	if (!rcu_access_pointer(memcg->bpf_numab)) {
> +		spin_unlock(&numab_ops_lock);
> +		return;
> +	}
> +	rcu_replace_pointer(memcg->bpf_numab, NULL, lockdep_is_held(&numab_ops_lock));
> +	spin_unlock(&numab_ops_lock);
> +	static_branch_dec(&bpf_numab_enabled_key);
> +	synchronize_rcu();
> +}
> +
> +static int bpf_numab_update(void *kdata, void *old_kdata, struct bpf_link *link)
> +{
> +	struct bpf_numab_ops *ops = kdata;
> +	struct mem_cgroup *memcg;
> +
> +	memcg = mem_cgroup_get_from_ino(ops->cgroup_id);
> +	if (!memcg)
> +		return -EINVAL;

Same two issues as bpf_numab_unreg(): the check should handle ERR_PTR()
values, and there appears to be no mem_cgroup_put() to release the
reference taken by mem_cgroup_get_from_ino().

> +	spin_lock(&numab_ops_lock);
> +	/* The update can proceed regardless of whether memcg->bpf_numab has been previously set. */
> +	rcu_replace_pointer(memcg->bpf_numab, ops, lockdep_is_held(&numab_ops_lock));
> +	spin_unlock(&numab_ops_lock);
> +	synchronize_rcu();
> +	return 0;
> +}


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20956455529

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing
  2026-01-13 12:12 ` [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing Yafang Shao
@ 2026-01-13 12:42   ` bot+bpf-ci
  2026-01-13 12:48     ` Yafang Shao
  0 siblings, 1 reply; 8+ messages in thread
From: bot+bpf-ci @ 2026-01-13 12:42 UTC (permalink / raw)
  To: laoar.shao, roman.gushchin, inwardvessel, shakeel.butt, akpm,
	ast, daniel, andrii, mkoutny, yu.c.chen, zhao1.liu
  Cc: bpf, linux-mm, laoar.shao, ast, andrii, daniel, martin.lau,
	eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3267 bytes --]

> diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
> index 52b22c5c396d..792b6665f476 100644
> --- a/include/linux/sched/numa_balancing.h
> +++ b/include/linux/sched/numa_balancing.h

[ ... ]

> @@ -32,6 +33,28 @@ extern void set_numabalancing_state(bool enabled);
> extern void task_numa_free(struct task_struct *p, bool final);
> bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> 				int src_nid, int dst_cpu);
> +
> +extern struct static_key_false sched_numa_balancing;
> +static inline bool task_numab_enabled(struct task_struct *p)
> +{
> +	if (static_branch_unlikely(&sched_numa_balancing))
> +		return true;
> +	return false;
> +}
> +
> +static inline bool task_numab_mode_normal(void)
> +{
> +	if (sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL)
> +		return true;
> +	return false;
> +}
> +
> +static inline bool task_numab_mode_tiering(void)
> +{
> +	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> +		return true;
> +	return false;
> +}
>  #else
>  static inline void task_numa_fault(int last_node, int node, int pages,
> 				   int flags)

[ ... ]

> @@ -52,6 +75,10 @@ static inline bool should_numa_migrate_memory(struct task_struct *p,
>  {
>  	return true;
>  }
> +static inline bool task_numab_enabled(struct task_struct *p)
> +{
> +	return false;
> +}
>  #endif

The #else block adds a stub for task_numab_enabled(), but task_numab_mode_normal()
and task_numab_mode_tiering() are missing from this section. These helpers are
defined when CONFIG_NUMA_BALANCING is enabled, but have no stubs when it is
disabled.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 670fe9fae5ba..7ee5695326e3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c

[ ... ]

> @@ -4843,9 +4844,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
>  	if (!current_is_kswapd() || sc->order)
>  		return false;
>
> -	mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> -	       WMARK_PROMO : WMARK_HIGH;
> -
> +	mark = task_numab_mode_tiering() ? WMARK_PROMO : WMARK_HIGH;

[ ... ]

> @@ -6774,7 +6773,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
>  		enum zone_stat_item item;
>  		unsigned long free_pages;
>
> -		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> +		if (task_numab_mode_tiering())
>  			mark = promo_wmark_pages(zone);

The original sysctl_numa_balancing_mode is defined to 0 when CONFIG_NUMA_BALANCING
is disabled (see include/linux/sched/sysctl.h), allowing this code to compile and
work correctly in both configurations. The new task_numab_mode_tiering() helper
lacks a stub in the #else block, so should_abort_scan() and pgdat_balanced() in
mm/vmscan.c will fail to compile when CONFIG_NUMA_BALANCING is not enabled.

Would it make sense to add stubs for task_numab_mode_normal() and
task_numab_mode_tiering() to the #else block, returning false to match the
original behavior when sysctl_numa_balancing_mode is defined as 0?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20956455529

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] mm: add support for bpf based numa balancing
  2026-01-13 12:29   ` bot+bpf-ci
@ 2026-01-13 12:46     ` Yafang Shao
  0 siblings, 0 replies; 8+ messages in thread
From: Yafang Shao @ 2026-01-13 12:46 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: roman.gushchin, inwardvessel, shakeel.butt, akpm, ast, daniel,
	andrii, mkoutny, yu.c.chen, zhao1.liu, bpf, linux-mm, martin.lau,
	eddyz87, yonghong.song, clm, ihor.solodrai

On Tue, Jan 13, 2026 at 8:29 PM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/mm/Makefile b/mm/Makefile
> > index bf46fe31dc14..c2b887491f09 100644
>
> [ ... ]
>
> > @@ -107,8 +107,13 @@ ifdef CONFIG_SWAP
> >  obj-$(CONFIG_MEMCG) += swap_cgroup.o
> >  endif
> >  ifdef CONFIG_BPF_SYSCALL
> > +ifdef CONFIG_NUMA_BALANCING
> >  obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
> >  endif
> > +endif
> > +ifdef CONFIG_BPF_SYSCALL
> > +obj-$(CONFIG_MEMCG) += bpf_numa_balancing.o
> > +endif
>
> Should bpf_numa_balancing.o also require CONFIG_SHRINKER_DEBUG? The code
> calls mem_cgroup_get_from_ino() which is only defined when
> CONFIG_SHRINKER_DEBUG is enabled. Without it, compilation will fail due to
> an undefined reference.

Yes, this functionality depends on CONFIG_SHRINKER_DEBUG.
If this patchset is accepted, it will be merged after Roman's series
[0] since it has additional dependencies on that work. This explains
why CONFIG_SHRINKER_DEBUG is not added here.

[0]. https://lore.kernel.org/bpf/20251027231727.472628-5-roman.gushchin@linux.dev/

>
> Also, should it require CONFIG_NUMA_BALANCING? Without that config option,
> task_numab_enabled() returns false unconditionally, so the BPF hook would
> never be invoked even if a program is attached.

will fix it.

>
> > diff --git a/mm/bpf_numa_balancing.c b/mm/bpf_numa_balancing.c
> > new file mode 100644
> > index 000000000000..aac4eec7c6ba
> > --- /dev/null
> > +++ b/mm/bpf_numa_balancing.c
>
> [ ... ]
>
> > +static int bpf_numab_reg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_numab_ops *ops = kdata;
> > +     struct mem_cgroup *memcg;
> > +     int err = 0;
> > +
> > +     /* Only the link mode is supported. */
> > +     if (!link)
> > +             return -EOPNOTSUPP;
> > +
> > +     /* Depends on CONFIG_SHRINKER_DEBUG */
> > +     memcg = mem_cgroup_get_from_ino(ops->cgroup_id);
> > +     if (!memcg || IS_ERR(memcg))
> > +             return -ENOENT;
> > +
> > +     spin_lock(&numab_ops_lock);
> > +     /* Each memory cgroup can have at most one attached BPF program to ensure
> > +      * exclusive control and avoid interference between different BPF policies.
> > +      */
> > +     if (rcu_access_pointer(memcg->bpf_numab)) {
> > +             err = -EBUSY;
> > +             goto out;
> > +     }
> > +     rcu_assign_pointer(memcg->bpf_numab, ops);
> > +     spin_unlock(&numab_ops_lock);
> > +     static_branch_inc(&bpf_numab_enabled_key);
> > +
> > +out:
> > +     mem_cgroup_put(memcg);
> > +     return err;
> > +}
>
> When rcu_access_pointer(memcg->bpf_numab) is non-NULL, the code sets err
> to -EBUSY and jumps to out:, but this bypasses the spin_unlock(). Is the
> spinlock left held when returning -EBUSY?

Nice catch!

>
> > +static void bpf_numab_unreg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_numab_ops *ops = kdata;
> > +     struct mem_cgroup *memcg;
> > +
> > +     memcg = mem_cgroup_get_from_ino(ops->cgroup_id);
> > +     if (!memcg)
> > +             return;
>
> mem_cgroup_get_from_ino() returns ERR_PTR() on failure, not NULL. If the
> cgroup was deleted between reg and unreg, this function would get an
> ERR_PTR value, the if (!memcg) check would pass, and then
> rcu_access_pointer(memcg->bpf_numab) would dereference an invalid pointer.
> Should this check be if (!memcg || IS_ERR(memcg)) like in bpf_numab_reg()?
>
> Also, mem_cgroup_get_from_ino() takes a reference on the memcg via
> cgroup_get_e_css(). Where is the corresponding mem_cgroup_put() call?
> The function appears to leak the memcg reference on all paths.

Nice catch.
will fix it.

>
> > +     spin_lock(&numab_ops_lock);
> > +     if (!rcu_access_pointer(memcg->bpf_numab)) {
> > +             spin_unlock(&numab_ops_lock);
> > +             return;
> > +     }
> > +     rcu_replace_pointer(memcg->bpf_numab, NULL, lockdep_is_held(&numab_ops_lock));
> > +     spin_unlock(&numab_ops_lock);
> > +     static_branch_dec(&bpf_numab_enabled_key);
> > +     synchronize_rcu();
> > +}
> > +
> > +static int bpf_numab_update(void *kdata, void *old_kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_numab_ops *ops = kdata;
> > +     struct mem_cgroup *memcg;
> > +
> > +     memcg = mem_cgroup_get_from_ino(ops->cgroup_id);
> > +     if (!memcg)
> > +             return -EINVAL;
>
> Same two issues as bpf_numab_unreg(): the check should handle ERR_PTR()
> values, and there appears to be no mem_cgroup_put() to release the
> reference taken by mem_cgroup_get_from_ino().

Thanks for your review.
You are awesome!

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing
  2026-01-13 12:42   ` bot+bpf-ci
@ 2026-01-13 12:48     ` Yafang Shao
  0 siblings, 0 replies; 8+ messages in thread
From: Yafang Shao @ 2026-01-13 12:48 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: roman.gushchin, inwardvessel, shakeel.butt, akpm, ast, daniel,
	andrii, mkoutny, yu.c.chen, zhao1.liu, bpf, linux-mm, martin.lau,
	eddyz87, yonghong.song, clm, ihor.solodrai

On Tue, Jan 13, 2026 at 8:42 PM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
> > index 52b22c5c396d..792b6665f476 100644
> > --- a/include/linux/sched/numa_balancing.h
> > +++ b/include/linux/sched/numa_balancing.h
>
> [ ... ]
>
> > @@ -32,6 +33,28 @@ extern void set_numabalancing_state(bool enabled);
> > extern void task_numa_free(struct task_struct *p, bool final);
> > bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> >                               int src_nid, int dst_cpu);
> > +
> > +extern struct static_key_false sched_numa_balancing;
> > +static inline bool task_numab_enabled(struct task_struct *p)
> > +{
> > +     if (static_branch_unlikely(&sched_numa_balancing))
> > +             return true;
> > +     return false;
> > +}
> > +
> > +static inline bool task_numab_mode_normal(void)
> > +{
> > +     if (sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL)
> > +             return true;
> > +     return false;
> > +}
> > +
> > +static inline bool task_numab_mode_tiering(void)
> > +{
> > +     if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> > +             return true;
> > +     return false;
> > +}
> >  #else
> >  static inline void task_numa_fault(int last_node, int node, int pages,
> >                                  int flags)
>
> [ ... ]
>
> > @@ -52,6 +75,10 @@ static inline bool should_numa_migrate_memory(struct task_struct *p,
> >  {
> >       return true;
> >  }
> > +static inline bool task_numab_enabled(struct task_struct *p)
> > +{
> > +     return false;
> > +}
> >  #endif
>
> The #else block adds a stub for task_numab_enabled(), but task_numab_mode_normal()
> and task_numab_mode_tiering() are missing from this section. These helpers are
> defined when CONFIG_NUMA_BALANCING is enabled, but have no stubs when it is
> disabled.
>
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 670fe9fae5ba..7ee5695326e3 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
>
> [ ... ]
>
> > @@ -4843,9 +4844,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> >       if (!current_is_kswapd() || sc->order)
> >               return false;
> >
> > -     mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > -            WMARK_PROMO : WMARK_HIGH;
> > -
> > +     mark = task_numab_mode_tiering() ? WMARK_PROMO : WMARK_HIGH;
>
> [ ... ]
>
> > @@ -6774,7 +6773,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> >               enum zone_stat_item item;
> >               unsigned long free_pages;
> >
> > -             if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> > +             if (task_numab_mode_tiering())
> >                       mark = promo_wmark_pages(zone);
>
> The original sysctl_numa_balancing_mode is defined to 0 when CONFIG_NUMA_BALANCING
> is disabled (see include/linux/sched/sysctl.h), allowing this code to compile and
> work correctly in both configurations. The new task_numab_mode_tiering() helper
> lacks a stub in the #else block, so should_abort_scan() and pgdat_balanced() in
> mm/vmscan.c will fail to compile when CONFIG_NUMA_BALANCING is not enabled.
>
> Would it make sense to add stubs for task_numab_mode_normal() and
> task_numab_mode_tiering() to the #else block, returning false to match the
> original behavior when sysctl_numa_balancing_mode is defined as 0?

Right.
I missed it.  Thanks for the review.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-01-13 12:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-13 12:12 [RFC PATCH bpf-next 0/3] BPF-based NUMA balancing Yafang Shao
2026-01-13 12:12 ` [RFC PATCH bpf-next 1/3] sched: add helpers for numa balancing Yafang Shao
2026-01-13 12:42   ` bot+bpf-ci
2026-01-13 12:48     ` Yafang Shao
2026-01-13 12:12 ` [RFC PATCH bpf-next 2/3] mm: add support for bpf based " Yafang Shao
2026-01-13 12:29   ` bot+bpf-ci
2026-01-13 12:46     ` Yafang Shao
2026-01-13 12:12 ` [RFC PATCH bpf-next 3/3] mm: set numa balancing hot threshold with bpf Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox