[RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control
@ 2025-02-25 13:59 Chen Yu
  2025-02-25 14:00 ` [RFC PATCH 1/3] sched/numa: Introduce numa balance task migration and swap in schedstats Chen Yu
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Chen Yu @ 2025-02-25 13:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Liam R. Howlett,
	Lorenzo Stoakes, Huang, Ying, Tim Chen, Aubrey Li, Michael Wang,
	Kaiyang Zhao, David Rientjes, Raghavendra K T, cgroups, linux-mm,
	linux-kernel, Chen Yu

Introduce a per-cgroup interface to enable NUMA balancing
for specific cgroups. The system administrator needs to set
the NUMA balancing mode to NUMA_BALANCING_CGROUP=4 to enable
this feature. When in the NUMA_BALANCING_CGROUP mode, all
cgroups' NUMA balancing is disabled by default. After the
administrator enables this feature for a specific cgroup,
NUMA balancing for that cgroup is enabled.

This per-cgroup NUMA balancing control was once proposed in
2019 by Yun Wang[1]. Then, in 2024, Kaiyang Zhao mentioned
that he was working with Meta on per-cgroup NUMA control[2]
during a discussion with David Rientjes.

I could not find further discussion regarding per-cgroup NUMA
balancing from that point on. This set of RFC patches is a
rough and compile-passed version, and may have unhandled cases
(for example, THP). It has not been thoroughly tested and is
intended to initiate or resume the discussion on the topic of
per-cgroup NUMA load balancing.

The first patch is a NUMA load balancing statistics enhancement.
The second patch introduces per-cgroup NUMA balancing. The third
one enhances NUMA load balancing for the MPOL_INTERLEAVE policy.

Any feedback would be appreciated.

[1] https://lore.kernel.org/linux-fsdevel/60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com/
[2] https://lore.kernel.org/linux-mm/ZrukILyQhMAKWwTe@localhost.localhost/T/

Chen Yu (3):
  sched/numa: Introduce numa balance task migration and swap in
    schedstats
  sched/numa: Introduce per cgroup numa balance control
  sched/numa: Allow intervale memory allocation for numa balance

 include/linux/numa.h           |  1 +
 include/linux/sched.h          |  4 ++++
 include/linux/sched/sysctl.h   |  1 +
 include/linux/vm_event_item.h  |  2 ++
 include/uapi/linux/mempolicy.h |  1 +
 kernel/sched/core.c            | 42 ++++++++++++++++++++++++++++++++--
 kernel/sched/debug.c           |  4 ++++
 kernel/sched/fair.c            | 18 +++++++++++++++
 kernel/sched/sched.h           |  3 +++
 mm/memcontrol.c                |  2 ++
 mm/memory.c                    |  2 +-
 mm/mempolicy.c                 |  7 ++++++
 mm/mprotect.c                  |  5 ++--
 mm/vmstat.c                    |  2 ++
 14 files changed, 89 insertions(+), 5 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 1/3] sched/numa: Introduce numa balance task migration and swap in schedstats
  2025-02-25 13:59 [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control Chen Yu
@ 2025-02-25 14:00 ` Chen Yu
  2025-02-25 14:00 ` [RFC PATCH 2/3] sched/numa: Introduce per cgroup numa balance control Chen Yu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Chen Yu @ 2025-02-25 14:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Liam R. Howlett,
	Lorenzo Stoakes, Huang, Ying, Tim Chen, Aubrey Li, Michael Wang,
	Kaiyang Zhao, David Rientjes, Raghavendra K T, cgroups, linux-mm,
	linux-kernel, Chen Yu

There is a requirement to track task activities during NUMA
balancing. NUMA balancing has two mechanisms for task migration:
one is to migrate the task to an idle CPU in its preferred node,
and the other is to swap tasks on different nodes if they are
on each other's preferred node. The kernel already has NUMA page
migration statistics. Add the task migration and swap count
described above in the per-task/cgroup scope. The data will be
displayed at

/sys/fs/cgroup/mytest/memory.stat and
/proc/{PID}/sched.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched.h         |  4 ++++
 include/linux/vm_event_item.h |  2 ++
 kernel/sched/core.c           | 10 ++++++++--
 kernel/sched/debug.c          |  4 ++++
 mm/memcontrol.c               |  2 ++
 mm/vmstat.c                   |  2 ++
 6 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9632e3318e0d..01faa608ed7c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -527,6 +527,10 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+#ifdef CONFIG_NUMA_BALANCING
+	u64				nr_numa_migrations;
+	u64				nr_numa_swap;
+#endif
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..aef817474781 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -64,6 +64,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
+		NUMA_TASK_MIGRATE,
+		NUMA_TASK_SWAP,
 #endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 165c90ba64ea..44efc725054a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3348,6 +3348,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 #ifdef CONFIG_NUMA_BALANCING
 static void __migrate_swap_task(struct task_struct *p, int cpu)
 {
+	__schedstat_inc(p->stats.nr_numa_swap);
+
+	if (p->mm)
+		count_memcg_events_mm(p->mm, NUMA_TASK_SWAP, 1);
+
 	if (task_on_rq_queued(p)) {
 		struct rq *src_rq, *dst_rq;
 		struct rq_flags srf, drf;
@@ -7901,8 +7906,9 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 	if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
 		return -EINVAL;
 
-	/* TODO: This is not properly updating schedstats */
-
+	__schedstat_inc(p->stats.nr_numa_migrations);
+	if (p->mm)
+		count_memcg_events_mm(p->mm, NUMA_TASK_MIGRATE, 1);
 	trace_sched_move_numa(p, curr_cpu, target_cpu);
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ef047add7f9e..ed801cc00bf1 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1204,6 +1204,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+#ifdef CONFIG_NUMA_BALANCING
+		P_SCHEDSTAT(nr_numa_migrations);
+		P_SCHEDSTAT(nr_numa_swap);
+#endif
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46f8b372d212..496b5edc3db6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -460,6 +460,8 @@ static const unsigned int memcg_vm_event_stat[] = {
 	NUMA_PAGE_MIGRATE,
 	NUMA_PTE_UPDATES,
 	NUMA_HINT_FAULTS,
+	NUMA_TASK_MIGRATE,
+	NUMA_TASK_SWAP,
 #endif
 };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd..d6651778e4bf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1339,6 +1339,8 @@ const char * const vmstat_text[] = {
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
+	"numa_task_migrated",
+	"numa_task_swaped",
 #endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
-- 
2.25.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 2/3] sched/numa: Introduce per cgroup numa balance control
  2025-02-25 13:59 [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control Chen Yu
  2025-02-25 14:00 ` [RFC PATCH 1/3] sched/numa: Introduce numa balance task migration and swap in schedstats Chen Yu
@ 2025-02-25 14:00 ` Chen Yu
  2025-03-07 22:54   ` Tim Chen
  2025-02-25 14:00 ` [RFC PATCH 3/3] sched/numa: Allow intervale memory allocation for numa balance Chen Yu
  2025-03-05 14:38 ` [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control Kaiyang Zhao
  3 siblings, 1 reply; 8+ messages in thread
From: Chen Yu @ 2025-02-25 14:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Liam R. Howlett,
	Lorenzo Stoakes, Huang, Ying, Tim Chen, Aubrey Li, Michael Wang,
	Kaiyang Zhao, David Rientjes, Raghavendra K T, cgroups, linux-mm,
	linux-kernel, Chen Yu

[Problem Statement]
Currently, NUMA balancing is configured system-wide. However,
in some production environments, different containers may have
varying requirements for NUMA balancing. Some containers are
CPU-intensive, while others are memory-intensive. Some do not
benefit from NUMA balancing due to the overhead associated with
VMA scanning, while the other prefers NUMA balancing as it helps
improve memory locality. In this case, system-wide NUMA balancing
is usually disabled to produce stable results.

[Proposal]
Introduce a per-cgroup interface to enable NUMA balancing for specific
cgroups. The system administrator must set the NUMA balancing mode
to NUMA_BALANCING_CGROUP=4 to enable this feature. When in global
NUMA_BALANCING_CGROUP mode, all cgroups' NUMA balancing is disabled by
default. After the administrator enables this feature for a specific
cgroup, NUMA balancing for that cgroup is enabled.

A simple example to show how to use per-cgroup Numa balancing:

Step1
//switch to global per cgroup Numa balancing,
//All cgroup's Numa balance is disabled by default.
echo 4 > /proc/sys/kernel/numa_balancing

Step2
//created a cgroup named mytest, enable its Numa balancing
echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance

[Benchmark]
Tested on two systems. Both systems have 4 nodes. Created a
cgroup mytest which is bind to node0 and node1(cpu affinity
as well as memory allocation policy). Launched autonumabench
NUMA01_THREADLOCAL via mmtests.

echo 0 > /sys/fs/cgroup/mytest/cpu.numa_load_balance
cgexec -g cpuset:mytest  ./run-mmtests.sh --no-monitor \
	--config config-numa baseline
echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance \
cgexec -g cpuset:mytest  ./run-mmtests.sh --no-monitor
	--config config-numa nb_cgroup

system1: 4 nodes, 24 Cores(48 CPUs)/node.
baseline took a total of 191.32 seconds to finish, while cgroup
numa balancing took a total of 104.46 seconds. There is around
45% improvement.

                                                 baselin               nb_cgrou
                                                baseline              nb_cgroup
Min       syst-NUMA01_THREADLOCAL       69.65 (   0.00%)      106.73 ( -53.24%)
Min       elsp-NUMA01_THREADLOCAL      191.32 (   0.00%)      104.46 (  45.40%)
Amean     syst-NUMA01_THREADLOCAL       69.65 (   0.00%)      106.73 * -53.24%*
Amean     elsp-NUMA01_THREADLOCAL      191.32 (   0.00%)      104.46 *  45.40%*  <---
Stddev    syst-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
Stddev    elsp-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
CoeffVar  syst-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
CoeffVar  elsp-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
Max       syst-NUMA01_THREADLOCAL       69.65 (   0.00%)      106.73 ( -53.24%)
Max       elsp-NUMA01_THREADLOCAL      191.32 (   0.00%)      104.46 (  45.40%)
BAmean-50 syst-NUMA01_THREADLOCAL       69.65 (   0.00%)      106.73 ( -53.24%)
BAmean-50 elsp-NUMA01_THREADLOCAL      191.32 (   0.00%)      104.46 (  45.40%)
BAmean-95 syst-NUMA01_THREADLOCAL       69.65 (   0.00%)      106.73 ( -53.24%)
BAmean-95 elsp-NUMA01_THREADLOCAL      191.32 (   0.00%)      104.46 (  45.40%)
BAmean-99 syst-NUMA01_THREADLOCAL       69.65 (   0.00%)      106.73 ( -53.24%)
BAmean-99 elsp-NUMA01_THREADLOCAL      191.32 (   0.00%)      104.46 (  45.40%)

The run-to-run deviation downgrading occurs because sometimes the
per-cgroup NUMA balancing does not improve the score, although no
performance downgrading is observed.

delta of /sys/fs/cgroup/mytest/memory.stat during the test:
numa_pages_migrated: 979933
numa_pte_updates:    21007548  <-- introduced in previous patch
numa_hint_faults:    19663982  <-- introduced in previous patch

system1: 4 nodes, 40 Cores(80 CPUs)/node.
baseline took a total of 212.94 seconds to finish, while cgroup numa
balance took a total of 127.05 second, which is of 40.34% improvment.

                                                 baselin               nb_cgrou
                                                baseline              nb_cgroup
Min       syst-NUMA01_THREADLOCAL     8356.05 (   0.00%)     8921.84 (  -6.77%)
Min       elsp-NUMA01_THREADLOCAL      212.94 (   0.00%)      127.05 (  40.34%)
Amean     syst-NUMA01_THREADLOCAL     8356.05 (   0.00%)     8921.84 (  -6.77%)
Amean     elsp-NUMA01_THREADLOCAL      212.94 (   0.00%)      127.05 (  40.34%)  <---
Stddev    syst-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
Stddev    elsp-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
CoeffVar  syst-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
CoeffVar  elsp-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
Max       syst-NUMA01_THREADLOCAL     8356.05 (   0.00%)     8921.84 (  -6.77%)
Max       elsp-NUMA01_THREADLOCAL      212.94 (   0.00%)      127.05 (  40.34%)
BAmean-50 syst-NUMA01_THREADLOCAL     8356.05 (   0.00%)     8921.84 (  -6.77%)
BAmean-50 elsp-NUMA01_THREADLOCAL      212.94 (   0.00%)      127.05 (  40.34%)
BAmean-95 syst-NUMA01_THREADLOCAL     8356.05 (   0.00%)     8921.84 (  -6.77%)
BAmean-95 elsp-NUMA01_THREADLOCAL      212.94 (   0.00%)      127.05 (  40.34%)
BAmean-99 syst-NUMA01_THREADLOCAL     8356.05 (   0.00%)     8921.84 (  -6.77%)
BAmean-99 elsp-NUMA01_THREADLOCAL      212.94 (   0.00%)      127.05 (  40.34%)

The Numa statistics delta during the test:
numa_pages_migrated:  785848
numa_pte_updates:     2359714
numa_hint_faults:     2349857

[Shortage]
It has been observed that even with per-cgroup NUMA balancing enabled,
there is still remote node access, and the benchmark score does not
increase compared to the baseline. According to the NUMA statistics,
not much NUMA page migration is detected. Further testing shows that
global NUMA balancing has the same issue—sometimes NUMA balancing
does not help. This could be a generic issue in the current kernel
code, possibly due to either the NUMA page migration or task migration
strategy, and it needs to be further investigated.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/core.c          | 32 ++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 18 ++++++++++++++++++
 kernel/sched/sched.h         |  3 +++
 mm/mprotect.c                |  5 +++--
 5 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5a64582b086b..1e4d5a9ddb26 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -22,6 +22,7 @@ enum sched_tunable_scaling {
 #define NUMA_BALANCING_DISABLED		0x0
 #define NUMA_BALANCING_NORMAL		0x1
 #define NUMA_BALANCING_MEMORY_TIERING	0x2
+#define NUMA_BALANCING_CGROUP		0x4
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int sysctl_numa_balancing_mode;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44efc725054a..f4f048b3da68 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10023,6 +10023,31 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 }
 #endif
 
+#ifdef CONFIG_NUMA_BALANCING
+static DEFINE_MUTEX(numa_balance_mutex);
+static int numa_balance_write_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cftype, u64 enable)
+{
+	struct task_group *tg;
+	int ret;
+
+	guard(mutex)(&numa_balance_mutex);
+	tg = css_tg(css);
+	if (tg->nlb_enabled == enable)
+		return 0;
+
+	tg->nlb_enabled = enable;
+
+	return ret;
+}
+
+static u64 numa_balance_read_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	return css_tg(css)->nlb_enabled;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_GROUP_SCHED_WEIGHT
 	{
@@ -10071,6 +10096,13 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	{
+		.name = "numa_load_balance",
+		.read_u64 = numa_balance_read_u64,
+		.write_u64 = numa_balance_write_u64,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c0ef435a7aa..526cb33b007c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3146,6 +3146,18 @@ void task_numa_free(struct task_struct *p, bool final)
 	}
 }
 
+/* return true if the task group has enabled the numa balance */
+static bool tg_numa_balance_enabled(struct task_struct *p)
+{
+	struct task_group *tg = task_group(p);
+
+	if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) &&
+	    !tg->nlb_enabled)
+		return false;
+
+	return true;
+}
+
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
@@ -3174,6 +3186,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	     !cpupid_valid(last_cpupid)))
 		return;
 
+	if (!tg_numa_balance_enabled(p))
+		return;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) *
@@ -3596,6 +3611,9 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
 		return;
 
+	if (!tg_numa_balance_enabled(curr))
+		return;
+
 	/*
 	 * Using runtime rather than walltime has the dual advantage that
 	 * we (mostly) drive the selection from busy threads and that the
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 38e0e323dda2..9f478fb2c03a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -491,6 +491,9 @@ struct task_group {
 	/* Effective clamp values used for a task group */
 	struct uclamp_se	uclamp[UCLAMP_CNT];
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	u64			nlb_enabled;
+#endif
 
 };
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 516b1d847e2c..ddaaf20ef94c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 				toptier = node_is_toptier(nid);
 
 				/*
-				 * Skip scanning top tier node if normal numa
+				 * Skip scanning top tier node if normal/cgroup numa
 				 * balancing is disabled
 				 */
-				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
+				if (!(sysctl_numa_balancing_mode &
+				    (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) &&
 				    toptier)
 					continue;
 				if (folio_use_access_time(folio))
-- 
2.25.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 2/3] sched/numa: Introduce per cgroup numa balance control
  2025-02-25 14:00 ` [RFC PATCH 2/3] sched/numa: Introduce per cgroup numa balance control Chen Yu
@ 2025-03-07 22:54   ` Tim Chen
  2025-03-10 15:36     ` Chen Yu
  0 siblings, 1 reply; 8+ messages in thread
From: Tim Chen @ 2025-03-07 22:54 UTC (permalink / raw)
  To: Chen Yu, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Liam R. Howlett,
	Lorenzo Stoakes, Huang, Ying, Tim Chen, Aubrey Li, Michael Wang,
	Kaiyang Zhao, David Rientjes, Raghavendra K T, cgroups, linux-mm,
	linux-kernel

On Tue, 2025-02-25 at 22:00 +0800, Chen Yu wrote:
> [Problem Statement]
> Currently, NUMA balancing is configured system-wide. However,
> 
> 
> A simple example to show how to use per-cgroup Numa balancing:
> 
> Step1
> //switch to global per cgroup Numa balancing,
> //All cgroup's Numa balance is disabled by default.
> echo 4 > /proc/sys/kernel/numa_balancing
> 

Can you add documentation of this additional feature
for numa_balancing in
admin-guide/sysctl/kernel.rst

Should you make NUMA_BALANCING_NORMAL and NUMA_BALANCING_CGROUP
mutually exclusive in? In other words
echo 5 > /proc/sys/kernel/numa_balancing should result in numa_balancing to be 1?

Otherwise tg_numa_balance_enabled() can return 0 with NUMA_BALANCING_CGROUP
bit turned on even though you have NUMA_BALANCING_NORMAL bit on.

Tim
> 
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/sched/sysctl.h |  1 +
>  kernel/sched/core.c          | 32 ++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c          | 18 ++++++++++++++++++
>  kernel/sched/sched.h         |  3 +++
>  mm/mprotect.c                |  5 +++--
>  5 files changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 5a64582b086b..1e4d5a9ddb26 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -22,6 +22,7 @@ enum sched_tunable_scaling {
>  #define NUMA_BALANCING_DISABLED		0x0
>  #define NUMA_BALANCING_NORMAL		0x1
>  #define NUMA_BALANCING_MEMORY_TIERING	0x2
> +#define NUMA_BALANCING_CGROUP		0x4
>  
>  #ifdef CONFIG_NUMA_BALANCING
>  extern int sysctl_numa_balancing_mode;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 44efc725054a..f4f048b3da68 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10023,6 +10023,31 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
>  }
>  #endif
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +static DEFINE_MUTEX(numa_balance_mutex);
> +static int numa_balance_write_u64(struct cgroup_subsys_state *css,
> +				  struct cftype *cftype, u64 enable)
> +{
> +	struct task_group *tg;
> +	int ret;
> +
> +	guard(mutex)(&numa_balance_mutex);
> +	tg = css_tg(css);
> +	if (tg->nlb_enabled == enable)
> +		return 0;
> +
> +	tg->nlb_enabled = enable;
> +
> +	return ret;
> +}
> +
> +static u64 numa_balance_read_u64(struct cgroup_subsys_state *css,
> +				 struct cftype *cft)
> +{
> +	return css_tg(css)->nlb_enabled;
> +}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
>  static struct cftype cpu_files[] = {
>  #ifdef CONFIG_GROUP_SCHED_WEIGHT
>  	{
> @@ -10071,6 +10096,13 @@ static struct cftype cpu_files[] = {
>  		.seq_show = cpu_uclamp_max_show,
>  		.write = cpu_uclamp_max_write,
>  	},
> +#endif
> +#ifdef CONFIG_NUMA_BALANCING
> +	{
> +		.name = "numa_load_balance",
> +		.read_u64 = numa_balance_read_u64,
> +		.write_u64 = numa_balance_write_u64,
> +	},
>  #endif
>  	{ }	/* terminate */
>  };
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1c0ef435a7aa..526cb33b007c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3146,6 +3146,18 @@ void task_numa_free(struct task_struct *p, bool final)
>  	}
>  }
>  
> +/* return true if the task group has enabled the numa balance */
> +static bool tg_numa_balance_enabled(struct task_struct *p)
> +{
> +	struct task_group *tg = task_group(p);
> +
> +	if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) &&
> +	    !tg->nlb_enabled)
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * Got a PROT_NONE fault for a page on @node.
>   */
> @@ -3174,6 +3186,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>  	     !cpupid_valid(last_cpupid)))
>  		return;
>  
> +	if (!tg_numa_balance_enabled(p))
> +		return;
> +
>  	/* Allocate buffer to track faults on a per-node basis */
>  	if (unlikely(!p->numa_faults)) {
>  		int size = sizeof(*p->numa_faults) *
> @@ -3596,6 +3611,9 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>  	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
>  		return;
>  
> +	if (!tg_numa_balance_enabled(curr))
> +		return;
> +
>  	/*
>  	 * Using runtime rather than walltime has the dual advantage that
>  	 * we (mostly) drive the selection from busy threads and that the
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 38e0e323dda2..9f478fb2c03a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -491,6 +491,9 @@ struct task_group {
>  	/* Effective clamp values used for a task group */
>  	struct uclamp_se	uclamp[UCLAMP_CNT];
>  #endif
> +#ifdef CONFIG_NUMA_BALANCING
> +	u64			nlb_enabled;
> +#endif
>  
>  };
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 516b1d847e2c..ddaaf20ef94c 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				toptier = node_is_toptier(nid);
>  
>  				/*
> -				 * Skip scanning top tier node if normal numa
> +				 * Skip scanning top tier node if normal/cgroup numa
>  				 * balancing is disabled
>  				 */
> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> +				if (!(sysctl_numa_balancing_mode &
> +				    (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) &&
>  				    toptier)
>  					continue;
>  				if (folio_use_access_time(folio))



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 2/3] sched/numa: Introduce per cgroup numa balance control
  2025-03-07 22:54   ` Tim Chen
@ 2025-03-10 15:36     ` Chen Yu
  0 siblings, 0 replies; 8+ messages in thread
From: Chen Yu @ 2025-03-10 15:36 UTC (permalink / raw)
  To: Tim Chen
  Cc: Chen Yu, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Andrew Morton, Rik van Riel, Mel Gorman,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Liam R. Howlett, Lorenzo Stoakes, Huang, Ying,
	Tim Chen, Aubrey Li, Michael Wang, Kaiyang Zhao, David Rientjes,
	Raghavendra K T, cgroups, linux-mm, linux-kernel

On 2025-03-07 at 14:54:10 -0800, Tim Chen wrote:
> On Tue, 2025-02-25 at 22:00 +0800, Chen Yu wrote:
> > [Problem Statement]
> > Currently, NUMA balancing is configured system-wide. However,
> > 
> > 
> > A simple example to show how to use per-cgroup Numa balancing:
> > 
> > Step1
> > //switch to global per cgroup Numa balancing,
> > //All cgroup's Numa balance is disabled by default.
> > echo 4 > /proc/sys/kernel/numa_balancing
> > 
> 
> Can you add documentation of this additional feature
> for numa_balancing in
> admin-guide/sysctl/kernel.rst
>

OK, will refine in next version.
 
> Should you make NUMA_BALANCING_NORMAL and NUMA_BALANCING_CGROUP
> mutually exclusive in? In other words
> echo 5 > /proc/sys/kernel/numa_balancing should result in numa_balancing to be 1?
> 
> Otherwise tg_numa_balance_enabled() can return 0 with NUMA_BALANCING_CGROUP
> bit turned on even though you have NUMA_BALANCING_NORMAL bit on.
>

I see, will fix tg_numa_balance_enabled() in next version, thanks!

Best,
Chenyu
 
> Tim
> > 
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >  include/linux/sched/sysctl.h |  1 +
> >  kernel/sched/core.c          | 32 ++++++++++++++++++++++++++++++++
> >  kernel/sched/fair.c          | 18 ++++++++++++++++++
> >  kernel/sched/sched.h         |  3 +++
> >  mm/mprotect.c                |  5 +++--
> >  5 files changed, 57 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > index 5a64582b086b..1e4d5a9ddb26 100644
> > --- a/include/linux/sched/sysctl.h
> > +++ b/include/linux/sched/sysctl.h
> > @@ -22,6 +22,7 @@ enum sched_tunable_scaling {
> >  #define NUMA_BALANCING_DISABLED		0x0
> >  #define NUMA_BALANCING_NORMAL		0x1
> >  #define NUMA_BALANCING_MEMORY_TIERING	0x2
> > +#define NUMA_BALANCING_CGROUP		0x4
> >  
> >  #ifdef CONFIG_NUMA_BALANCING
> >  extern int sysctl_numa_balancing_mode;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 44efc725054a..f4f048b3da68 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -10023,6 +10023,31 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
> >  }
> >  #endif
> >  
> > +#ifdef CONFIG_NUMA_BALANCING
> > +static DEFINE_MUTEX(numa_balance_mutex);
> > +static int numa_balance_write_u64(struct cgroup_subsys_state *css,
> > +				  struct cftype *cftype, u64 enable)
> > +{
> > +	struct task_group *tg;
> > +	int ret;
> > +
> > +	guard(mutex)(&numa_balance_mutex);
> > +	tg = css_tg(css);
> > +	if (tg->nlb_enabled == enable)
> > +		return 0;
> > +
> > +	tg->nlb_enabled = enable;
> > +
> > +	return ret;
> > +}
> > +
> > +static u64 numa_balance_read_u64(struct cgroup_subsys_state *css,
> > +				 struct cftype *cft)
> > +{
> > +	return css_tg(css)->nlb_enabled;
> > +}
> > +#endif /* CONFIG_NUMA_BALANCING */
> > +
> >  static struct cftype cpu_files[] = {
> >  #ifdef CONFIG_GROUP_SCHED_WEIGHT
> >  	{
> > @@ -10071,6 +10096,13 @@ static struct cftype cpu_files[] = {
> >  		.seq_show = cpu_uclamp_max_show,
> >  		.write = cpu_uclamp_max_write,
> >  	},
> > +#endif
> > +#ifdef CONFIG_NUMA_BALANCING
> > +	{
> > +		.name = "numa_load_balance",
> > +		.read_u64 = numa_balance_read_u64,
> > +		.write_u64 = numa_balance_write_u64,
> > +	},
> >  #endif
> >  	{ }	/* terminate */
> >  };
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1c0ef435a7aa..526cb33b007c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3146,6 +3146,18 @@ void task_numa_free(struct task_struct *p, bool final)
> >  	}
> >  }
> >  
> > +/* return true if the task group has enabled the numa balance */
> > +static bool tg_numa_balance_enabled(struct task_struct *p)
> > +{
> > +	struct task_group *tg = task_group(p);
> > +
> > +	if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) &&
> > +	    !tg->nlb_enabled)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  /*
> >   * Got a PROT_NONE fault for a page on @node.
> >   */
> > @@ -3174,6 +3186,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
> >  	     !cpupid_valid(last_cpupid)))
> >  		return;
> >  
> > +	if (!tg_numa_balance_enabled(p))
> > +		return;
> > +
> >  	/* Allocate buffer to track faults on a per-node basis */
> >  	if (unlikely(!p->numa_faults)) {
> >  		int size = sizeof(*p->numa_faults) *
> > @@ -3596,6 +3611,9 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
> >  	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
> >  		return;
> >  
> > +	if (!tg_numa_balance_enabled(curr))
> > +		return;
> > +
> >  	/*
> >  	 * Using runtime rather than walltime has the dual advantage that
> >  	 * we (mostly) drive the selection from busy threads and that the
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 38e0e323dda2..9f478fb2c03a 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -491,6 +491,9 @@ struct task_group {
> >  	/* Effective clamp values used for a task group */
> >  	struct uclamp_se	uclamp[UCLAMP_CNT];
> >  #endif
> > +#ifdef CONFIG_NUMA_BALANCING
> > +	u64			nlb_enabled;
> > +#endif
> >  
> >  };
> >  
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 516b1d847e2c..ddaaf20ef94c 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb,
> >  				toptier = node_is_toptier(nid);
> >  
> >  				/*
> > -				 * Skip scanning top tier node if normal numa
> > +				 * Skip scanning top tier node if normal/cgroup numa
> >  				 * balancing is disabled
> >  				 */
> > -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> > +				if (!(sysctl_numa_balancing_mode &
> > +				    (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) &&
> >  				    toptier)
> >  					continue;
> >  				if (folio_use_access_time(folio))
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 3/3] sched/numa: Allow intervale memory allocation for numa balance
  2025-02-25 13:59 [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control Chen Yu
  2025-02-25 14:00 ` [RFC PATCH 1/3] sched/numa: Introduce numa balance task migration and swap in schedstats Chen Yu
  2025-02-25 14:00 ` [RFC PATCH 2/3] sched/numa: Introduce per cgroup numa balance control Chen Yu
@ 2025-02-25 14:00 ` Chen Yu
  2025-03-05 14:38 ` [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control Kaiyang Zhao
  3 siblings, 0 replies; 8+ messages in thread
From: Chen Yu @ 2025-02-25 14:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Liam R. Howlett,
	Lorenzo Stoakes, Huang, Ying, Tim Chen, Aubrey Li, Michael Wang,
	Kaiyang Zhao, David Rientjes, Raghavendra K T, cgroups, linux-mm,
	linux-kernel, Chen Yu

MPOL_INTERLEAVE is used to allocate pages interleaved across different
NUMA nodes to make the best use of memory bandwidth. Under MPOL_INTERLEAVE
mode, NUMA load balance page migration does not occur because the page is
already in its designated place. Similarly, NUMA load task migration does
not occur either—mpol_misplaced() returns NUMA_NO_NODE, which instructs
do_numa_page() to skip page/task migration. However, there is a scenario
in the production environment where NUMA balance could benefit
MPOL_INTERLEAVE. This typical scenario involves tasks within cgroup g_A
being bound to two SNC (Sub-NUMA Cluster) nodes via cpuset, with their
pages allocated only on these two SNC nodes in an interleaved manner
using MPOL_INTERLEAVE. This setup allows g_A to achieve good resource
isolation while effectively utilizing the memory bandwidth of the two
SNC nodes. However, it is possible that tasks t1 and t2 in g_A could
experience remote access patterns:

	Node 0		Node 1
	t1		t1.page
	t2.page		t2

Ideally, a NUMA balance task swap would be beneficial:

	Node 0		Node 1
	t2		t1.page
	t2.page		t1

In other words, NUMA balancing can help swap t1 and t2 to improve NUMA
locality without migrating pages, thereby still honoring the
MPOL_INTERLEAVE policy. To enable NUMA balancing to manage MPOL_INTERLEAVE,
add MPOL_F_MOF to the MPOL_INTERLEAVE policy if the user has requested
it via MPOL_F_NUMA_BALANCING (similar to MPOL_BIND). In summary, pages
will not be migrated for MPOL_INTERLEAVE, but tasks will be migrated to
their preferred nodes.

Tested on a system with 4 nodes, 40 Cores(80 CPUs)/node, using
autonumabench NUMA01_THREADLOCAL, with some minor changes to support
MPOL_INTERLEAVE:

  p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, \
		MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  set_mempolicy(MPOL_INTERLEAVE | MPOL_F_NUMA_BALANCING, \
		&nodemask_global, max_nodes);
  ...
  //each thread accesses 4K of data every 8K,
  //1 thread should access the pages on 1 node.

No obvious score difference was observed, but noticed some Numa balance
task migration:

                                 baseline_nocg_interleav      nb_nocg_interlave
                                 baseline_nocg_interleave     nb_nocg_interlave/
Min       syst-NUMA01_THREADLOCAL     7156.34 (   0.00%)     7267.28 (  -1.55%)
Min       elsp-NUMA01_THREADLOCAL       90.73 (   0.00%)       90.88 (  -0.17%)
Amean     syst-NUMA01_THREADLOCAL     7156.34 (   0.00%)     7267.28 (  -1.55%)
Amean     elsp-NUMA01_THREADLOCAL       90.73 (   0.00%)       90.88 (  -0.17%)
Stddev    syst-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
Stddev    elsp-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
CoeffVar  syst-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
CoeffVar  elsp-NUMA01_THREADLOCAL        0.00 (   0.00%)        0.00 (   0.00%)
Max       syst-NUMA01_THREADLOCAL     7156.34 (   0.00%)     7267.28 (  -1.55%)
Max       elsp-NUMA01_THREADLOCAL       90.73 (   0.00%)       90.88 (  -0.17%)
BAmean-50 syst-NUMA01_THREADLOCAL     7156.34 (   0.00%)     7267.28 (  -1.55%)
BAmean-50 elsp-NUMA01_THREADLOCAL       90.73 (   0.00%)       90.88 (  -0.17%)
BAmean-95 syst-NUMA01_THREADLOCAL     7156.34 (   0.00%)     7267.28 (  -1.55%)
BAmean-95 elsp-NUMA01_THREADLOCAL       90.73 (   0.00%)       90.88 (  -0.17%)
BAmean-99 syst-NUMA01_THREADLOCAL     7156.34 (   0.00%)     7267.28 (  -1.55%)
BAmean-99 elsp-NUMA01_THREADLOCAL       90.73 (   0.00%)       90.88 (  -0.17%)

delta of /sys/fs/cgroup/mytest/memory.stat during the test:
numa_pages_migrated: 0
numa_pte_updates:    9156154
numa_hint_faults:    8659673
numa_task_migrated:  282  <--- introduced in previous patch
numa_task_swaped:    114  <---- introduced in previous patch

More tests to come.

Suggested-by: Aubrey Li <aubrey.li@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/numa.h           | 1 +
 include/uapi/linux/mempolicy.h | 1 +
 mm/memory.c                    | 2 +-
 mm/mempolicy.c                 | 7 +++++++
 4 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/numa.h b/include/linux/numa.h
index 3567e40329eb..6c3f2d839c76 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -14,6 +14,7 @@
 
 #define	NUMA_NO_NODE	(-1)
 #define	NUMA_NO_MEMBLK	(-1)
+#define	NUMA_TASK_MIG	(1)
 
 static inline bool numa_valid_node(int nid)
 {
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 1f9bb10d1a47..2081365612ac 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -64,6 +64,7 @@ enum {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
+#define MPOL_F_MOFT	(1 << 5) /* allow task but no page migrate on fault */
 
 /*
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
diff --git a/mm/memory.c b/mm/memory.c
index 539c0f7c6d54..4013bbcbf40f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5683,7 +5683,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 
 	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
 					writable, &last_cpupid);
-	if (target_nid == NUMA_NO_NODE)
+	if (target_nid == NUMA_NO_NODE || target_nid == NUMA_TASK_MIG)
 		goto out_map;
 	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
 		flags |= TNF_MIGRATE_FAIL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bbaadbeeb291..0b88601ec22d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1510,6 +1510,8 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 	if (*flags & MPOL_F_NUMA_BALANCING) {
 		if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
 			*flags |= (MPOL_F_MOF | MPOL_F_MORON);
+		else if (*mode == MPOL_INTERLEAVE)
+			*flags |= (MPOL_F_MOF | MPOL_F_MOFT);
 		else
 			return -EINVAL;
 	}
@@ -2779,6 +2781,11 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
 	if (!(pol->flags & MPOL_F_MOF))
 		goto out;
 
+	if (pol->flags & MPOL_F_MOFT) {
+		ret = NUMA_TASK_MIG;
+		goto out;
+	}
+
 	switch (pol->mode) {
 	case MPOL_INTERLEAVE:
 		polnid = interleave_nid(pol, ilx);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control
  2025-02-25 13:59 [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control Chen Yu
                   ` (2 preceding siblings ...)
  2025-02-25 14:00 ` [RFC PATCH 3/3] sched/numa: Allow intervale memory allocation for numa balance Chen Yu
@ 2025-03-05 14:38 ` Kaiyang Zhao
  2025-03-10 15:12   ` Chen Yu
  3 siblings, 1 reply; 8+ messages in thread
From: Kaiyang Zhao @ 2025-03-05 14:38 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Andrew Morton, Rik van Riel, Mel Gorman, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Liam R. Howlett, Lorenzo Stoakes, Huang, Ying, Tim Chen,
	Aubrey Li, Michael Wang, David Rientjes, Raghavendra K T,
	cgroups, linux-mm, linux-kernel

On Tue, Feb 25, 2025 at 09:59:33PM +0800, Chen Yu wrote:
> This per-cgroup NUMA balancing control was once proposed in
> 2019 by Yun Wang[1]. Then, in 2024, Kaiyang Zhao mentioned
> that he was working with Meta on per-cgroup NUMA control[2]
> during a discussion with David Rientjes.
> 
> I could not find further discussion regarding per-cgroup NUMA
> balancing from that point on. This set of RFC patches is a
> rough and compile-passed version, and may have unhandled cases
> (for example, THP). It has not been thoroughly tested and is
> intended to initiate or resume the discussion on the topic of
> per-cgroup NUMA load balancing.

Hello Chen,

It's nice to see people interested in this. I posted a set of RFC patches
later[1] that focuses on the fairness issue in memory tiering. It mostly
concerns the demotion side of things, and the promotion / NUMA balancing
side of things was left out of the patch set.

I don't work for Meta now, but my understanding is that they'll attempt
to push through a solution for per-cgroup control of memory tiering that
is in the same vein as my RFC patches, and it may include controls for
per-group NUMA balancing in the context of tiered memory.

Best,
Kaiyang

[1] https://lore.kernel.org/linux-mm/20240920221202.1734227-1-kaiyang2@cs.cmu.edu/


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control
  2025-03-05 14:38 ` [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control Kaiyang Zhao
@ 2025-03-10 15:12   ` Chen Yu
  0 siblings, 0 replies; 8+ messages in thread
From: Chen Yu @ 2025-03-10 15:12 UTC (permalink / raw)
  To: Kaiyang Zhao
  Cc: Chen Yu, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Andrew Morton, Rik van Riel, Mel Gorman,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Liam R. Howlett, Lorenzo Stoakes, Huang, Ying,
	Tim Chen, Aubrey Li, David Rientjes, Raghavendra K T, cgroups,
	linux-mm, linux-kernel

Hi Kaiyang,

On 2025-03-05 at 14:38:14 +0000, Kaiyang Zhao wrote:
> On Tue, Feb 25, 2025 at 09:59:33PM +0800, Chen Yu wrote:
> > This per-cgroup NUMA balancing control was once proposed in
> > 2019 by Yun Wang[1]. Then, in 2024, Kaiyang Zhao mentioned
> > that he was working with Meta on per-cgroup NUMA control[2]
> > during a discussion with David Rientjes.
> > 
> > I could not find further discussion regarding per-cgroup NUMA
> > balancing from that point on. This set of RFC patches is a
> > rough and compile-passed version, and may have unhandled cases
> > (for example, THP). It has not been thoroughly tested and is
> > intended to initiate or resume the discussion on the topic of
> > per-cgroup NUMA load balancing.
> 
> Hello Chen,
> 
> It's nice to see people interested in this. I posted a set of RFC patches
> later[1] that focuses on the fairness issue in memory tiering. It mostly
> concerns the demotion side of things, and the promotion / NUMA balancing
> side of things was left out of the patch set.
>

I see, thanks for the information.
 
> I don't work for Meta now, but my understanding is that they'll attempt
> to push through a solution for per-cgroup control of memory tiering that
> is in the same vein as my RFC patches, and it may include controls for
> per-group NUMA balancing in the context of tiered memory.
>

OK, it would be nice to see that patch set. We can continue the disscussion
on this basic per-cgroup Numa balancing control, the tiered memory promotion
could be on top of that IMO.

thanks,
Chenyu
 
> Best,
> Kaiyang
> 
> [1] https://lore.kernel.org/linux-mm/20240920221202.1734227-1-kaiyang2@cs.cmu.edu/



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-03-10 15:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-25 13:59 [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control Chen Yu
2025-02-25 14:00 ` [RFC PATCH 1/3] sched/numa: Introduce numa balance task migration and swap in schedstats Chen Yu
2025-02-25 14:00 ` [RFC PATCH 2/3] sched/numa: Introduce per cgroup numa balance control Chen Yu
2025-03-07 22:54   ` Tim Chen
2025-03-10 15:36     ` Chen Yu
2025-02-25 14:00 ` [RFC PATCH 3/3] sched/numa: Allow intervale memory allocation for numa balance Chen Yu
2025-03-05 14:38 ` [RFC PATCH 0/3] sched/numa: Introduce per cgroup numa balance control Kaiyang Zhao
2025-03-10 15:12   ` Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox