* [PATCH] sched/numa: Add statistics of numa balance task migration and swap
@ 2025-04-02 1:06 Chen Yu
2025-04-02 13:24 ` Michal Koutný
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Chen Yu @ 2025-04-02 1:06 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton
Cc: Tim Chen, Aubrey Li, Rik van Riel, Raghavendra K T,
K Prateek Nayak, Baolin Wang, Xunlei Pang, linux-kernel, cgroups,
linux-mm, Chen Yu, Chen Yu
On system with NUMA balancing enabled, it is found that tracking
the task activities due to NUMA balancing is helpful. NUMA balancing
has two mechanisms for task migration: one is to migrate the task to
an idle CPU in its preferred node, the other is to swap tasks on
different nodes if they are on each other's preferred node.
The kernel already has NUMA page migration statistics in
/sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
but does not have statistics for task migration/swap.
Add the task migration and swap count accordingly.
The following two new fields:
numa_task_migrated
numa_task_swapped
will be displayed in both
/sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
Previous RFC version can be found here:
https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
RFC->v1: Rename the nr_numa_task_migrated to
numa_task_migrated, and nr_numa_task_swapped
numa_task_swapped in /proc/{PID}/sched,
so both cgroup's memory.stat and task's
sched have the same field name.
---
include/linux/sched.h | 4 ++++
include/linux/vm_event_item.h | 2 ++
kernel/sched/core.c | 10 ++++++++--
kernel/sched/debug.c | 4 ++++
mm/memcontrol.c | 2 ++
mm/vmstat.c | 2 ++
6 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0785268c76f8..9623e5300453 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -532,6 +532,10 @@ struct sched_statistics {
u64 nr_failed_migrations_running;
u64 nr_failed_migrations_hot;
u64 nr_forced_migrations;
+#ifdef CONFIG_NUMA_BALANCING
+ u64 numa_task_migrated;
+ u64 numa_task_swapped;
+#endif
u64 nr_wakeups;
u64 nr_wakeups_sync;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..aef817474781 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -64,6 +64,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NUMA_HINT_FAULTS,
NUMA_HINT_FAULTS_LOCAL,
NUMA_PAGE_MIGRATE,
+ NUMA_TASK_MIGRATE,
+ NUMA_TASK_SWAP,
#endif
#ifdef CONFIG_MIGRATION
PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c86c05264719..314d5cbce2b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3348,6 +3348,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
#ifdef CONFIG_NUMA_BALANCING
static void __migrate_swap_task(struct task_struct *p, int cpu)
{
+ __schedstat_inc(p->stats.numa_task_swapped);
+
+ if (p->mm)
+ count_memcg_events_mm(p->mm, NUMA_TASK_SWAP, 1);
+
if (task_on_rq_queued(p)) {
struct rq *src_rq, *dst_rq;
struct rq_flags srf, drf;
@@ -7948,8 +7953,9 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
return -EINVAL;
- /* TODO: This is not properly updating schedstats */
-
+ __schedstat_inc(p->stats.numa_task_migrated);
+ if (p->mm)
+ count_memcg_events_mm(p->mm, NUMA_TASK_MIGRATE, 1);
trace_sched_move_numa(p, curr_cpu, target_cpu);
return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 56ae54e0ce6a..f971c2af7912 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1206,6 +1206,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P_SCHEDSTAT(nr_failed_migrations_running);
P_SCHEDSTAT(nr_failed_migrations_hot);
P_SCHEDSTAT(nr_forced_migrations);
+#ifdef CONFIG_NUMA_BALANCING
+ P_SCHEDSTAT(numa_task_migrated);
+ P_SCHEDSTAT(numa_task_swapped);
+#endif
P_SCHEDSTAT(nr_wakeups);
P_SCHEDSTAT(nr_wakeups_sync);
P_SCHEDSTAT(nr_wakeups_migrate);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4de6acb9b8ec..1656c90b2381 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -460,6 +460,8 @@ static const unsigned int memcg_vm_event_stat[] = {
NUMA_PAGE_MIGRATE,
NUMA_PTE_UPDATES,
NUMA_HINT_FAULTS,
+ NUMA_TASK_MIGRATE,
+ NUMA_TASK_SWAP,
#endif
};
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd..7de1583a63c9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1339,6 +1339,8 @@ const char * const vmstat_text[] = {
"numa_hint_faults",
"numa_hint_faults_local",
"numa_pages_migrated",
+ "numa_task_migrated",
+ "numa_task_swapped",
#endif
#ifdef CONFIG_MIGRATION
"pgmigrate_success",
--
2.25.1
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 1:06 [PATCH] sched/numa: Add statistics of numa balance task migration and swap Chen Yu
@ 2025-04-02 13:24 ` Michal Koutný
2025-04-02 17:43 ` K Prateek Nayak
2025-04-03 2:47 ` Chen, Yu C
2025-04-02 13:33 ` Madadi Vineeth Reddy
` (2 subsequent siblings)
3 siblings, 2 replies; 13+ messages in thread
From: Michal Koutný @ 2025-04-02 13:24 UTC (permalink / raw)
To: Chen Yu
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Tim Chen, Aubrey Li,
Rik van Riel, Raghavendra K T, K Prateek Nayak, Baolin Wang,
Xunlei Pang, linux-kernel, cgroups, linux-mm, Chen Yu
[-- Attachment #1: Type: text/plain, Size: 544 bytes --]
Hello Chen.
On Wed, Apr 02, 2025 at 09:06:11AM +0800, Chen Yu <yu.c.chen@intel.com> wrote:
> On system with NUMA balancing enabled, it is found that tracking
> the task activities due to NUMA balancing is helpful.
...
> The following two new fields:
>
> numa_task_migrated
> numa_task_swapped
>
> will be displayed in both
> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
Why is the field /proc/$pid/sched not enough?
Also, you may want to update Documentation/admin-guide/cgroup-v2.rst
too.
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 1:06 [PATCH] sched/numa: Add statistics of numa balance task migration and swap Chen Yu
2025-04-02 13:24 ` Michal Koutný
@ 2025-04-02 13:33 ` Madadi Vineeth Reddy
2025-04-02 17:23 ` K Prateek Nayak
2025-04-02 17:35 ` K Prateek Nayak
2025-04-02 18:50 ` Madadi Vineeth Reddy
3 siblings, 1 reply; 13+ messages in thread
From: Madadi Vineeth Reddy @ 2025-04-02 13:33 UTC (permalink / raw)
To: Chen Yu
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Tim Chen, Aubrey Li,
Rik van Riel, Raghavendra K T, K Prateek Nayak, Baolin Wang,
Xunlei Pang, linux-kernel, cgroups, linux-mm, Chen Yu,
Madadi Vineeth Reddy
Hi Chen Yu,
On 02/04/25 06:36, Chen Yu wrote:
> On system with NUMA balancing enabled, it is found that tracking
> the task activities due to NUMA balancing is helpful. NUMA balancing
> has two mechanisms for task migration: one is to migrate the task to
> an idle CPU in its preferred node, the other is to swap tasks on
> different nodes if they are on each other's preferred node.
>
> The kernel already has NUMA page migration statistics in
> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
> but does not have statistics for task migration/swap.
> Add the task migration and swap count accordingly.
>
> The following two new fields:
>
> numa_task_migrated
> numa_task_swapped
>
> will be displayed in both
> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
I applied this patch, but I still don't see the two new fields
in /proc/{PID}/sched.
Am I missing any additional steps?
Thanks,
Madadi Vineeth Reddy
>
> Previous RFC version can be found here:
> https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> RFC->v1: Rename the nr_numa_task_migrated to
> numa_task_migrated, and nr_numa_task_swapped
> numa_task_swapped in /proc/{PID}/sched,
> so both cgroup's memory.stat and task's
> sched have the same field name.
> ---
> include/linux/sched.h | 4 ++++
> include/linux/vm_event_item.h | 2 ++
> kernel/sched/core.c | 10 ++++++++--
> kernel/sched/debug.c | 4 ++++
> mm/memcontrol.c | 2 ++
> mm/vmstat.c | 2 ++
> 6 files changed, 22 insertions(+), 2 deletions(-)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 13:33 ` Madadi Vineeth Reddy
@ 2025-04-02 17:23 ` K Prateek Nayak
2025-04-02 18:08 ` Madadi Vineeth Reddy
0 siblings, 1 reply; 13+ messages in thread
From: K Prateek Nayak @ 2025-04-02 17:23 UTC (permalink / raw)
To: 20250402010611.3204674-1-yu.c.chen, Chen Yu
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Tim Chen, Aubrey Li,
Rik van Riel, Raghavendra K T, Baolin Wang, Xunlei Pang,
linux-kernel, cgroups, linux-mm, Chen Yu, Madadi Vineeth Reddy
On 4/2/2025 7:03 PM, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
>
> On 02/04/25 06:36, Chen Yu wrote:
>> On system with NUMA balancing enabled, it is found that tracking
>> the task activities due to NUMA balancing is helpful. NUMA balancing
>> has two mechanisms for task migration: one is to migrate the task to
>> an idle CPU in its preferred node, the other is to swap tasks on
>> different nodes if they are on each other's preferred node.
>>
>> The kernel already has NUMA page migration statistics in
>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
>> but does not have statistics for task migration/swap.
>> Add the task migration and swap count accordingly.
>>
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be displayed in both
>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
>
> I applied this patch, but I still don't see the two new fields
> in /proc/{PID}/sched.
>
> Am I missing any additional steps?
You also need to enable schedstats:
echo 1 > /proc/sys/kernel/sched_schedstats
After that it should be visible:
$ cat /proc/4030/sched
sched-messaging (4030, #threads: 641)
-------------------------------------------------------------------
se.exec_start : 283818.948537
...
nr_forced_migrations : 0
numa_task_migrated : 0
numa_task_swapped : 0
nr_wakeups : 0
...
--
Thanks and Regards,
Prateek
>
> Thanks,
> Madadi Vineeth Reddy
>
>>
>> Previous RFC version can be found here:
>> https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
>>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>> RFC->v1: Rename the nr_numa_task_migrated to
>> numa_task_migrated, and nr_numa_task_swapped
>> numa_task_swapped in /proc/{PID}/sched,
>> so both cgroup's memory.stat and task's
>> sched have the same field name.
>> ---
>> include/linux/sched.h | 4 ++++
>> include/linux/vm_event_item.h | 2 ++
>> kernel/sched/core.c | 10 ++++++++--
>> kernel/sched/debug.c | 4 ++++
>> mm/memcontrol.c | 2 ++
>> mm/vmstat.c | 2 ++
>> 6 files changed, 22 insertions(+), 2 deletions(-)
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 1:06 [PATCH] sched/numa: Add statistics of numa balance task migration and swap Chen Yu
2025-04-02 13:24 ` Michal Koutný
2025-04-02 13:33 ` Madadi Vineeth Reddy
@ 2025-04-02 17:35 ` K Prateek Nayak
2025-04-03 2:49 ` Chen, Yu C
2025-04-02 18:50 ` Madadi Vineeth Reddy
3 siblings, 1 reply; 13+ messages in thread
From: K Prateek Nayak @ 2025-04-02 17:35 UTC (permalink / raw)
To: Chen Yu, Peter Zijlstra, Ingo Molnar, Juri Lelli,
Vincent Guittot, Mel Gorman, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
Cc: Tim Chen, Aubrey Li, Rik van Riel, Raghavendra K T, Baolin Wang,
Xunlei Pang, linux-kernel, cgroups, linux-mm, Chen Yu
Hello Chenyu,
On 4/2/2025 6:36 AM, Chen Yu wrote:
> On system with NUMA balancing enabled, it is found that tracking
> the task activities due to NUMA balancing is helpful. NUMA balancing
> has two mechanisms for task migration: one is to migrate the task to
> an idle CPU in its preferred node, the other is to swap tasks on
> different nodes if they are on each other's preferred node.
>
> The kernel already has NUMA page migration statistics in
> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
> but does not have statistics for task migration/swap.
> Add the task migration and swap count accordingly.
>
> The following two new fields:
>
> numa_task_migrated
> numa_task_swapped
>
> will be displayed in both
> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
Running sched-messaging with schedstats enabled, I could see both
"numa_task_migrated" and "numa_task_swapped" being populated for the
sched-messaging threads:
$ for i in $(ls /proc/4030/task/); do grep "numa_task_migrated" /proc/$i/sched; done | tr -s ' ' | cut -d ' ' -f3 | sort | uniq -c
400 0
231 1
10 2
$ for i in $(ls /proc/4030/task/); do grep "numa_task_swapped" /proc/$i/sched; done | tr -s ' ' | cut -d ' ' -f3 | sort | uniq -c
389 0
193 1
47 2
11 3
1 4
>
> Previous RFC version can be found here:
> https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Feel free to add:
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek
> ---
> RFC->v1: Rename the nr_numa_task_migrated to
> numa_task_migrated, and nr_numa_task_swapped
> numa_task_swapped in /proc/{PID}/sched,
> so both cgroup's memory.stat and task's
> sched have the same field name.
>
[..snip..]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 13:24 ` Michal Koutný
@ 2025-04-02 17:43 ` K Prateek Nayak
2025-04-03 17:57 ` Michal Koutný
2025-04-03 2:47 ` Chen, Yu C
1 sibling, 1 reply; 13+ messages in thread
From: K Prateek Nayak @ 2025-04-02 17:43 UTC (permalink / raw)
To: Michal Koutný, Chen Yu
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Tim Chen, Aubrey Li,
Rik van Riel, Raghavendra K T, Baolin Wang, Xunlei Pang,
linux-kernel, cgroups, linux-mm, Chen Yu
Hello Michal,
On 4/2/2025 6:54 PM, Michal Koutný wrote:
> Hello Chen.
>
> On Wed, Apr 02, 2025 at 09:06:11AM +0800, Chen Yu <yu.c.chen@intel.com> wrote:
>> On system with NUMA balancing enabled, it is found that tracking
>> the task activities due to NUMA balancing is helpful.
> ...
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be displayed in both
>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
>
> Why is the field /proc/$pid/sched not enough?
The /proc/$pid/sched accounting is only done when schedstats are
enabled. memcg users might want to track it separately without relying
on schedstats which also enables a bunch of other scheduler related
stats collection adding more overheads.
>
> Also, you may want to update Documentation/admin-guide/cgroup-v2.rst
> too.
>
> Thanks,
> Michal
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 17:23 ` K Prateek Nayak
@ 2025-04-02 18:08 ` Madadi Vineeth Reddy
0 siblings, 0 replies; 13+ messages in thread
From: Madadi Vineeth Reddy @ 2025-04-02 18:08 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Tim Chen, Aubrey Li,
Rik van Riel, Raghavendra K T, Baolin Wang, Xunlei Pang,
linux-kernel, cgroups, linux-mm, Chen Yu, Chen Yu,
Madadi Vineeth Reddy
On 02/04/25 22:53, K Prateek Nayak wrote:
> On 4/2/2025 7:03 PM, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 02/04/25 06:36, Chen Yu wrote:
>>> On system with NUMA balancing enabled, it is found that tracking
>>> the task activities due to NUMA balancing is helpful. NUMA balancing
>>> has two mechanisms for task migration: one is to migrate the task to
>>> an idle CPU in its preferred node, the other is to swap tasks on
>>> different nodes if they are on each other's preferred node.
>>>
>>> The kernel already has NUMA page migration statistics in
>>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
>>> but does not have statistics for task migration/swap.
>>> Add the task migration and swap count accordingly.
>>>
>>> The following two new fields:
>>>
>>> numa_task_migrated
>>> numa_task_swapped
>>>
>>> will be displayed in both
>>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
>>
>> I applied this patch, but I still don't see the two new fields
>> in /proc/{PID}/sched.
>>
>> Am I missing any additional steps?
>
> You also need to enable schedstats:
>
> echo 1 > /proc/sys/kernel/sched_schedstats
>
> After that it should be visible:
Thanks, Prateek! I had missed enabling schedstats. Now that it's enabled,
I can see the fields.
Thanks,
Madadi Vineeth Reddy
>
> $ cat /proc/4030/sched
> sched-messaging (4030, #threads: 641)
> -------------------------------------------------------------------
> se.exec_start : 283818.948537
>
> ...
>
> nr_forced_migrations : 0
> numa_task_migrated : 0
> numa_task_swapped : 0
> nr_wakeups : 0
>
> ...
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 1:06 [PATCH] sched/numa: Add statistics of numa balance task migration and swap Chen Yu
` (2 preceding siblings ...)
2025-04-02 17:35 ` K Prateek Nayak
@ 2025-04-02 18:50 ` Madadi Vineeth Reddy
2025-04-03 2:52 ` Chen, Yu C
3 siblings, 1 reply; 13+ messages in thread
From: Madadi Vineeth Reddy @ 2025-04-02 18:50 UTC (permalink / raw)
To: Chen Yu
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Tim Chen, Aubrey Li,
Rik van Riel, Raghavendra K T, K Prateek Nayak, Baolin Wang,
Xunlei Pang, linux-kernel, cgroups, linux-mm, Chen Yu,
Madadi Vineeth Reddy
Hi Chen Yu,
On 02/04/25 06:36, Chen Yu wrote:
> On system with NUMA balancing enabled, it is found that tracking
> the task activities due to NUMA balancing is helpful. NUMA balancing
> has two mechanisms for task migration: one is to migrate the task to
> an idle CPU in its preferred node, the other is to swap tasks on
> different nodes if they are on each other's preferred node.
>
> The kernel already has NUMA page migration statistics in
> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
> but does not have statistics for task migration/swap.
> Add the task migration and swap count accordingly.
>
> The following two new fields:
>
> numa_task_migrated
> numa_task_swapped
>
> will be displayed in both
> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
I was able to see the fields and their corresponding values for schbench:
numa_task_swapped : 2
numa_task_migrated : 0
numa_task_swapped : 1
numa_task_migrated : 0
numa_task_swapped : 0
numa_task_migrated : 0
numa_task_swapped : 1
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Thanks,
Madadi Vineeth Reddy
> Previous RFC version can be found here:
> https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> RFC->v1: Rename the nr_numa_task_migrated to
> numa_task_migrated, and nr_numa_task_swapped
> numa_task_swapped in /proc/{PID}/sched,
> so both cgroup's memory.stat and task's
> sched have the same field name.
> ---
> include/linux/sched.h | 4 ++++
> include/linux/vm_event_item.h | 2 ++
> kernel/sched/core.c | 10 ++++++++--
> kernel/sched/debug.c | 4 ++++
> mm/memcontrol.c | 2 ++
> mm/vmstat.c | 2 ++
> 6 files changed, 22 insertions(+), 2 deletions(-)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 13:24 ` Michal Koutný
2025-04-02 17:43 ` K Prateek Nayak
@ 2025-04-03 2:47 ` Chen, Yu C
2025-04-03 18:03 ` Michal Koutný
1 sibling, 1 reply; 13+ messages in thread
From: Chen, Yu C @ 2025-04-03 2:47 UTC (permalink / raw)
To: Michal Koutný
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Tim Chen, Aubrey Li,
Rik van Riel, Raghavendra K T, K Prateek Nayak, Baolin Wang,
Xunlei Pang, linux-kernel, cgroups, linux-mm, Chen Yu
Hi Michal,
thanks for taking a look at this,
On 4/2/2025 9:24 PM, Michal Koutný wrote:
> Hello Chen.
>
> On Wed, Apr 02, 2025 at 09:06:11AM +0800, Chen Yu <yu.c.chen@intel.com> wrote:
>> On system with NUMA balancing enabled, it is found that tracking
>> the task activities due to NUMA balancing is helpful.
> ...
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be displayed in both
>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
>
> Why is the field /proc/$pid/sched not enough?
>
In the context of NUMA balancing, it would be helpful to not only
monitor on the activities of individual task/thread but also the
resource usage and task migrations at the group level - which helps us
quickly evaluate the performance and resource usage of the container -
like per memcg numa_pages_migrated, numa_pte_updates introduced in
commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA
balancing operations"). Yes, we can iterate the /proc/$pid/sched to
find the accumulated NUMA stat, and the introduction of per - cgroup
numa stat can help users more conveniently track the overall data of the
workload.
Besides, I'm considering evaluating the per - cgroup NUMA balance
control[1] to help users do fine - grain control per workload. This per
- cgroup NUMA balance stat could be used to evaluate the efficiency of
per - cgroup NUMA balance.
> Also, you may want to update Documentation/admin-guide/cgroup-v2.rst
> too.
Got it, will do in next version.
[1]
https://lore.kernel.org/lkml/b3f1f6c478127a38b9091a8341374ba160d25c5a.1740483690.git.yu.c.chen@intel.com/
thanks,
Chenyu
>
> Thanks,
> Michal
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 17:35 ` K Prateek Nayak
@ 2025-04-03 2:49 ` Chen, Yu C
0 siblings, 0 replies; 13+ messages in thread
From: Chen, Yu C @ 2025-04-03 2:49 UTC (permalink / raw)
To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Juri Lelli,
Vincent Guittot, Mel Gorman, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
Cc: Tim Chen, Aubrey Li, Rik van Riel, Raghavendra K T, Baolin Wang,
Xunlei Pang, linux-kernel, cgroups, linux-mm, Chen Yu
On 4/3/2025 1:35 AM, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 4/2/2025 6:36 AM, Chen Yu wrote:
>> On system with NUMA balancing enabled, it is found that tracking
>> the task activities due to NUMA balancing is helpful. NUMA balancing
>> has two mechanisms for task migration: one is to migrate the task to
>> an idle CPU in its preferred node, the other is to swap tasks on
>> different nodes if they are on each other's preferred node.
>>
>> The kernel already has NUMA page migration statistics in
>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
>> but does not have statistics for task migration/swap.
>> Add the task migration and swap count accordingly.
>>
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be displayed in both
>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
>
> Running sched-messaging with schedstats enabled, I could see both
> "numa_task_migrated" and "numa_task_swapped" being populated for the
> sched-messaging threads:
>
> $ for i in $(ls /proc/4030/task/); do grep "numa_task_migrated" /proc/
> $i/sched; done | tr -s ' ' | cut -d ' ' -f3 | sort | uniq -c
> 400 0
> 231 1
> 10 2
>
> $ for i in $(ls /proc/4030/task/); do grep "numa_task_swapped" /proc/$i/
> sched; done | tr -s ' ' | cut -d ' ' -f3 | sort | uniq -c
> 389 0
> 193 1
> 47 2
> 11 3
> 1 4
>
>>
>> Previous RFC version can be found here:
>> https://lore.kernel.org/
>> lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
>>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>
> Feel free to add:
>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>
Thanks Prateek!
Thanks,
Chenyu
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 18:50 ` Madadi Vineeth Reddy
@ 2025-04-03 2:52 ` Chen, Yu C
0 siblings, 0 replies; 13+ messages in thread
From: Chen, Yu C @ 2025-04-03 2:52 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Tim Chen, Aubrey Li,
Rik van Riel, Raghavendra K T, K Prateek Nayak, Baolin Wang,
Xunlei Pang, linux-kernel, cgroups, linux-mm, Chen Yu
On 4/3/2025 2:50 AM, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
>
> On 02/04/25 06:36, Chen Yu wrote:
>> On system with NUMA balancing enabled, it is found that tracking
>> the task activities due to NUMA balancing is helpful. NUMA balancing
>> has two mechanisms for task migration: one is to migrate the task to
>> an idle CPU in its preferred node, the other is to swap tasks on
>> different nodes if they are on each other's preferred node.
>>
>> The kernel already has NUMA page migration statistics in
>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
>> but does not have statistics for task migration/swap.
>> Add the task migration and swap count accordingly.
>>
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be displayed in both
>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
>
> I was able to see the fields and their corresponding values for schbench:
>
> numa_task_swapped : 2
> numa_task_migrated : 0
> numa_task_swapped : 1
> numa_task_migrated : 0
> numa_task_swapped : 0
> numa_task_migrated : 0
> numa_task_swapped : 1
>
> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>
Yes, the sysfs schedstat has to be enabled. Thanks for your test, Madadi!
thanks,
Chenyu
> Thanks,
> Madadi Vineeth Reddy
>
>> Previous RFC version can be found here:
>> https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
>>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>> RFC->v1: Rename the nr_numa_task_migrated to
>> numa_task_migrated, and nr_numa_task_swapped
>> numa_task_swapped in /proc/{PID}/sched,
>> so both cgroup's memory.stat and task's
>> sched have the same field name.
>> ---
>> include/linux/sched.h | 4 ++++
>> include/linux/vm_event_item.h | 2 ++
>> kernel/sched/core.c | 10 ++++++++--
>> kernel/sched/debug.c | 4 ++++
>> mm/memcontrol.c | 2 ++
>> mm/vmstat.c | 2 ++
>> 6 files changed, 22 insertions(+), 2 deletions(-)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-02 17:43 ` K Prateek Nayak
@ 2025-04-03 17:57 ` Michal Koutný
0 siblings, 0 replies; 13+ messages in thread
From: Michal Koutný @ 2025-04-03 17:57 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Chen Yu, Peter Zijlstra, Ingo Molnar, Juri Lelli,
Vincent Guittot, Mel Gorman, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Tim Chen, Aubrey Li, Rik van Riel, Raghavendra K T, Baolin Wang,
Xunlei Pang, linux-kernel, cgroups, linux-mm, Chen Yu
[-- Attachment #1: Type: text/plain, Size: 511 bytes --]
On Wed, Apr 02, 2025 at 11:13:03PM +0530, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> The /proc/$pid/sched accounting is only done when schedstats are
> enabled. memcg users might want to track it separately without relying
> on schedstats which also enables a bunch of other scheduler related
> stats collection adding more overheads.
.oO(memory.[numa_]stat could end up with something similar since not all
users are interested in all fields but all users are affected by added
fields)
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched/numa: Add statistics of numa balance task migration and swap
2025-04-03 2:47 ` Chen, Yu C
@ 2025-04-03 18:03 ` Michal Koutný
0 siblings, 0 replies; 13+ messages in thread
From: Michal Koutný @ 2025-04-03 18:03 UTC (permalink / raw)
To: Chen, Yu C
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
Mel Gorman, Johannes Weiner, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Tim Chen, Aubrey Li,
Rik van Riel, Raghavendra K T, K Prateek Nayak, Baolin Wang,
Xunlei Pang, linux-kernel, cgroups, linux-mm, Chen Yu
[-- Attachment #1: Type: text/plain, Size: 734 bytes --]
On Thu, Apr 03, 2025 at 10:47:44AM +0800, "Chen, Yu C" <yu.c.chen@intel.com> wrote:
> In the context of NUMA balancing, it would be helpful to not only monitor on
> the activities of individual task/thread but also the resource usage and
> task migrations at the group level - which helps us quickly evaluate the
> performance and resource usage of the container - like per memcg
> numa_pages_migrated, numa_pte_updates introduced in
Somehow I thought that these are the useful metrics (amount of misplaced
memory) and the aggregated task stats aren't interesting (they'd be more
or less proportion of total number of tasks in the group and you don't
know which task it was).
> Got it, will do in next version.
OK.
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-04-03 18:03 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-02 1:06 [PATCH] sched/numa: Add statistics of numa balance task migration and swap Chen Yu
2025-04-02 13:24 ` Michal Koutný
2025-04-02 17:43 ` K Prateek Nayak
2025-04-03 17:57 ` Michal Koutný
2025-04-03 2:47 ` Chen, Yu C
2025-04-03 18:03 ` Michal Koutný
2025-04-02 13:33 ` Madadi Vineeth Reddy
2025-04-02 17:23 ` K Prateek Nayak
2025-04-02 18:08 ` Madadi Vineeth Reddy
2025-04-02 17:35 ` K Prateek Nayak
2025-04-03 2:49 ` Chen, Yu C
2025-04-02 18:50 ` Madadi Vineeth Reddy
2025-04-03 2:52 ` Chen, Yu C
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox