* [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
@ 2025-04-24 2:45 Libo Chen
2025-04-24 2:45 ` [PATCH v5 1/2] " Libo Chen
` (5 more replies)
0 siblings, 6 replies; 9+ messages in thread
From: Libo Chen @ 2025-04-24 2:45 UTC (permalink / raw)
To: akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
vincent.guittot, tj, llong
Cc: sraithal, venkat88, kprateek.nayak, raghavendra.kt, yu.c.chen,
tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel
v1->v2:
1. add perf improvment numbers in commit log. Yet to find perf diff on
will-it-scale, so not included here. Plan to run more workloads.
2. add tracepoint.
3. To peterz's comment, this will make it impossible to attract tasks to
those memory just like other VMA skippings. This is the current
implementation, I think we can improve that in the future, but at the
moment it's probabaly better to keep it consistent.
v2->v3:
1. add enable_cpuset() based on Mel's suggestion but again I think it's
redundant.
2. print out nodemask with %*p.. format in the tracepoint.
v3->v4:
1. fix an unsafe dereference of a pointer to content not on ring buffer,
namely mem_allowed_ptr in the tracepoint.
v4->v5:
1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
changes (particularly in size) in nodemask_t.
Libo Chen (2):
sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems
sched/numa: Add tracepoint that tracks the skipping of numa balancing
due to cpuset memory pinning
include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 9 +++++++++
2 files changed, 42 insertions(+)
--
2.43.5
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
2025-04-24 2:45 [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
@ 2025-04-24 2:45 ` Libo Chen
2025-04-24 2:45 ` [PATCH v5 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Libo Chen
` (4 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Libo Chen @ 2025-04-24 2:45 UTC (permalink / raw)
To: akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
vincent.guittot, tj, llong
Cc: sraithal, venkat88, kprateek.nayak, raghavendra.kt, yu.c.chen,
tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel
When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead. With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.
We have seen up to a 6x improvement on a typical java workload running on
VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
platform, we have seen 20% improvment in a microbench that creates a
30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
pages in a fixed number of loops.
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..c9903b1b3948 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
if (p->flags & PF_EXITING)
return;
+ /*
+ * Memory is pinned to only one NUMA node via cpuset.mems, naturally
+ * no page can be migrated.
+ */
+ if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
+ return;
+
if (!mm->numa_next_scan) {
mm->numa_next_scan = now +
msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
--
2.43.5
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
2025-04-24 2:45 [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
2025-04-24 2:45 ` [PATCH v5 1/2] " Libo Chen
@ 2025-04-24 2:45 ` Libo Chen
2025-04-24 4:42 ` [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems K Prateek Nayak
` (3 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Libo Chen @ 2025-04-24 2:45 UTC (permalink / raw)
To: akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
vincent.guittot, tj, llong
Cc: sraithal, venkat88, kprateek.nayak, raghavendra.kt, yu.c.chen,
tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel
Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this
tracks the task subjected to cpuset.mems pinning and prints out its
allowed memory node mask.
Signed-off-by: Libo Chen <libo.chen@oracle.com>
---
include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 4 +++-
2 files changed, 36 insertions(+), 1 deletion(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c1..ff3990318aec 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -745,6 +745,39 @@ TRACE_EVENT(sched_skip_vma_numa,
__entry->vm_end,
__print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
);
+
+TRACE_EVENT(sched_skip_cpuset_numa,
+
+ TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
+
+ TP_ARGS(tsk, mem_allowed_ptr),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( pid_t, tgid )
+ __field( pid_t, ngid )
+ __array( unsigned long, mem_allowed, BITS_TO_LONGS(MAX_NUMNODES))
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+ __entry->pid = task_pid_nr(tsk);
+ __entry->tgid = task_tgid_nr(tsk);
+ __entry->ngid = task_numa_group_id(tsk);
+ BUILD_BUG_ON(sizeof(nodemask_t) != \
+ BITS_TO_LONGS(MAX_NUMNODES) * sizeof(long));
+ memcpy(__entry->mem_allowed, mem_allowed_ptr->bits,
+ sizeof(__entry->mem_allowed));
+ ),
+
+ TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
+ __entry->comm,
+ __entry->pid,
+ __entry->tgid,
+ __entry->ngid,
+ MAX_NUMNODES, __entry->mem_allowed)
+);
#endif /* CONFIG_NUMA_BALANCING */
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9903b1b3948..cc892961ce15 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3333,8 +3333,10 @@ static void task_numa_work(struct callback_head *work)
* Memory is pinned to only one NUMA node via cpuset.mems, naturally
* no page can be migrated.
*/
- if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
+ if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) {
+ trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed);
return;
+ }
if (!mm->numa_next_scan) {
mm->numa_next_scan = now +
--
2.43.5
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
2025-04-24 2:45 [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
2025-04-24 2:45 ` [PATCH v5 1/2] " Libo Chen
2025-04-24 2:45 ` [PATCH v5 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Libo Chen
@ 2025-04-24 4:42 ` K Prateek Nayak
2025-04-24 7:05 ` Venkat Rao Bagalkote
` (2 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: K Prateek Nayak @ 2025-04-24 4:42 UTC (permalink / raw)
To: Libo Chen, akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
vincent.guittot, tj, llong
Cc: sraithal, venkat88, raghavendra.kt, yu.c.chen, tim.c.chen,
vineethr, chris.hyser, daniel.m.jordan, lorenzo.stoakes, mkoutny,
linux-mm, cgroups, linux-kernel
Hello Libo,
On 4/24/2025 8:15 AM, Libo Chen wrote:
> v1->v2:
> 1. add perf improvment numbers in commit log. Yet to find perf diff on
> will-it-scale, so not included here. Plan to run more workloads.
> 2. add tracepoint.
> 3. To peterz's comment, this will make it impossible to attract tasks to
> those memory just like other VMA skippings. This is the current
> implementation, I think we can improve that in the future, but at the
> moment it's probabaly better to keep it consistent.
I tested the series with hackbench running on a dual socket system with
memory pinned to one node and I could see the skip_cpuset_numa traces
being logged:
sched-messaging-9430 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9430 tgid=9007 ngid=0 mem_nodes_allowed=0
sched-messaging-9640 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9640 tgid=9007 ngid=0 mem_nodes_allowed=0
sched-messaging-9645 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9645 tgid=9007 ngid=0 mem_nodes_allowed=0
sched-messaging-9637 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9637 tgid=9007 ngid=0 mem_nodes_allowed=0
sched-messaging-9629 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9629 tgid=9007 ngid=0 mem_nodes_allowed=0
sched-messaging-9639 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9639 tgid=9007 ngid=0 mem_nodes_allowed=0
sched-messaging-9630 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9630 tgid=9007 ngid=0 mem_nodes_allowed=0
sched-messaging-9487 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9487 tgid=9007 ngid=0 mem_nodes_allowed=0
sched-messaging-9635 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9635 tgid=9007 ngid=0 mem_nodes_allowed=0
sched-messaging-9647 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9647 tgid=9007 ngid=0 mem_nodes_allowed=0
...
Feel free to add:
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek
>
> v2->v3:
> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
> redundant.
> 2. print out nodemask with %*p.. format in the tracepoint.
>
> v3->v4:
> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
> namely mem_allowed_ptr in the tracepoint.
>
> v4->v5:
> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
> changes (particularly in size) in nodemask_t.
>
> Libo Chen (2):
> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
> cpuset.mems
> sched/numa: Add tracepoint that tracks the skipping of numa balancing
> due to cpuset memory pinning
>
> include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 9 +++++++++
> 2 files changed, 42 insertions(+)
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
2025-04-24 2:45 [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
` (2 preceding siblings ...)
2025-04-24 4:42 ` [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems K Prateek Nayak
@ 2025-04-24 7:05 ` Venkat Rao Bagalkote
2025-04-24 7:46 ` Libo Chen
2025-04-24 7:52 ` Aithal, Srikanth
2025-04-24 9:50 ` Venkat Rao Bagalkote
5 siblings, 1 reply; 9+ messages in thread
From: Venkat Rao Bagalkote @ 2025-04-24 7:05 UTC (permalink / raw)
To: Libo Chen, akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
vincent.guittot, tj, llong
Cc: sraithal, kprateek.nayak, raghavendra.kt, yu.c.chen, tim.c.chen,
vineethr, chris.hyser, daniel.m.jordan, lorenzo.stoakes, mkoutny,
linux-mm, cgroups, linux-kernel
On 24/04/25 8:15 am, Libo Chen wrote:
> v1->v2:
> 1. add perf improvment numbers in commit log. Yet to find perf diff on
> will-it-scale, so not included here. Plan to run more workloads.
> 2. add tracepoint.
> 3. To peterz's comment, this will make it impossible to attract tasks to
> those memory just like other VMA skippings. This is the current
> implementation, I think we can improve that in the future, but at the
> moment it's probabaly better to keep it consistent.
>
> v2->v3:
> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
> redundant.
> 2. print out nodemask with %*p.. format in the tracepoint.
>
> v3->v4:
> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
> namely mem_allowed_ptr in the tracepoint.
>
> v4->v5:
> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
> changes (particularly in size) in nodemask_t.
>
> Libo Chen (2):
> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
> cpuset.mems
> sched/numa: Add tracepoint that tracks the skipping of numa balancing
> due to cpuset memory pinning
>
> include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 9 +++++++++
> 2 files changed, 42 insertions(+)
>
Hello Libo,
For some reason I am not able to apply this patch. I am trying to test
the boot warning[1].
I am trying to apply on top of next-20250423. Below is the error. Am I
missing anything?
[1]: https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/
Error:
git am -i
v5_20250423_libo_chen_sched_numa_skip_vma_scanning_on_memory_pinned_to_one_numa_node_via_cpuset_mems.mbx
Commit Body is:
--------------------------
sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems
When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead. With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.
We have seen up to a 6x improvement on a typical java workload running on
VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
platform, we have seen 20% improvment in a microbench that creates a
30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
pages in a fixed number of loops.
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--------------------------
Apply? [y]es/[n]o/[e]dit/[v]iew patch/[a]ccept all: a
Applying: sched/numa: Skip VMA scanning on memory pinned to one NUMA
node via cpuset.mems
error: patch failed: kernel/sched/fair.c:3329
error: kernel/sched/fair.c: patch does not apply
Patch failed at 0001 sched/numa: Skip VMA scanning on memory pinned to
one NUMA node via cpuset.mems
Regards,
Venkat.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
2025-04-24 7:05 ` Venkat Rao Bagalkote
@ 2025-04-24 7:46 ` Libo Chen
2025-04-24 9:47 ` Venkat Rao Bagalkote
0 siblings, 1 reply; 9+ messages in thread
From: Libo Chen @ 2025-04-24 7:46 UTC (permalink / raw)
To: Venkat Rao Bagalkote, akpm, rostedt, peterz, mgorman, mingo,
juri.lelli, vincent.guittot, tj, llong
Cc: sraithal, kprateek.nayak, raghavendra.kt, yu.c.chen, tim.c.chen,
vineethr, chris.hyser, daniel.m.jordan, lorenzo.stoakes, mkoutny,
linux-mm, cgroups, linux-kernel
On 4/24/25 00:05, Venkat Rao Bagalkote wrote:
>
> On 24/04/25 8:15 am, Libo Chen wrote:
>> v1->v2:
>> 1. add perf improvment numbers in commit log. Yet to find perf diff on
>> will-it-scale, so not included here. Plan to run more workloads.
>> 2. add tracepoint.
>> 3. To peterz's comment, this will make it impossible to attract tasks to
>> those memory just like other VMA skippings. This is the current
>> implementation, I think we can improve that in the future, but at the
>> moment it's probabaly better to keep it consistent.
>>
>> v2->v3:
>> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
>> redundant.
>> 2. print out nodemask with %*p.. format in the tracepoint.
>>
>> v3->v4:
>> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
>> namely mem_allowed_ptr in the tracepoint.
>>
>> v4->v5:
>> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
>> changes (particularly in size) in nodemask_t.
>>
>> Libo Chen (2):
>> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
>> cpuset.mems
>> sched/numa: Add tracepoint that tracks the skipping of numa balancing
>> due to cpuset memory pinning
>>
>> include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
>> kernel/sched/fair.c | 9 +++++++++
>> 2 files changed, 42 insertions(+)
>>
> Hello Libo,
>
>
> For some reason I am not able to apply this patch. I am trying to test the boot warning[1].
>
> I am trying to apply on top of next-20250423. Below is the error. Am I missing anything?
>
> [1]: https://urldefense.com/v3/__https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/__;!!ACWV5N9M2RV99hQ!IQpY9WDL1O3ppDekb1PpaTYJ98ehOXL6dNIkx02MPN84bCieT18zCh7WSouHctEGpwG2rtpZB42l7b5mkMFb$
> Error:
>
> git am -i v5_20250423_libo_chen_sched_numa_skip_vma_scanning_on_memory_pinned_to_one_numa_node_via_cpuset_mems.mbx
> Commit Body is:
> --------------------------
> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
>
> When the memory of the current task is pinned to one NUMA node by cgroup,
> there is no point in continuing the rest of VMA scanning and hinting page
> faults as they will just be overhead. With this change, there will be no
> more unnecessary PTE updates or page faults in this scenario.
>
> We have seen up to a 6x improvement on a typical java workload running on
> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
> platform, we have seen 20% improvment in a microbench that creates a
> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
> pages in a fixed number of loops.
>
> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> Tested-by: Chen Yu <yu.c.chen@intel.com>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> --------------------------
> Apply? [y]es/[n]o/[e]dit/[v]iew patch/[a]ccept all: a
> Applying: sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
> error: patch failed: kernel/sched/fair.c:3329
> error: kernel/sched/fair.c: patch does not apply
> Patch failed at 0001 sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
>
>
Hi Venkat,
I just did git am -i t.mbox on top of next-20250423, not sure why but the second patch was ahead of the
first patch in apply order, have you made sure the second patch was not applied before the first one?
- Libo
> Regards,
>
> Venkat.
>
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
2025-04-24 2:45 [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
` (3 preceding siblings ...)
2025-04-24 7:05 ` Venkat Rao Bagalkote
@ 2025-04-24 7:52 ` Aithal, Srikanth
2025-04-24 9:50 ` Venkat Rao Bagalkote
5 siblings, 0 replies; 9+ messages in thread
From: Aithal, Srikanth @ 2025-04-24 7:52 UTC (permalink / raw)
To: Libo Chen, akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
vincent.guittot, tj, llong
Cc: venkat88, kprateek.nayak, raghavendra.kt, yu.c.chen, tim.c.chen,
vineethr, chris.hyser, daniel.m.jordan, lorenzo.stoakes, mkoutny,
linux-mm, cgroups, linux-kernel
On 4/24/2025 8:15 AM, Libo Chen wrote:
> v1->v2:
> 1. add perf improvment numbers in commit log. Yet to find perf diff on
> will-it-scale, so not included here. Plan to run more workloads.
> 2. add tracepoint.
> 3. To peterz's comment, this will make it impossible to attract tasks to
> those memory just like other VMA skippings. This is the current
> implementation, I think we can improve that in the future, but at the
> moment it's probabaly better to keep it consistent.
>
> v2->v3:
> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
> redundant.
> 2. print out nodemask with %*p.. format in the tracepoint.
>
> v3->v4:
> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
> namely mem_allowed_ptr in the tracepoint.
>
> v4->v5:
> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
> changes (particularly in size) in nodemask_t.
>
> Libo Chen (2):
> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
> cpuset.mems
> sched/numa: Add tracepoint that tracks the skipping of numa balancing
> due to cpuset memory pinning
>
> include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 9 +++++++++
> 2 files changed, 42 insertions(+)
>
Tested on top of next-20250424. The boot warning[1] is fixed with this
version.
Tested-by: Srikanth Aithal <sraithal@amd.com>
[1]: https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
2025-04-24 7:46 ` Libo Chen
@ 2025-04-24 9:47 ` Venkat Rao Bagalkote
0 siblings, 0 replies; 9+ messages in thread
From: Venkat Rao Bagalkote @ 2025-04-24 9:47 UTC (permalink / raw)
To: Libo Chen, akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
vincent.guittot, tj, llong
Cc: sraithal, kprateek.nayak, raghavendra.kt, yu.c.chen, tim.c.chen,
vineethr, chris.hyser, daniel.m.jordan, lorenzo.stoakes, mkoutny,
linux-mm, cgroups, linux-kernel
On 24/04/25 1:16 pm, Libo Chen wrote:
>
> On 4/24/25 00:05, Venkat Rao Bagalkote wrote:
>> On 24/04/25 8:15 am, Libo Chen wrote:
>>> v1->v2:
>>> 1. add perf improvment numbers in commit log. Yet to find perf diff on
>>> will-it-scale, so not included here. Plan to run more workloads.
>>> 2. add tracepoint.
>>> 3. To peterz's comment, this will make it impossible to attract tasks to
>>> those memory just like other VMA skippings. This is the current
>>> implementation, I think we can improve that in the future, but at the
>>> moment it's probabaly better to keep it consistent.
>>>
>>> v2->v3:
>>> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
>>> redundant.
>>> 2. print out nodemask with %*p.. format in the tracepoint.
>>>
>>> v3->v4:
>>> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
>>> namely mem_allowed_ptr in the tracepoint.
>>>
>>> v4->v5:
>>> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
>>> changes (particularly in size) in nodemask_t.
>>>
>>> Libo Chen (2):
>>> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
>>> cpuset.mems
>>> sched/numa: Add tracepoint that tracks the skipping of numa balancing
>>> due to cpuset memory pinning
>>>
>>> include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
>>> kernel/sched/fair.c | 9 +++++++++
>>> 2 files changed, 42 insertions(+)
>>>
>> Hello Libo,
>>
>>
>> For some reason I am not able to apply this patch. I am trying to test the boot warning[1].
>>
>> I am trying to apply on top of next-20250423. Below is the error. Am I missing anything?
>>
>> [1]: https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/
>> Error:
>>
>> git am -i v5_20250423_libo_chen_sched_numa_skip_vma_scanning_on_memory_pinned_to_one_numa_node_via_cpuset_mems.mbx
>> Commit Body is:
>> --------------------------
>> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
>>
>> When the memory of the current task is pinned to one NUMA node by cgroup,
>> there is no point in continuing the rest of VMA scanning and hinting page
>> faults as they will just be overhead. With this change, there will be no
>> more unnecessary PTE updates or page faults in this scenario.
>>
>> We have seen up to a 6x improvement on a typical java workload running on
>> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
>> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
>> platform, we have seen 20% improvment in a microbench that creates a
>> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
>> pages in a fixed number of loops.
>>
>> Signed-off-by: Libo Chen <libo.chen@oracle.com>
>> Tested-by: Chen Yu <yu.c.chen@intel.com>
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> --------------------------
>> Apply? [y]es/[n]o/[e]dit/[v]iew patch/[a]ccept all: a
>> Applying: sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
>> error: patch failed: kernel/sched/fair.c:3329
>> error: kernel/sched/fair.c: patch does not apply
>> Patch failed at 0001 sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
>>
>>
> Hi Venkat,
>
> I just did git am -i t.mbox on top of next-20250423, not sure why but the second patch was ahead of the
> first patch in apply order, have you made sure the second patch was not applied before the first one?
>
> - Libo
Hi Libo,
Apolozies!!!
I freshly cloned and tried and it worked now. So, please ignore my
earlier mail.
Regards,
Venkat.
>> Regards,
>>
>> Venkat.
>>
>>
>>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
2025-04-24 2:45 [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
` (4 preceding siblings ...)
2025-04-24 7:52 ` Aithal, Srikanth
@ 2025-04-24 9:50 ` Venkat Rao Bagalkote
5 siblings, 0 replies; 9+ messages in thread
From: Venkat Rao Bagalkote @ 2025-04-24 9:50 UTC (permalink / raw)
To: Libo Chen, akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
vincent.guittot, tj, llong
Cc: sraithal, kprateek.nayak, raghavendra.kt, yu.c.chen, tim.c.chen,
vineethr, chris.hyser, daniel.m.jordan, lorenzo.stoakes, mkoutny,
linux-mm, cgroups, linux-kernel
On 24/04/25 8:15 am, Libo Chen wrote:
> v1->v2:
> 1. add perf improvment numbers in commit log. Yet to find perf diff on
> will-it-scale, so not included here. Plan to run more workloads.
> 2. add tracepoint.
> 3. To peterz's comment, this will make it impossible to attract tasks to
> those memory just like other VMA skippings. This is the current
> implementation, I think we can improve that in the future, but at the
> moment it's probabaly better to keep it consistent.
>
> v2->v3:
> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
> redundant.
> 2. print out nodemask with %*p.. format in the tracepoint.
>
> v3->v4:
> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
> namely mem_allowed_ptr in the tracepoint.
>
> v4->v5:
> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
> changes (particularly in size) in nodemask_t.
>
> Libo Chen (2):
> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
> cpuset.mems
> sched/numa: Add tracepoint that tracks the skipping of numa balancing
> due to cpuset memory pinning
>
> include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 9 +++++++++
> 2 files changed, 42 insertions(+)
>
Tested the above patch on top of next-20250424 and it fixes the boot
warning on IBM Power server. Hence,
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Regards,
Venkat.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-04-24 9:50 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-24 2:45 [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
2025-04-24 2:45 ` [PATCH v5 1/2] " Libo Chen
2025-04-24 2:45 ` [PATCH v5 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Libo Chen
2025-04-24 4:42 ` [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems K Prateek Nayak
2025-04-24 7:05 ` Venkat Rao Bagalkote
2025-04-24 7:46 ` Libo Chen
2025-04-24 9:47 ` Venkat Rao Bagalkote
2025-04-24 7:52 ` Aithal, Srikanth
2025-04-24 9:50 ` Venkat Rao Bagalkote
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox