[PATCH v4 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mem
@ 2025-04-24  0:01 Libo Chen
  2025-04-24  0:01 ` [PATCH v4 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
  2025-04-24  0:01 ` [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Libo Chen
  0 siblings, 2 replies; 9+ messages in thread
From: Libo Chen @ 2025-04-24  0:01 UTC (permalink / raw)
  To: akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
	vincent.guittot, tj, llong
  Cc: sraithal, venkat88, kprateek.nayak, raghavendra.kt, yu.c.chen,
	tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
	lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel

v1->v2:
1. add perf improvment numbers in commit log. Yet to find perf diff on
will-it-scale, so not included here. Plan to run more workloads.
2. add tracepoint.
3. To peterz's comment, this will make it impossible to attract tasks to
those memory just like other VMA skippings. This is the current
implementation, I think we can improve that in the future, but at the
moment it's probabaly better to keep it consistent.

v2->v3:
1. add enable_cpuset() based on Mel's suggestion but again I think it's
redundant
2. print out nodemask with %*p.. format in the tracepoint

v3->v4:
1. fix an unsafe dereference of a pointer to content not on ring buffer,
namely mem_allowed_ptr in the tracepoint.

Libo Chen (2):
  sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
    cpuset.mems
  sched/numa: Add tracepoint that tracks the skipping of numa balancing
    due to cpuset memory pinning

 include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  9 +++++++++
 2 files changed, 40 insertions(+)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
  2025-04-24  0:01 [PATCH v4 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mem Libo Chen
@ 2025-04-24  0:01 ` Libo Chen
  2025-04-24  0:01 ` [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Libo Chen
  1 sibling, 0 replies; 9+ messages in thread
From: Libo Chen @ 2025-04-24  0:01 UTC (permalink / raw)
  To: akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
	vincent.guittot, tj, llong
  Cc: sraithal, venkat88, kprateek.nayak, raghavendra.kt, yu.c.chen,
	tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
	lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel

When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead. With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.

We have seen up to a 6x improvement on a typical java workload running on
VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
platform, we have seen 20% improvment in a microbench that creates a
30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
pages in a fixed number of loops.

Signed-off-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..c9903b1b3948 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	/*
+	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
+	 * no page can be migrated.
+	 */
+	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
+		return;
+
 	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
 			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
  2025-04-24  0:01 [PATCH v4 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mem Libo Chen
  2025-04-24  0:01 ` [PATCH v4 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
@ 2025-04-24  0:01 ` Libo Chen
  2025-04-24  0:18   ` Steven Rostedt
  1 sibling, 1 reply; 9+ messages in thread
From: Libo Chen @ 2025-04-24  0:01 UTC (permalink / raw)
  To: akpm, rostedt, peterz, mgorman, mingo, juri.lelli,
	vincent.guittot, tj, llong
  Cc: sraithal, venkat88, kprateek.nayak, raghavendra.kt, yu.c.chen,
	tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
	lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel

Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this
tracks the task subjected to cpuset.mems pinning and prints out its
allowed memory node mask.

Signed-off-by: Libo Chen <libo.chen@oracle.com>
---
 include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  4 +++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c1..91f9dc177dad 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -745,6 +745,37 @@ TRACE_EVENT(sched_skip_vma_numa,
 		  __entry->vm_end,
 		  __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
 );
+
+TRACE_EVENT(sched_skip_cpuset_numa,
+
+	TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
+
+	TP_ARGS(tsk, mem_allowed_ptr),
+
+	TP_STRUCT__entry(
+		__array( char,		comm,		TASK_COMM_LEN		)
+		__field( pid_t,		pid					)
+		__field( pid_t,		tgid					)
+		__field( pid_t,		ngid					)
+		__array( unsigned long, mem_allowed, BITS_TO_LONGS(MAX_NUMNODES))
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		__entry->pid		 = task_pid_nr(tsk);
+		__entry->tgid		 = task_tgid_nr(tsk);
+		__entry->ngid		 = task_numa_group_id(tsk);
+		memcpy(__entry->mem_allowed, mem_allowed_ptr->bits,
+		       sizeof(__entry->mem_allowed));
+	),
+
+	TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
+		  __entry->comm,
+		  __entry->pid,
+		  __entry->tgid,
+		  __entry->ngid,
+		  MAX_NUMNODES, __entry->mem_allowed)
+);
 #endif /* CONFIG_NUMA_BALANCING */
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9903b1b3948..cc892961ce15 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3333,8 +3333,10 @@ static void task_numa_work(struct callback_head *work)
 	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
 	 * no page can be migrated.
 	 */
-	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
+	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) {
+		trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed);
 		return;
+	}
 
 	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
-- 
2.43.5



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
  2025-04-24  0:01 ` [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Libo Chen
@ 2025-04-24  0:18   ` Steven Rostedt
  2025-04-24  0:36     ` Libo Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2025-04-24  0:18 UTC (permalink / raw)
  To: Libo Chen
  Cc: akpm, peterz, mgorman, mingo, juri.lelli, vincent.guittot, tj,
	llong, sraithal, venkat88, kprateek.nayak, raghavendra.kt,
	yu.c.chen, tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
	lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel

On Wed, 23 Apr 2025 17:01:46 -0700
Libo Chen <libo.chen@oracle.com> wrote:

> +++ b/include/trace/events/sched.h
> @@ -745,6 +745,37 @@ TRACE_EVENT(sched_skip_vma_numa,
>  		  __entry->vm_end,
>  		  __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
>  );
> +
> +TRACE_EVENT(sched_skip_cpuset_numa,
> +
> +	TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
> +
> +	TP_ARGS(tsk, mem_allowed_ptr),
> +
> +	TP_STRUCT__entry(
> +		__array( char,		comm,		TASK_COMM_LEN		)
> +		__field( pid_t,		pid					)
> +		__field( pid_t,		tgid					)
> +		__field( pid_t,		ngid					)
> +		__array( unsigned long, mem_allowed, BITS_TO_LONGS(MAX_NUMNODES))
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> +		__entry->pid		 = task_pid_nr(tsk);
> +		__entry->tgid		 = task_tgid_nr(tsk);
> +		__entry->ngid		 = task_numa_group_id(tsk);
> +		memcpy(__entry->mem_allowed, mem_allowed_ptr->bits,
> +		       sizeof(__entry->mem_allowed));

Is mem_allowed->bits guaranteed to be the size of BITS_TO_LONGS(MAX_NUM_NODES)
in size? If not, then memcpy will read beyond that size.

-- Steve


> +	),
> +
> +	TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
> +		  __entry->comm,
> +		  __entry->pid,
> +		  __entry->tgid,
> +		  __entry->ngid,
> +		  MAX_NUMNODES, __entry->mem_allowed)
> +);


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
  2025-04-24  0:18   ` Steven Rostedt
@ 2025-04-24  0:36     ` Libo Chen
  2025-04-24  1:01       ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Libo Chen @ 2025-04-24  0:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: akpm, peterz, mgorman, mingo, juri.lelli, vincent.guittot, tj,
	llong, sraithal, venkat88, kprateek.nayak, raghavendra.kt,
	yu.c.chen, tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
	lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel



On 4/23/25 17:18, Steven Rostedt wrote:
> On Wed, 23 Apr 2025 17:01:46 -0700
> Libo Chen <libo.chen@oracle.com> wrote:
> 
>> +++ b/include/trace/events/sched.h
>> @@ -745,6 +745,37 @@ TRACE_EVENT(sched_skip_vma_numa,
>>  		  __entry->vm_end,
>>  		  __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
>>  );
>> +
>> +TRACE_EVENT(sched_skip_cpuset_numa,
>> +
>> +	TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
>> +
>> +	TP_ARGS(tsk, mem_allowed_ptr),
>> +
>> +	TP_STRUCT__entry(
>> +		__array( char,		comm,		TASK_COMM_LEN		)
>> +		__field( pid_t,		pid					)
>> +		__field( pid_t,		tgid					)
>> +		__field( pid_t,		ngid					)
>> +		__array( unsigned long, mem_allowed, BITS_TO_LONGS(MAX_NUMNODES))
>> +	),
>> +
>> +	TP_fast_assign(
>> +		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>> +		__entry->pid		 = task_pid_nr(tsk);
>> +		__entry->tgid		 = task_tgid_nr(tsk);
>> +		__entry->ngid		 = task_numa_group_id(tsk);
>> +		memcpy(__entry->mem_allowed, mem_allowed_ptr->bits,
>> +		       sizeof(__entry->mem_allowed));
> 
> Is mem_allowed->bits guaranteed to be the size of BITS_TO_LONGS(MAX_NUM_NODES)
> in size? If not, then memcpy will read beyond that size.
> 

Yes, evidence can be found in the definitions of nodemask_t and DECLARE_BITMAP:

// include/linux/nodemask_types.h 
typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;

// include/linux/types.h
#define DECLARE_BITMAP(name,bits) \
	unsigned long name[BITS_TO_LONGS(bits)]



Thanks,
Libo
> -- Steve
> 
> 
>> +	),
>> +
>> +	TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
>> +		  __entry->comm,
>> +		  __entry->pid,
>> +		  __entry->tgid,
>> +		  __entry->ngid,
>> +		  MAX_NUMNODES, __entry->mem_allowed)
>> +);
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
  2025-04-24  0:36     ` Libo Chen
@ 2025-04-24  1:01       ` Steven Rostedt
  2025-04-24  1:12         ` Libo Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2025-04-24  1:01 UTC (permalink / raw)
  To: Libo Chen
  Cc: akpm, peterz, mgorman, mingo, juri.lelli, vincent.guittot, tj,
	llong, sraithal, venkat88, kprateek.nayak, raghavendra.kt,
	yu.c.chen, tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
	lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel

On Wed, 23 Apr 2025 17:36:30 -0700
Libo Chen <libo.chen@oracle.com> wrote:

> >> +	TP_fast_assign(
> >> +		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> >> +		__entry->pid		 = task_pid_nr(tsk);
> >> +		__entry->tgid		 = task_tgid_nr(tsk);
> >> +		__entry->ngid		 = task_numa_group_id(tsk);
> >> +		memcpy(__entry->mem_allowed, mem_allowed_ptr->bits,
> >> +		       sizeof(__entry->mem_allowed));  
> > 
> > Is mem_allowed->bits guaranteed to be the size of BITS_TO_LONGS(MAX_NUM_NODES)
> > in size? If not, then memcpy will read beyond that size.
> >   
> 
> Yes, evidence can be found in the definitions of nodemask_t and DECLARE_BITMAP:
> 
> // include/linux/nodemask_types.h 
> typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
> 
> // include/linux/types.h
> #define DECLARE_BITMAP(name,bits) \
> 	unsigned long name[BITS_TO_LONGS(bits)]
> 

Hmm, I wonder then if we should add in TP_fast_assign():

	BUILD_BUG_ON(sizeof(nodemask_t) != BITS_TO_LONGS(MAX_NUM_NODES) * sizeof(long));

-- Steve


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
  2025-04-24  1:01       ` Steven Rostedt
@ 2025-04-24  1:12         ` Libo Chen
  2025-04-24  1:33           ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Libo Chen @ 2025-04-24  1:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: akpm, peterz, mgorman, mingo, juri.lelli, vincent.guittot, tj,
	llong, sraithal, venkat88, kprateek.nayak, raghavendra.kt,
	yu.c.chen, tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
	lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel



On 4/23/25 18:01, Steven Rostedt wrote:
> On Wed, 23 Apr 2025 17:36:30 -0700
> Libo Chen <libo.chen@oracle.com> wrote:
> 
>>>> +	TP_fast_assign(
>>>> +		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>>>> +		__entry->pid		 = task_pid_nr(tsk);
>>>> +		__entry->tgid		 = task_tgid_nr(tsk);
>>>> +		__entry->ngid		 = task_numa_group_id(tsk);
>>>> +		memcpy(__entry->mem_allowed, mem_allowed_ptr->bits,
>>>> +		       sizeof(__entry->mem_allowed));  
>>>
>>> Is mem_allowed->bits guaranteed to be the size of BITS_TO_LONGS(MAX_NUM_NODES)
>>> in size? If not, then memcpy will read beyond that size.
>>>   
>>
>> Yes, evidence can be found in the definitions of nodemask_t and DECLARE_BITMAP:
>>
>> // include/linux/nodemask_types.h 
>> typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
>>
>> // include/linux/types.h
>> #define DECLARE_BITMAP(name,bits) \
>> 	unsigned long name[BITS_TO_LONGS(bits)]
>>
> 
> Hmm, I wonder then if we should add in TP_fast_assign():
> 
> 	BUILD_BUG_ON(sizeof(nodemask_t) != BITS_TO_LONGS(MAX_NUM_NODES) * sizeof(long));
> 

to guard against potential changes in nodemask_t definition? 


 
> -- Steve



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
  2025-04-24  1:12         ` Libo Chen
@ 2025-04-24  1:33           ` Steven Rostedt
  2025-04-24  1:41             ` Libo Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2025-04-24  1:33 UTC (permalink / raw)
  To: Libo Chen
  Cc: akpm, peterz, mgorman, mingo, juri.lelli, vincent.guittot, tj,
	llong, sraithal, venkat88, kprateek.nayak, raghavendra.kt,
	yu.c.chen, tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
	lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel

On Wed, 23 Apr 2025 18:12:55 -0700
Libo Chen <libo.chen@oracle.com> wrote:

> > Hmm, I wonder then if we should add in TP_fast_assign():
> > 
> > 	BUILD_BUG_ON(sizeof(nodemask_t) != BITS_TO_LONGS(MAX_NUM_NODES) * sizeof(long));
> >   
> 
> to guard against potential changes in nodemask_t definition? 

Correct.

Whenever there's an implicit dependency like this, where if something were
to change it can cause a bug in the kernel, it's always better to have a
build time check to catch it before it becomes an issue.

-- Steve


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
  2025-04-24  1:33           ` Steven Rostedt
@ 2025-04-24  1:41             ` Libo Chen
  0 siblings, 0 replies; 9+ messages in thread
From: Libo Chen @ 2025-04-24  1:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: akpm, peterz, mgorman, mingo, juri.lelli, vincent.guittot, tj,
	llong, sraithal, venkat88, kprateek.nayak, raghavendra.kt,
	yu.c.chen, tim.c.chen, vineethr, chris.hyser, daniel.m.jordan,
	lorenzo.stoakes, mkoutny, linux-mm, cgroups, linux-kernel



On 4/23/25 18:33, Steven Rostedt wrote:
> On Wed, 23 Apr 2025 18:12:55 -0700
> Libo Chen <libo.chen@oracle.com> wrote:
> 
>>> Hmm, I wonder then if we should add in TP_fast_assign():
>>>
>>> 	BUILD_BUG_ON(sizeof(nodemask_t) != BITS_TO_LONGS(MAX_NUM_NODES) * sizeof(long));
>>>   
>>
>> to guard against potential changes in nodemask_t definition? 
> 
> Correct.
> 
> Whenever there's an implicit dependency like this, where if something were
> to change it can cause a bug in the kernel, it's always better to have a
> build time check to catch it before it becomes an issue.
> 

Okay that's reasonable. I will add it~




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-04-24  1:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-24  0:01 [PATCH v4 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mem Libo Chen
2025-04-24  0:01 ` [PATCH v4 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Libo Chen
2025-04-24  0:01 ` [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Libo Chen
2025-04-24  0:18   ` Steven Rostedt
2025-04-24  0:36     ` Libo Chen
2025-04-24  1:01       ` Steven Rostedt
2025-04-24  1:12         ` Libo Chen
2025-04-24  1:33           ` Steven Rostedt
2025-04-24  1:41             ` Libo Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox