[RFC PATCH V1 0/2] sched/numa: Disjoint set vma scan improvements

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH V1 0/2] sched/numa: Disjoint set vma scan improvements
@ 2023-05-03  2:05 Raghavendra K T
  2023-05-03  2:05 ` [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter Raghavendra K T
  2023-05-03  2:05 ` [RFC PATCH V1 2/2] sched/numa: Introduce per vma numa_scan_seq Raghavendra K T
  0 siblings, 2 replies; 4+ messages in thread
From: Raghavendra K T @ 2023-05-03  2:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
	Bharata B Rao, Raghavendra K T

With the numa scan enhancements [1], only the threads which had previously
accessed vma are allowed to scan.
While this has improved significant system time overhead, there are corner
cases, which genuinely needs some relaxation for e.g., concern raised by
PeterZ where unfairness amongst the threadbelonging to disjoint set of VMSs
can potentially amplify the side effects of vma regions belonging to some of
the tasks being left unscanned.

Currently that is handled by allowing first two scans at mm level
(mm->numa_scan_seq) unconditionally.

One of the test that exercise similar side effect is numa01_THREAD_ALLOC where
allocation happen by main thread and it is divided into memory chunks of 24MB
to be continuously bzeroed.

(this is run default by LKP tests while numa01 is run default in mmtests which
operate on full 3GB region by each thread)

So to address this issue, proposal here is:
1) Have per vma scan counter that gets incremented for every successful scan
(which potentially scans 256MB or sysctl_scan_size)
2) Do unconditional scan for first few times (To be precise, half of the window
  calculated for scanning normally otherwise)
3) Do reset of the counter when whole mm is scanned (this needs remembering
mm->numa_scan_sequece) at vma level

With this patch I am seeing good improvement in numa01_THREAD_ALLOC case, 
but please note that with [1] there was a drastic decrease in system time when
benchmarks run, this patch adds back some of the system time.

Your comments/Ideas are welcome.

Result:
SUT: Milan w/ 2 numa nodes 256 cpus

Manaul run of numa01_THREAD__ALLOC
Base 11-apr-next
	w/numascan 	w/o numascan	numascan+patch

real	1m33.579s	1m2.042s	1m11.738s
user	280m46.032s	213m38.647s	231m40.226s
sys	0m18.061s	6m54.963s	4m43.174s
		
numa_hit		5813057	6166060	6146064
numa_local 		5812546	6165471 6145573
numa_other 		511	589	491
numa_pte_updates 	0	2098276	1248398
numa_hint_faults 	10	1768382	982034
numa_hint_faults_local 	10	981824	625424
numa_pages_migrated 	0	786558	356604

Below is the mmtest kernbench and autonuma performance

kernbench
===========
Base 11-apr-next
			w/numascan      	w/o numascan    	numascan+patch

Amean     user-256    23873.01 (   0.00%)    23688.21 *   0.77%*    23948.47 *  -0.32%*
Amean     syst-256     4990.73 (   0.00%)     5113.32 *  -2.46%*     4800.86 *   3.80%*
Amean     elsp-256      150.67 (   0.00%)      150.52 *   0.10%*      150.63 *   0.03%*

Duration User       71628.53    71074.04    71855.31
Duration System     14985.61    15354.33    14416.72
Duration Elapsed      472.69      473.24      473.72

Ops NUMA alloc hit                1739476674.00  1739443601.00  1739591558.00
Ops NUMA alloc local              1739534231.00  1739519795.00  1739647666.00
Ops NUMA base-page range updates      485073.00      673766.00      733129.00
Ops NUMA PTE updates                  485073.00      673766.00      733129.00
Ops NUMA hint faults                  107776.00      181920.00      186250.00
Ops NUMA hint local faults %            1789.00        6165.00       10889.00
Ops NUMA hint local percent                1.66           3.39           5.85
Ops NUMA pages migrated               105987.00      175755.00      175356.00
Ops AutoNUMA cost                        544.29         917.66         939.71

autonumabench
===============
					 w/numascan      	w/o numascan    	numascan+patch
Amean     syst-NUMA01                   33.10 (   0.00%)      571.68 *-1627.21%*      219.51 *-563.21%*
Amean     syst-NUMA01_THREADLOCAL        0.23 (   0.00%)        0.22 *   4.38%*        0.22 *   5.00%*
Amean     syst-NUMA02                    0.81 (   0.00%)        0.75 *   7.76%*        0.76 *   6.00%*
Amean     syst-NUMA02_SMT                0.68 (   0.00%)        0.73 *  -7.79%*        0.65 *   3.58%*
Amean     elsp-NUMA01                  299.71 (   0.00%)      333.24 * -11.19%*      329.60 *  -9.97%*
Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.06 *   0.00%*        1.06 *  -0.68%*
Amean     elsp-NUMA02                    3.29 (   0.00%)        3.23 *   1.95%*        3.18 *   3.51%*
Amean     elsp-NUMA02_SMT                3.75 (   0.00%)        3.38 *   9.86%*        3.79 *  -0.95%*

Duration User      321693.29   437210.09   376657.80
Duration System       244.25     4014.23     1548.57
Duration Elapsed     2165.83     2395.53     2373.46


Ops NUMA alloc hit                  49608099.00    62272320.00    55815229.00
Ops NUMA alloc local                49585747.00    62236996.00    55812601.00
Ops NUMA base-page range updates        1571.00   202868357.00    96006221.00
Ops NUMA PTE updates                    1571.00   202868357.00    96006221.00
Ops NUMA hint faults                    1203.00   204902318.00    97246909.00
Ops NUMA hint local faults %             981.00   187233695.00    81136933.00
Ops NUMA hint local percent               81.55          91.38          83.43
Ops NUMA pages migrated                  222.00    10011134.00     6060787.00
Ops AutoNUMA cost                          6.03     1026121.88      487021.74

Notes: Implementation considered/tried
1) Limit the disjoint set vma scan to 4 (hardcoded) = 1GB per whole mm scan
2) Current PID reset window = 4 * sysctl_scan_delay is changed to
8 * sysctl_scan_delay (to ensure some random overlapping overtime in scanning)


links:
[1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t

Raghavendra K T (2):
  sched/numa: Introduce per vma scan counter
  sched/numa: Introduce per vma numa_scan_seq

 include/linux/mm_types.h |  2 ++
 kernel/sched/fair.c      | 44 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 43 insertions(+), 3 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter
  2023-05-03  2:05 [RFC PATCH V1 0/2] sched/numa: Disjoint set vma scan improvements Raghavendra K T
@ 2023-05-03  2:05 ` Raghavendra K T
  2023-05-03 17:42   ` Raghavendra K T
  2023-05-03  2:05 ` [RFC PATCH V1 2/2] sched/numa: Introduce per vma numa_scan_seq Raghavendra K T
  1 sibling, 1 reply; 4+ messages in thread
From: Raghavendra K T @ 2023-05-03  2:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
	Bharata B Rao, Raghavendra K T

With the recent numa scan enhancements, only the tasks which had
previously accessed vma are allowed to scan.

While this has improved significant system time overhead, there are
corner cases, which genuinely needs some relaxation for e.g., concern
raised by PeterZ where unfairness amongst the theread belonging to
disjoint set of VMSs can potentially amplify the side effects of vma
regions belonging to some of the tasks being left unscanned.

To address this, allow scanning for first few times with a per vma
counter.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 30 +++++++++++++++++++++++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3fc9e680f174..f66e6b4e0620 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@ struct vma_numab_state {
 	unsigned long next_scan;
 	unsigned long next_pid_reset;
 	unsigned long access_pids[2];
+	unsigned int scan_counter;
 };
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a29ca11bead2..3c50dc3893eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2928,19 +2928,38 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	p->mm->numa_scan_offset = 0;
 }
 
+/* Scan 1GB or 4 * scan_size */
+#define VMA_DISJOINT_SET_ACCESS_THRESH		4U
+
 static bool vma_is_accessed(struct vm_area_struct *vma)
 {
 	unsigned long pids;
+	unsigned int windows;
+	unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
+
+	if (scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / scan_size;
+
+	/* Allow only half of the windows for disjoint set cases */
+	windows /= 2;
+
+	windows = max(VMA_DISJOINT_SET_ACCESS_THRESH, windows);
+
 	/*
-	 * Allow unconditional access first two times, so that all the (pages)
-	 * of VMAs get prot_none fault introduced irrespective of accesses.
+	 * Make sure to allow scanning of disjoint vma set for the first
+	 * few times.
+	 * OR At mm level allow unconditional access first two times, so that
+	 * all the (pages) of VMAs get prot_none fault introduced irrespective
+	 * of accesses.
 	 * This is also done to avoid any side effect of task scanning
 	 * amplifying the unfairness of disjoint set of VMAs' access.
 	 */
-	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
+	if (READ_ONCE(vma->numab_state->scan_counter) < windows ||
+		READ_ONCE(current->mm->numa_scan_seq) < 2)
 		return true;
 
 	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
+
 	return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
 }
 
@@ -3058,6 +3077,8 @@ static void task_numa_work(struct callback_head *work)
 			/* Reset happens after 4 times scan delay of scan start */
 			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+			WRITE_ONCE(vma->numab_state->scan_counter, 0);
 		}
 
 		/*
@@ -3084,6 +3105,9 @@ static void task_numa_work(struct callback_head *work)
 			vma->numab_state->access_pids[1] = 0;
 		}
 
+		WRITE_ONCE(vma->numab_state->scan_counter,
+				READ_ONCE(vma->numab_state->scan_counter) + 1);
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC PATCH V1 2/2] sched/numa: Introduce per vma numa_scan_seq
  2023-05-03  2:05 [RFC PATCH V1 0/2] sched/numa: Disjoint set vma scan improvements Raghavendra K T
  2023-05-03  2:05 ` [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter Raghavendra K T
@ 2023-05-03  2:05 ` Raghavendra K T
  1 sibling, 0 replies; 4+ messages in thread
From: Raghavendra K T @ 2023-05-03  2:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
	Bharata B Rao, Raghavendra K T

 Per vma scan counter was introduced to aid disjoint set vma
scanning in corner cases. But that counter needs reset regularly.

Reset is achieved after full round of mm scanning by per vma
numa_scan_sequence that follows mm->numa_scan_seq.

Result: With this patch series we recover mmtest's
numa01_THREAD_ALLOC performance as below

Base 11-apr-next
        w/numascan      w/o numascan    numascan+patch

real    1m33.579s       1m2.042s        1m11.738s
user    280m46.032s     213m38.647s     231m40.226s
sys     0m18.061s       6m54.963s       4m43.174s

In summary: it adds back some system overhaed of scanning disjoint
vma scanning, But still we are at huge advantage w.r.t base kernel

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 18 ++++++++++++++++--
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f66e6b4e0620..9c0fc83118da 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@ struct vma_numab_state {
 	unsigned long next_scan;
 	unsigned long next_pid_reset;
 	unsigned long access_pids[2];
+	unsigned int vma_scan_seq;
 	unsigned int scan_counter;
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c50dc3893eb..dc011a2a31ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2935,6 +2935,7 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
 {
 	unsigned long pids;
 	unsigned int windows;
+	unsigned int mm_seq, vma_seq;
 	unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
 
 	if (scan_size < MAX_SCAN_WINDOW)
@@ -2945,6 +2946,18 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
 
 	windows = max(VMA_DISJOINT_SET_ACCESS_THRESH, windows);
 
+	mm_seq = READ_ONCE(current->mm->numa_scan_seq);
+	vma_seq = READ_ONCE(vma->numab_state->vma_scan_seq);
+
+	if (vma_seq != mm_seq) {
+	/*
+	 * One more round of whole mm scan was done. Reset the vma scan_counter
+	 * and sync per vma numa_scan_seq.
+	 */
+		WRITE_ONCE(vma->numab_state->vma_scan_seq,
+					READ_ONCE(current->mm->numa_scan_seq));
+		WRITE_ONCE(vma->numab_state->scan_counter, 0);
+	}
 	/*
 	 * Make sure to allow scanning of disjoint vma set for the first
 	 * few times.
@@ -2954,8 +2967,7 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
 	 * This is also done to avoid any side effect of task scanning
 	 * amplifying the unfairness of disjoint set of VMAs' access.
 	 */
-	if (READ_ONCE(vma->numab_state->scan_counter) < windows ||
-		READ_ONCE(current->mm->numa_scan_seq) < 2)
+	if (READ_ONCE(vma->numab_state->scan_counter) < windows || mm_seq < 2)
 		return true;
 
 	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
@@ -3078,6 +3090,8 @@ static void task_numa_work(struct callback_head *work)
 			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
 
+			WRITE_ONCE(vma->numab_state->vma_scan_seq,
+					READ_ONCE(current->mm->numa_scan_seq));
 			WRITE_ONCE(vma->numab_state->scan_counter, 0);
 		}
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter
  2023-05-03  2:05 ` [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter Raghavendra K T
@ 2023-05-03 17:42   ` Raghavendra K T
  0 siblings, 0 replies; 4+ messages in thread
From: Raghavendra K T @ 2023-05-03 17:42 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
	Bharata B Rao

On 5/3/2023 7:35 AM, Raghavendra K T wrote:
> With the recent numa scan enhancements, only the tasks which had
> previously accessed vma are allowed to scan.
> 
> While this has improved significant system time overhead, there are
> corner cases, which genuinely needs some relaxation for e.g., concern
> raised by PeterZ where unfairness amongst the theread belonging to
> disjoint set of VMSs can potentially amplify the side effects of vma
> regions belonging to some of the tasks being left unscanned.
> 
> To address this, allow scanning for first few times with a per vma
> counter.
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---

Some clarification:
base was linux-next-20230411 (because I have some issue with
linux-next-20230425 onwards and linux master branch, which I am diging.


>   include/linux/mm_types.h |  1 +
>   kernel/sched/fair.c      | 30 +++++++++++++++++++++++++++---
>   2 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3fc9e680f174..f66e6b4e0620 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -479,6 +479,7 @@ struct vma_numab_state {
>   	unsigned long next_scan;
>   	unsigned long next_pid_reset;
>   	unsigned long access_pids[2];
> +	unsigned int scan_counter;
>   };
>   
>   /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a29ca11bead2..3c50dc3893eb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2928,19 +2928,38 @@ static void reset_ptenuma_scan(struct task_struct *p)
>   	p->mm->numa_scan_offset = 0;
>   }
>   
> +/* Scan 1GB or 4 * scan_size */
> +#define VMA_DISJOINT_SET_ACCESS_THRESH		4U
> +
>   static bool vma_is_accessed(struct vm_area_struct *vma)
>   {
>   	unsigned long pids;
> +	unsigned int windows;

Missed windows = 0 while splitting the patch
will be corrected in next posting.

/me Remembered after kernel test robot noticed
[...]


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-05-03 17:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03  2:05 [RFC PATCH V1 0/2] sched/numa: Disjoint set vma scan improvements Raghavendra K T
2023-05-03  2:05 ` [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter Raghavendra K T
2023-05-03 17:42   ` Raghavendra K T
2023-05-03  2:05 ` [RFC PATCH V1 2/2] sched/numa: Introduce per vma numa_scan_seq Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox