Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Raghavendra K T <raghavendra.kt@amd.com>
To: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
	Bharata B Rao <bharata@amd.com>, Ingo Molnar <mingo@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative
Date: Tue, 10 Oct 2023 17:12:17 +0530	[thread overview]
Message-ID: <18236c47-8673-c6ed-0f1b-3e1d1ae4f8de@amd.com> (raw)
In-Reply-To: <20231010083143.19593-7-mgorman@techsingularity.net>

On 10/10/2023 2:01 PM, Mel Gorman wrote:
> VMAs are skipped if there is no recent fault activity but this represents
> a chicken-and-egg problem as there may be no fault activity if the PTEs
> are never updated to trap NUMA hints. There is an indirect reliance on
> scanning to be forced early in the lifetime of a task but this may fail
> to detect changes in phase behaviour. Force inactive VMAs to be scanned
> when all other eligible VMAs have been updated within the same scan
> sequence.
> 
> Test results in general look good with some changes in performance, both
> negative and positive, depending on whether the additional scanning and
> faulting was beneficial or not to the workload. The autonuma benchmark
> workload NUMA01_THREADLOCAL was picked for closer examination. The workload
> creates two processes with numerous threads and thread-local storage that
> is zero-filled in a loop. It exercises the corner case where unrelated
> threads may skip VMAs that are thread-local to another thread and still
> has some VMAs that inactive while the workload executes.
> 
> The VMA skipping activity frequency with and without the patch is as
> follows;
> 
> 6.6.0-rc2-sched-numabtrace-v1
>      649 reason=scan_delay
>     9094 reason=unsuitable
>    48915 reason=shared_ro
>   143919 reason=inaccessible
>   193050 reason=pid_inactive
> 
> 6.6.0-rc2-sched-numabselective-v1
>      146 reason=seq_completed
>      622 reason=ignore_pid_inactive
>      624 reason=scan_delay
>     6570 reason=unsuitable
>    16101 reason=shared_ro
>    27608 reason=inaccessible
>    41939 reason=pid_inactive
> 
> Note that with the patch applied, the PID activity is ignored
> (ignore_pid_inactive) to ensure a VMA with some activity is completely
> scanned. In addition, a small number of VMAs are scanned when no other
> eligible VMA is available during a single scan window (seq_completed).
> The number of times a VMA is skipped due to no PID activity from the
> scanning task (pid_inactive) drops dramatically. It is expected that
> this will increase the number of PTEs updated for NUMA hinting faults
> as well as hinting faults but these represent PTEs that would otherwise
> have been missed. The tradeoff is scan+fault overhead versus improving
> locality due to migration.
> 
> On a 2-socket Cascade Lake test machine, the time to complete the
> workload is as follows;
> 
>                                                 6.6.0-rc2              6.6.0-rc2
>                                       sched-numabtrace-v1 sched-numabselective-v1
> Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
> Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
> Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
> CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
> Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)
> 
> The time to complete the workload is reduced by almost 30%
> 
>                     6.6.0-rc2   6.6.0-rc2
>                  sched-numabtrace-v1 sched-numabselective-v1 /
> Duration User       91201.80    63506.64
> Duration System      2015.53     1819.78
> Duration Elapsed     1234.77      868.37
> 
> In this specific case, system CPU time was not increased but it's not
> universally true.
> 
>  From vmstat, the NUMA scanning and fault activity is as follows;
> 
>                                        6.6.0-rc2      6.6.0-rc2
>                              sched-numabtrace-v1 sched-numabselective-v1
> Ops NUMA base-page range updates       64272.00    26374386.00
> Ops NUMA PTE updates                   36624.00       55538.00
> Ops NUMA PMD updates                      54.00       51404.00
> Ops NUMA hint faults                   15504.00       75786.00
> Ops NUMA hint local faults %           14860.00       56763.00
> Ops NUMA hint local percent               95.85          74.90
> Ops NUMA pages migrated                 1629.00     6469222.00
> 
> Both the number of PTE updates and hint faults is dramatically
> increased. While this is superficially unfortunate, it represents
> ranges that were simply skipped without the patch. As a result
> of the scanning and hinting faults, many more pages were also
> migrated but as the time to completion is reduced, the overhead
> is offset by the gain.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>   include/linux/mm_types.h             |  6 +++
>   include/linux/sched/numa_balancing.h |  1 +
>   include/trace/events/sched.h         |  3 +-
>   kernel/sched/fair.c                  | 55 ++++++++++++++++++++++++++--
>   4 files changed, 61 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8cb1dec3e358..a123c1a58617 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -578,6 +578,12 @@ struct vma_numab_state {
>   						 * VMA_PID_RESET_PERIOD
>   						 * jiffies.
>   						 */
> +	int prev_scan_seq;			/* MM scan sequence ID when
> +						 * the VMA was last completely
> +						 * scanned. A VMA is not
> +						 * eligible for scanning if
> +						 * prev_scan_seq == numa_scan_seq
> +						 */
>   };
>   
>   /*
> diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
> index 7dcc0bdfddbb..b69afb8630db 100644
> --- a/include/linux/sched/numa_balancing.h
> +++ b/include/linux/sched/numa_balancing.h
> @@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
>   	NUMAB_SKIP_SCAN_DELAY,
>   	NUMAB_SKIP_PID_INACTIVE,
>   	NUMAB_SKIP_IGNORE_PID,
> +	NUMAB_SKIP_SEQ_COMPLETED,
>   };
>   
>   #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 27b51c81b106..010ba1b7cb0e 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
>   	EM( NUMAB_SKIP_INACCESSIBLE,		"inaccessible" )	\
>   	EM( NUMAB_SKIP_SCAN_DELAY,		"scan_delay" )	\
>   	EM( NUMAB_SKIP_PID_INACTIVE,		"pid_inactive" )	\
> -	EMe(NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )
> +	EM( NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )		\
> +	EMe(NUMAB_SKIP_SEQ_COMPLETED,		"seq_completed" )
>   
>   /* Redefine for export. */
>   #undef EM
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 150f01948ec6..72ef60f394ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3175,6 +3175,8 @@ static void task_numa_work(struct callback_head *work)
>   	unsigned long nr_pte_updates = 0;
>   	long pages, virtpages;
>   	struct vma_iterator vmi;
> +	bool vma_pids_skipped;
> +	bool vma_pids_forced = false;
>   
>   	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
>   
> @@ -3217,7 +3219,6 @@ static void task_numa_work(struct callback_head *work)
>   	 */
>   	p->node_stamp += 2 * TICK_NSEC;
>   
> -	start = mm->numa_scan_offset;
>   	pages = sysctl_numa_balancing_scan_size;
>   	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
>   	virtpages = pages * 8;	   /* Scan up to this much virtual space */
> @@ -3227,6 +3228,16 @@ static void task_numa_work(struct callback_head *work)
>   
>   	if (!mmap_read_trylock(mm))
>   		return;
> +
> +	/*
> +	 * VMAs are skipped if the current PID has not trapped a fault within
> +	 * the VMA recently. Allow scanning to be forced if there is no
> +	 * suitable VMA remaining.
> +	 */
> +	vma_pids_skipped = false;
> +
> +retry_pids:
> +	start = mm->numa_scan_offset;
>   	vma_iter_init(&vmi, mm, start);
>   	vma = vma_next(&vmi);
>   	if (!vma) {
> @@ -3277,6 +3288,13 @@ static void task_numa_work(struct callback_head *work)
>   			/* Reset happens after 4 times scan delay of scan start */
>   			vma->numab_state->pids_active_reset =  vma->numab_state->next_scan +
>   				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
> +
> +			/*
> +			 * Ensure prev_scan_seq does not match numa_scan_seq
> +			 * to prevent VMAs being skipped prematurely on the
> +			 * first scan.
> +			 */
> +			 vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;

nit:
Perhaps even vma->numab_state->prev_scan_seq = -1 would have worked, but
does not matter.

>   		}

next prev parent reply	other threads:[~2023-10-10 11:42 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-10  8:31 [PATCH 0/6] sched/numa: Complete scanning of partial and inactive VMAs Mel Gorman
2023-10-10  8:31 ` [PATCH 1/6] sched/numa: Document vma_numab_state fields Mel Gorman
2023-10-10  8:31 ` [PATCH 2/6] sched/numa: Rename vma_numab_state.access_pids Mel Gorman
2023-10-10  8:31 ` [PATCH 3/6] sched/numa: Trace decisions related to skipping VMAs Mel Gorman
2023-10-10  8:31 ` [PATCH 4/6] sched/numa: Move up the access pid reset logic Mel Gorman
2023-10-10  8:31 ` [PATCH 5/6] sched/numa: Complete scanning of partial VMAs regardless of PID activity Mel Gorman
2023-10-10  8:31 ` [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative Mel Gorman
2023-10-10  9:23   ` Ingo Molnar
2023-10-10  9:57     ` Mel Gorman
2023-10-10 21:39       ` Ingo Molnar
2023-10-10 11:40     ` Raghavendra K T
2024-02-24  4:50       ` Raghavendra K T
2023-10-10 11:42   ` Raghavendra K T [this message]
2023-10-10 11:39 ` [PATCH 0/6] sched/numa: Complete scanning of partial and inactive VMAs Raghavendra K T
2023-10-10 21:45   ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18236c47-8673-c6ed-0f1b-3e1d1ae4f8de@amd.com \
    --to=raghavendra.kt@amd.com \
    --cc=bharata@amd.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox