Re: [RFC PATCH V2 1/1] sched/numa: Fix disjoint set vma scan regression

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Bharata B Rao <bharata@amd.com>
To: Raghavendra K T <raghavendra.kt@amd.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	rppt@kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Aithal Srikanth <sraithal@amd.com>,
	kernel test robot <oliver.sang@intel.com>
Subject: Re: [RFC PATCH V2 1/1] sched/numa: Fix disjoint set vma scan regression
Date: Fri, 19 May 2023 13:26:43 +0530	[thread overview]
Message-ID: <226b9290-deea-53f5-54b3-42ee52c67e1d@amd.com> (raw)
In-Reply-To: <b0a8f3490b491d4fd003c3e0493e940afaea5f2c.1684228065.git.raghavendra.kt@amd.com>

On 16-May-23 2:49 PM, Raghavendra K T wrote:
>  With the numa scan enhancements [1], only the threads which had previously
> accessed vma are allowed to scan.
> 
> While this had improved significant system time overhead, there were corner
> cases, which genuinely need some relaxation. For e.g.,
> 
> 1) Concern raised by PeterZ, where if there are N partition sets of vmas
> belonging to tasks, then unfairness in allowing these threads to scan could
> potentially amplify the side effect of some of the vmas being left
> unscanned.
> 
> 2) Below reports of LKP numa01 benchmark regression.
> 
> Currently this was handled by allowing first two scanning unconditional
> as indicated by mm->numa_scan_seq. This is imprecise since for some
> benchmark vma scanning might itself start at numa_scan_seq > 2.
> 
> Solution:
> Allow unconditional scanning of vmas of tasks depending on vma size. This
> is achieved by maintaining a per vma scan counter, where
> 
> f(allowed_to_scan) = f(scan_counter <  vma_size / scan_size)
> 
> Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
> regression.
> 
> Result:
> numa01_THREAD_ALLOC result on 6.4.0-rc1 (that has w/ numascan enhancement)
>                 base-numascan           base                    base+fix
> real            1m3.025s                1m24.163s               1m3.551s
> user            213m44.232s             251m3.638s              219m55.662s
> sys             6m26.598s               0m13.056s               2m35.767s
> 
> numa_hit                5478165         4395752         4907431
> numa_local              5478103         4395366         4907044
> numa_other                   62             386             387
> numa_pte_updates        1989274           11606         1265014
> numa_hint_faults        1756059             515         1135804
> numa_hint_faults_local   971500             486          558076
> numa_pages_migrated      784211              29          577728
> 
> Summary: Regression in base is recovered by allowing scanning as required.
> 
> [1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> 
> Reported-by: Aithal Srikanth <sraithal@amd.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>  include/linux/mm_types.h |  1 +
>  kernel/sched/fair.c      | 41 ++++++++++++++++++++++++++++++++--------
>  2 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..992e460a713e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -479,6 +479,7 @@ struct vma_numab_state {
>  	unsigned long next_scan;
>  	unsigned long next_pid_reset;
>  	unsigned long access_pids[2];
> +	unsigned int scan_counter;
>  };
>  
>  /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..2c3e17e7fc2f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2931,20 +2931,34 @@ static void reset_ptenuma_scan(struct task_struct *p)
>  static bool vma_is_accessed(struct vm_area_struct *vma)
>  {
>  	unsigned long pids;
> +	unsigned int vma_size;
> +	unsigned int scan_threshold;
> +	unsigned int scan_size;
> +
> +	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
> +
> +	if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
> +		return true;
> +
> +	scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
> +	/* vma size in MB */
> +	vma_size = (vma->vm_end - vma->vm_start) >> 20;
> +
> +	/* Total scans needed to cover VMA */
> +	scan_threshold = (vma_size / scan_size);
> +
>  	/*
> -	 * Allow unconditional access first two times, so that all the (pages)
> -	 * of VMAs get prot_none fault introduced irrespective of accesses.
> +	 * Allow the scanning of half of disjoint set's VMA to induce
> +	 * prot_none fault irrespective of accesses.
>  	 * This is also done to avoid any side effect of task scanning
>  	 * amplifying the unfairness of disjoint set of VMAs' access.
>  	 */
> -	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
> -		return true;
> -
> -	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
> -	return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
> +	scan_threshold = 1 + (scan_threshold >> 1);
> +	return (READ_ONCE(vma->numab_state->scan_counter) <= scan_threshold);
>  }
>  
> -#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
> +#define VMA_PID_RESET_PERIOD		(4 * sysctl_numa_balancing_scan_delay)
> +#define DISJOINT_VMA_SCAN_RENEW_THRESH	16
>  
>  /*
>   * The expensive part of numa migration is done from task_work context.
> @@ -3058,6 +3072,8 @@ static void task_numa_work(struct callback_head *work)
>  			/* Reset happens after 4 times scan delay of scan start */
>  			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
>  				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
> +
> +			WRITE_ONCE(vma->numab_state->scan_counter, 0);
>  		}
>  
>  		/*
> @@ -3068,6 +3084,13 @@ static void task_numa_work(struct callback_head *work)
>  						vma->numab_state->next_scan))
>  			continue;
>  
> +		/*
> +		 * For long running tasks, renew the disjoint vma scanning
> +		 * periodically.
> +		 */
> +		if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH))

Don't you need a READ_ONCE() accessor for mm->numa_scan_seq?

Regards,
Bharata.

next prev parent reply	other threads:[~2023-05-19  7:57 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-16  9:19 [RFC PATCH V2 0/1] " Raghavendra K T
2023-05-16  9:19 ` [RFC PATCH V2 1/1] " Raghavendra K T
2023-05-19  7:56   ` Bharata B Rao [this message]
2023-05-19 12:05     ` Raghavendra K T
2023-05-26  1:45   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=226b9290-deea-53f5-54b3-42ee52c67e1d@amd.com \
    --to=bharata@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=oliver.sang@intel.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rppt@kernel.org \
    --cc=sraithal@amd.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox