Re: [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Raghavendra K T <raghavendra.kt@amd.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	rppt@kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Bharata B Rao <bharata@amd.com>,
	Aithal Srikanth <sraithal@amd.com>,
	kernel test robot <oliver.sang@intel.com>,
	Sapkal Swapnil <Swapnil.Sapkal@amd.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>
Subject: Re: [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning
Date: Tue, 19 Sep 2023 12:00:01 +0530	[thread overview]
Message-ID: <719f0729-d28f-d12f-cff4-ab8115861d30@amd.com> (raw)
In-Reply-To: <cover.1693287931.git.raghavendra.kt@amd.com>

On 8/29/2023 11:36 AM, Raghavendra K T wrote:
> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") [1]
> VMA scanning is allowed if:
> 1) The task had accessed the VMA.
>   Rationale: Reduce overhead for the tasks that had not
> touched VMA. Also filter out unnecessary scanning.
> 
> 2) Early phase of the VMA scan where mm->numa_scan_seq is less than 2.
>   Rationale: Understanding initial characteristics of VMAs and also
>   prevent VMA scanning unfairness.
> 
> While that works for most of the times to reduce scanning overhead,
>   there are some corner cases associated with it.
> 
> This was found in an internal LKP run and also reported by [2]. There was
> an attempt to fix.
> 
> Link: https://lore.kernel.org/linux-mm/cover.1685506205.git.raghavendra.kt@amd.com/T/
> 
> This is a fully different series after Mel's feedback to address the issue
>   and also a continuation of enhancing VMA scanning for NUMA balancing.
> 
> Problem statement (Disjoint VMA set):
> ======================================
> Let's look at some of the corner cases with a below example of tasks and their
> access pattern.
> 
> Consider N tasks (threads) of a process.
> Set1 tasks accessing vma_x (group of VMAs)
> Set2 tasks accessing vma_y (group of VMAs)
> 
>               Set1                      Set2
>          -------------------         --------------------
>          | task_1..task_n/2 |       | task_n/2+1..task_n |
>          -------------------         --------------------	
>                   |                             |
>                   V                             V
>          -------------------         --------------------
>          |     vma_x       |         |     vma_y         |
>          -------------------         --------------------	
> 
> Corner cases:
> (a) Out of N tasks, not all of them gets fair opportunity to scan. (PeterZ).
> suppose Set1 tasks gets more opportunity to scan (May be because of the
> activity pattern of tasks or other reasons in current design) in the above
> example, then vma_x gets scanned more number of times than vma_y.
> 
> some experiment is also done here which illustrates this unfairness:
> Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/
> 
> (b) Sizes of vmas can differ.
> Suppose size of vma_y is far greater than the size of vma_x, then a bigger
> portion of vma_y can potentially be left unscanned since scanning is bounded
> by scan_size of 256MB (default) for each iteration.
> 
> (c) Highly active threads trap a few VMAs frequently, and some of the VMAs not
> accessed for long time can potentially get starved of scanning indefinitely
> (Mel). There is a possibility of lack of enough hints/details about VMAs if it
> is needed later for migration.
> 
> (d) Allocation of memory in some specific manner (Mel).
> One example could be, Suppose a main thread allocates memory and it is not
> active. When other threads tries to act upon it, they may not have much
> hints about it, if the corresponding VMA was not scanned.
> 
> (e) VMAs that are created after two full scans of mm (mm->numa_scan_seq > 2)
> will never get scanned. (Observed rarely but very much possible depending on
> workload behaviour).
> 
> Above this, a combination of some of the above (e.g., (a) and (b)) can
> potentially amplifyi/worsen the side effect.
> 
> This patchset, tries to address the above issues by enhancing unconditional
> VMA scanning logic.
> 
> High level ideas:
> =================
> Idea-1) Depending on vma_size, populate a per vma_scan_select value, decrement it
> and when it hits zero do force scan (Mel).
> vma_scan_select value is again repopulated when it hits zero.
> 
> This is how VMA scanning phases looks like after implementation:
> 
> |<---p1--->|<-----p2----->|<-----p2----->|...
> 
> Algorithm:
> p1: New VMA, initial phase do not scan till scan_delay.
> 
> p2: Allow scanning if the task has accessed VMA or vma_scan_select hit zero.
> 
> Reinitialize vma_scan_select and repeat p2.
> 
> pros/cons:
> +  : Ratelimiting is inbuilt to the approach
> +  : vma_size is taken into account for scanning
> +/-: Scanning continues forever
> -  : Changes in vma size is taken care after force scan. i.e.,
>     vma_scan_select is repopulated only after vma_scan_select hits zero.
> 
> Idea-1 can potentially cover all the issues mentioned above.
> 
> Idea-2) Take bitmask_weight of latest access_pids value (suggested by Bharata).
> If number of tasks accessing vma is >= 1, unconditionally allow scanning.
> 
> Idea-3 ) Take bitmask_weight of access_pid history of VMA. If number of tasks
> accessing VMA is > THRESHOLD (=3), unconditionally allow scanning.
> 
> Rationale (Idea-2,3): Do not miss out scanning of critical VMAs.
> 
> Idea-4) Have a per vma_scan_seq. allow the unconditional scan till vma_scan_seq
> reaches a value proportional (or equal) to vma_size/scan_size.
> This a complimentary to Idea-1.
> 
> this is how VMA scanning phases looks like after implementation:
> 
> |<--p1--->|<-----p2----->|<-----p3----->|<-----p4----->...||<-----p2----->|<-----p3----->|<-----p4-----> ...||
>                                                          RESET                                               RESET
> Algorithm:
> p1: New VMA, initial phase do not scan till scan_delay.
> 
> p2: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
>   f(vma_size)/scan_size) for e.g., f = 1/2 * vma_size/scan_size.
> 
> p3: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
>   f(vma_size)/scan_size in a rate limited manner. This is an optional phase.
> 
> p4: Allow scanning iff task has accessed VMA.
> 
> Reset after p4 (optional).
> 
> Repeat p2, p3 p4
> 
> Motivation: Allow agressive scanning in the beginning followed by a rate
> limited scanning. And then completely disallow scanning to avoid unnecessary
> scanning. Reset time could be a function of scan_delay and chosen long enough
> to aid long running task to forget history and start afresh.
> 
> +  : Ratelimiting need to be taken care separately if needed.
> +/-: Scanning continues only if RESET of vma_scan_seq is implemented.
> +  : changes in vma size is taken care in every scan.
> 
>   Current patch series implements Ideas 1, 2, 3 + extension of access PID history
> idea from PeterZ.
> 
> Results:
> ======
> Base: 6.5.0-rc6+ (4853c74bd7ab)
> SUT: Milan w/ 2 numa nodes 256 cpus
> 
> mmtest		numa01_THREAD_ALLOC manual run:
> 
> 		base		patched
> real		1m22.758s	1m9.200s
> user		249m49.540s	229m30.039s
> sys		0m25.040s	3m10.451s
> 	
> numa_pte_updates 	6985	1573363
> numa_hint_faults 	2705	1022623
> numa_hint_faults_local 	2279	389633
> numa_pages_migrated 	426	632990
> 
> kernbench
> 			base			patched
> Amean     user-256    21989.09 (   0.00%)    21677.36 *   1.42%*
> Amean     syst-256    10171.34 (   0.00%)    10818.28 *  -6.36%*
> Amean     elsp-256      166.81 (   0.00%)      168.40 *  -0.95%*
> 
> Duration User       65973.18    65038.00
> Duration System     30538.92    32478.59
> Duration Elapsed      529.52      533.09
> 
> Ops NUMA PTE updates                  976844.00      962680.00
> Ops NUMA hint faults                  226763.00      245620.00
> Ops NUMA pages migrated               220146.00      207025.00
> Ops AutoNUMA cost                       1144.84        1238.77
> 
> Improvements in other benchmarks I have tested.
> Time based:
> Hashjoin	4.21%
> Btree	 	2.04%
> XSbench		0.36%
> 
> Throughput based:
> Graph500 	-3.62%
> Nas.bt		3.69%
> Nas.ft		21.91%
> 
> Note: VMA scanning improvements [1] has refined scanning so much that
> system overhead we re-introduce with additional scan look glaringly
> high. But If we consider the difference between before [1] and current
> series, overall scanning overhead is considerably reduced.
> 
> 1. Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> 2. Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
> 
> Note: Patch description is again repeated in some patches to avoid any
> need to copy from cover letter again.
> 
> Peter Zijlstra (1):
>    sched/numa: Increase tasks' access history
> 
> Raghavendra K T (5):
>    sched/numa: Move up the access pid reset logic
>    sched/numa: Add disjoint vma unconditional scan logic
>    sched/numa: Remove unconditional scan logic using mm numa_scan_seq
>    sched/numa: Allow recently accessed VMAs to be scanned
>    sched/numa: Allow scanning of shared VMAs
> 
>   include/linux/mm.h       |  12 +++--
>   include/linux/mm_types.h |   5 +-
>   kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
>   3 files changed, 102 insertions(+), 24 deletions(-)
> 

Hello Andrew,

I am Resending patch rebasing to mm-unstable, adding results from Oliver
and Swapnil.

(so that it is ready to merge once we get go ahead/ no objection from
Mel, Peter ... Okay to work any further suggestions if any).

Hope that is Okay.

Thanks and Regards
- Raghu

next prev parent reply	other threads:[~2023-09-19  6:30 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-29  6:06 Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 1/6] sched/numa: Move up the access pid reset logic Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 2/6] sched/numa: Add disjoint vma unconditional scan logic Raghavendra K T
2023-09-12  7:50   ` kernelt test robot
2023-09-13  6:21     ` Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 3/6] sched/numa: Remove unconditional scan logic using mm numa_scan_seq Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 4/6] sched/numa: Increase tasks' access history Raghavendra K T
2023-09-12 14:24   ` kernel test robot
2023-09-13  6:15     ` Raghavendra K T
2023-09-13  7:34       ` Oliver Sang
2023-08-29  6:06 ` [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned Raghavendra K T
2023-09-10 15:29   ` kernel test robot
2023-09-11 11:25     ` Raghavendra K T
2023-09-12  2:22       ` Oliver Sang
2023-09-12  6:43         ` Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 6/6] sched/numa: Allow scanning of shared VMAs Raghavendra K T
2023-09-13  5:28 ` [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning Swapnil Sapkal
2023-09-13  6:24   ` Raghavendra K T
2023-09-19  6:30 ` Raghavendra K T [this message]
2023-09-19  7:15   ` Ingo Molnar
2023-09-19  8:06     ` Raghavendra K T
2023-09-19  9:28 ` Peter Zijlstra
2023-09-19 16:22   ` Mel Gorman
2023-09-19 19:11     ` Peter Zijlstra
2023-09-20 10:42     ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=719f0729-d28f-d12f-cff4-ab8115861d30@amd.com \
    --to=raghavendra.kt@amd.com \
    --cc=Swapnil.Sapkal@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=oliver.sang@intel.com \
    --cc=peterz@infradead.org \
    --cc=rppt@kernel.org \
    --cc=sraithal@amd.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox