Re: [RFC PATCH V3 0/1] sched/numa: Fix disjoint set vma scan regression

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Raghavendra K T <raghavendra.kt@amd.com>
To: Sapkal Swapnil <Swapnil.Sapkal@amd.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kernel test robot <oliver.sang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	rppt@kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Bharata B Rao <bharata@amd.com>,
	Aithal Srikanth <sraithal@amd.com>
Subject: Re: [RFC PATCH V3 0/1] sched/numa: Fix disjoint set vma scan regression
Date: Thu, 8 Jun 2023 09:31:01 +0530	[thread overview]
Message-ID: <301ffc14-84be-461a-91d8-ff5a97cef981@amd.com> (raw)
In-Reply-To: <53f3872a-4cbf-563a-2658-9222586680da@amd.com>

On 6/7/2023 5:10 PM, Sapkal Swapnil wrote:
> Hello Raghavendra,
> 
> On 5/31/2023 9:55 AM, Raghavendra K T wrote:
>> With the numa scan enhancements [1], only the threads which had 
>> previously
>> accessed vma are allowed to scan.
>>
>> While this had improved significant system time overhead, there were 
>> corner
>> cases, which genuinely need some relaxation for e.g., concern raised by
>> PeterZ where unfairness amongst the thread belonging to disjoint set 
>> of vmas,
>> that can potentially amplify the side effects, where vma regions 
>> belonging
>> to some of the tasks being left unscanned.
>>
>> [1] had handled that issue by allowing first two scans at mm level
>> (mm->numa_scan_seq) unconditionally. But that was not enough.
>>
>> One of the test that exercise similar side effect is 
>> numa01_THREAD_ALLOC where
>> allocation happen by main thread and it is divided into memory chunks 
>> of 24MB
>> to be continuously bzeroed (for 128 threads on my machine).
>>
>> This was found in internal LKP run and also reported by [4].
>>
>> While RFC V1 [2] tried to address this issue, the logic had more 
>> heuristics.
>> RFC V2 [3] was rewritten based on vma_size.
>>
>> Current implementation drops some of additional logic for long running 
>> task
>> and relooked some of the usage of READ_ONCE/WRITE_ONCE().
>>
>> The current patch addresses the same issue in a more accurate way as
>> follows:
>>
>> (1) Any disjoint vma which is not associated with a task, that tries to
>> scan is now allowed to induce prot_none faults. Total number of such
>> unconditional scans allowed per vma is derived based on the exact vma 
>> size
>> as follows:
>>
>> total scans allowed = 1/2 * vma_size / scan_size.
>>
>> (2) Total scans already done is maintained using a per vma scan counter.
>>
>> With above patch, numa01_THREAD_ALLOC regression reported is resolved,
>> but please note that with [1] there was a drastic decrease in system time
>> for mmtest numa01, this patch adds back some of the system time.
>>
>> Summary: numa scan enhancement patch [1] togethor with the current 
>> patchset
>> improves overall system time by filtering unnecessary numa scan
>> while still retaining necessary scanning in some corner cases which
>> involves disjoint set vmas.
>>
>> Your comments/Ideas are welcome.
>>
>> Changes since:
>> RFC V2:
>> 1) Drop reset of scan counter that tried to take care of long running 
>> workloads
>> 2) Correct usage of READ_ONCE/WRITE_ONCE (Bharata)
>> 3) Base is 6.4.0-rc2
>>
>> RFC V1:
>> 1) Rewrite entire logic based on actual vma size than heuristics
>> 2) Added Reported-by kernel test robot and internal LKP test
>> 3) Rebased to 6.4.-rc1 (ba0ad6ed89)
>>
>> Result:
>> SUT: Milan w/ 2 numa nodes 256 cpus
>>
>> Run of numa01_THREAD__ALLOC on 6.4.0-rc2 (that has w/ numascan 
>> enhancement)
>>                      base-numascan    base        base+fix
>> real            1m1.507s    1m23.259s    1m2.632s
>> user            213m51.336s    251m46.363s    220m35.528s
>> sys             3m3.397s    0m12.492s    2m41.393s
>>
>> numa_hit         5615517        4560123        4963875
>> numa_local         5615505        4560024        4963700
>> numa_other         12        99        175
>> numa_pte_updates     1822797        493        1559111
>> numa_hint_faults     1307113        523        1469031
>> numa_hint_faults_local     612617        488        884829
>> numa_pages_migrated     694370        35        584202
>>
>> We can see regression in base real time recovered, but with some 
>> additional
>> system time overhead.
>>
>> Below is the mmtest autonuma performance
>>
>> autonumabench
>> ===========
>> (base 6.4.0-rc2 that has numascan enhancement)
>>                     base-numascan        base            base+fix
>> Amean     syst-NUMA01                  300.46 (   0.00%)       23.97 
>> *  92.02%*       67.18 *  77.64%*
>> Amean     syst-NUMA01_THREADLOCAL        0.20 (   0.00%)        0.22 
>> *  -9.15%*        0.22 *  -9.15%*
>> Amean     syst-NUMA02                    0.70 (   0.00%)        0.71 
>> *  -0.61%*        0.70 *   0.41%*
>> Amean     syst-NUMA02_SMT                0.58 (   0.00%)        0.62 
>> *  -5.38%*        0.61 *  -3.67%*
>> Amean     elsp-NUMA01                  320.92 (   0.00%)      276.13 
>> *  13.96%*      324.11 *  -0.99%*
>> Amean     elsp-NUMA01_THREADLOCAL        1.02 (   0.00%)        1.03 
>> *  -1.83%*        1.03 *  -1.83%*
>> Amean     elsp-NUMA02                    3.16 (   0.00%)        3.93 * 
>> -24.20%*        3.14 *   0.81%*
>> Amean     elsp-NUMA02_SMT                3.82 (   0.00%)        3.87 
>> *  -1.27%*        3.44 *   9.90%*
>>
>> Duration User      403532.43   279173.53   359098.23
>> Duration System      2114.31      179.20      481.54
>> Duration Elapsed     2312.20     2004.48     2335.84
>>
>> Ops NUMA alloc hit                  55795455.00    45452739.00    
>> 45500387.00
>> Ops NUMA alloc local                55794177.00    45435858.00    
>> 45500070.00
>> Ops NUMA base-page range updates   147858285.00       18601.00    
>> 42043107.00
>> Ops NUMA PTE updates               147858285.00       18601.00    
>> 42043107.00
>> Ops NUMA hint faults               150531983.00       18254.00    
>> 42450080.00
>> Ops NUMA hint local faults %       125691825.00       11964.00    
>> 32993313.00
>> Ops NUMA hint local percent               83.50          
>> 65.54          77.72
>> Ops NUMA pages migrated             13535786.00        2207.00     
>> 4654628.00
>> Ops AutoNUMA cost                     753952.10          91.44      
>> 212633.14
>>
>> Please note there is a system time overhead added for numa01 but we 
>> still have very
>> good improvement w.r.t base without numascan.
>>
> 
> I tested the patch with lkp autonuma benchmark on a dual socket 4th 
> Generation EPYC server (2 X 96C/192T) running in NPS1 mode. Below are 
> the results:
> 
> commit:
>    6.4.0-rc2
>    6.4.0-rc2+patch
> 
>        6.4.0-rc2            6.4.0-rc2+patch
> ---------------- ---------------------------
>           %stddev     %change         %stddev
>               \          |                \
>      501.84           -12.5%     439.14       numa01.seconds
>      228.66            -1.8%     224.44       numa01_THREAD_ALLOC.seconds
>        0.51           +21.6%       0.62       numa02.seconds
>      107.17            +0.0%     107.17       numa02_SMT.seconds
>        2936            -9.1%       2669       elapsed_time
>      794910            +3.7%     824178       system_time
>      474520           -17.5%     391331       user_time
> 
> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> 
>> [1] Link: 
>> https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t 
>>
>> [2] Link: 
>> https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
>> [3] Link: 
>> https://lore.kernel.org/lkml/cover.1684228065.git.raghavendra.kt@amd.com/T/ 
>>
>> [4] Link: 
>> https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/ 
>>
>>
>> Raghavendra K T (1):
>>    sched/numa: Fix disjoint set vma scan regression
>>
>>   include/linux/mm_types.h |  1 +
>>   kernel/sched/fair.c      | 31 ++++++++++++++++++++++++-------
>>   2 files changed, 25 insertions(+), 7 deletions(-)
>>
> -- 
> Thanks and regards,
> Swapnil

Thank you Swapnil.
It reminds again that LKP's numa01 = numa01_THREAD_ALLOC which has
regained numbers.

I will also wait if kernel-test-robot also sees issue fixed. and also

if Mel/Peter have any objections/comment on the direction.

Regards

     prev parent reply	other threads:[~2023-06-08  4:01 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-31  4:25 Raghavendra K T
2023-05-31  4:25 ` [RFC PATCH V3 1/1] " Raghavendra K T
2023-07-05  5:48   ` Raghavendra K T
2023-07-16 14:17     ` Oliver Sang
2023-07-17  6:23       ` Raghavendra K T
2023-07-21 15:18   ` Mel Gorman
2023-07-24  7:41     ` Raghavendra K T
2023-08-11 13:35       ` Raghavendra K T
2023-06-07 11:40 ` [RFC PATCH V3 0/1] " Sapkal Swapnil
2023-06-08  4:01   ` Raghavendra K T [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=301ffc14-84be-461a-91d8-ff5a97cef981@amd.com \
    --to=raghavendra.kt@amd.com \
    --cc=Swapnil.Sapkal@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=oliver.sang@intel.com \
    --cc=peterz@infradead.org \
    --cc=rppt@kernel.org \
    --cc=sraithal@amd.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox