Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Raghavendra K T <raghavendra.kt@amd.com>
To: AneeshKumar.KizhakeVeetil@arm.com, Hasan.Maruf@amd.com,
	Michael.Day@amd.com, akpm@linux-foundation.org, bharata@amd.com,
	dave.hansen@intel.com, david@redhat.com,
	dongjoo.linux.dev@gmail.com, feng.tang@intel.com,
	gourry@gourry.net, hannes@cmpxchg.org, honggyu.kim@sk.com,
	hughd@google.com, jhubbard@nvidia.com, jon.grimm@amd.com,
	k.shutemov@gmail.com, kbusch@meta.com, kmanaouil.dev@gmail.com,
	leesuyeon0506@gmail.com, leillc@google.com,
	liam.howlett@oracle.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, mgorman@techsingularity.net,
	mingo@redhat.com, nadav.amit@gmail.com, nphamcs@gmail.com,
	peterz@infradead.org, riel@surriel.com, rientjes@google.com,
	rppt@kernel.org, santosh.shukla@amd.com, shivankg@amd.com,
	shy828301@gmail.com, sj@kernel.org, vbabka@suse.cz,
	weixugc@google.com, willy@infradead.org,
	ying.huang@linux.alibaba.com, ziy@nvidia.com,
	Jonathan.Cameron@huawei.com, alok.rathore@samsung.com
Subject: Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit
Date: Fri, 21 Mar 2025 00:41:20 +0530	[thread overview]
Message-ID: <d0aedf71-047b-4ee4-9175-a67708a389de@amd.com> (raw)
In-Reply-To: <52b2c1dd-2f4a-42fa-8a40-bd3664e7c56a@amd.com>

On 3/20/2025 2:21 PM, Raghavendra K T wrote:
> On 3/20/2025 4:30 AM, Davidlohr Bueso wrote:
>> On Wed, 19 Mar 2025, Raghavendra K T wrote:
>>
>>> Introduction:
>>> =============
>>> In the current hot page promotion, all the activities including the
>>> process address space scanning, NUMA hint fault handling and page
>>> migration is performed in the process context. i.e., scanning 
>>> overhead is
>>> borne by applications.
>>>
>>> This is RFC V1 patch series to do (slow tier) CXL page promotion.
>>> The approach in this patchset assists/addresses the issue by adding PTE
>>> Accessed bit scanning.
>>>
>>> Scanning is done by a global kernel thread which routinely scans all
>>> the processes' address spaces and checks for accesses by reading the
>>> PTE A bit.
>>>
>>> A separate migration thread migrates/promotes the pages to the toptier
>>> node based on a simple heuristic that uses toptier scan/access 
>>> information
>>> of the mm.
>>>
>>> Additionally based on the feedback for RFC V0 [4], a prctl knob with
>>> a scalar value is provided to control per task scanning.
>>>
>>> Initial results show promising number on a microbenchmark. Soon
>>> will get numbers with real benchmarks and findings (tunings).
>>>
>>> Experiment:
>>> ============
>>> Abench microbenchmark,
>>> - Allocates 8GB/16GB/32GB/64GB of memory on CXL node
>>> - 64 threads created, and each thread randomly accesses pages in 4K
>>>  granularity.
>>> - 512 iterations with a delay of 1 us between two successive iterations.
>>>
>>> SUT: 512 CPU, 2 node 256GB, AMD EPYC.
>>>
>>> 3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>
>>>
>>> Calculates how much time is taken to complete the task, lower is better.
>>> Expectation is CXL node memory is expected to be migrated as fast as
>>> possible.
>>>
>>> Base case: 6.14-rc6    w/ numab mode = 2 (hot page promotion is 
>>> enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>> we expect daemon to do page promotion.
>>>
>>> Result:
>>> ========
>>>         base NUMAB2                    patched NUMAB1
>>>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
>>> 8GB     134.33       ( 0.19 )        120.52  ( 0.21 )     10.28
>>> 16GB     292.24       ( 0.60 )        275.97  ( 0.18 )      5.56
>>> 32GB     585.06       ( 0.24 )        546.49  ( 0.35 )      6.59
>>> 64GB    1278.98       ( 0.27 )       1205.20  ( 2.29 )      5.76
>>>
>>> Base case: 6.14-rc6    w/ numab mode = 1 (numa balancing is enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>>         base NUMAB1                    patched NUMAB1
>>>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
>>> 8GB     186.71       ( 0.99 )        120.52  ( 0.21 )     35.45
>>> 16GB     376.09       ( 0.46 )        275.97  ( 0.18 )     26.62
>>> 32GB     744.37       ( 0.71 )        546.49  ( 0.35 )     26.58
>>> 64GB    1534.49       ( 0.09 )       1205.20  ( 2.29 )     21.45
>>
>> Very promising, but a few things. A more fair comparison would be
>> vs kpromoted using the PROT_NONE of NUMAB2. Essentially disregarding
>> the asynchronous migration, and effectively measuring synchronous
>> vs asynchronous scanning overhead and implied semantics. Essentially
>> save the extra kthread and only have a per-NUMA node migrator, which
>> is the common denominator for all these sources of hotness.
> 
> 
> Yes, I agree that fair comparison would be
> 1) kmmscand generating data on pages to be promoted working with
> kpromoted asynchronously migrating
> VS
> 2) NUMAB2 generating data on pages to be migrated integrated with
> kpromoted.
> 
> As Bharata already mentioned, we tried integrating kpromoted with
> kmmscand generated migration list, But kmmscand generates huge amount of
> scanned page data, and need to be organized better so that kpromted can 
> handle the migration effectively.
> 
> (2) We have not tried it yet, will get back on the possibility (and also
> numbers when both are ready).
> 
>>
>> Similarly, while I don't see any users disabling NUMAB1 _and_ enabling
>> this sort of thing, it would be useful to have data on no numa balancing
>> at all. If nothing else, that would measure the effects of the dest
>> node heuristics.
> 
> Last time when I checked, with patch, numbers with NUMAB=0 and NUMAB=1
> was not making much difference in 8GB case because most of the migration 
> was handled by kmmscand. It is because before NUMAB=1 learns and tries
> to migrate, kmmscand would have already migrated.
> 
> But a longer running/ more memory workload may make more difference.
> I will comeback with that number.

                  base NUMAB=2   Patched NUMAB=0
                  time in sec    time in sec
===================================================
8G:              134.33 (0.19)   119.88 ( 0.25)
16G:             292.24 (0.60)   325.06 (11.11)
32G:             585.06 (0.24)   546.15 ( 0.50)
64G:            1278.98 (0.27)  1221.41 ( 1.54)

We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
patched case.

PS: for 16G there was a bad case where a rare contention happen for lock
for same mm. that we can see from stdev, which should be taken care in
next version.

[...]

next prev parent reply	other threads:[~2025-03-20 19:11 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-19 19:30 Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 01/13] mm: Add kmmscand kernel daemon Raghavendra K T
2025-03-21 16:06   ` Jonathan Cameron
2025-03-24 15:09     ` Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 02/13] mm: Maintain mm_struct list in the system Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 03/13] mm: Scan the mm and create a migration list Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 04/13] mm: Create a separate kernel thread for migration Raghavendra K T
2025-03-21 17:29   ` Jonathan Cameron
2025-03-24 15:17     ` Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 05/13] mm/migration: Migrate accessed folios to toptier node Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 06/13] mm: Add throttling of mm scanning using scan_period Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 07/13] mm: Add throttling of mm scanning using scan_size Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 08/13] mm: Add initial scan delay Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 09/13] mm: Add heuristic to calculate target node Raghavendra K T
2025-03-21 17:42   ` Jonathan Cameron
2025-03-24 16:17     ` Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 10/13] sysfs: Add sysfs support to tune scanning Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 11/13] vmstat: Add vmstat counters Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 12/13] trace/kmmscand: Add tracing of scanning and migration Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 13/13] prctl: Introduce new prctl to control scanning Raghavendra K T
2025-03-19 23:00 ` [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit Davidlohr Bueso
2025-03-20  8:51   ` Raghavendra K T
2025-03-20 19:11     ` Raghavendra K T [this message]
2025-03-21 20:35       ` Davidlohr Bueso
2025-03-25  6:36         ` Raghavendra K T
2025-03-20 21:50     ` Davidlohr Bueso
2025-03-21  6:48       ` Raghavendra K T
2025-03-21 15:52 ` Jonathan Cameron
     [not found] ` <20250321105309.3521-1-hdanton@sina.com>
2025-03-23 18:14   ` [RFC PATCH V1 09/13] mm: Add heuristic to calculate target node Raghavendra K T
     [not found]   ` <20250324110543.3599-1-hdanton@sina.com>
2025-03-24 14:54     ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0aedf71-047b-4ee4-9175-a67708a389de@amd.com \
    --to=raghavendra.kt@amd.com \
    --cc=AneeshKumar.KizhakeVeetil@arm.com \
    --cc=Hasan.Maruf@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Michael.Day@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=bharata@amd.com \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=dongjoo.linux.dev@gmail.com \
    --cc=feng.tang@intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=honggyu.kim@sk.com \
    --cc=hughd@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=jon.grimm@amd.com \
    --cc=k.shutemov@gmail.com \
    --cc=kbusch@meta.com \
    --cc=kmanaouil.dev@gmail.com \
    --cc=leesuyeon0506@gmail.com \
    --cc=leillc@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nadav.amit@gmail.com \
    --cc=nphamcs@gmail.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=santosh.shukla@amd.com \
    --cc=shivankg@amd.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox