From: Raghavendra K T <raghavendra.kt@amd.com>
To: AneeshKumar.KizhakeVeetil@arm.com, Hasan.Maruf@amd.com,
Michael.Day@amd.com, akpm@linux-foundation.org, bharata@amd.com,
dave.hansen@intel.com, david@redhat.com,
dongjoo.linux.dev@gmail.com, feng.tang@intel.com,
gourry@gourry.net, hannes@cmpxchg.org, honggyu.kim@sk.com,
hughd@google.com, jhubbard@nvidia.com, jon.grimm@amd.com,
k.shutemov@gmail.com, kbusch@meta.com, kmanaouil.dev@gmail.com,
leesuyeon0506@gmail.com, leillc@google.com,
liam.howlett@oracle.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, mgorman@techsingularity.net,
mingo@redhat.com, nadav.amit@gmail.com, nphamcs@gmail.com,
peterz@infradead.org, riel@surriel.com, rientjes@google.com,
rppt@kernel.org, santosh.shukla@amd.com, shivankg@amd.com,
shy828301@gmail.com, sj@kernel.org, vbabka@suse.cz,
weixugc@google.com, willy@infradead.org,
ying.huang@linux.alibaba.com, ziy@nvidia.com,
Jonathan.Cameron@huawei.com, alok.rathore@samsung.com
Subject: Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit
Date: Fri, 21 Mar 2025 00:41:20 +0530 [thread overview]
Message-ID: <d0aedf71-047b-4ee4-9175-a67708a389de@amd.com> (raw)
In-Reply-To: <52b2c1dd-2f4a-42fa-8a40-bd3664e7c56a@amd.com>
On 3/20/2025 2:21 PM, Raghavendra K T wrote:
> On 3/20/2025 4:30 AM, Davidlohr Bueso wrote:
>> On Wed, 19 Mar 2025, Raghavendra K T wrote:
>>
>>> Introduction:
>>> =============
>>> In the current hot page promotion, all the activities including the
>>> process address space scanning, NUMA hint fault handling and page
>>> migration is performed in the process context. i.e., scanning
>>> overhead is
>>> borne by applications.
>>>
>>> This is RFC V1 patch series to do (slow tier) CXL page promotion.
>>> The approach in this patchset assists/addresses the issue by adding PTE
>>> Accessed bit scanning.
>>>
>>> Scanning is done by a global kernel thread which routinely scans all
>>> the processes' address spaces and checks for accesses by reading the
>>> PTE A bit.
>>>
>>> A separate migration thread migrates/promotes the pages to the toptier
>>> node based on a simple heuristic that uses toptier scan/access
>>> information
>>> of the mm.
>>>
>>> Additionally based on the feedback for RFC V0 [4], a prctl knob with
>>> a scalar value is provided to control per task scanning.
>>>
>>> Initial results show promising number on a microbenchmark. Soon
>>> will get numbers with real benchmarks and findings (tunings).
>>>
>>> Experiment:
>>> ============
>>> Abench microbenchmark,
>>> - Allocates 8GB/16GB/32GB/64GB of memory on CXL node
>>> - 64 threads created, and each thread randomly accesses pages in 4K
>>> granularity.
>>> - 512 iterations with a delay of 1 us between two successive iterations.
>>>
>>> SUT: 512 CPU, 2 node 256GB, AMD EPYC.
>>>
>>> 3 runs, command: abench -m 2 -d 1 -i 512 -s <size>
>>>
>>> Calculates how much time is taken to complete the task, lower is better.
>>> Expectation is CXL node memory is expected to be migrated as fast as
>>> possible.
>>>
>>> Base case: 6.14-rc6 w/ numab mode = 2 (hot page promotion is
>>> enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>> we expect daemon to do page promotion.
>>>
>>> Result:
>>> ========
>>> base NUMAB2 patched NUMAB1
>>> time in sec (%stdev) time in sec (%stdev) %gain
>>> 8GB 134.33 ( 0.19 ) 120.52 ( 0.21 ) 10.28
>>> 16GB 292.24 ( 0.60 ) 275.97 ( 0.18 ) 5.56
>>> 32GB 585.06 ( 0.24 ) 546.49 ( 0.35 ) 6.59
>>> 64GB 1278.98 ( 0.27 ) 1205.20 ( 2.29 ) 5.76
>>>
>>> Base case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>> base NUMAB1 patched NUMAB1
>>> time in sec (%stdev) time in sec (%stdev) %gain
>>> 8GB 186.71 ( 0.99 ) 120.52 ( 0.21 ) 35.45
>>> 16GB 376.09 ( 0.46 ) 275.97 ( 0.18 ) 26.62
>>> 32GB 744.37 ( 0.71 ) 546.49 ( 0.35 ) 26.58
>>> 64GB 1534.49 ( 0.09 ) 1205.20 ( 2.29 ) 21.45
>>
>> Very promising, but a few things. A more fair comparison would be
>> vs kpromoted using the PROT_NONE of NUMAB2. Essentially disregarding
>> the asynchronous migration, and effectively measuring synchronous
>> vs asynchronous scanning overhead and implied semantics. Essentially
>> save the extra kthread and only have a per-NUMA node migrator, which
>> is the common denominator for all these sources of hotness.
>
>
> Yes, I agree that fair comparison would be
> 1) kmmscand generating data on pages to be promoted working with
> kpromoted asynchronously migrating
> VS
> 2) NUMAB2 generating data on pages to be migrated integrated with
> kpromoted.
>
> As Bharata already mentioned, we tried integrating kpromoted with
> kmmscand generated migration list, But kmmscand generates huge amount of
> scanned page data, and need to be organized better so that kpromted can
> handle the migration effectively.
>
> (2) We have not tried it yet, will get back on the possibility (and also
> numbers when both are ready).
>
>>
>> Similarly, while I don't see any users disabling NUMAB1 _and_ enabling
>> this sort of thing, it would be useful to have data on no numa balancing
>> at all. If nothing else, that would measure the effects of the dest
>> node heuristics.
>
> Last time when I checked, with patch, numbers with NUMAB=0 and NUMAB=1
> was not making much difference in 8GB case because most of the migration
> was handled by kmmscand. It is because before NUMAB=1 learns and tries
> to migrate, kmmscand would have already migrated.
>
> But a longer running/ more memory workload may make more difference.
> I will comeback with that number.
base NUMAB=2 Patched NUMAB=0
time in sec time in sec
===================================================
8G: 134.33 (0.19) 119.88 ( 0.25)
16G: 292.24 (0.60) 325.06 (11.11)
32G: 585.06 (0.24) 546.15 ( 0.50)
64G: 1278.98 (0.27) 1221.41 ( 1.54)
We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
patched case.
PS: for 16G there was a bad case where a rare contention happen for lock
for same mm. that we can see from stdev, which should be taken care in
next version.
[...]
next prev parent reply other threads:[~2025-03-20 19:11 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-19 19:30 Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 01/13] mm: Add kmmscand kernel daemon Raghavendra K T
2025-03-21 16:06 ` Jonathan Cameron
2025-03-24 15:09 ` Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 02/13] mm: Maintain mm_struct list in the system Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 03/13] mm: Scan the mm and create a migration list Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 04/13] mm: Create a separate kernel thread for migration Raghavendra K T
2025-03-21 17:29 ` Jonathan Cameron
2025-03-24 15:17 ` Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 05/13] mm/migration: Migrate accessed folios to toptier node Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 06/13] mm: Add throttling of mm scanning using scan_period Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 07/13] mm: Add throttling of mm scanning using scan_size Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 08/13] mm: Add initial scan delay Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 09/13] mm: Add heuristic to calculate target node Raghavendra K T
2025-03-21 17:42 ` Jonathan Cameron
2025-03-24 16:17 ` Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 10/13] sysfs: Add sysfs support to tune scanning Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 11/13] vmstat: Add vmstat counters Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 12/13] trace/kmmscand: Add tracing of scanning and migration Raghavendra K T
2025-03-19 19:30 ` [RFC PATCH V1 13/13] prctl: Introduce new prctl to control scanning Raghavendra K T
2025-03-19 23:00 ` [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit Davidlohr Bueso
2025-03-20 8:51 ` Raghavendra K T
2025-03-20 19:11 ` Raghavendra K T [this message]
2025-03-21 20:35 ` Davidlohr Bueso
2025-03-25 6:36 ` Raghavendra K T
2025-03-20 21:50 ` Davidlohr Bueso
2025-03-21 6:48 ` Raghavendra K T
2025-03-21 15:52 ` Jonathan Cameron
[not found] ` <20250321105309.3521-1-hdanton@sina.com>
2025-03-23 18:14 ` [RFC PATCH V1 09/13] mm: Add heuristic to calculate target node Raghavendra K T
[not found] ` <20250324110543.3599-1-hdanton@sina.com>
2025-03-24 14:54 ` Raghavendra K T
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d0aedf71-047b-4ee4-9175-a67708a389de@amd.com \
--to=raghavendra.kt@amd.com \
--cc=AneeshKumar.KizhakeVeetil@arm.com \
--cc=Hasan.Maruf@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=Michael.Day@amd.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=bharata@amd.com \
--cc=dave.hansen@intel.com \
--cc=david@redhat.com \
--cc=dongjoo.linux.dev@gmail.com \
--cc=feng.tang@intel.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=honggyu.kim@sk.com \
--cc=hughd@google.com \
--cc=jhubbard@nvidia.com \
--cc=jon.grimm@amd.com \
--cc=k.shutemov@gmail.com \
--cc=kbusch@meta.com \
--cc=kmanaouil.dev@gmail.com \
--cc=leesuyeon0506@gmail.com \
--cc=leillc@google.com \
--cc=liam.howlett@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nadav.amit@gmail.com \
--cc=nphamcs@gmail.com \
--cc=peterz@infradead.org \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=santosh.shukla@amd.com \
--cc=shivankg@amd.com \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=vbabka@suse.cz \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox