* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-18 4:19 Slow-tier Page Promotion discussion recap and open questions David Rientjes
@ 2024-12-18 14:50 ` Zi Yan
2024-12-19 6:38 ` Shivank Garg
2024-12-18 15:21 ` Nadav Amit
` (3 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: Zi Yan @ 2024-12-18 14:50 UTC (permalink / raw)
To: David Rientjes
Cc: Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
Grimm, Jon, sj, shy828301, Liam Howlett, Gregory Price, linux-mm
On 17 Dec 2024, at 23:19, David Rientjes wrote:
> Hi everybody,
>
> We had a very interactive discussion last week led by RaghavendraKT on
> slow-tier page promotion intended for memory tiering platforms, thank
> you! Thanks as well to everybody who attended and provided great
> questions, suggestions, and feedback.
>
> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
> is a proposal to allow for asynchronous page promotion based on memory
> accesses as an alternative to NUMA Balancing based promotions. There was
> widespread interest in this topic and the discussion surfaced multiple
> use cases and requirements, very focused on CXL use cases.
>
<snip>
> ----->o-----
> I asked about offloading the migration to a data mover, such as the PSP
> for AMD, DMA engine, etc and whether that should be treated entirely
> separately as a topic. Bharata said there was a proof-of-concept
> available from AMD that does just that but the initial results were not
> that encouraging.
>
> Zi asked if the DMA engine saturated the link between the slow and fast
> tiers. If we want to offload to a copy engine, we need to verify that
> the throughput is sufficient or we may be better off using idle cpus to
> perform the migration for us.
<snip>
>
> - we likely want to reconsider the single threaded nature of the kthread
> even if only for NUMA purposes
>
Related to using DMA engine and/or multi threads for page migration, I had
a patchset accelerating page migration[1] back in 2019. It showed good
throughput speedup, ~4x using 16 threads to copy multiple 2MB THP. I think
it is time to revisit the topic.
[1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-18 14:50 ` Zi Yan
@ 2024-12-19 6:38 ` Shivank Garg
2024-12-30 5:30 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: Shivank Garg @ 2024-12-19 6:38 UTC (permalink / raw)
To: Zi Yan, David Rientjes
Cc: Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
Grimm, Jon, sj, shy828301, Liam Howlett, Gregory Price, linux-mm
On 12/18/2024 8:20 PM, Zi Yan wrote:
> On 17 Dec 2024, at 23:19, David Rientjes wrote:
>
>> Hi everybody,
>>
>> We had a very interactive discussion last week led by RaghavendraKT on
>> slow-tier page promotion intended for memory tiering platforms, thank
>> you! Thanks as well to everybody who attended and provided great
>> questions, suggestions, and feedback.
>>
>> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
>> is a proposal to allow for asynchronous page promotion based on memory
>> accesses as an alternative to NUMA Balancing based promotions. There was
>> widespread interest in this topic and the discussion surfaced multiple
>> use cases and requirements, very focused on CXL use cases.
>>
> <snip>
>> ----->o-----
>> I asked about offloading the migration to a data mover, such as the PSP
>> for AMD, DMA engine, etc and whether that should be treated entirely
>> separately as a topic. Bharata said there was a proof-of-concept
>> available from AMD that does just that but the initial results were not
>> that encouraging.
>>
>> Zi asked if the DMA engine saturated the link between the slow and fast
>> tiers. If we want to offload to a copy engine, we need to verify that
>> the throughput is sufficient or we may be better off using idle cpus to
>> perform the migration for us.
>
> <snip>
>>
>> - we likely want to reconsider the single threaded nature of the kthread
>> even if only for NUMA purposes
>>
>
> Related to using DMA engine and/or multi threads for page migration, I had
> a patchset accelerating page migration[1] back in 2019. It showed good
> throughput speedup, ~4x using 16 threads to copy multiple 2MB THP. I think
> it is time to revisit the topic.
>
>
> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
Hi All,
I wanted to provide some additional context regarding the AMD DMA offloading
POC mentioned by Bharata:
https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com
While the initial results weren't as encouraging as hoped, I plan to improve this
in next versions of the patchset.
The core idea in my RFC patchset is restructuring the folio move operation
to better leverage DMA hardware. Instead of the current folio-by-folio approach:
for_each_folio() {
copy metadata + content + update PTEs
}
We batch the operations to minimize overhead:
for_each_folio() {
copy metadata
}
DMA batch copy all content
for_each_folio() {
update PTEs
}
My experiment showed that folio copy can consume up to 26.6% of total migration
cost when moving data between NUMA nodes. This suggests significant room for
improvement through DMA offloading, particularly for the larger transfers expected
in CXL scenarios.
It would be interesting work on combining these approaches for optimized page
promotion.
Best regards,
Shivank Garg
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-19 6:38 ` Shivank Garg
@ 2024-12-30 5:30 ` David Rientjes
2024-12-30 17:33 ` Zi Yan
2025-01-06 9:14 ` Shivank Garg
0 siblings, 2 replies; 18+ messages in thread
From: David Rientjes @ 2024-12-30 5:30 UTC (permalink / raw)
To: Shivank Garg
Cc: Zi Yan, Aneesh Kumar, David Hildenbrand, John Hubbard,
Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
Liam Howlett, Gregory Price, linux-mm
On Thu, 19 Dec 2024, Shivank Garg wrote:
> On 12/18/2024 8:20 PM, Zi Yan wrote:
> > On 17 Dec 2024, at 23:19, David Rientjes wrote:
> >
> >> Hi everybody,
> >>
> >> We had a very interactive discussion last week led by RaghavendraKT on
> >> slow-tier page promotion intended for memory tiering platforms, thank
> >> you! Thanks as well to everybody who attended and provided great
> >> questions, suggestions, and feedback.
> >>
> >> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
> >> is a proposal to allow for asynchronous page promotion based on memory
> >> accesses as an alternative to NUMA Balancing based promotions. There was
> >> widespread interest in this topic and the discussion surfaced multiple
> >> use cases and requirements, very focused on CXL use cases.
> >>
> > <snip>
> >> ----->o-----
> >> I asked about offloading the migration to a data mover, such as the PSP
> >> for AMD, DMA engine, etc and whether that should be treated entirely
> >> separately as a topic. Bharata said there was a proof-of-concept
> >> available from AMD that does just that but the initial results were not
> >> that encouraging.
> >>
> >> Zi asked if the DMA engine saturated the link between the slow and fast
> >> tiers. If we want to offload to a copy engine, we need to verify that
> >> the throughput is sufficient or we may be better off using idle cpus to
> >> perform the migration for us.
> >
> > <snip>
> >>
> >> - we likely want to reconsider the single threaded nature of the kthread
> >> even if only for NUMA purposes
> >>
> >
> > Related to using DMA engine and/or multi threads for page migration, I had
> > a patchset accelerating page migration[1] back in 2019. It showed good
> > throughput speedup, ~4x using 16 threads to copy multiple 2MB THP. I think
> > it is time to revisit the topic.
> >
> >
> > [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>
> Hi All,
>
> I wanted to provide some additional context regarding the AMD DMA offloading
> POC mentioned by Bharata:
> https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com
>
> While the initial results weren't as encouraging as hoped, I plan to improve this
> in next versions of the patchset.
>
> The core idea in my RFC patchset is restructuring the folio move operation
> to better leverage DMA hardware. Instead of the current folio-by-folio approach:
>
> for_each_folio() {
> copy metadata + content + update PTEs
> }
>
> We batch the operations to minimize overhead:
>
> for_each_folio() {
> copy metadata
> }
> DMA batch copy all content
> for_each_folio() {
> update PTEs
> }
>
> My experiment showed that folio copy can consume up to 26.6% of total migration
> cost when moving data between NUMA nodes. This suggests significant room for
> improvement through DMA offloading, particularly for the larger transfers expected
> in CXL scenarios.
>
> It would be interesting work on combining these approaches for optimized page
> promotion.
>
This is very exciting, thanks Shivank and Zi! The reason I brought this
topic up during the session on asynchronous page promotion for memory
tiering was because page migration is likely going to become *much* more
popular and will be in the critical path under system-wide memory
pressure. Hardware assist and any software optimizations that can go
along with it would certainly be very interesting to discuss.
Shivank, do you have an estimated timeline for when that patch series will
be refreshed? Any planned integration with TMPM?
Zi, are you looking to refresh your series and continue discussing page
migration offload? We could set up another Linux MM Alignment Session
topic focused exactly on this and get representatives from the vendors
involved.
Thanks!
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-30 5:30 ` David Rientjes
@ 2024-12-30 17:33 ` Zi Yan
2025-01-06 9:14 ` Shivank Garg
1 sibling, 0 replies; 18+ messages in thread
From: Zi Yan @ 2024-12-30 17:33 UTC (permalink / raw)
To: David Rientjes, Shivank Garg
Cc: Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
Grimm, Jon, sj, shy828301, Liam Howlett, Gregory Price, linux-mm,
Kefeng Wang
On Mon Dec 30, 2024 at 12:30 AM EST, David Rientjes wrote:
> On Thu, 19 Dec 2024, Shivank Garg wrote:
>
> > On 12/18/2024 8:20 PM, Zi Yan wrote:
> > > On 17 Dec 2024, at 23:19, David Rientjes wrote:
> > >
> > >> Hi everybody,
> > >>
> > >> We had a very interactive discussion last week led by RaghavendraKT on
> > >> slow-tier page promotion intended for memory tiering platforms, thank
> > >> you! Thanks as well to everybody who attended and provided great
> > >> questions, suggestions, and feedback.
> > >>
> > >> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
> > >> is a proposal to allow for asynchronous page promotion based on memory
> > >> accesses as an alternative to NUMA Balancing based promotions. There was
> > >> widespread interest in this topic and the discussion surfaced multiple
> > >> use cases and requirements, very focused on CXL use cases.
> > >>
> > > <snip>
> > >> ----->o-----
> > >> I asked about offloading the migration to a data mover, such as the PSP
> > >> for AMD, DMA engine, etc and whether that should be treated entirely
> > >> separately as a topic. Bharata said there was a proof-of-concept
> > >> available from AMD that does just that but the initial results were not
> > >> that encouraging.
> > >>
> > >> Zi asked if the DMA engine saturated the link between the slow and fast
> > >> tiers. If we want to offload to a copy engine, we need to verify that
> > >> the throughput is sufficient or we may be better off using idle cpus to
> > >> perform the migration for us.
> > >
> > > <snip>
> > >>
> > >> - we likely want to reconsider the single threaded nature of the kthread
> > >> even if only for NUMA purposes
> > >>
> > >
> > > Related to using DMA engine and/or multi threads for page migration, I had
> > > a patchset accelerating page migration[1] back in 2019. It showed good
> > > throughput speedup, ~4x using 16 threads to copy multiple 2MB THP. I think
> > > it is time to revisit the topic.
> > >
> > >
> > > [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
> >
> > Hi All,
> >
> > I wanted to provide some additional context regarding the AMD DMA offloading
> > POC mentioned by Bharata:
> > https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com
> >
> > While the initial results weren't as encouraging as hoped, I plan to improve this
> > in next versions of the patchset.
> >
> > The core idea in my RFC patchset is restructuring the folio move operation
> > to better leverage DMA hardware. Instead of the current folio-by-folio approach:
> >
> > for_each_folio() {
> > copy metadata + content + update PTEs
> > }
> >
> > We batch the operations to minimize overhead:
> >
> > for_each_folio() {
> > copy metadata
> > }
> > DMA batch copy all content
> > for_each_folio() {
> > update PTEs
> > }
> >
> > My experiment showed that folio copy can consume up to 26.6% of total migration
> > cost when moving data between NUMA nodes. This suggests significant room for
> > improvement through DMA offloading, particularly for the larger transfers expected
> > in CXL scenarios.
> >
> > It would be interesting work on combining these approaches for optimized page
> > promotion.
> >
>
> This is very exciting, thanks Shivank and Zi! The reason I brought this
> topic up during the session on asynchronous page promotion for memory
> tiering was because page migration is likely going to become *much* more
> popular and will be in the critical path under system-wide memory
> pressure. Hardware assist and any software optimizations that can go
> along with it would certainly be very interesting to discuss.
>
> Shivank, do you have an estimated timeline for when that patch series will
> be refreshed? Any planned integration with TMPM?
>
> Zi, are you looking to refresh your series and continue discussing page
> migration offload? We could set up another Linux MM Alignment Session
> topic focused exactly on this and get representatives from the vendors
> involved.
Sure. I am redoing the experiments with multithreads recently
and see more throughput increase (up to 10x througput with 32 threads)
on NVIDIA Grace CPUs.
Shivank's approach, using MIGRATE_SYNC_NO_COPY, looks simpler
than what I have done, splitting migrate_folio() into two parts[1]. I
am planning to rebuild my multithreaded folio copy patches on top of
Shivank's patches with some modifications. One thing to note is that
MIGRATE_SYNC_NO_COPY is removed by Kefeng (cc'd) recently[2], so I will
need to bring it back.
[1] https://github.com/x-y-z/linux-dev/tree/batched_page_migration_copy-v6.12
[2] https://lore.kernel.org/all/20240524052843.182275-6-wangkefeng.wang@huawei.com/
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-30 5:30 ` David Rientjes
2024-12-30 17:33 ` Zi Yan
@ 2025-01-06 9:14 ` Shivank Garg
1 sibling, 0 replies; 18+ messages in thread
From: Shivank Garg @ 2025-01-06 9:14 UTC (permalink / raw)
To: David Rientjes
Cc: Zi Yan, Aneesh Kumar, David Hildenbrand, John Hubbard,
Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
Liam Howlett, Gregory Price, linux-mm
On 12/30/2024 11:00 AM, David Rientjes wrote:
> On Thu, 19 Dec 2024, Shivank Garg wrote:
>
>> On 12/18/2024 8:20 PM, Zi Yan wrote:
>>> On 17 Dec 2024, at 23:19, David Rientjes wrote:
>>>
>>>> Hi everybody,
>>>>
>>>> We had a very interactive discussion last week led by RaghavendraKT on
>>>> slow-tier page promotion intended for memory tiering platforms, thank
>>>> you! Thanks as well to everybody who attended and provided great
>>>> questions, suggestions, and feedback.
>>>>
>>>> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
>>>> is a proposal to allow for asynchronous page promotion based on memory
>>>> accesses as an alternative to NUMA Balancing based promotions. There was
>>>> widespread interest in this topic and the discussion surfaced multiple
>>>> use cases and requirements, very focused on CXL use cases.
>>>>
>>> <snip>
>>>> ----->o-----
>>>> I asked about offloading the migration to a data mover, such as the PSP
>>>> for AMD, DMA engine, etc and whether that should be treated entirely
>>>> separately as a topic. Bharata said there was a proof-of-concept
>>>> available from AMD that does just that but the initial results were not
>>>> that encouraging.
>>>>
>>>> Zi asked if the DMA engine saturated the link between the slow and fast
>>>> tiers. If we want to offload to a copy engine, we need to verify that
>>>> the throughput is sufficient or we may be better off using idle cpus to
>>>> perform the migration for us.
>>>
>>> <snip>
>>>>
>>>> - we likely want to reconsider the single threaded nature of the kthread
>>>> even if only for NUMA purposes
>>>>
>>>
>>> Related to using DMA engine and/or multi threads for page migration, I had
>>> a patchset accelerating page migration[1] back in 2019. It showed good
>>> throughput speedup, ~4x using 16 threads to copy multiple 2MB THP. I think
>>> it is time to revisit the topic.
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>>
>> Hi All,
>>
>> I wanted to provide some additional context regarding the AMD DMA offloading
>> POC mentioned by Bharata:
>> https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com
>>
>> While the initial results weren't as encouraging as hoped, I plan to improve this
>> in next versions of the patchset.
>>
>> The core idea in my RFC patchset is restructuring the folio move operation
>> to better leverage DMA hardware. Instead of the current folio-by-folio approach:
>>
>> for_each_folio() {
>> copy metadata + content + update PTEs
>> }
>>
>> We batch the operations to minimize overhead:
>>
>> for_each_folio() {
>> copy metadata
>> }
>> DMA batch copy all content
>> for_each_folio() {
>> update PTEs
>> }
>>
>> My experiment showed that folio copy can consume up to 26.6% of total migration
>> cost when moving data between NUMA nodes. This suggests significant room for
>> improvement through DMA offloading, particularly for the larger transfers expected
>> in CXL scenarios.
>>
>> It would be interesting work on combining these approaches for optimized page
>> promotion.
>>
>
> This is very exciting, thanks Shivank and Zi! The reason I brought this
> topic up during the session on asynchronous page promotion for memory
> tiering was because page migration is likely going to become *much* more
> popular and will be in the critical path under system-wide memory
> pressure. Hardware assist and any software optimizations that can go
> along with it would certainly be very interesting to discuss.
>
> Shivank, do you have an estimated timeline for when that patch series will
> be refreshed? Any planned integration with TMPM?
Hi David,
It's definitely interesting for us to get it working with SDXI.
I'm going to try it out.
Thanks,
Shivank
>
> Zi, are you looking to refresh your series and continue discussing page
> migration offload? We could set up another Linux MM Alignment Session
> topic focused exactly on this and get representatives from the vendors
> involved.
>
> Thanks!
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-18 4:19 Slow-tier Page Promotion discussion recap and open questions David Rientjes
2024-12-18 14:50 ` Zi Yan
@ 2024-12-18 15:21 ` Nadav Amit
2024-12-20 11:28 ` Raghavendra K T
2024-12-18 19:23 ` SeongJae Park
` (2 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2024-12-18 15:21 UTC (permalink / raw)
To: David Rientjes
Cc: Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
Grimm, Jon, sj, shy828301, Zi Yan, Liam R . Howlett,
Gregory Price, open list:MEMORY MANAGEMENT
> On 18 Dec 2024, at 6:19, David Rientjes <rientjes@google.com> wrote:
>
> Hi everybody,
>
> We had a very interactive discussion last week led by RaghavendraKT on
> slow-tier page promotion intended for memory tiering platforms, thank
> you! Thanks as well to everybody who attended and provided great
> questions, suggestions, and feedback.
>
> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
> is a proposal to allow for asynchronous page promotion based on memory
> accesses as an alternative to NUMA Balancing based promotions. There was
> widespread interest in this topic and the discussion surfaced multiple
> use cases and requirements, very focused on CXL use cases.
>
Just sharing my 2 cents.
IIUC, the suggested approach has two benefits:
1. Fewer/no page-faults (as A-bit is used to detect usage)
2. Batching
While (2) seems like a win that might be added un top of AutoNUMA, (1)
is more delicate. As indicated in the patch-set, the "exact destination”
is lost. At the same time, the last time I checked, the A-bit setting
wasn’t free and cost something like 550 cycles (others saw similar
results [1]).
So considering empty page-fault is ~1050 cycles (2014 number Linus
measured [2]), there is a question how big of a win it is...
[1] https://lore.kernel.org/all/20160620000606.GB3194@blaptop/
[2] Google+ post RIP
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-18 15:21 ` Nadav Amit
@ 2024-12-20 11:28 ` Raghavendra K T
0 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-20 11:28 UTC (permalink / raw)
To: Nadav Amit, David Rientjes
Cc: Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
Grimm, Jon, sj, shy828301, Zi Yan, Liam R . Howlett,
Gregory Price, open list:MEMORY MANAGEMENT
On 12/18/2024 8:51 PM, Nadav Amit wrote:
>
>
>> On 18 Dec 2024, at 6:19, David Rientjes <rientjes@google.com> wrote:
>>
>> Hi everybody,
>>
>> We had a very interactive discussion last week led by RaghavendraKT on
>> slow-tier page promotion intended for memory tiering platforms, thank
>> you! Thanks as well to everybody who attended and provided great
>> questions, suggestions, and feedback.
>>
>> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
>> is a proposal to allow for asynchronous page promotion based on memory
>> accesses as an alternative to NUMA Balancing based promotions. There was
>> widespread interest in this topic and the discussion surfaced multiple
>> use cases and requirements, very focused on CXL use cases.
>>
>
> Just sharing my 2 cents.
>
> IIUC, the suggested approach has two benefits:
>
> 1. Fewer/no page-faults (as A-bit is used to detect usage)
> 2. Batching
>
> While (2) seems like a win that might be added un top of AutoNUMA, (1)
> is more delicate. As indicated in the patch-set, the "exact destination”
> is lost. At the same time, the last time I checked, the A-bit setting
> wasn’t free and cost something like 550 cycles (others saw similar
> results [1]).
>
> So considering empty page-fault is ~1050 cycles (2014 number Linus
> measured [2]), there is a question how big of a win it is...
>
>
> [1] https://lore.kernel.org/all/20160620000606.GB3194@blaptop/
> [2] Google+ post RIP
Thanks for the feedback. (as I noted in other post), can A bit scanning
that detects hot VMA information be fed to NUMAB=1 scanning?
Regards
- Raghu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-18 4:19 Slow-tier Page Promotion discussion recap and open questions David Rientjes
2024-12-18 14:50 ` Zi Yan
2024-12-18 15:21 ` Nadav Amit
@ 2024-12-18 19:23 ` SeongJae Park
2024-12-19 0:56 ` Gregory Price
2024-12-20 11:21 ` Raghavendra K T
4 siblings, 0 replies; 18+ messages in thread
From: SeongJae Park @ 2024-12-18 19:23 UTC (permalink / raw)
To: David Rientjes
Cc: SeongJae Park, Aneesh Kumar, David Hildenbrand, John Hubbard,
Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
Lei Chen, Shukla, Santosh, Grimm, Jon, shy828301, Zi Yan,
Liam Howlett, Gregory Price, linux-mm
On Tue, 17 Dec 2024 20:19:56 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> Hi everybody,
>
> We had a very interactive discussion last week led by RaghavendraKT on
> slow-tier page promotion intended for memory tiering platforms, thank
> you! Thanks as well to everybody who attended and provided great
> questions, suggestions, and feedback.
>
> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
> is a proposal to allow for asynchronous page promotion based on memory
> accesses as an alternative to NUMA Balancing based promotions. There was
> widespread interest in this topic and the discussion surfaced multiple
> use cases and requirements, very focused on CXL use cases.
Thank you for keeping the series and this great summary, David :)
[...]
> ----->o-----
> I followed up on a discussion point early in the talk about whether this
> should be virtual address scanning like the current approach, walking
> mm_struct's, or the alternative approach which would be physical address
> scanning.
>
> Raghu sees this as a fully alternative approach such as what DAMON uses
> that is based on rmap. The only advantage appears to be avoiding
> scanning on top tier memory completely.
IMHO, there could be more advantages of physical address space based
appraoches. Easier handling of unmapped pages and short-lived processes,
applying different access monitoring / promotion policies for differnt NUMA
nodes (tiers) are some of those off the top of my head.
>
> ----->o-----
> Wei noted there was a lot of similarities between the RFC implementation
> and the MGLRU page walk functionality and whether it would make sense to
> try to converge these together or make more generally useful.
>
> SeongJae noted that if DAMON logic were used for the scanning that we
> could re-use the existing support for controlling the overhead.
Just to clarify. I added this comment since there were concerns around rmap
overhead for pysical address space-based monitoring approaches.
[...]
> My takeaways:
[...]
> - I think virtual memory scanning is likely the only viable approach for
> this purpose and we could store state in the underlying struct page,
> similar to NUMA Balancing, but that all scanning should be driven by
> walking the mm_struct's to harvest the Accessed bit
I don't clearly get why you think virtual memory scanning is the only viable
approach. I'm curious if you have some pros/cons list about virtual vs
physical address based appraoches in your mind, and willing to share.
[...]
> We'll be looking to incorporate this discussion in our upstream Memory
> Tiering Working Group to accelerate alignment and progress on the
> approach.
Thank you again for your efforts on this!
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-18 4:19 Slow-tier Page Promotion discussion recap and open questions David Rientjes
` (2 preceding siblings ...)
2024-12-18 19:23 ` SeongJae Park
@ 2024-12-19 0:56 ` Gregory Price
2024-12-26 1:28 ` Karim Manaouil
2024-12-20 11:21 ` Raghavendra K T
4 siblings, 1 reply; 18+ messages in thread
From: Gregory Price @ 2024-12-19 0:56 UTC (permalink / raw)
To: David Rientjes
Cc: Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
Grimm, Jon, sj, shy828301, Zi Yan, Liam Howlett, Gregory Price,
linux-mm
On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote:
> ----->o-----
> Raghu noted the current promotion destination is node 0 by default. Wei
> noted we could get some page owner information to determine things like
> mempolicies or compute the distance between nodes and, if multiple nodes
> have the same distance, choose one of them just as we do for demotions.
>
> Gregory Price noted some downsides to using mempolicies for this based on
> per-task, per-vma, and cross socket policies, so using the kernel's
> memory tiering policies is probably the best way to go about it.
>
Slightly elaborating here:
- In an async context, associating a page with a specific task is not
presently possible (that I know of). The most we know is the last
accessing CPU - maybe - in the page/folio struct. Right now this
is disabled in favor of a timestamp when tiering is enabled.
a process with 2 tasks which have access to the page may not run
on the same socket, so we run the risk of migrating to a bad target.
Best effort here would suggest either socket is fine - since they're
both "fast nodes" - but this requires that we record the last
accessing CPU for a page at identification time.
- Even if you could associate with a particular task, the task and/or
cgroup are not guaranteed to have a socket affinity. Though obviously
if it does, that can be used (just doesn't satisfy default behavior).
Basically just saying we shouldn't depend on this
- per-vma mempolicies are a potential solution, but they're not very
common in the wild - software would have to become numa aware and
utilize mbind() on particular memory regions.
Likewise we shouldn't depend on this either.
- This holds for future mechanisms like CHMU, whose accessing data is
even more abstract (no concept of accessing task / cpu / owner at all)
More generally - in an async scanning context it's presently not
possible to identify the optimal promotion node - and it likely is
not possible without userland hints.
So probably we should just leverage static configuration data (HMAT)
and some basic math to put together a promotion target in a similar
way to how we calculate a demotion target.
Long winded way of saying I don't think an optimal solution is possible,
so lets start with suboptimal and get data.
> ----->o-----
> My takeaways:
>
> - there is a definite need to separate hot page detection and the
> promotion path since hot pages may be derived from multiple sources,
> including hardware assists in the future
>
> - for the hot page tracking itself, a common abstraction to be used that
> can effectively describe hotness regardless of the backend it is
> deriving its information from would likely be quite useful
>
In a synchronous context (Accessing Task), something like:
target_node = numa_node_id; # cpu we're currently operating on
promote_pagevec(vec, numa_node_id, PROMOTE_DEFER);
where the function promotion logic then does something like:
promote_batch(pagevec, target)
In an asynchronous context (Scanning Task), something like:
promote_pagevec(vec, -1, PROMOTE_DEFER);
where the promotion logic then does something like
for page in pagevec:
target = memory_tiers_promotion_target(page_to_nid(page))
promote(folio, target)
Plumbing-wise this can be optimized to identify similarly located
pages into a sub-pagevec and use promote_batch() semantics.
My gut says this is the best we're going to get, since async contexts
can't identify accessor locations easily (especially CHMU).
> - I think virtual memory scanning is likely the only viable approach for
Hard disagree. Virtual memory scanning misses an entire class of memory
Unmapped file cache.
https://lore.kernel.org/linux-mm/20241210213744.2968-1-gourry@gourry.net/
> this purpose and we could store state in the underlying struct page,
This is contentious. Look at folio->_last_cpupid for context, we're
already overloading fields in subtle ways to steal a 32 bit area.
> similar to NUMA Balancing, but that all scanning should be driven by
> walking the mm_struct's to harvest the Accessed bit
>
If the goal is to do multi-tenant tiering (i.e. many mm_struct's), then
this scales poorly by design.
Elsewhere, folks agreed that CXL-memory will have HMU-driven hotness
data as the primary mechanism. This is a physical-memory hotness tracking
mechanism that avoids scanning page tables or page structs.
If we think that's the direction it's going, then we shouldn't bother
investing a ton of effort into a virtual-memory driven design as the
primary user. (Sure, support it, but don't dive too much further in)
> - if there is any general pushback on leveraging a kthread for this,
> this would be very good feedback to have early
>
I think for the promotion system, having one or more kthreads based on
promotion pressure is a good idea.
I'm not sure how well this will scale for many-process, high-memory
systems (1TB+ on a scanning interval of 256MB is very low accuracy).
Need more data!
~Gregory
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-19 0:56 ` Gregory Price
@ 2024-12-26 1:28 ` Karim Manaouil
2024-12-30 5:36 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: Karim Manaouil @ 2024-12-26 1:28 UTC (permalink / raw)
To: Gregory Price
Cc: David Rientjes, Aneesh Kumar, David Hildenbrand, John Hubbard,
Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301, Zi Yan,
Liam Howlett, Gregory Price, linux-mm
On Wed, Dec 18, 2024 at 07:56:19PM -0500, Gregory Price wrote:
> On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote:
> > ----->o-----
> > Raghu noted the current promotion destination is node 0 by default. Wei
> > noted we could get some page owner information to determine things like
> > mempolicies or compute the distance between nodes and, if multiple nodes
> > have the same distance, choose one of them just as we do for demotions.
> >
> > Gregory Price noted some downsides to using mempolicies for this based on
> > per-task, per-vma, and cross socket policies, so using the kernel's
> > memory tiering policies is probably the best way to go about it.
> >
>
> Slightly elaborating here:
> - In an async context, associating a page with a specific task is not
> presently possible (that I know of). The most we know is the last
> accessing CPU - maybe - in the page/folio struct. Right now this
> is disabled in favor of a timestamp when tiering is enabled.
>
> a process with 2 tasks which have access to the page may not run
> on the same socket, so we run the risk of migrating to a bad target.
> Best effort here would suggest either socket is fine - since they're
> both "fast nodes" - but this requires that we record the last
> accessing CPU for a page at identification time.
>
This can be sovled with a two steps migration: first, you promote the
page from CXL to a NUMA node, then you rely on NUMA balancing to
further place the page into the right NUMA node. NUMA hint faults can
still be enabled for pages allocated from NUMA nodes, but not for CXL.
Best
Karim
Edinburgh University
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-26 1:28 ` Karim Manaouil
@ 2024-12-30 5:36 ` David Rientjes
2024-12-30 6:51 ` Raghavendra K T
2025-01-06 17:02 ` Gregory Price
0 siblings, 2 replies; 18+ messages in thread
From: David Rientjes @ 2024-12-30 5:36 UTC (permalink / raw)
To: Karim Manaouil
Cc: Gregory Price, Aneesh Kumar, David Hildenbrand, John Hubbard,
Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301, Zi Yan,
Liam Howlett, Gregory Price, linux-mm
On Thu, 26 Dec 2024, Karim Manaouil wrote:
> On Wed, Dec 18, 2024 at 07:56:19PM -0500, Gregory Price wrote:
> > On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote:
> > > ----->o-----
> > > Raghu noted the current promotion destination is node 0 by default. Wei
> > > noted we could get some page owner information to determine things like
> > > mempolicies or compute the distance between nodes and, if multiple nodes
> > > have the same distance, choose one of them just as we do for demotions.
> > >
> > > Gregory Price noted some downsides to using mempolicies for this based on
> > > per-task, per-vma, and cross socket policies, so using the kernel's
> > > memory tiering policies is probably the best way to go about it.
> > >
> >
> > Slightly elaborating here:
> > - In an async context, associating a page with a specific task is not
> > presently possible (that I know of). The most we know is the last
> > accessing CPU - maybe - in the page/folio struct. Right now this
> > is disabled in favor of a timestamp when tiering is enabled.
> >
> > a process with 2 tasks which have access to the page may not run
> > on the same socket, so we run the risk of migrating to a bad target.
> > Best effort here would suggest either socket is fine - since they're
> > both "fast nodes" - but this requires that we record the last
> > accessing CPU for a page at identification time.
> >
>
> This can be sovled with a two steps migration: first, you promote the
> page from CXL to a NUMA node, then you rely on NUMA balancing to
> further place the page into the right NUMA node. NUMA hint faults can
> still be enabled for pages allocated from NUMA nodes, but not for CXL.
>
I think it would be a shame to promote to the wrong top-tier NUMA node and
rely on NUMA Balancing to fix it up with yet another migration :/
Since these cpuless memory nodes should have a promotion node associated
with them, which defaults to the latency given to us by the HMAT, can we
make that the default promotion target when memory is accessed? The
"normal mode" for NUMA Balancing could fix this up subsequent to the
promotion, but only if enabled.
Raghu noted in the session that the current patch series only promotes to
node 0 but that choice is only for the RFC. I *assume* that every CXL
memory node will have a standard top-tier node to promote to *or* that we
stash that promotion node information at the time of demotion so memory
comes back to the same node it was demoted from.
Either way, this feels like a solvable problem?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-30 5:36 ` David Rientjes
@ 2024-12-30 6:51 ` Raghavendra K T
2025-01-06 17:02 ` Gregory Price
1 sibling, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-30 6:51 UTC (permalink / raw)
To: David Rientjes, Karim Manaouil
Cc: Gregory Price, Aneesh Kumar, David Hildenbrand, John Hubbard,
Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301, Zi Yan,
Liam Howlett, Gregory Price, linux-mm
On 12/30/2024 11:06 AM, David Rientjes wrote:
> On Thu, 26 Dec 2024, Karim Manaouil wrote:
>
>> On Wed, Dec 18, 2024 at 07:56:19PM -0500, Gregory Price wrote:
>>> On Tue, Dec 17, 2024 at 08:19:56PM -0800, David Rientjes wrote:
>>>> ----->o-----
>>>> Raghu noted the current promotion destination is node 0 by default. Wei
>>>> noted we could get some page owner information to determine things like
>>>> mempolicies or compute the distance between nodes and, if multiple nodes
>>>> have the same distance, choose one of them just as we do for demotions.
>>>>
>>>> Gregory Price noted some downsides to using mempolicies for this based on
>>>> per-task, per-vma, and cross socket policies, so using the kernel's
>>>> memory tiering policies is probably the best way to go about it.
>>>>
>>>
>>> Slightly elaborating here:
>>> - In an async context, associating a page with a specific task is not
>>> presently possible (that I know of). The most we know is the last
>>> accessing CPU - maybe - in the page/folio struct. Right now this
>>> is disabled in favor of a timestamp when tiering is enabled.
>>>
>>> a process with 2 tasks which have access to the page may not run
>>> on the same socket, so we run the risk of migrating to a bad target.
>>> Best effort here would suggest either socket is fine - since they're
>>> both "fast nodes" - but this requires that we record the last
>>> accessing CPU for a page at identification time.
>>>
>>
>> This can be sovled with a two steps migration: first, you promote the
>> page from CXL to a NUMA node, then you rely on NUMA balancing to
>> further place the page into the right NUMA node. NUMA hint faults can
>> still be enabled for pages allocated from NUMA nodes, but not for CXL.
>>
>
> I think it would be a shame to promote to the wrong top-tier NUMA node and
> rely on NUMA Balancing to fix it up with yet another migration :/
Agree here. Advantage of promotion is lost, considering the typical
access time for CXL vs regular node we have currently.
>
> Since these cpuless memory nodes should have a promotion node associated
> with them, which defaults to the latency given to us by the HMAT, can we
> make that the default promotion target when memory is accessed? The
> "normal mode" for NUMA Balancing could fix this up subsequent to the
> promotion, but only if enabled.
>
> Raghu noted in the session that the current patch series only promotes to
> node 0 but that choice is only for the RFC. I *assume* that every CXL
> memory node will have a standard top-tier node to promote to *or* that we
> stash that promotion node information at the time of demotion so memory
> comes back to the same node it was demoted from.
>
> Either way, this feels like a solvable problem?
How about sharing the hint between NUMAB mode=1 and kernel thread. For
e.g., NUMAB mode=1 needs help on hot VMAs to scan. (which is supplied
from kernel thread) whereas promotion target is kept at VMA level as a
hint based on hint faults?? (Thinking loud here).
Even top-tier node associated CXL might work, but need to think more here.
PS: I had run my experiment with NUMAB mode=1 the benefit of kernel
thread was intact.
Thanks and Regards
- Raghu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-30 5:36 ` David Rientjes
2024-12-30 6:51 ` Raghavendra K T
@ 2025-01-06 17:02 ` Gregory Price
1 sibling, 0 replies; 18+ messages in thread
From: Gregory Price @ 2025-01-06 17:02 UTC (permalink / raw)
To: David Rientjes
Cc: Karim Manaouil, Aneesh Kumar, David Hildenbrand, John Hubbard,
Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301, Zi Yan,
Liam Howlett, linux-mm
On Sun, Dec 29, 2024 at 09:36:41PM -0800, David Rientjes wrote:
> On Thu, 26 Dec 2024, Karim Manaouil wrote:
>
> memory node will have a standard top-tier node to promote to *or* that we
> stash that promotion node information at the time of demotion
Not sure stashing this information generalizes the way you want, and
may require space in the folio/page struct that may not be available.
Things also get more complicated the further away from virtual memory
mappings you get, because associations become weaker (physical pages
backing file cache may have weak, ephemeral associations, for example).
We should consider the implications of limiting promotion to one or
more levels in a single proximity domain - and whether the "two step"
migration process is ultimately harmful or not.
Just trying to save ourselves from crawling down a rabbit hole.
~Gregory
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-18 4:19 Slow-tier Page Promotion discussion recap and open questions David Rientjes
` (3 preceding siblings ...)
2024-12-19 0:56 ` Gregory Price
@ 2024-12-20 11:21 ` Raghavendra K T
2025-01-02 4:44 ` David Rientjes
4 siblings, 1 reply; 18+ messages in thread
From: Raghavendra K T @ 2024-12-20 11:21 UTC (permalink / raw)
To: David Rientjes, Aneesh Kumar, David Hildenbrand, John Hubbard,
Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301, Zi Yan,
Liam Howlett, Gregory Price
Cc: linux-mm
On 12/18/2024 9:49 AM, David Rientjes wrote:
> Hi everybody,
>
Hello David,
This is an excellent recap and summary.
Thank you for this summary and also the opportunity.
Some additions to the points where I had not elaborated enough during
discussion. (also may be some more points along the way).
> We had a very interactive discussion last week led by RaghavendraKT on
> slow-tier page promotion intended for memory tiering platforms, thank
> you! Thanks as well to everybody who attended and provided great
> questions, suggestions, and feedback.
>
> The RFC patch series "mm: slowtier page promotion based on PTE A bit"[1]
> is a proposal to allow for asynchronous page promotion based on memory
> accesses as an alternative to NUMA Balancing based promotions. There was
> widespread interest in this topic and the discussion surfaced multiple
> use cases and requirements, very focused on CXL use cases.>
> ----->o-----
> Raghu noted that the current approach utilizing NUMA Balancing focuses on
> scan and *migration* in process context, which often gets observed as
> latency spikes. This led to an idea for scanning of the PTE Accessed bit
> and promotion to be handled by a kthread instead. In Raghu's proposal,
> this is called kmmscand. For every mm on the system, the vmas are
> scanned and a migration list is created that feeds into page migration.
>
> To avoid scanning the entire process address space, however, there is a
> per-process scan period and scan size. Scanning the vmas continue while
> still in the scan period. When the scan size is complete, the scanning
> transitions into the migration phase.
>
> High level, the scan period and scan size are adjusted based on the
> accessed folios that were observed in the last scan.
>
> ----->o-----
> I asked if this was really done single threaded, which was confirmed. If
> only a single process has pages on a slow memory tier, for example, then
> flexible tuning of the scan period and size ensures we do not scan
> needlessly. The scan period can be tuned to be more responsive (down to
> 400ms in this proposal) depending on how many accesses we have on the
> last scan; similarly, it can be much less responsive (up to 5s) if memory
> is not found to be accessed.
>
> I also asked if scanning can be disabled entirely, Raghu clarified that
> it cannot be.
>
We have a sysfs tunable (kmmscand/scan_enabled) to enable/disable the
whole scanning at a global level but not at process level granularity.
> Wei Xu asked if the scan period should be interpreted as the minimal
> interval between scans because kmmscand is single threaded and there are
> many processes. Raghu confirmed this is correct, the minimal delay.
> Even if the scan period is 400ms, in reality it could be multiple seconds
> based on load.
>
> Liam Howlett asked how we could have two scans colliding in a time
> segment. Raghu noted if we are able to complete the last scan in less
> time than 400ms, then we have this delay to avoid continuously scanning
> that results in increased cpu overhead. Liam further asked if processes
> opt into a scan or out of the scan, Raghu noted we always scan every
> process. John Hubbard suggested that we have per-process control.
+1 for prctl()
Also I want to add that, I will get data on:
what is the min and max time required to finish the entire scan for the
current micro-benchmark and one of the real workload (such as Redis/
Rocksdb...), so that we can check if we are meeting the deadline of
scanning with single kthread.
>
> ----->o-----
> Zi Yan asked a great question about how this would interact with LRU
> information used for page reclaim. The scanning could interfere with
> cold page detection because it manipulates the Accessed bits.
>
> Wei noted that the kernel leverages page_young for this so during
> scanning we need to transfer the Accessed bit information into
> page_young. This is what idle page tracking currently uses to not
> interfere with anything that harvests the Accessed bit. The scan only
> cares about the Accessed bit.
>
> Zi asked how this would be handled if processes are allowed to opt out,
> in other words, if some processes are propagating their Accessed bits to
> page_young and others are not. Wei clarified that for page reclaim, the
> Accessed bit and page_young should both be checked and are treated
> equally.
>
> Wei noted a subtlety here where MGLRU does not currently check
> page_young. Since multiple users of the Accessed bit exist, MGLRU should
> likely check page_young as well.
>
> Bharata B Rao noted this is equivalent to how idle page tracking handles
> this behavior as well as DAMON.
I think not much change is expected here.
>
> ----->o-----
> John Hubbard suggested that this scanning may very well be multi-threaded
> and there's no explicit reason to avoid it. (I didn't bring it up at the
> time, but I think this is required just for NUMA purposes.) Otherwise it
> won't scale well. Raghu noted we have a global mm list, but will think
> about this for future iterations.
>
Ideally when a kthread was intended for keeping top-tier data, we could
easily have a kthread / top-tier node that affine to CPUs spanning the
top-tier node + hotplug callbacks.
Here we had a single CPU-less CXL node, so I went ahead with single
kthread without hotplug callbacks
I do agree, eventually we need to have one per slow-tier OR one per
all the nodes available in the system.
> ----->o-----
> Raghu noted the current promotion destination is node 0 by default. Wei
> noted we could get some page owner information to determine things like
> mempolicies or compute the distance between nodes and, if multiple nodes
> have the same distance, choose one of them just as we do for demotions.
>
> Gregory Price noted some downsides to using mempolicies for this based on
> per-task, per-vma, and cross socket policies, so using the kernel's
> memory tiering policies is probably the best way to go about it.
>
> ----->o-----
> Wei asked about benchmark results and why migration time was reduced
> given the same amount of memory to migrate. Raghu noted the only
> difference was the migration path, so things like kswapd or page
> allocation did not spend a lot of time trying to reclaim memory for the
> migration to succeed. This can happen if migrating to a nearly full
> target NUMA node.
>
> Raghu also noted that the migration time is not exactly comparable
> between NUMA Balancing and kmmscand. We're also not tracking things like
> timestamp and storing state to migrate after multiple accesses. Zi also
> noted that migrating batched memory has some optimizations especially for
> tlb shootdowns.
+1
>
> ----->o-----
> Wei noted an important point about separating hot page detection and
> promotion, which don't actually need to be coupled at all. This uses
> page table scanning while future support may not need to leverage this at
> all. We'd very much like to avoid multiple promotion solutions for
> different ways to track page hotness.
>
> I strongly supported this because I believe for CXL, at least within the
> next three years, that memory hotness will likely not be derived from
> page table Accessed bit scanning. Zi Yan agreed.
>
> The promotion path may also want to be much less aggressive than on first
> access. Raghu showed many improvements, including handling short lived
> processes, more accurate hot page detection using timestamp, etc.
Some of these TODOs can be implemented in next version.
>
> ----->o-----
> I asked about offloading the migration to a data mover, such as the PSP
> for AMD, DMA engine, etc and whether that should be treated entirely
> separately as a topic. Bharata said there was a proof-of-concept
> available from AMD that does just that but the initial results were not
> that encouraging.
>
> Zi asked if the DMA engine saturated the link between the slow and fast
> tiers. If we want to offload to a copy engine, we need to verify that
> the throughput is sufficient or we may be better off using idle cpus to
> perform the migration for us.
>
> ----->o-----
> I followed up on a discussion point early in the talk about whether this
> should be virtual address scanning like the current approach, walking
> mm_struct's, or the alternative approach which would be physical address
> scanning.
>
> Raghu sees this as a fully alternative approach such as what DAMON uses
> that is based on rmap. The only advantage appears to be avoiding
> scanning on top tier memory completely.
Having a clarity here would help. Both the approaches have its own pros
and cons.
Need to also explore on using / Reusing DMAON/ MGLRU.. to the extent
possible based on the approach.
>
> ----->o-----
> Wei noted there was a lot of similarities between the RFC implementation
> and the MGLRU page walk functionality and whether it would make sense to
> try to converge these together or make more generally useful.
>
+1
> SeongJae noted that if DAMON logic were used for the scanning that we
> could re-use the existing support for controlling the overhead.
+1
>
> John echoed the idea of leveraging the learnings from MGLRU in this,
> additionally for trying to get more use of MGLRU. Wei noted there are
> MGLRU optimizations that we can leverage such as when the pmd Accessed
> bit is clear we don't need to iterate any further for that scan.
>
Agree.
> ----->o-----
> My takeaways:
>
> - the memory tiering discussion that I led at LSF/MM/BPF this year also
> focused on asynchronous memory migration, decoupled from NUMA
> Balancing and I strongly believe this is the right direction
>
Strongly agree.
> - the per-process control seems important and with no obvious downsides
> as John noted, so likely better to ensure that some processes can opt
> out of scanning with a prctl()
>
+1
> - it likely makes sense for MGLRU to also check page_young as Wei noted
> so this deals with the transfer of the Accessed bit to page_young
> evenly for all processes, even when opting out
>
> - we likely want to reconsider the single threaded nature of the kthread
> even if only for NUMA purposes
>
> - using node 0 for all target migrations is only for illustrative
> purposes, this will definitely need to be more thought out such as
> using the kernel's understanding of the memory tiers on the system as
> Gregory pointed out
Agree. I hope with some more brainstorming, we could achieve this.
>
> - we want to ensure that the promotion node is a very reasonable
> destination target, it would be unfortunate to rely on NUMA Balancing
> to then migrate memory again once it's promoted to get the proper
> affinity :)
Strongly agree. Promoting to a wrong top-tier node looses entire
benefit because of ratio of access latency between remote node and CXL
node we have currently.
>
> - promotion on first access will likely need to be reconsidered, which
> is not even used by NUMA Balancing. We'll likely need to store some
> state to promote memory that is repeatedly being accessed as opposed
> to treating a single access as though the memory must be promoted
>
Just thinking loud here, how about using first access as a feeder for
independent hot-page detection module? Current approach is state-less,
i.e. once we determine if the page was accessed, we add to migration
list and forget about that.
Can we have this as a feeder for normal NUMAB algorithm to detect hot VMAs?
Reason I took approach was, if we had to go for time-stamp / access
history based logic.. we should be using hashlist, + some finite hash
bucket size etc.
Current microbenchmark that involved 8GB CXL hot page itself involved 2
million pages in a very short span.
Either way, need some more thought here.
> - there is a definite need to separate hot page detection and the
> promotion path since hot pages may be derived from multiple sources,
> including hardware assists in the future
>
> - for the hot page tracking itself, a common abstraction to be used that
> can effectively describe hotness regardless of the backend it is
> deriving its information from would likely be quite useful
+1
>
> - I think virtual memory scanning is likely the only viable approach for
> this purpose and we could store state in the underlying struct page,
> similar to NUMA Balancing, but that all scanning should be driven by
> walking the mm_struct's to harvest the Accessed bit
> > - re-using the MGLRU page walk implementation would likely make the
> kmmscand scanning implementation much simpler
>
Will explore this.
> - if there is any general pushback on leveraging a kthread for this,
> this would be very good feedback to have early
>
This was one of the most important feedback I was looking for.
Is there any important reason why we should or should not go for kthread.
Only reason I found was overhead of scanning / overall CPU overhead.
So I brought down the overhead significantly compared to initial
implementation based (from 434s to 4s in 8GB case.) on SJ'd feedback.
This is now comparable but still current NUMAB scanning has lesser overhead.
> We'll be looking to incorporate this discussion in our upstream Memory
> Tiering Working Group to accelerate alignment and progress on the
> approach.
>
+1
> If you are interested in participating in this series of discussions,
> please let me know in email. Everybody is welcome to participate and
> we'll have summary email threads such as this one to follow-up on the
> mailing lists.
>
> Raghu, do you have plans for your next version of the RFC?
>
> Thanks!
>
Depending on how much radical change is required to current
implementation, am hopeful to come up with next revision in mid of Jan
or later part of Jan (considering year end holidays).
> [1] https://lore.kernel.org/linux-mm/20241201153818.2633616-1-raghavendra.kt@amd.com/T/#t
Thanks and Regards
- Raghu
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Slow-tier Page Promotion discussion recap and open questions
2024-12-20 11:21 ` Raghavendra K T
@ 2025-01-02 4:44 ` David Rientjes
2025-01-06 6:29 ` Raghavendra K T
2025-01-08 5:43 ` Raghavendra K T
0 siblings, 2 replies; 18+ messages in thread
From: David Rientjes @ 2025-01-02 4:44 UTC (permalink / raw)
To: Raghavendra K T
Cc: Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
Grimm, Jon, sj, shy828301, Zi Yan, Liam Howlett, Gregory Price,
linux-mm
On Fri, 20 Dec 2024, Raghavendra K T wrote:
> > I asked if this was really done single threaded, which was confirmed. If
> > only a single process has pages on a slow memory tier, for example, then
> > flexible tuning of the scan period and size ensures we do not scan
> > needlessly. The scan period can be tuned to be more responsive (down to
> > 400ms in this proposal) depending on how many accesses we have on the
> > last scan; similarly, it can be much less responsive (up to 5s) if memory
> > is not found to be accessed.
> >
> > I also asked if scanning can be disabled entirely, Raghu clarified that
> > it cannot be.
> >
>
> We have a sysfs tunable (kmmscand/scan_enabled) to enable/disable the
> whole scanning at a global level but not at process level granularity.
>
Thanks Raghu for the clarification. I think during discussion that there
was a preference to make this multi-threaded so we didn't rely on a single
kmmscand thread, perhaps this would be (at minimum) one kmmscand thread
per NUMA node?
> > Wei Xu asked if the scan period should be interpreted as the minimal
> > interval between scans because kmmscand is single threaded and there are
> > many processes. Raghu confirmed this is correct, the minimal delay.
> > Even if the scan period is 400ms, in reality it could be multiple seconds
> > based on load.
> >
> > Liam Howlett asked how we could have two scans colliding in a time
> > segment. Raghu noted if we are able to complete the last scan in less
> > time than 400ms, then we have this delay to avoid continuously scanning
> > that results in increased cpu overhead. Liam further asked if processes
> > opt into a scan or out of the scan, Raghu noted we always scan every
> > process. John Hubbard suggested that we have per-process control.
>
> +1 for prctl()
>
> Also I want to add that, I will get data on:
>
> what is the min and max time required to finish the entire scan for the
> current micro-benchmark and one of the real workload (such as Redis/
> Rocksdb...), so that we can check if we are meeting the deadline of
> scanning with single kthread.
>
Do we want more fine-grained per-process control other than just the
ability to opt out entire processes? There may be situations where we
want to always serve latency tolerant jobs from CXL extended memory, we
don't care to ever promote its memory, but I also think there will be
processes that are between the two extremes (latency critical and latency
tolerant).
I think careful consideration needs to be given to how we handle
per-process policy for multi-tenant systems that have different levels of
latency sensitivity. If kmmscand becomes the standard way of doing page
promotion in the kernel, the userspace API to inform it of these policy
decisions is going to be key. There have been approaches where this was
primarily driven by BPF that has to solve the same challenge.
> > Wei noted an important point about separating hot page detection and
> > promotion, which don't actually need to be coupled at all. This uses
> > page table scanning while future support may not need to leverage this at
> > all. We'd very much like to avoid multiple promotion solutions for
> > different ways to track page hotness.
> >
> > I strongly supported this because I believe for CXL, at least within the
> > next three years, that memory hotness will likely not be derived from
> > page table Accessed bit scanning. Zi Yan agreed.
> >
> > The promotion path may also want to be much less aggressive than on first
> > access. Raghu showed many improvements, including handling short lived
> > processes, more accurate hot page detection using timestamp, etc.
>
> Some of these TODOs can be implemented in next version.
>
Thanks! Are you planning on sending out another RFC patch series soon or
are you interested in publishing this on git.kernel.org or github? There
may be an opportunity for others to send you pull requests into the series
of patches while we discuss.
> > ----->o-----
> > I followed up on a discussion point early in the talk about whether this
> > should be virtual address scanning like the current approach, walking
> > mm_struct's, or the alternative approach which would be physical address
> > scanning.
> >
> > Raghu sees this as a fully alternative approach such as what DAMON uses
> > that is based on rmap. The only advantage appears to be avoiding
> > scanning on top tier memory completely.
>
> Having a clarity here would help. Both the approaches have its own pros
> and cons.
>
> Need to also explore on using / Reusing DMAON/ MGLRU.. to the extent possible
> based on the approach.
>
Yeah, I definitely think this is a key point to discuss early on. Gregory
had indicated that unmapped file cache is one of the key downsides to
using only virtual memory scanning.
While things like the CHMU are still on the way, I think there's benefit
to making incremental progress from what we currently have available (NUMA
Balancing) before we get there.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Slow-tier Page Promotion discussion recap and open questions
2025-01-02 4:44 ` David Rientjes
@ 2025-01-06 6:29 ` Raghavendra K T
2025-01-08 5:43 ` Raghavendra K T
1 sibling, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2025-01-06 6:29 UTC (permalink / raw)
To: David Rientjes
Cc: Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
Grimm, Jon, sj, shy828301, Zi Yan, Liam Howlett, Gregory Price,
linux-mm
On 1/2/2025 10:14 AM, David Rientjes wrote:
> On Fri, 20 Dec 2024, Raghavendra K T wrote:
>
>>> I asked if this was really done single threaded, which was confirmed. If
>>> only a single process has pages on a slow memory tier, for example, then
>>> flexible tuning of the scan period and size ensures we do not scan
>>> needlessly. The scan period can be tuned to be more responsive (down to
>>> 400ms in this proposal) depending on how many accesses we have on the
>>> last scan; similarly, it can be much less responsive (up to 5s) if memory
>>> is not found to be accessed.
>>>
>>> I also asked if scanning can be disabled entirely, Raghu clarified that
>>> it cannot be.
>>>
>>
>> We have a sysfs tunable (kmmscand/scan_enabled) to enable/disable the
>> whole scanning at a global level but not at process level granularity.
>>
>
> Thanks Raghu for the clarification. I think during discussion that there
> was a preference to make this multi-threaded so we didn't rely on a single
> kmmscand thread, perhaps this would be (at minimum) one kmmscand thread
> per NUMA node?
>
Correct. From my side a bit more thought on:
- whether we need kthread for CXL nodes too?
- How to share the scanning between kthreads etc.. needed.
>>> Wei Xu asked if the scan period should be interpreted as the minimal
>>> interval between scans because kmmscand is single threaded and there are
>>> many processes. Raghu confirmed this is correct, the minimal delay.
>>> Even if the scan period is 400ms, in reality it could be multiple seconds
>>> based on load.
>>>
>>> Liam Howlett asked how we could have two scans colliding in a time
>>> segment. Raghu noted if we are able to complete the last scan in less
>>> time than 400ms, then we have this delay to avoid continuously scanning
>>> that results in increased cpu overhead. Liam further asked if processes
>>> opt into a scan or out of the scan, Raghu noted we always scan every
>>> process. John Hubbard suggested that we have per-process control.
>>
>> +1 for prctl()
>>
>> Also I want to add that, I will get data on:
>>
>> what is the min and max time required to finish the entire scan for the
>> current micro-benchmark and one of the real workload (such as Redis/
>> Rocksdb...), so that we can check if we are meeting the deadline of
>> scanning with single kthread.
>>
>
> Do we want more fine-grained per-process control other than just the
> ability to opt out entire processes? There may be situations where we
> want to always serve latency tolerant jobs from CXL extended memory, we
> don't care to ever promote its memory, but I also think there will be
> processes that are between the two extremes (latency critical and latency
> tolerant).
>
> I think careful consideration needs to be given to how we handle
> per-process policy for multi-tenant systems that have different levels of
> latency sensitivity. If kmmscand becomes the standard way of doing page
> promotion in the kernel, the userspace API to inform it of these policy
> decisions is going to be key. There have been approaches where this was
> primarily driven by BPF that has to solve the same challenge.
>
Very good point.
How to defer/skip? Should we provide a numeric value to determine
latency sensitivity range?
This can be provided perhaps along with a normal enable/disable.
>>> Wei noted an important point about separating hot page detection and
>>> promotion, which don't actually need to be coupled at all. This uses
>>> page table scanning while future support may not need to leverage this at
>>> all. We'd very much like to avoid multiple promotion solutions for
>>> different ways to track page hotness.
>>>
>>> I strongly supported this because I believe for CXL, at least within the
>>> next three years, that memory hotness will likely not be derived from
>>> page table Accessed bit scanning. Zi Yan agreed.
>>>
>>> The promotion path may also want to be much less aggressive than on first
>>> access. Raghu showed many improvements, including handling short lived
>>> processes, more accurate hot page detection using timestamp, etc.
>>
>> Some of these TODOs can be implemented in next version.
>>
>
> Thanks! Are you planning on sending out another RFC patch series soon or
> are you interested in publishing this on git.kernel.org or github? There
> may be an opportunity for others to send you pull requests into the series
> of patches while we discuss.
>
Good idea. will do. Perhaps a simple changes that is needed immediately
as next RFC + github (Will explore on this internally how it is done
here.) sooner.
>>> ----->o-----
>>> I followed up on a discussion point early in the talk about whether this
>>> should be virtual address scanning like the current approach, walking
>>> mm_struct's, or the alternative approach which would be physical address
>>> scanning.
>>>
>>> Raghu sees this as a fully alternative approach such as what DAMON uses
>>> that is based on rmap. The only advantage appears to be avoiding
>>> scanning on top tier memory completely.
>>
>> Having a clarity here would help. Both the approaches have its own pros
>> and cons.
>>
>> Need to also explore on using / Reusing DMAON/ MGLRU.. to the extent possible
>> based on the approach.
>>
>
> Yeah, I definitely think this is a key point to discuss early on. Gregory
> had indicated that unmapped file cache is one of the key downsides to
> using only virtual memory scanning.
>
> While things like the CHMU are still on the way, I think there's benefit
> to making incremental progress from what we currently have available (NUMA
> Balancing) before we get there.
Agree.
Regards
- Raghu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Slow-tier Page Promotion discussion recap and open questions
2025-01-02 4:44 ` David Rientjes
2025-01-06 6:29 ` Raghavendra K T
@ 2025-01-08 5:43 ` Raghavendra K T
1 sibling, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2025-01-08 5:43 UTC (permalink / raw)
To: David Rientjes, Raghavendra K T
Cc: Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
shy828301, Zi Yan, Liam Howlett, Gregory Price, linux-mm,
ligang.bdlg
+ Gang Li
On 1/2/2025 10:14 AM, David Rientjes wrote:
> On Fri, 20 Dec 2024, Raghavendra K T wrote:
[...]
>>> Wei Xu asked if the scan period should be interpreted as the minimal
>>> interval between scans because kmmscand is single threaded and there are
>>> many processes. Raghu confirmed this is correct, the minimal delay.
>>> Even if the scan period is 400ms, in reality it could be multiple seconds
>>> based on load.
>>>
>>> Liam Howlett asked how we could have two scans colliding in a time
>>> segment. Raghu noted if we are able to complete the last scan in less
>>> time than 400ms, then we have this delay to avoid continuously scanning
>>> that results in increased cpu overhead. Liam further asked if processes
>>> opt into a scan or out of the scan, Raghu noted we always scan every
>>> process. John Hubbard suggested that we have per-process control.
>>
>> +1 for prctl()
>>
>> Also I want to add that, I will get data on:
>>
>> what is the min and max time required to finish the entire scan for the
>> current micro-benchmark and one of the real workload (such as Redis/
>> Rocksdb...), so that we can check if we are meeting the deadline of
>> scanning with single kthread.
>>
>
> Do we want more fine-grained per-process control other than just the
> ability to opt out entire processes? There may be situations where we
> want to always serve latency tolerant jobs from CXL extended memory, we
> don't care to ever promote its memory, but I also think there will be
> processes that are between the two extremes (latency critical and latency
> tolerant).
>
> I think careful consideration needs to be given to how we handle
> per-process policy for multi-tenant systems that have different levels of
> latency sensitivity. If kmmscand becomes the standard way of doing page
> promotion in the kernel, the userspace API to inform it of these policy
> decisions is going to be key. There have been approaches where this was
> primarily driven by BPF that has to solve the same challenge.
>
Came across prctl() approach for NUMAB by Gang Li,
Link:
https://lore.kernel.org/lkml/20220224075227.27127-1-ligang.bdlg@bytedance.com/T/
I hope there was no stronger reason against the merging (just to ensure
since it will be almost same for our case here).
[...]
Regards
- Raghu
^ permalink raw reply [flat|nested] 18+ messages in thread