* [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
@ 2025-03-19 12:47 Jonathan Cameron
2025-03-19 23:50 ` SeongJae Park
2025-04-04 10:39 ` Jonathan Cameron
0 siblings, 2 replies; 5+ messages in thread
From: Jonathan Cameron @ 2025-03-19 12:47 UTC (permalink / raw)
To: Raghavendra K T, Bharata B Rao, SeongJae Park, lsf-pc, linux-mm
Cc: Michal Hocko, Dan Williams, linuxarm, Matthew Wilcox,
Johannes Weiner, Gregory Price
Prior to LSFMM, this is an update on where the discussion has gone on list
since the original proposal back in January (which was buried in the
thread for Ragha's proposal focused on PTE A bit scanning)
v1: https://lore.kernel.org/all/20250131130901.00000dd1@huawei.com/
Note that this is combining comments and discussion from many people and I may
well have summarized things badly + missed key details. If time allows
I'll update with a v3 when people have ripped up this straw man.
Bharata has posted code for one approach and discussion is ongoing:
https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
This proposal overlaps with part of several other proposals, (Damon, access
bit tracking etc) but the focus is intended to be more general.
Abstract:
We have:
1) A range of different technologies tracking what may be loosely defined
as the hotness of regions of memory.
2) A set of use cases that care about this data.
Question:
Is it useful or feasible to aggregate the data from the sources (1) to some
layer before providing answers to (2)? What should that layer look like?
What services and abstractions should it provide? Is there commonality in
what those use cases need?
By aggregate I'm not necessarily implying multiple techniques in use at
once, but more that we want one interface driven by whatever solution
is the right balance on a particular system. That balance can be affected
by hardware availability or characteristics of the system or workloa
Note that many of the hotness driven actions are painful (e.g. migration
of hot pages) and for those we need to be very sure it is a good idea
to do anything at all!
My assumption is that in at least some cases the problem will be too hard
to solve in kernel but lets consider what we can do.
On to the details:
------------------
Note: I'm ignoring the low level implementation details of each method
and how they avoid resource exhaustion, tune sampling timing (epoch length)
and what is sampled (scanning random etc) as in at least some cases that's
a problem for the lowest technique specific level.
Enumerating the cases (thanks to Bharata, Johannes, SJ and others for inputs
on this!) Much of this is direct quotes from this thread:
https://lore.kernel.org/all/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
(particularly Bharata's reply to my original questions)
Here is a compilation of available temperature sources and how the
hot/access data is consumed by different subsystems:
PA-Physical address available
VA-Virtual address available
AA-Access time available
NA-accessing Node info available
==================================================
Temperature PA VA AA NA
source
==================================================
PROT_NONE faults Y Y Y Y
--------------------------------------------------
folio_mark_accessed() Y Y Y
--------------------------------------------------
PTE A bit Y Y N* N
--------------------------------------------------
Platform hints Y Y Y Y
(AMD IBS)
--------------------------------------------------
Device hints Y N N N
(CXL HMU)
==================================================
* Some information available from scanning timing.
In all cases other methods can be applied to fill in the missing data
(rmap etc)
And here is an attempt to compile how different subsystems
use the above data:
==========================================================================================
Source Subsystem Consumption Activation/Frequency
==========================================================================================
PROT_NONE faults NUMAB NUMAB=1 locality based While task is running,
via process pgtable balancing rate varies on observed
walk NUMAB=2 hot page locality and sysctl knobs.
promotion
==========================================================================================
folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap
==========================================================================================
PTE A bit via Reclaim:LRU LRU list activation, During memory pressure
rmap walk deactivation/demotion
==========================================================================================
PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure
rmap walk and process deactivation/demotion - Continuous sampling (configurable)
pgtable walk for workingset reporting
==========================================================================================
PTE A bit via DAMON LRU activation,
rmap walk hot page promotion,
demotion etc
==========================================================================================
Platform hints NUMAB NUMAB=1 Locality based
(e.g. AMD IBS) balancing and
NUMAB=2 hot page
promotion
==========================================================================================
Device hints NUMAB NUMAB=2 hot page
(e.g. CXL HMU) promotion
==========================================================================================
PG_young / PG_idle ?
==========================================================================================
Technique trade offs:
Why not just use one method?
- Cost of capture, cost of use.
* Run all the time - aggregate data for stability of hotness.
* Run occasionally to minimize cost.
- Different availability. e.g. IBS might be needed for other things,
hardware monitors may not be available.
Straw man (based part on IBS proposal linked above)
---------------------------------------------------
Multiple sources become similar at different levels.
Taking just tiering promotion as an example and keeping in mind the golden
rule of tiered memory: Put data in the right place to start with if you
can. So this is about when you can't: application unaware, changing memory
pressure and workload mix etc.
_____________________ __________________
| Sampling techniques | | Hardware units |
| - Access counter, | | CXL HMU etc |
| - Trace based | |_________________|
|_____________________| |
| Hot page
Events |
| |
__________v___________ |
| Events to counts | |
| - hashtable, sketch | |
| etc | |
|______________________| |
| |
Hot page |
| |
___________V______________________V_________
| Hot list - responsible for stability? |
|____________________________________________|
|
Timely hotlist data
| Additional data (process newness, stack location...?)
__________v__________________|___
| Promotion Daemon |
|_________________________________|
For all paths where data is flowing down we probably need control parameters
flowing back the other way + if we have multiple users of the datastream
we need to satisfy each of their constraints.
SJ has proposed perhaps extending Damon as a possible interface layer. I am
yet to understand how that works in cases where regions do not provide
a compact representation due to lack of contiguity in the hotness.
An example usecase is hypervisor wanting to migrate data under unaware,
cheap VMs. After a system has been running for a while (particularly with hot
pages being migrated, swap etc) the hotness map looks much like noise.
Now for the "there be monsters bit"...
---------------------------------------
- Stability of hotness matters and is hard to establish.
Predict a page will remain hot - various heuristics.
a) It is hot, probably stays so? (super hot!)
Sometimes enough to be detected as hot once,
often not.
b) It has been hot a while, probably stays so.
Check this hot list against previous hot list,
entries in both needed to promote.
This has a problem if hotlist is small compared to
total count of hot pages. Say list is 1%, 20% actually
hot, low chance of repeats even in hot pages.
c) It is hot, let's monitor a while before doing anything.
Measurement technique may change. Maybe cheaper
to monitor 'candidate' pages than all pages
e.g. CXL HMU gives 1000 pages, then we use access bit
sampling to check they are at least accessed N times
in next second.
d) It was hot, We moved it. Did it stay hot?
More useful to identify when we are thrashing and should
just stop doing anything. To late to fix this one!
- Some data should be considered hot even when not in use (e.g. stack)
- Usecases interfere. So it can't just be a broadcast mode
where hotness information is sent to all users.
- When to stop, start migration / tracking?
a) Detecting bad decisions. Enough bad decisions, better to
do nothing?
b) Metadata beyond the counts is useful
https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
Promotion algorithms can need aggregate statistics for a memory
device to decide how much to move.
As noted above, this may well overlap with other sessions.
One outcome of the discussion so far is to highlight what I think many
already knew. This is hard!
Jonathan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
2025-03-19 12:47 [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
@ 2025-03-19 23:50 ` SeongJae Park
2025-03-21 15:30 ` Jonathan Cameron
2025-04-04 10:39 ` Jonathan Cameron
1 sibling, 1 reply; 5+ messages in thread
From: SeongJae Park @ 2025-03-19 23:50 UTC (permalink / raw)
To: Jonathan Cameron
Cc: SeongJae Park, Raghavendra K T, Bharata B Rao, lsf-pc, linux-mm,
Michal Hocko, Dan Williams, linuxarm, Matthew Wilcox,
Johannes Weiner, Gregory Price
Hi Jonathan,
On Wed, 19 Mar 2025 12:47:53 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> Prior to LSFMM, this is an update on where the discussion has gone on list
> since the original proposal back in January (which was buried in the
> thread for Ragha's proposal focused on PTE A bit scanning)
>
> v1: https://lore.kernel.org/all/20250131130901.00000dd1@huawei.com/
>
> Note that this is combining comments and discussion from many people and I may
> well have summarized things badly + missed key details. If time allows
> I'll update with a v3 when people have ripped up this straw man.
>
> Bharata has posted code for one approach and discussion is ongoing:
> https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
> This proposal overlaps with part of several other proposals, (Damon, access
> bit tracking etc) but the focus is intended to be more general.
>
> Abstract:
>
> We have:
> 1) A range of different technologies tracking what may be loosely defined
> as the hotness of regions of memory.
> 2) A set of use cases that care about this data.
>
> Question:
>
> Is it useful or feasible to aggregate the data from the sources (1) to some
> layer before providing answers to (2)? What should that layer look like?
> What services and abstractions should it provide? Is there commonality in
> what those use cases need?
>
> By aggregate I'm not necessarily implying multiple techniques in use at
> once, but more that we want one interface driven by whatever solution
> is the right balance on a particular system. That balance can be affected
> by hardware availability or characteristics of the system or workloa
>
> Note that many of the hotness driven actions are painful (e.g. migration
> of hot pages) and for those we need to be very sure it is a good idea
> to do anything at all!
>
> My assumption is that in at least some cases the problem will be too hard
> to solve in kernel but lets consider what we can do.
>
> On to the details:
> ------------------
>
> Note: I'm ignoring the low level implementation details of each method
> and how they avoid resource exhaustion, tune sampling timing (epoch length)
> and what is sampled (scanning random etc) as in at least some cases that's
> a problem for the lowest technique specific level.
>
> Enumerating the cases (thanks to Bharata, Johannes, SJ and others for inputs
> on this!) Much of this is direct quotes from this thread:
> https://lore.kernel.org/all/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
> (particularly Bharata's reply to my original questions)
>
> Here is a compilation of available temperature sources and how the
> hot/access data is consumed by different subsystems:
>
> PA-Physical address available
> VA-Virtual address available
> AA-Access time available
> NA-accessing Node info available
>
> ==================================================
> Temperature PA VA AA NA
> source
> ==================================================
> PROT_NONE faults Y Y Y Y
> --------------------------------------------------
> folio_mark_accessed() Y Y Y
> --------------------------------------------------
> PTE A bit Y Y N* N
> --------------------------------------------------
> Platform hints Y Y Y Y
> (AMD IBS)
> --------------------------------------------------
> Device hints Y N N N
> (CXL HMU)
> ==================================================
> * Some information available from scanning timing.
> In all cases other methods can be applied to fill in the missing data
> (rmap etc)
>
> And here is an attempt to compile how different subsystems
> use the above data:
> ==========================================================================================
> Source Subsystem Consumption Activation/Frequency
> ==========================================================================================
> PROT_NONE faults NUMAB NUMAB=1 locality based While task is running,
> via process pgtable balancing rate varies on observed
> walk NUMAB=2 hot page locality and sysctl knobs.
> promotion
> ==========================================================================================
> folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap
> ==========================================================================================
> PTE A bit via Reclaim:LRU LRU list activation, During memory pressure
> rmap walk deactivation/demotion
> ==========================================================================================
> PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure
> rmap walk and process deactivation/demotion - Continuous sampling (configurable)
> pgtable walk for workingset reporting
> ==========================================================================================
> PTE A bit via DAMON LRU activation,
> rmap walk hot page promotion,
> demotion etc
For virtual address spaces monitoring mode, DAMON uses PTE A bit via pgtable
walk.
It's activation and frequency is basically set as user requests. Activation
can be set to be reactive to memory pressure like events (using watermarks).
Frequency can be auto-tuned for pursuing access events per snapshot ratio.
> ==========================================================================================
> Platform hints NUMAB NUMAB=1 Locality based
> (e.g. AMD IBS) balancing and
> NUMAB=2 hot page
> promotion
> ==========================================================================================
> Device hints NUMAB NUMAB=2 hot page
> (e.g. CXL HMU) promotion
> ==========================================================================================
> PG_young / PG_idle ?
> ==========================================================================================
>
> Technique trade offs:
>
> Why not just use one method?
>
> - Cost of capture, cost of use.
> * Run all the time - aggregate data for stability of hotness.
> * Run occasionally to minimize cost.
>
> - Different availability. e.g. IBS might be needed for other things,
> hardware monitors may not be available.
>
> Straw man (based part on IBS proposal linked above)
> ---------------------------------------------------
>
> Multiple sources become similar at different levels.
>
> Taking just tiering promotion as an example and keeping in mind the golden
> rule of tiered memory: Put data in the right place to start with if you
> can. So this is about when you can't: application unaware, changing memory
> pressure and workload mix etc.
>
> _____________________ __________________
> | Sampling techniques | | Hardware units |
> | - Access counter, | | CXL HMU etc |
> | - Trace based | |_________________|
> |_____________________| |
> | Hot page
> Events |
> | |
> __________v___________ |
> | Events to counts | |
> | - hashtable, sketch | |
> | etc | |
> |______________________| |
> | |
> Hot page |
> | |
> ___________V______________________V_________
> | Hot list - responsible for stability? |
> |____________________________________________|
> |
> Timely hotlist data
> | Additional data (process newness, stack location...?)
> __________v__________________|___
> | Promotion Daemon |
> |_________________________________|
>
> For all paths where data is flowing down we probably need control parameters
> flowing back the other way + if we have multiple users of the datastream
> we need to satisfy each of their constraints.
>
> SJ has proposed perhaps extending Damon as a possible interface layer. I am
> yet to understand how that works in cases where regions do not provide
> a compact representation due to lack of contiguity in the hotness.
> An example usecase is hypervisor wanting to migrate data under unaware,
> cheap VMs. After a system has been running for a while (particularly with hot
> pages being migrated, swap etc) the hotness map looks much like noise.
Similar concerns for DAMON's region abstraction were raised for physical
address space monitoring, because there is no cautious effort for making hot
pages gathered together (or, locality).
I'd argue there is no cautious effort to make temperature be spread, though.
As a result, we can expect a level of uncautious bias, and that matches with my
experiences from DAMON use cases on products environemnts so far.
Also, in practice, DAMON regions are used in combination with other
information. For example, DAMON-based reclaim checkes PTE A bit of each page
in DAMON-suggested cold memory region to make final decision about whether to
reclaim or not it, like MADV_PAGEOUT does.
That is, yes, I agree DAMON's region abstraction is maybe not a good way to
find perfect answer to some questions such as finding N-th hottest single page.
And it has many rooms to improve. Nevertheless, even DAMON of today can give
good enough best-effort answers for questions that practical for some cases,
such as finding regions that may containing N most hot/cold pages, while
letting the monitoring overhead fixed as users ask.
Also, please note that there is no reason to restrict DAMON to always use
regions abstraction. For different use-cases and situation, DAMON will be open
to be extended to use new abstractions. DAMON aims not to be a subsystem for
DAMON regions concept but data access monitoring for practical efficiency, and
continue random evolution for given environments.
>
> Now for the "there be monsters bit"...
> ---------------------------------------
>
> - Stability of hotness matters and is hard to establish.
> Predict a page will remain hot - various heuristics.
> a) It is hot, probably stays so? (super hot!)
> Sometimes enough to be detected as hot once,
> often not.
> b) It has been hot a while, probably stays so.
> Check this hot list against previous hot list,
> entries in both needed to promote.
> This has a problem if hotlist is small compared to
> total count of hot pages. Say list is 1%, 20% actually
> hot, low chance of repeats even in hot pages.
> c) It is hot, let's monitor a while before doing anything.
> Measurement technique may change. Maybe cheaper
> to monitor 'candidate' pages than all pages
> e.g. CXL HMU gives 1000 pages, then we use access bit
> sampling to check they are at least accessed N times
> in next second.
> d) It was hot, We moved it. Did it stay hot?
> More useful to identify when we are thrashing and should
> just stop doing anything. To late to fix this one!
DAMON is providing a sort of b) approach, aka DAMON regions' age, for finding
both hot and cold regions.
> - Some data should be considered hot even when not in use (e.g. stack)
DAMOS filters is for this kind of exceptions, and DAMON kernel API is flexible
enough to let callers directly manipulate the regions information based on
thier special knowledges. We can further optimize the interface for easier
uses, of course.
> - Usecases interfere. So it can't just be a broadcast mode
> where hotness information is sent to all users.
> - When to stop, start migration / tracking?
> a) Detecting bad decisions. Enough bad decisions, better to
> do nothing?
> b) Metadata beyond the counts is useful
> https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> Promotion algorithms can need aggregate statistics for a memory
> device to decide how much to move.
DAMOS quotas goal feature is a sort of a feature for this question. It allows
users to set target metric and value, and tune the aggressiveness. For
promotions and demotions, I suggested using upper tier utilization and free
ratio as such possible goal metric, and gonna post an implementation for that
soon.
>
> As noted above, this may well overlap with other sessions.
> One outcome of the discussion so far is to highlight what I think many
> already knew. This is hard!
Indeed. Keeping more people on the same page is important and difficult.
Thank you for your effort again, and looking forward to discuss in more depth!
Thanks,
SJ
>
> Jonathan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
2025-03-19 23:50 ` SeongJae Park
@ 2025-03-21 15:30 ` Jonathan Cameron
2025-03-21 17:36 ` SeongJae Park
0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Cameron @ 2025-03-21 15:30 UTC (permalink / raw)
To: SeongJae Park
Cc: Raghavendra K T, Bharata B Rao, lsf-pc, linux-mm, Michal Hocko,
Dan Williams, linuxarm, Matthew Wilcox, Johannes Weiner,
Gregory Price
> > And here is an attempt to compile how different subsystems
> > use the above data:
> > ==========================================================================================
> > Source Subsystem Consumption Activation/Frequency
> > ==========================================================================================
> > PROT_NONE faults NUMAB NUMAB=1 locality based While task is running,
> > via process pgtable balancing rate varies on observed
> > walk NUMAB=2 hot page locality and sysctl knobs.
> > promotion
> > ==========================================================================================
> > folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap
> > ==========================================================================================
> > PTE A bit via Reclaim:LRU LRU list activation, During memory pressure
> > rmap walk deactivation/demotion
> > ==========================================================================================
> > PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure
> > rmap walk and process deactivation/demotion - Continuous sampling (configurable)
> > pgtable walk for workingset reporting
> > ==========================================================================================
> > PTE A bit via DAMON LRU activation,
> > rmap walk hot page promotion,
> > demotion etc
>
> For virtual address spaces monitoring mode, DAMON uses PTE A bit via pgtable
> walk.
>
> It's activation and frequency is basically set as user requests. Activation
> can be set to be reactive to memory pressure like events (using watermarks).
> Frequency can be auto-tuned for pursuing access events per snapshot ratio.
Thanks. I've added that (in very brief form) to the table in my slides.
> > SJ has proposed perhaps extending Damon as a possible interface layer. I am
> > yet to understand how that works in cases where regions do not provide
> > a compact representation due to lack of contiguity in the hotness.
> > An example usecase is hypervisor wanting to migrate data under unaware,
> > cheap VMs. After a system has been running for a while (particularly with hot
> > pages being migrated, swap etc) the hotness map looks much like noise.
>
> Similar concerns for DAMON's region abstraction were raised for physical
> address space monitoring, because there is no cautious effort for making hot
> pages gathered together (or, locality).
>
> I'd argue there is no cautious effort to make temperature be spread, though.
> As a result, we can expect a level of uncautious bias, and that matches with my
> experiences from DAMON use cases on products environemnts so far.
Whilst I'm not in a position to share the data, as it's not mine :( I've
seen graphs that show that for at least some use cases, even if we have some
contiguity of hotness in the VA space, it looks like noise in PA. So
I think this is a case of 'mileage may vary'. Damon works great sometimes but
sometime the spared of access statistics happen to be wrong.
>
> Also, in practice, DAMON regions are used in combination with other
> information. For example, DAMON-based reclaim checkes PTE A bit of each page
> in DAMON-suggested cold memory region to make final decision about whether to
> reclaim or not it, like MADV_PAGEOUT does.
Makes sense. The MADV_PAGEOUT case was one of the motivators for mixing
methods suggestion. Here it's kind of DAMON + dense A bit checking (on
candidate pages).
>
> That is, yes, I agree DAMON's region abstraction is maybe not a good way to
> find perfect answer to some questions such as finding N-th hottest single page.
> And it has many rooms to improve. Nevertheless, even DAMON of today can give
> good enough best-effort answers for questions that practical for some cases,
> such as finding regions that may containing N most hot/cold pages, while
> letting the monitoring overhead fixed as users ask.
>
> Also, please note that there is no reason to restrict DAMON to always use
> regions abstraction. For different use-cases and situation, DAMON will be open
> to be extended to use new abstractions. DAMON aims not to be a subsystem for
> DAMON regions concept but data access monitoring for practical efficiency, and
> continue random evolution for given environments.
Absolutely understood. In my current thinking Damon sits at a particular layer
in the stack and there may be one more abstraction on top of it (e.g. a list
of hot /cold pages). Equally possible that the layers may fuse and it becomes
an aspect of DAMON.
>
> >
> > Now for the "there be monsters bit"...
> > ---------------------------------------
> >
> > - Stability of hotness matters and is hard to establish.
> > Predict a page will remain hot - various heuristics.
> > a) It is hot, probably stays so? (super hot!)
> > Sometimes enough to be detected as hot once,
> > often not.
> > b) It has been hot a while, probably stays so.
> > Check this hot list against previous hot list,
> > entries in both needed to promote.
> > This has a problem if hotlist is small compared to
> > total count of hot pages. Say list is 1%, 20% actually
> > hot, low chance of repeats even in hot pages.
> > c) It is hot, let's monitor a while before doing anything.
> > Measurement technique may change. Maybe cheaper
> > to monitor 'candidate' pages than all pages
> > e.g. CXL HMU gives 1000 pages, then we use access bit
> > sampling to check they are at least accessed N times
> > in next second.
> > d) It was hot, We moved it. Did it stay hot?
> > More useful to identify when we are thrashing and should
> > just stop doing anything. To late to fix this one!
>
> DAMON is providing a sort of b) approach, aka DAMON regions' age, for finding
> both hot and cold regions.
>
> > - Some data should be considered hot even when not in use (e.g. stack)
>
> DAMOS filters is for this kind of exceptions, and DAMON kernel API is flexible
> enough to let callers directly manipulate the regions information based on
> thier special knowledges. We can further optimize the interface for easier
> uses, of course.
Nice.
>
> > - Usecases interfere. So it can't just be a broadcast mode
> > where hotness information is sent to all users.
> > - When to stop, start migration / tracking?
> > a) Detecting bad decisions. Enough bad decisions, better to
> > do nothing?
> > b) Metadata beyond the counts is useful
> > https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> > Promotion algorithms can need aggregate statistics for a memory
> > device to decide how much to move.
>
> DAMOS quotas goal feature is a sort of a feature for this question. It allows
> users to set target metric and value, and tune the aggressiveness. For
> promotions and demotions, I suggested using upper tier utilization and free
> ratio as such possible goal metric, and gonna post an implementation for that
> soon.
Those are certainly good metrics to consider, but I think we definitely also
need a metric around how beneficial are the moves being made.
That matters more on the promotion path, because that interrupts access to
hot data and so will cause a temporary drop in performance / latency spike.
>
> >
> > As noted above, this may well overlap with other sessions.
> > One outcome of the discussion so far is to highlight what I think many
> > already knew. This is hard!
>
> Indeed. Keeping more people on the same page is important and difficult.
> Thank you for your effort again, and looking forward to discuss in more depth!
>
I'm not sure we'll succeed. This may well be a wild west situation for a while
yet, but hopefully we can slowly converge or at least build some common
parts.
Jonathan
p.s. Heathrow disruption means I'm crossing my fingers on actually getting to
Montreal.
>
> Thanks,
> SJ
>
> >
> > Jonathan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
2025-03-21 15:30 ` Jonathan Cameron
@ 2025-03-21 17:36 ` SeongJae Park
0 siblings, 0 replies; 5+ messages in thread
From: SeongJae Park @ 2025-03-21 17:36 UTC (permalink / raw)
To: Jonathan Cameron
Cc: SeongJae Park, Raghavendra K T, Bharata B Rao, lsf-pc, linux-mm,
Michal Hocko, Dan Williams, linuxarm, Matthew Wilcox,
Johannes Weiner, Gregory Price
On Fri, 21 Mar 2025 15:30:44 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
Thank you for your nice comments. I agree to all your points, and adding just
a few more details below.
[...]
> Whilst I'm not in a position to share the data, as it's not mine :( I've
> seen graphs that show that for at least some use cases, even if we have some
> contiguity of hotness in the VA space, it looks like noise in PA. So
> I think this is a case of 'mileage may vary'. Damon works great sometimes but
> sometime the spared of access statistics happen to be wrong.
100% agree. Your findings and conclusions match with mine. Nevertheless, we
are trying to find why and when it works bad and good, and make it better in
more cases. So far, we found better visualization methods and DAMON parameters
tuning can help. We are therefore adding more visualization methods and DAMON
parameters auto-tuning. Still far from the perfect, but it would continue
being closer to the north star if we, the community, work together.
[...]
> > > b) Metadata beyond the counts is useful
> > > https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> > > Promotion algorithms can need aggregate statistics for a memory
> > > device to decide how much to move.
> >
> > DAMOS quotas goal feature is a sort of a feature for this question. It allows
> > users to set target metric and value, and tune the aggressiveness. For
> > promotions and demotions, I suggested using upper tier utilization and free
> > ratio as such possible goal metric, and gonna post an implementation for that
> > soon.
>
> Those are certainly good metrics to consider, but I think we definitely also
> need a metric around how beneficial are the moves being made.
>
> That matters more on the promotion path, because that interrupts access to
> hot data and so will cause a temporary drop in performance / latency spike.
Good point, and agreed. I think we can, and should, continue making such
better metrics together.
And I think DAMOS quota goal is a feature that can be easily used for
prototypes, experiments and hopefully productionizing of such new metric. The
feature is easy to extend for new metrics, and also supports setting multiple
goals. Also, it supports users directly feeding arbitrary input to the
feedback loop.
>
> >
> > >
> > > As noted above, this may well overlap with other sessions.
> > > One outcome of the discussion so far is to highlight what I think many
> > > already knew. This is hard!
> >
> > Indeed. Keeping more people on the same page is important and difficult.
> > Thank you for your effort again, and looking forward to discuss in more depth!
> >
>
> I'm not sure we'll succeed. This may well be a wild west situation for a while
> yet, but hopefully we can slowly converge or at least build some common
> parts.
I'm very sure this session will be an important step for the journey :)
>
> Jonathan
>
> p.s. Heathrow disruption means I'm crossing my fingers on actually getting to
> Montreal.
I hope it all go well with you!
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted?
2025-03-19 12:47 [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
2025-03-19 23:50 ` SeongJae Park
@ 2025-04-04 10:39 ` Jonathan Cameron
1 sibling, 0 replies; 5+ messages in thread
From: Jonathan Cameron @ 2025-04-04 10:39 UTC (permalink / raw)
To: Raghavendra K T, Bharata B Rao, SeongJae Park, lsf-pc, linux-mm
Cc: Michal Hocko, Dan Williams, Matthew Wilcox, Johannes Weiner,
Gregory Price
On Wed, 19 Mar 2025 12:47:53 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
https://drive.google.com/file/d/1o9g-Bggg7jJwrkLa90ZyLEW6xPdp2D2a/view?usp=drivesdk
Slides as presented at LSF-MM.
> Prior to LSFMM, this is an update on where the discussion has gone on list
> since the original proposal back in January (which was buried in the
> thread for Ragha's proposal focused on PTE A bit scanning)
>
> v1: https://lore.kernel.org/all/20250131130901.00000dd1@huawei.com/
>
> Note that this is combining comments and discussion from many people and I may
> well have summarized things badly + missed key details. If time allows
> I'll update with a v3 when people have ripped up this straw man.
>
> Bharata has posted code for one approach and discussion is ongoing:
> https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
> This proposal overlaps with part of several other proposals, (Damon, access
> bit tracking etc) but the focus is intended to be more general.
>
> Abstract:
>
> We have:
> 1) A range of different technologies tracking what may be loosely defined
> as the hotness of regions of memory.
> 2) A set of use cases that care about this data.
>
> Question:
>
> Is it useful or feasible to aggregate the data from the sources (1) to some
> layer before providing answers to (2)? What should that layer look like?
> What services and abstractions should it provide? Is there commonality in
> what those use cases need?
>
> By aggregate I'm not necessarily implying multiple techniques in use at
> once, but more that we want one interface driven by whatever solution
> is the right balance on a particular system. That balance can be affected
> by hardware availability or characteristics of the system or workloa
>
> Note that many of the hotness driven actions are painful (e.g. migration
> of hot pages) and for those we need to be very sure it is a good idea
> to do anything at all!
>
> My assumption is that in at least some cases the problem will be too hard
> to solve in kernel but lets consider what we can do.
>
> On to the details:
> ------------------
>
> Note: I'm ignoring the low level implementation details of each method
> and how they avoid resource exhaustion, tune sampling timing (epoch length)
> and what is sampled (scanning random etc) as in at least some cases that's
> a problem for the lowest technique specific level.
>
> Enumerating the cases (thanks to Bharata, Johannes, SJ and others for inputs
> on this!) Much of this is direct quotes from this thread:
> https://lore.kernel.org/all/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
> (particularly Bharata's reply to my original questions)
>
> Here is a compilation of available temperature sources and how the
> hot/access data is consumed by different subsystems:
>
> PA-Physical address available
> VA-Virtual address available
> AA-Access time available
> NA-accessing Node info available
>
> ==================================================
> Temperature PA VA AA NA
> source
> ==================================================
> PROT_NONE faults Y Y Y Y
> --------------------------------------------------
> folio_mark_accessed() Y Y Y
> --------------------------------------------------
> PTE A bit Y Y N* N
> --------------------------------------------------
> Platform hints Y Y Y Y
> (AMD IBS)
> --------------------------------------------------
> Device hints Y N N N
> (CXL HMU)
> ==================================================
> * Some information available from scanning timing.
> In all cases other methods can be applied to fill in the missing data
> (rmap etc)
>
> And here is an attempt to compile how different subsystems
> use the above data:
> ==========================================================================================
> Source Subsystem Consumption Activation/Frequency
> ==========================================================================================
> PROT_NONE faults NUMAB NUMAB=1 locality based While task is running,
> via process pgtable balancing rate varies on observed
> walk NUMAB=2 hot page locality and sysctl knobs.
> promotion
> ==========================================================================================
> folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap
> ==========================================================================================
> PTE A bit via Reclaim:LRU LRU list activation, During memory pressure
> rmap walk deactivation/demotion
> ==========================================================================================
> PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure
> rmap walk and process deactivation/demotion - Continuous sampling (configurable)
> pgtable walk for workingset reporting
> ==========================================================================================
> PTE A bit via DAMON LRU activation,
> rmap walk hot page promotion,
> demotion etc
> ==========================================================================================
> Platform hints NUMAB NUMAB=1 Locality based
> (e.g. AMD IBS) balancing and
> NUMAB=2 hot page
> promotion
> ==========================================================================================
> Device hints NUMAB NUMAB=2 hot page
> (e.g. CXL HMU) promotion
> ==========================================================================================
> PG_young / PG_idle ?
> ==========================================================================================
>
> Technique trade offs:
>
> Why not just use one method?
>
> - Cost of capture, cost of use.
> * Run all the time - aggregate data for stability of hotness.
> * Run occasionally to minimize cost.
>
> - Different availability. e.g. IBS might be needed for other things,
> hardware monitors may not be available.
>
> Straw man (based part on IBS proposal linked above)
> ---------------------------------------------------
>
> Multiple sources become similar at different levels.
>
> Taking just tiering promotion as an example and keeping in mind the golden
> rule of tiered memory: Put data in the right place to start with if you
> can. So this is about when you can't: application unaware, changing memory
> pressure and workload mix etc.
>
> _____________________ __________________
> | Sampling techniques | | Hardware units |
> | - Access counter, | | CXL HMU etc |
> | - Trace based | |_________________|
> |_____________________| |
> | Hot page
> Events |
> | |
> __________v___________ |
> | Events to counts | |
> | - hashtable, sketch | |
> | etc | |
> |______________________| |
> | |
> Hot page |
> | |
> ___________V______________________V_________
> | Hot list - responsible for stability? |
> |____________________________________________|
> |
> Timely hotlist data
> | Additional data (process newness, stack location...?)
> __________v__________________|___
> | Promotion Daemon |
> |_________________________________|
>
> For all paths where data is flowing down we probably need control parameters
> flowing back the other way + if we have multiple users of the datastream
> we need to satisfy each of their constraints.
>
> SJ has proposed perhaps extending Damon as a possible interface layer. I am
> yet to understand how that works in cases where regions do not provide
> a compact representation due to lack of contiguity in the hotness.
> An example usecase is hypervisor wanting to migrate data under unaware,
> cheap VMs. After a system has been running for a while (particularly with hot
> pages being migrated, swap etc) the hotness map looks much like noise.
>
> Now for the "there be monsters bit"...
> ---------------------------------------
>
> - Stability of hotness matters and is hard to establish.
> Predict a page will remain hot - various heuristics.
> a) It is hot, probably stays so? (super hot!)
> Sometimes enough to be detected as hot once,
> often not.
> b) It has been hot a while, probably stays so.
> Check this hot list against previous hot list,
> entries in both needed to promote.
> This has a problem if hotlist is small compared to
> total count of hot pages. Say list is 1%, 20% actually
> hot, low chance of repeats even in hot pages.
> c) It is hot, let's monitor a while before doing anything.
> Measurement technique may change. Maybe cheaper
> to monitor 'candidate' pages than all pages
> e.g. CXL HMU gives 1000 pages, then we use access bit
> sampling to check they are at least accessed N times
> in next second.
> d) It was hot, We moved it. Did it stay hot?
> More useful to identify when we are thrashing and should
> just stop doing anything. To late to fix this one!
> - Some data should be considered hot even when not in use (e.g. stack)
> - Usecases interfere. So it can't just be a broadcast mode
> where hotness information is sent to all users.
> - When to stop, start migration / tracking?
> a) Detecting bad decisions. Enough bad decisions, better to
> do nothing?
> b) Metadata beyond the counts is useful
> https://lore.kernel.org/all/87h64u2xkh.fsf@DESKTOP-5N7EMDA/
> Promotion algorithms can need aggregate statistics for a memory
> device to decide how much to move.
>
> As noted above, this may well overlap with other sessions.
> One outcome of the discussion so far is to highlight what I think many
> already knew. This is hard!
>
> Jonathan
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-04-04 10:39 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-19 12:47 [LSF/MM/BPF TOPIC v2] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron
2025-03-19 23:50 ` SeongJae Park
2025-03-21 15:30 ` Jonathan Cameron
2025-03-21 17:36 ` SeongJae Park
2025-04-04 10:39 ` Jonathan Cameron
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox