* [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning
@ 2025-01-23 10:57 Raghavendra K T
2025-01-23 18:20 ` SeongJae Park
` (4 more replies)
0 siblings, 5 replies; 33+ messages in thread
From: Raghavendra K T @ 2025-01-23 10:57 UTC (permalink / raw)
To: linux-mm, akpm, lsf-pc, bharata
Cc: gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes,
feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov,
mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett,
peterz, mingo, nadav.amit, shivankg, ziy, jhubbard,
AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm,
santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506,
honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen,
raghavendra.kt
Bharata and I would like to propose the following topic for LSFMM.
Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.
In the Linux kernel, hot page information can potentially be obtained from
multiple sources:
a. PROT_NONE faults (NUMA balancing)
b. PTE Access bit (LRU scanning)
c. Hardware provided page hotness info (like AMD IBS)
This information is further used to migrate (or promote) pages from slow memory
tier to top tier to increase performance.
In the current hot page promotion mechanism, all the activities including the
process address space scanning, NUMA hint fault handling and page migration are
performed in the process context. i.e., scanning overhead is borne by the
applications.
I had recently posted a patch [1] to improve this in the context of slow-tier
page promotion. Here, Scanning is done by a global kernel thread which routinely
scans all the processes' address spaces and checks for accesses by reading the
PTE A bit. The hot pages thus identified are maintained in list and subsequently
are promoted to a default top-tier node. Thus, the approach pushes overhead of
scanning, NUMA hint faults and migrations off from process context.
The topic was presented in the MM alignment session hosted by David Rientjes [2].
The topic also finds a mention in S J Park's LSFMM proposal [3].
Here is the list of potential discussion points:
1. Other improvements and enhancements to PTE A bit scanning approach. Use of
multiple kernel threads, throttling improvements, promotion policies, per-process
opt-in via prctl, virtual vs physical address based scanning, tuning hot page
detection algorithm etc.
2. Possibility of maintaining single source of truth for page hotness that would
maintain hot page information from multiple sources and let other sub-systems
use that info.
3. Discuss how hardware provided hotness info (like AMD IBS) can further aid
promotion. Bharata had posted an RFC [4] on this a while back.
4. Overlap with DAMON and potential reuse.
Links:
[1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
[2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/
[3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
[4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/
^ permalink raw reply [flat|nested] 33+ messages in thread* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T @ 2025-01-23 18:20 ` SeongJae Park 2025-01-24 8:54 ` Raghavendra K T 2025-01-24 5:53 ` Hyeonggon Yoo ` (3 subsequent siblings) 4 siblings, 1 reply; 33+ messages in thread From: SeongJae Park @ 2025-01-23 18:20 UTC (permalink / raw) To: Raghavendra K T Cc: SeongJae Park, linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen Hi Raghavendra, On Thu, 23 Jan 2025 10:57:21 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote: > Bharata and I would like to propose the following topic for LSFMM. > > Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. Thank you for proposing this. I'm interested in this! > > In the Linux kernel, hot page information can potentially be obtained from > multiple sources: > > a. PROT_NONE faults (NUMA balancing) > b. PTE Access bit (LRU scanning) > c. Hardware provided page hotness info (like AMD IBS) > > This information is further used to migrate (or promote) pages from slow memory > tier to top tier to increase performance. > > In the current hot page promotion mechanism, all the activities including the > process address space scanning, NUMA hint fault handling and page migration are > performed in the process context. i.e., scanning overhead is borne by the > applications. I understand that you're mentioning about only fully in-kernel solutions. Just for readers' context, SK hynix' HMSDK cpacity expansion[1] does the works in two asynchronous threads (one for promotion and the other for demotion), using DAMON in kernel as the core worker, and controlling DAMON from the user-space. > > I had recently posted a patch [1] to improve this in the context of slow-tier > page promotion. Here, Scanning is done by a global kernel thread which routinely > scans all the processes' address spaces and checks for accesses by reading the > PTE A bit. The hot pages thus identified are maintained in list and subsequently > are promoted to a default top-tier node. Thus, the approach pushes overhead of > scanning, NUMA hint faults and migrations off from process context. > > The topic was presented in the MM alignment session hosted by David Rientjes [2]. > The topic also finds a mention in S J Park's LSFMM proposal [3]. > > Here is the list of potential discussion points: Great discussion points, thank you. I'm adding how DAMON tries to deal with some of the points below. > 1. Other improvements and enhancements to PTE A bit scanning approach. Use of > multiple kernel threads, DAMON allows use of multiple kernel threads for different monitoring scopes. There were also ideas for splitting the monitoring part and migration-like system operation part to different threads. > throttling improvements, DAMON provides features called "adaptive regions adjustment" and "DAMOS quotas" for throttling overheads from access monitoring and migration-like system operation actions. > promotion policies, DAMON's access-aware system operation feature (DAMOS) allows setting this kind of system operation policy based on access pattern and additional information including page level information such as anonymousness, belonging cgroup, page granular A bit recheck. > per-process opt-in via prctl, DAMON allows making the system operation action to pages belonging to specific cgroups using a feature called DAMOS filters. It is not integrated with prctl, and would work in cgroups scope, but may be able to be used. Extending DAMOS filters for belonging processes may also be doable. > virtual vs physical address based scanning, DAMON supports both virtual and physical address spaces monitoring. DAMON's pages migration is currently not supported for virtual address spaces, though I believe adding the support is not difficult. I'm bit in favor or physical address space, probably because I'm biased to what DAMON currently supports, but also due to unmapped pages promotion like edge cases. > tuning hot page detection algorithm etc. DAMON requires users manually tuning some important paramters for hot pages detection. We recently provided a tuning guide[2], and working on making it automated. I believe the essential problem is similar to many use cases regardless of the type of low level access check primitives, so want to learn if the tuning automation idea can be generally used. > > 2. Possibility of maintaining single source of truth for page hotness that would > maintain hot page information from multiple sources and let other sub-systems > use that info. DAMON is currently using the PTE A bit as the essential access check primitive. We designed DAMON to be able to be extended for other access check primitives such as page faults and AMD IBS like h/w features. We are now planning to do such extension, though still in the very early low-priority planning stage. DAMON also provides the kernel API. > > 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid > promotion. Bharata had posted an RFC [4] on this a while back. Maybe CXL Hotness Monitoring Unit could also be an interesting thing to discuss together. > > 4. Overlap with DAMON and potential reuse. I confess that it seems some of the works might overlap with DAMON to my biased eyes. I'm looking forward to attend this session, to make it less biased and more aligned with people :) > > Links: > > [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/ > [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/ > [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/ > [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/ Again, thank you for proposing this topic, and I wish to see you at Montreal! [1] https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion [2] https://lkml.kernel.org/r/20250110185232.54907-1-sj@kernel.org Thanks, SJ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-23 18:20 ` SeongJae Park @ 2025-01-24 8:54 ` Raghavendra K T 2025-01-24 18:05 ` Jonathan Cameron 0 siblings, 1 reply; 33+ messages in thread From: Raghavendra K T @ 2025-01-24 8:54 UTC (permalink / raw) To: SeongJae Park Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On 1/23/2025 11:50 PM, SeongJae Park wrote: > Hi Raghavendra, > > On Thu, 23 Jan 2025 10:57:21 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote: > >> Bharata and I would like to propose the following topic for LSFMM. >> >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. > > Thank you for proposing this. I'm interested in this! > Thank you. [...] >> virtual vs physical address based scanning, > > DAMON supports both virtual and physical address spaces monitoring. DAMON's > pages migration is currently not supported for virtual address spaces, though I > believe adding the support is not difficult. > Will check this. [...] >> >> 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid >> promotion. Bharata had posted an RFC [4] on this a while back. > > Maybe CXL Hotness Monitoring Unit could also be an interesting thing to discuss > together. > Definitely. >> >> 4. Overlap with DAMON and potential reuse. > > I confess that it seems some of the works might overlap with DAMON to my biased > eyes. I'm looking forward to attend this session, to make it less biased and > more aligned with people :) > Yes. Agree. >> >> Links: >> >> [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/ >> [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/ >> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/ >> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/ > > Again, thank you for proposing this topic, and I wish to see you at Montreal! > Same here .. Thank you :) - Raghu ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-24 8:54 ` Raghavendra K T @ 2025-01-24 18:05 ` Jonathan Cameron 0 siblings, 0 replies; 33+ messages in thread From: Jonathan Cameron @ 2025-01-24 18:05 UTC (permalink / raw) To: Raghavendra K T Cc: SeongJae Park, linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Fri, 24 Jan 2025 14:24:40 +0530 Raghavendra K T <raghavendra.kt@amd.com> wrote: > On 1/23/2025 11:50 PM, SeongJae Park wrote: > > Hi Raghavendra, > > > > On Thu, 23 Jan 2025 10:57:21 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote: > > > >> Bharata and I would like to propose the following topic for LSFMM. > >> > >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. > > > > Thank you for proposing this. I'm interested in this! > > > > Thank you. > > [...] > > >> virtual vs physical address based scanning, > > > > DAMON supports both virtual and physical address spaces monitoring. DAMON's > > pages migration is currently not supported for virtual address spaces, though I > > believe adding the support is not difficult. > > > > Will check this. > > [...] > > >> > >> 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid > >> promotion. Bharata had posted an RFC [4] on this a while back. > > > > Maybe CXL Hotness Monitoring Unit could also be an interesting thing to discuss > > together. > > > > Definitely. Thanks for the shout out SJ. I'm definitely interested in this topic from the angle of the hardware hotness monitoring units (roughly speaking ones that give you a list of hot pages - typically by PA). Making sure any solution works for those is perhaps key for the longer term. Not entirely clear to me that we are ready yet for data aggregation solution, or mixed techniques but definitely interesting to brainstorm. Until now my main focus has been on getting infrastructure in place to work out the lower levels of using a hardware hotness monitoring unit (using QEMU for now with TCG plugins to get the access data). In general, not stuff I suspect anyone will want to discuss at LSF/MM, but perhaps providing insights into how good data we might get could be. Unless the hardware units people build are very capable (and expensive) the chances are we will have to deal with accuracy limitations that I suspect the users of this data for migration etc do not want to explicitly deal with. If our tracking is coming from multiple sources we need to deal with differences in, and potentially estimation of accuracy. Anything efficient is going to have some accuracy issues (regions for Damon, access bit scanning frequency for your technique, sampling for page fault techniques, data in wrong place - access bit's will tell you to promote stuff that is always in cache - arguably a waste of time etc) I've no idea yet how painful this is going to be. Using the different sources to overcome limitations on each one is interesting but likely to be complex and tricky to generalize. Maybe access bit scanning to detect hotish large scale regions, then a hardware tracker to separate out 'hot' from 'warm'. Sounds fun, but far form general! Lots of problems to solve in this space. And when we have done that, there is paravirtualizing hardware trackers / other methods, application specific usage of the data (some apps will know better than the kernel and will want this data, security / side channels etc). For stretch goals there is even the fun question of hotness monitoring down stream of interleave, particularly when it's scrambled and not a power of 2 ways. Again, maybe not a general problem but will affect data biases. How much of that we want to hide down in implementations below some general 'give me hot stuff' is an open question (I'm guessing hide almost everything beyond controls on data bandwidth). Jonathan > > >> > >> 4. Overlap with DAMON and potential reuse. > > > > I confess that it seems some of the works might overlap with DAMON to my biased > > eyes. I'm looking forward to attend this session, to make it less biased and > > more aligned with people :) > > > > Yes. Agree. > > >> > >> Links: > >> > >> [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/ > >> [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/ > >> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/ > >> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/ > > > > Again, thank you for proposing this topic, and I wish to see you at Montreal! > > > > Same here .. Thank you :) > > - Raghu > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T 2025-01-23 18:20 ` SeongJae Park @ 2025-01-24 5:53 ` Hyeonggon Yoo 2025-01-24 9:02 ` Raghavendra K T 2025-02-06 3:14 ` Yuanchu Xie 2025-01-26 2:27 ` Huang, Ying ` (2 subsequent siblings) 4 siblings, 2 replies; 33+ messages in thread From: Hyeonggon Yoo @ 2025-01-24 5:53 UTC (permalink / raw) To: Raghavendra K T, linux-mm, akpm, lsf-pc, bharata Cc: kernel_team, 42.hyeyoo, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, yuanchu On 1/23/2025 7:57 PM, Raghavendra K T wrote: > Bharata and I would like to propose the following topic for LSFMM. > > Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. > > In the Linux kernel, hot page information can potentially be obtained from > multiple sources: > > a. PROT_NONE faults (NUMA balancing) > b. PTE Access bit (LRU scanning) > c. Hardware provided page hotness info (like AMD IBS) > > This information is further used to migrate (or promote) pages from slow memory > tier to top tier to increase performance. > > In the current hot page promotion mechanism, all the activities including the > process address space scanning, NUMA hint fault handling and page migration are > performed in the process context. i.e., scanning overhead is borne by the > applications. > > I had recently posted a patch [1] to improve this in the context of slow-tier > page promotion. Here, Scanning is done by a global kernel thread which routinely > scans all the processes' address spaces and checks for accesses by reading the > PTE A bit. The hot pages thus identified are maintained in list and subsequently> are promoted to a default top-tier node. Thus, the approach pushes overhead of > scanning, NUMA hint faults and migrations off from process context. > > The topic was presented in the MM alignment session hosted by David Rientjes [2]. > The topic also finds a mention in S J Park's LSFMM proposal [3]. > > Here is the list of potential discussion points: > 1. Other improvements and enhancements to PTE A bit scanning approach. Use of > multiple kernel threads, throttling improvements, promotion policies, per-process > opt-in via prctl, virtual vs physical address based scanning, tuning hot page > detection algorithm etc. Yuanchu's MGLRU periodic aging series [1] seems quite relevant here, you might want to look at it. adding Yuanchu to Cc. By the way, do you have any reason why you'd prefer opt-in prctl over per-memcg control? [1] https://lore.kernel.org/all/20221214225123.2770216-1-yuanchu@google.com/ > 2. Possibility of maintaining single source of truth for page hotness that would > maintain hot page information from multiple sources and let other sub-systems > use that info. > > 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid > promotion. Bharata had posted an RFC [4] on this a while back. > > 4. Overlap with DAMON and potential reuse. > > Links: > > [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/ > [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/ > [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/ > [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/ > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-24 5:53 ` Hyeonggon Yoo @ 2025-01-24 9:02 ` Raghavendra K T 2025-01-27 7:01 ` David Rientjes 2025-02-06 3:14 ` Yuanchu Xie 1 sibling, 1 reply; 33+ messages in thread From: Raghavendra K T @ 2025-01-24 9:02 UTC (permalink / raw) To: Hyeonggon Yoo, linux-mm, akpm, lsf-pc, bharata Cc: kernel_team, 42.hyeyoo, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, yuanchu On 1/24/2025 11:23 AM, Hyeonggon Yoo wrote: > > > On 1/23/2025 7:57 PM, Raghavendra K T wrote: >> Bharata and I would like to propose the following topic for LSFMM. >> >> Topic: Overhauling hot page detection and promotion based on PTE A bit >> scanning. [...] >> Here is the list of potential discussion points: >> 1. Other improvements and enhancements to PTE A bit scanning approach. >> Use of >> multiple kernel threads, throttling improvements, promotion policies, >> per-process >> opt-in via prctl, virtual vs physical address based scanning, tuning >> hot page >> detection algorithm etc. > > Yuanchu's MGLRU periodic aging series [1] seems quite relevant here, > you might want to look at it. adding Yuanchu to Cc. Thank you for pointing that. > > By the way, do you have any reason why you'd prefer opt-in prctl > over per-memcg control? > opt-in prctl came in the MM alignment discussion, and have added that. per-memcg also definitely makes sense. I am not aware which is the most used usecase. But adding provision for both with one having more priority over other may be the way to go. Overall point here is to save time in unnecessary scanning. will be adding prctl in the upcoming version to start with. > [1] https://lore.kernel.org/all/20221214225123.2770216-1- > yuanchu@google.com/ > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-24 9:02 ` Raghavendra K T @ 2025-01-27 7:01 ` David Rientjes 2025-01-27 7:11 ` Raghavendra K T 0 siblings, 1 reply; 33+ messages in thread From: David Rientjes @ 2025-01-27 7:01 UTC (permalink / raw) To: Raghavendra K T Cc: Hyeonggon Yoo, linux-mm, akpm, lsf-pc, bharata, kernel_team, 42.hyeyoo, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, yuanchu On Fri, 24 Jan 2025, Raghavendra K T wrote: > On 1/24/2025 11:23 AM, Hyeonggon Yoo wrote: > > > > > > On 1/23/2025 7:57 PM, Raghavendra K T wrote: > > > Bharata and I would like to propose the following topic for LSFMM. > > > > > > Topic: Overhauling hot page detection and promotion based on PTE A bit > > > scanning. > [...] > > > Here is the list of potential discussion points: > > > 1. Other improvements and enhancements to PTE A bit scanning approach. Use > > > of > > > multiple kernel threads, throttling improvements, promotion policies, > > > per-process > > > opt-in via prctl, virtual vs physical address based scanning, tuning hot > > > page > > > detection algorithm etc. > > > > Yuanchu's MGLRU periodic aging series [1] seems quite relevant here, > > you might want to look at it. adding Yuanchu to Cc. > > Thank you for pointing that. > +1. Yuanchu, do you have ideas for how MGLRU periodic aging and working set can play a role in this? > > By the way, do you have any reason why you'd prefer opt-in prctl > > over per-memcg control? > > > > opt-in prctl came in the MM alignment discussion, and have added that. Are you planning on sending a refresh of that patch series? :) > per-memcg also definitely makes sense. I am not aware which is the most > used usecase. But adding provision for both with one having more > priority over other may be the way to go. > I would suggest leveraging prctl() for this as opposed to memcg. I think making this part of memcg is beyond the scope for what memcg is intended to do, limitation of memory resources, similar to the recent discussions on per-cgroup control for THP. Additionally, the current memcg configuration of the system may also not be convenient for using for this purpose, especially if one process should be opted out in the memcg hierarchy. Requiring users to change how their memcg is configured just to opt out would be rather unfortunate. > Overall point here is to save time in unnecessary scanning. > will be adding prctl in the upcoming version to start with. > Fully agreed. Thanks very much for proposing this topic, Raghu, I think it will be very useful to discuss! Looking forward to it! ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-27 7:01 ` David Rientjes @ 2025-01-27 7:11 ` Raghavendra K T 0 siblings, 0 replies; 33+ messages in thread From: Raghavendra K T @ 2025-01-27 7:11 UTC (permalink / raw) To: David Rientjes Cc: Hyeonggon Yoo, linux-mm, akpm, lsf-pc, bharata, kernel_team, 42.hyeyoo, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, yuanchu On 1/27/2025 12:31 PM, David Rientjes wrote: > On Fri, 24 Jan 2025, Raghavendra K T wrote: [...] > >>> By the way, do you have any reason why you'd prefer opt-in prctl >>> over per-memcg control? >>> >> >> opt-in prctl came in the MM alignment discussion, and have added that. > > Are you planning on sending a refresh of that patch series? :) Hello David, Current plan is to send by next week. (Because after measuring the per mm latency and overall latency to do full scan, I was thinking to add parallel scanning in next version itself). > >> per-memcg also definitely makes sense. I am not aware which is the most >> used usecase. But adding provision for both with one having more >> priority over other may be the way to go. >> > > I would suggest leveraging prctl() for this as opposed to memcg. I think > making this part of memcg is beyond the scope for what memcg is intended > to do, limitation of memory resources, similar to the recent discussions > on per-cgroup control for THP. > > Additionally, the current memcg configuration of the system may also not > be convenient for using for this purpose, especially if one process should > be opted out in the memcg hierarchy. Requiring users to change how their > memcg is configured just to opt out would be rather unfortunate. > >> Overall point here is to save time in unnecessary scanning. >> will be adding prctl in the upcoming version to start with. >> > > Fully agreed. > > Thanks very much for proposing this topic, Raghu, I think it will be very > useful to discuss! Looking forward to it! Thank you. - Raghu ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-24 5:53 ` Hyeonggon Yoo 2025-01-24 9:02 ` Raghavendra K T @ 2025-02-06 3:14 ` Yuanchu Xie 1 sibling, 0 replies; 33+ messages in thread From: Yuanchu Xie @ 2025-02-06 3:14 UTC (permalink / raw) To: Hyeonggon Yoo, Raghavendra K T, bharata Cc: linux-mm, akpm, lsf-pc, kernel_team, 42.hyeyoo, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, Kinsey Ho On Thu, Jan 23, 2025 at 9:53 PM Hyeonggon Yoo <hyeonggon.yoo@sk.com> wrote: > On 1/23/2025 7:57 PM, Raghavendra K T wrote: > > Bharata and I would like to propose the following topic for LSFMM. > > > > Here is the list of potential discussion points: > > 1. Other improvements and enhancements to PTE A bit scanning approach. Use of > > multiple kernel threads, throttling improvements, promotion policies, per-process > > opt-in via prctl, virtual vs physical address based scanning, tuning hot page > > detection algorithm etc. > > Yuanchu's MGLRU periodic aging series [1] seems quite relevant here, > you might want to look at it. adding Yuanchu to Cc. Thanks for the mention, Hyeonggon Yoo. Working set reporting doesn't aim to promote/demote/reclaim pages, but to show aggregate stats of the memory in access recency. The periodic aging part is optional since client devices wouldn't want a background daemon wasting battery aging lruvecs when nothing is happening. For the server use case, the aging kthread periodically invoke MGLRU aging, which performs the PTE A bit scanning. MGLRU handles unmapped page cache as well for reclaim purposes. Reading through the kmmscand patch series. Kmmscand also keeps a list of mm_struct and performs scanning on them, so given there're many use cases for PTE A bit scanning, this seems like an opportunity to abstract some of the mm_struct scanning. Code-wise the A bit scanners do very similar things, and the MGLRU version has optional optimizations that reduce the scanning overhead. I wonder if you have considered migrating pages from the MGLRU young generation of a remote node, or pages that have remained in the young generation. Some changes to MGLRU would be necessary in that case. Also adding Kinsey Ho since he's been looking at page promotion as well. Thanks, Yuanchu ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T 2025-01-23 18:20 ` SeongJae Park 2025-01-24 5:53 ` Hyeonggon Yoo @ 2025-01-26 2:27 ` Huang, Ying 2025-01-27 5:11 ` Bharata B Rao 2025-02-07 19:06 ` Davidlohr Bueso 2025-01-31 12:28 ` Jonathan Cameron 2025-04-07 3:13 ` Bharata B Rao 4 siblings, 2 replies; 33+ messages in thread From: Huang, Ying @ 2025-01-26 2:27 UTC (permalink / raw) To: Raghavendra K T Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen Hi, Raghavendra, Raghavendra K T <raghavendra.kt@amd.com> writes: > Bharata and I would like to propose the following topic for LSFMM. > > Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. > > In the Linux kernel, hot page information can potentially be obtained from > multiple sources: > > a. PROT_NONE faults (NUMA balancing) > b. PTE Access bit (LRU scanning) > c. Hardware provided page hotness info (like AMD IBS) > > This information is further used to migrate (or promote) pages from slow memory > tier to top tier to increase performance. > > In the current hot page promotion mechanism, all the activities including the > process address space scanning, NUMA hint fault handling and page migration are > performed in the process context. i.e., scanning overhead is borne by the > applications. > > I had recently posted a patch [1] to improve this in the context of slow-tier > page promotion. Here, Scanning is done by a global kernel thread which routinely > scans all the processes' address spaces and checks for accesses by reading the > PTE A bit. The hot pages thus identified are maintained in list and subsequently > are promoted to a default top-tier node. Thus, the approach pushes overhead of > scanning, NUMA hint faults and migrations off from process context. This has been discussed before too. For example, in the following thread https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/ The drawbacks of asynchronous scanning including - The CPU cycles used are not charged properly - There may be no idle CPU cycles to use - The scanning CPU may be not near the workload CPUs enough It's better to involve Mel and Peter in the discussion for this. > The topic was presented in the MM alignment session hosted by David Rientjes [2]. > The topic also finds a mention in S J Park's LSFMM proposal [3]. > > Here is the list of potential discussion points: > 1. Other improvements and enhancements to PTE A bit scanning approach. Use of > multiple kernel threads, throttling improvements, promotion policies, per-process > opt-in via prctl, virtual vs physical address based scanning, tuning hot page > detection algorithm etc. One drawback of physical address based scanning is that it's hard to apply some workload specific policy. For example, if a low priority workload has many relatively hot pages, while a high priority workload has many relative warm (not so hot) pages. We need to promote the warm pages in the high priority workload, while physcial address based scanning may report the hot pages in the low priority workload. Right? > 2. Possibility of maintaining single source of truth for page hotness that would > maintain hot page information from multiple sources and let other sub-systems > use that info. > > 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid > promotion. Bharata had posted an RFC [4] on this a while back. > > 4. Overlap with DAMON and potential reuse. > > Links: > > [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/ > [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/ > [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/ > [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/ > --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-26 2:27 ` Huang, Ying @ 2025-01-27 5:11 ` Bharata B Rao 2025-01-27 18:34 ` SeongJae Park 2025-02-07 19:06 ` Davidlohr Bueso 1 sibling, 1 reply; 33+ messages in thread From: Bharata B Rao @ 2025-01-27 5:11 UTC (permalink / raw) To: Huang, Ying, Raghavendra K T Cc: linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On 26-Jan-25 7:57 AM, Huang, Ying wrote: > Hi, Raghavendra, > > Raghavendra K T <raghavendra.kt@amd.com> writes: > >> Bharata and I would like to propose the following topic for LSFMM. >> >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. >> >> In the Linux kernel, hot page information can potentially be obtained from >> multiple sources: >> >> a. PROT_NONE faults (NUMA balancing) >> b. PTE Access bit (LRU scanning) >> c. Hardware provided page hotness info (like AMD IBS) >> >> This information is further used to migrate (or promote) pages from slow memory >> tier to top tier to increase performance. >> >> In the current hot page promotion mechanism, all the activities including the >> process address space scanning, NUMA hint fault handling and page migration are >> performed in the process context. i.e., scanning overhead is borne by the >> applications. >> >> I had recently posted a patch [1] to improve this in the context of slow-tier >> page promotion. Here, Scanning is done by a global kernel thread which routinely >> scans all the processes' address spaces and checks for accesses by reading the >> PTE A bit. The hot pages thus identified are maintained in list and subsequently >> are promoted to a default top-tier node. Thus, the approach pushes overhead of >> scanning, NUMA hint faults and migrations off from process context. > > This has been discussed before too. For example, in the following thread > > https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/ Thanks for pointing to this discussion. > > The drawbacks of asynchronous scanning including > > - The CPU cycles used are not charged properly > > - There may be no idle CPU cycles to use > > - The scanning CPU may be not near the workload CPUs enough > > It's better to involve Mel and Peter in the discussion for this. They are CC'ed in this thread and hopefully have insights to share. Charging CPU cycles to the right process has been brought up in other similar contexts. Recent one is from page migration batching and using multiple threads for migration - https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com/ Does it make sense to treat hot page promotion from slow tiers differently compared to locality based balancing? I mean couldn't the charging of this async thread be similar to the cycles spent by other system threads like kcompactd and khugepaged? > >> The topic was presented in the MM alignment session hosted by David Rientjes [2]. >> The topic also finds a mention in S J Park's LSFMM proposal [3]. >> >> Here is the list of potential discussion points: >> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of >> multiple kernel threads, throttling improvements, promotion policies, per-process >> opt-in via prctl, virtual vs physical address based scanning, tuning hot page >> detection algorithm etc. > > One drawback of physical address based scanning is that it's hard to > apply some workload specific policy. For example, if a low priority > workload has many relatively hot pages, while a high priority workload > has many relative warm (not so hot) pages. We need to promote the warm > pages in the high priority workload, while physcial address based > scanning may report the hot pages in the low priority workload. Right? Correct. I wonder if DAMON has already devised a scheme to address this. SJ? Regards, Bharata. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-27 5:11 ` Bharata B Rao @ 2025-01-27 18:34 ` SeongJae Park 2025-02-07 8:10 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: SeongJae Park @ 2025-01-27 18:34 UTC (permalink / raw) To: Bharata B Rao Cc: SeongJae Park, Huang, Ying, Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Mon, 27 Jan 2025 10:41:07 +0530 Bharata B Rao <bharata@amd.com> wrote: > On 26-Jan-25 7:57 AM, Huang, Ying wrote: > > Hi, Raghavendra, > > > > Raghavendra K T <raghavendra.kt@amd.com> writes: > > > >> Bharata and I would like to propose the following topic for LSFMM. > >> > >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. > >> > >> In the Linux kernel, hot page information can potentially be obtained from > >> multiple sources: > >> > >> a. PROT_NONE faults (NUMA balancing) > >> b. PTE Access bit (LRU scanning) > >> c. Hardware provided page hotness info (like AMD IBS) > >> > >> This information is further used to migrate (or promote) pages from slow memory > >> tier to top tier to increase performance. > >> > >> In the current hot page promotion mechanism, all the activities including the > >> process address space scanning, NUMA hint fault handling and page migration are > >> performed in the process context. i.e., scanning overhead is borne by the > >> applications. > >> > >> I had recently posted a patch [1] to improve this in the context of slow-tier > >> page promotion. Here, Scanning is done by a global kernel thread which routinely > >> scans all the processes' address spaces and checks for accesses by reading the > >> PTE A bit. The hot pages thus identified are maintained in list and subsequently > >> are promoted to a default top-tier node. Thus, the approach pushes overhead of > >> scanning, NUMA hint faults and migrations off from process context. > > > > This has been discussed before too. For example, in the following thread > > > > https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/ > > Thanks for pointing to this discussion. > > > > > The drawbacks of asynchronous scanning including > > > > - The CPU cycles used are not charged properly > > > > - There may be no idle CPU cycles to use > > > > - The scanning CPU may be not near the workload CPUs enough > > > > It's better to involve Mel and Peter in the discussion for this. > > They are CC'ed in this thread and hopefully have insights to share. > > Charging CPU cycles to the right process has been brought up in other > similar contexts. Recent one is from page migration batching and using > multiple threads for migration - > https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com/ > > Does it make sense to treat hot page promotion from slow tiers > differently compared to locality based balancing? I mean couldn't the > charging of this async thread be similar to the cycles spent by other > system threads like kcompactd and khugepaged? I'm up to this idea. I agree the fairness is a thing that we need to aware of. But IMHO, it is something that the async approach can further be advanced for, not a strict blocker for now. > > > > >> The topic was presented in the MM alignment session hosted by David Rientjes [2]. > >> The topic also finds a mention in S J Park's LSFMM proposal [3]. > >> > >> Here is the list of potential discussion points: > >> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of > >> multiple kernel threads, throttling improvements, promotion policies, per-process > >> opt-in via prctl, virtual vs physical address based scanning, tuning hot page > >> detection algorithm etc. > > > > One drawback of physical address based scanning is that it's hard to > > apply some workload specific policy. For example, if a low priority > > workload has many relatively hot pages, while a high priority workload > > has many relative warm (not so hot) pages. We need to promote the warm > > pages in the high priority workload, while physcial address based > > scanning may report the hot pages in the low priority workload. Right? > > Correct. I wonder if DAMON has already devised a scheme to address this. SJ? Yes, I think DAMOS quotas and DAMOS filters can be used to address this issue. For this case, assuming each workload has its own cgroup, users can add a DAMOS scheme for promotion per workload. The schemes will have different DAMOS quotas based on the workloads' priority. The schemes will also be controlled to do the promotion for pages of the specific workloads using DAMOS filters. For example, below kdamond configuration can be used. # damo args damon \ --damos_action migrate_hot 0 --damos_quotas 100ms 1G 1s 0% 100% 100% \ --damos_filter reject none memcg /workloads/high-priority \ \ --damos_action migrate_hot 0 --damos_quotas 10ms 100M 1s 0% 100% 100% \ --damos_filter reject none memcg /workloads/low-priority \ --damos_nr_filters 1 1 --out kdamond.json # damo report damon --input_file ./kdamond.json --damon_params_omit_defaults kdamond 0 context 0 ops: paddr target 0 region [4,294,967,296, 68,577,918,975) (59.868 GiB) intervals: sample 5 ms, aggr 100 ms, update 1 s nr_regions: [10, 1,000] scheme 0 action: migrate_hot to node 0 per aggr interval target access pattern sz: [0 B, max] nr_accesses: [0 %, 18,446,744,073,709,551,616 %] age: [0 ns, max] quotas 100 ms / 1024.000 MiB / 0 B per 1 s priority: sz 0 %, nr_accesses 100 %, age 100 % filter 0 reject none memcg /workloads/high-priority scheme 1 action: migrate_hot to node 0 per aggr interval target access pattern sz: [0 B, max] nr_accesses: [0 %, 18,446,744,073,709,551,616 %] age: [0 ns, max] quotas 10 ms / 100.000 MiB / 0 B per 1 s priority: sz 0 %, nr_accesses 100 %, age 100 % filter 0 reject none memcg /workloads/low-priority Please note that this is just one example based on existing DAMOS features. This may have drawbacks and future optimizations would be possible. Thanks, SJ > > Regards, > Bharata. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-27 18:34 ` SeongJae Park @ 2025-02-07 8:10 ` Huang, Ying 2025-02-07 9:06 ` Gregory Price 2025-02-07 19:52 ` SeongJae Park 0 siblings, 2 replies; 33+ messages in thread From: Huang, Ying @ 2025-02-07 8:10 UTC (permalink / raw) To: SeongJae Park, Bharata B Rao Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen SeongJae Park <sj@kernel.org> writes: > On Mon, 27 Jan 2025 10:41:07 +0530 Bharata B Rao <bharata@amd.com> wrote: > >> On 26-Jan-25 7:57 AM, Huang, Ying wrote: >> > Hi, Raghavendra, >> > >> > Raghavendra K T <raghavendra.kt@amd.com> writes: >> > >> >> Bharata and I would like to propose the following topic for LSFMM. >> >> >> >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. >> >> >> >> In the Linux kernel, hot page information can potentially be obtained from >> >> multiple sources: >> >> >> >> a. PROT_NONE faults (NUMA balancing) >> >> b. PTE Access bit (LRU scanning) >> >> c. Hardware provided page hotness info (like AMD IBS) >> >> >> >> This information is further used to migrate (or promote) pages from slow memory >> >> tier to top tier to increase performance. >> >> >> >> In the current hot page promotion mechanism, all the activities including the >> >> process address space scanning, NUMA hint fault handling and page migration are >> >> performed in the process context. i.e., scanning overhead is borne by the >> >> applications. >> >> >> >> I had recently posted a patch [1] to improve this in the context of slow-tier >> >> page promotion. Here, Scanning is done by a global kernel thread which routinely >> >> scans all the processes' address spaces and checks for accesses by reading the >> >> PTE A bit. The hot pages thus identified are maintained in list and subsequently >> >> are promoted to a default top-tier node. Thus, the approach pushes overhead of >> >> scanning, NUMA hint faults and migrations off from process context. >> > >> > This has been discussed before too. For example, in the following thread >> > >> > https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/ >> >> Thanks for pointing to this discussion. >> >> > >> > The drawbacks of asynchronous scanning including >> > >> > - The CPU cycles used are not charged properly >> > >> > - There may be no idle CPU cycles to use >> > >> > - The scanning CPU may be not near the workload CPUs enough >> > >> > It's better to involve Mel and Peter in the discussion for this. >> >> They are CC'ed in this thread and hopefully have insights to share. >> >> Charging CPU cycles to the right process has been brought up in other >> similar contexts. Recent one is from page migration batching and using >> multiple threads for migration - >> https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com/ >> >> Does it make sense to treat hot page promotion from slow tiers >> differently compared to locality based balancing? I mean couldn't the >> charging of this async thread be similar to the cycles spent by other >> system threads like kcompactd and khugepaged? > > I'm up to this idea. > > I agree the fairness is a thing that we need to aware of. But IMHO, it is > something that the async approach can further be advanced for, not a strict > blocker for now. Personally, I have no objection to async operations in general. However, we may need to find some way to control these async operations instead of adding more and more background kthreads blindly. How to charge and constrain the resources used by these async operations is important too. For example, some users may want to bind some async operations on some CPUs. IMHO, we should think about the requirements and possible solutions instead of ignoring the issues. >> >> > >> >> The topic was presented in the MM alignment session hosted by David Rientjes [2]. >> >> The topic also finds a mention in S J Park's LSFMM proposal [3]. >> >> >> >> Here is the list of potential discussion points: >> >> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of >> >> multiple kernel threads, throttling improvements, promotion policies, per-process >> >> opt-in via prctl, virtual vs physical address based scanning, tuning hot page >> >> detection algorithm etc. >> > >> > One drawback of physical address based scanning is that it's hard to >> > apply some workload specific policy. For example, if a low priority >> > workload has many relatively hot pages, while a high priority workload >> > has many relative warm (not so hot) pages. We need to promote the warm >> > pages in the high priority workload, while physcial address based >> > scanning may report the hot pages in the low priority workload. Right? >> >> Correct. I wonder if DAMON has already devised a scheme to address this. SJ? > > Yes, I think DAMOS quotas and DAMOS filters can be used to address this issue. > > For this case, assuming each workload has its own cgroup, users can add a DAMOS > scheme for promotion per workload. The schemes will have different DAMOS > quotas based on the workloads' priority. The schemes will also be controlled > to do the promotion for pages of the specific workloads using DAMOS filters. > > For example, below kdamond configuration can be used. > > # damo args damon \ > --damos_action migrate_hot 0 --damos_quotas 100ms 1G 1s 0% 100% 100% \ > --damos_filter reject none memcg /workloads/high-priority \ > \ > --damos_action migrate_hot 0 --damos_quotas 10ms 100M 1s 0% 100% 100% \ > --damos_filter reject none memcg /workloads/low-priority \ > --damos_nr_filters 1 1 --out kdamond.json > # damo report damon --input_file ./kdamond.json --damon_params_omit_defaults > kdamond 0 > context 0 > ops: paddr > target 0 > region [4,294,967,296, 68,577,918,975) (59.868 GiB) > intervals: sample 5 ms, aggr 100 ms, update 1 s > nr_regions: [10, 1,000] > scheme 0 > action: migrate_hot to node 0 per aggr interval > target access pattern > sz: [0 B, max] > nr_accesses: [0 %, 18,446,744,073,709,551,616 %] > age: [0 ns, max] > quotas > 100 ms / 1024.000 MiB / 0 B per 1 s > priority: sz 0 %, nr_accesses 100 %, age 100 % > filter 0 > reject none memcg /workloads/high-priority > scheme 1 > action: migrate_hot to node 0 per aggr interval > target access pattern > sz: [0 B, max] > nr_accesses: [0 %, 18,446,744,073,709,551,616 %] > age: [0 ns, max] > quotas > 10 ms / 100.000 MiB / 0 B per 1 s > priority: sz 0 %, nr_accesses 100 %, age 100 % > filter 0 > reject none memcg /workloads/low-priority > > Please note that this is just one example based on existing DAMOS features. > This may have drawbacks and future optimizations would be possible. IIUC, this is something like, physical address -> struct page -> cgroup -> per-cgroup hot threshold this sounds good to me. Thanks! --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-02-07 8:10 ` Huang, Ying @ 2025-02-07 9:06 ` Gregory Price 2025-02-07 19:52 ` SeongJae Park 1 sibling, 0 replies; 33+ messages in thread From: Gregory Price @ 2025-02-07 9:06 UTC (permalink / raw) To: Huang, Ying Cc: SeongJae Park, Bharata B Rao, Raghavendra K T, linux-mm, akpm, lsf-pc, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Fri, Feb 07, 2025 at 04:10:47PM +0800, Huang, Ying wrote: > SeongJae Park <sj@kernel.org> writes: > > > > I agree the fairness is a thing that we need to aware of. But IMHO, it is > > something that the async approach can further be advanced for, not a strict > > blocker for now. > > Personally, I have no objection to async operations in general. > However, we may need to find some way to control these async operations > instead of adding more and more background kthreads blindly. How to > charge and constrain the resources used by these async operations is > important too. For example, some users may want to bind some async > operations on some CPUs. > > IMHO, we should think about the requirements and possible solutions > instead of ignoring the issues. > It also concerns me that most every proposal on async promotion ignores the promotion-node selection problem as if it's a secondary issue. Async systems fundamentally lack accessor-locality information unless it is recorded - and recording this information is expensive and/or heuristically imprecise for memory shared across tasks (two threads in the same process schedule across sockets). If we can't agree on a solution to this problem, it undercuts many of these RFCs which often simply hard-code the target node to "0" because it's too hard or too expensive to consider the multi-socket scenario. ~Gregory ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-02-07 8:10 ` Huang, Ying 2025-02-07 9:06 ` Gregory Price @ 2025-02-07 19:52 ` SeongJae Park 1 sibling, 0 replies; 33+ messages in thread From: SeongJae Park @ 2025-02-07 19:52 UTC (permalink / raw) To: Huang, Ying Cc: SeongJae Park, Bharata B Rao, Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Fri, 07 Feb 2025 16:10:47 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote: > SeongJae Park <sj@kernel.org> writes: > > > On Mon, 27 Jan 2025 10:41:07 +0530 Bharata B Rao <bharata@amd.com> wrote: > > > >> On 26-Jan-25 7:57 AM, Huang, Ying wrote: > >> > Hi, Raghavendra, > >> > > >> > Raghavendra K T <raghavendra.kt@amd.com> writes: [...] > >> > The drawbacks of asynchronous scanning including > >> > > >> > - The CPU cycles used are not charged properly > >> > > >> > - There may be no idle CPU cycles to use > >> > > >> > - The scanning CPU may be not near the workload CPUs enough > >> > > >> > It's better to involve Mel and Peter in the discussion for this. > >> > >> They are CC'ed in this thread and hopefully have insights to share. > >> > >> Charging CPU cycles to the right process has been brought up in other > >> similar contexts. Recent one is from page migration batching and using > >> multiple threads for migration - > >> https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com/ > >> > >> Does it make sense to treat hot page promotion from slow tiers > >> differently compared to locality based balancing? I mean couldn't the > >> charging of this async thread be similar to the cycles spent by other > >> system threads like kcompactd and khugepaged? > > > > I'm up to this idea. > > > > I agree the fairness is a thing that we need to aware of. But IMHO, it is > > something that the async approach can further be advanced for, not a strict > > blocker for now. > > Personally, I have no objection to async operations in general. > However, we may need to find some way to control these async operations > instead of adding more and more background kthreads blindly. How to > charge and constrain the resources used by these async operations is > important too. For example, some users may want to bind some async > operations on some CPUs. > > IMHO, we should think about the requirements and possible solutions > instead of ignoring the issues. I agree. For DAMON, we implemented DAMOS quotas feature for such resource control. We also had a (non-public) discussion about splitting DAMON thread for monitoring part and operation schemes execution parts for finer control. I'm also thinking about making the quotas for monitoring part resource consumption. We didn't implement the ideas yet since the requirements on real-world is unclear as of now, though. We will keep collecting the requirements and prioritize those or make another solution as the requirements becomes clearer. [...] > >> > One drawback of physical address based scanning is that it's hard to > >> > apply some workload specific policy. For example, if a low priority > >> > workload has many relatively hot pages, while a high priority workload > >> > has many relative warm (not so hot) pages. We need to promote the warm > >> > pages in the high priority workload, while physcial address based > >> > scanning may report the hot pages in the low priority workload. Right? > >> > >> Correct. I wonder if DAMON has already devised a scheme to address this. SJ? > > > > Yes, I think DAMOS quotas and DAMOS filters can be used to address this issue. > > > > For this case, assuming each workload has its own cgroup, users can add a DAMOS > > scheme for promotion per workload. The schemes will have different DAMOS > > quotas based on the workloads' priority. The schemes will also be controlled > > to do the promotion for pages of the specific workloads using DAMOS filters. > > > > For example, below kdamond configuration can be used. > > > > # damo args damon \ > > --damos_action migrate_hot 0 --damos_quotas 100ms 1G 1s 0% 100% 100% \ > > --damos_filter reject none memcg /workloads/high-priority \ > > \ > > --damos_action migrate_hot 0 --damos_quotas 10ms 100M 1s 0% 100% 100% \ > > --damos_filter reject none memcg /workloads/low-priority \ > > --damos_nr_filters 1 1 --out kdamond.json > > # damo report damon --input_file ./kdamond.json --damon_params_omit_defaults > > kdamond 0 > > context 0 > > ops: paddr > > target 0 > > region [4,294,967,296, 68,577,918,975) (59.868 GiB) > > intervals: sample 5 ms, aggr 100 ms, update 1 s > > nr_regions: [10, 1,000] > > scheme 0 > > action: migrate_hot to node 0 per aggr interval > > target access pattern > > sz: [0 B, max] > > nr_accesses: [0 %, 18,446,744,073,709,551,616 %] > > age: [0 ns, max] > > quotas > > 100 ms / 1024.000 MiB / 0 B per 1 s > > priority: sz 0 %, nr_accesses 100 %, age 100 % > > filter 0 > > reject none memcg /workloads/high-priority > > scheme 1 > > action: migrate_hot to node 0 per aggr interval > > target access pattern > > sz: [0 B, max] > > nr_accesses: [0 %, 18,446,744,073,709,551,616 %] > > age: [0 ns, max] > > quotas > > 10 ms / 100.000 MiB / 0 B per 1 s > > priority: sz 0 %, nr_accesses 100 %, age 100 % > > filter 0 > > reject none memcg /workloads/low-priority > > > > Please note that this is just one example based on existing DAMOS features. > > This may have drawbacks and future optimizations would be possible. > > IIUC, this is something like, > > physical address -> struct page -> cgroup -> per-cgroup hot threshold You're right. > > this sounds good to me. Thanks! Happy to hear that, and looking forward to contiue improving it further with you! :) Thanks, SJ > > --- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-26 2:27 ` Huang, Ying 2025-01-27 5:11 ` Bharata B Rao @ 2025-02-07 19:06 ` Davidlohr Bueso 2025-03-14 1:56 ` Raghavendra K T 1 sibling, 1 reply; 33+ messages in thread From: Davidlohr Bueso @ 2025-02-07 19:06 UTC (permalink / raw) To: Huang, Ying Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, dongjoo.linux.dev On Sun, 26 Jan 2025, Huang, Ying wrote: >Hi, Raghavendra, > >Raghavendra K T <raghavendra.kt@amd.com> writes: > >> Bharata and I would like to propose the following topic for LSFMM. >> >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. >> >> In the Linux kernel, hot page information can potentially be obtained from >> multiple sources: >> >> a. PROT_NONE faults (NUMA balancing) >> b. PTE Access bit (LRU scanning) >> c. Hardware provided page hotness info (like AMD IBS) >> >> This information is further used to migrate (or promote) pages from slow memory >> tier to top tier to increase performance. >> >> In the current hot page promotion mechanism, all the activities including the >> process address space scanning, NUMA hint fault handling and page migration are >> performed in the process context. i.e., scanning overhead is borne by the >> applications. >> >> I had recently posted a patch [1] to improve this in the context of slow-tier >> page promotion. Here, Scanning is done by a global kernel thread which routinely >> scans all the processes' address spaces and checks for accesses by reading the >> PTE A bit. The hot pages thus identified are maintained in list and subsequently >> are promoted to a default top-tier node. Thus, the approach pushes overhead of >> scanning, NUMA hint faults and migrations off from process context. It seems that overall having a global view of hot memory is where folks are leaning towards. In the past we have discussed an external thread to harvest information from different sources and do the corresponding migration. I think your work is a step in this direction (and shows promising numbers), but I'm not sure if it should be doing the scanning part, as opposed to just receive the information and migrate (according to some policy based on a wider system view of what is hot; ie: what CHMU says is hot might not be so hot to the rest of the system, or as is pointed out below, workload based, as priorities). > >This has been discussed before too. For example, in the following thread > >https://lore.kernel.org/all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/ > >The drawbacks of asynchronous scanning including > >- The CPU cycles used are not charged properly > >- There may be no idle CPU cycles to use > >- The scanning CPU may be not near the workload CPUs enough One approach we experimented with was doing only the page migration asynchronously, leaving the scanning to the task context, which also knows the dest numa node. Results showed that page fault latencies were reduced without affecting benchmark performance. Of course busy systems are an issue, as the window between servicing the fault and actually making it available to the user in fast memory is enlarged. >It's better to involve Mel and Peter in the discussion for this. > >> The topic was presented in the MM alignment session hosted by David Rientjes [2]. >> The topic also finds a mention in S J Park's LSFMM proposal [3]. >> >> Here is the list of potential discussion points: >> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of >> multiple kernel threads, throttling improvements, promotion policies, per-process >> opt-in via prctl, virtual vs physical address based scanning, tuning hot page >> detection algorithm etc. > >One drawback of physical address based scanning is that it's hard to >apply some workload specific policy. For example, if a low priority >workload has many relatively hot pages, while a high priority workload >has many relative warm (not so hot) pages. We need to promote the warm >pages in the high priority workload, while physcial address based >scanning may report the hot pages in the low priority workload. Right? > >> 2. Possibility of maintaining single source of truth for page hotness that would >> maintain hot page information from multiple sources and let other sub-systems >> use that info. >> >> 3. Discuss how hardware provided hotness info (like AMD IBS) can further aid >> promotion. Bharata had posted an RFC [4] on this a while back. >> >> 4. Overlap with DAMON and potential reuse. >> >> Links: >> >> [1] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/ >> [2] https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/ >> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/ >> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/ >> > >--- >Best Regards, >Huang, Ying > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-02-07 19:06 ` Davidlohr Bueso @ 2025-03-14 1:56 ` Raghavendra K T 2025-03-14 2:12 ` Raghavendra K T 0 siblings, 1 reply; 33+ messages in thread From: Raghavendra K T @ 2025-03-14 1:56 UTC (permalink / raw) To: Huang, Ying, linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, dongjoo.linux.dev On 2/8/2025 12:36 AM, Davidlohr Bueso wrote: > On Sun, 26 Jan 2025, Huang, Ying wrote: > >> Hi, Raghavendra, >> >> Raghavendra K T <raghavendra.kt@amd.com> writes: >> >>> Bharata and I would like to propose the following topic for LSFMM. >>> >>> Topic: Overhauling hot page detection and promotion based on PTE A >>> bit scanning. >>> >>> In the Linux kernel, hot page information can potentially be obtained >>> from >>> multiple sources: >>> >>> a. PROT_NONE faults (NUMA balancing) >>> b. PTE Access bit (LRU scanning) >>> c. Hardware provided page hotness info (like AMD IBS) >>> >>> This information is further used to migrate (or promote) pages from >>> slow memory >>> tier to top tier to increase performance. >>> >>> In the current hot page promotion mechanism, all the activities >>> including the >>> process address space scanning, NUMA hint fault handling and page >>> migration are >>> performed in the process context. i.e., scanning overhead is borne by >>> the >>> applications. >>> >>> I had recently posted a patch [1] to improve this in the context of >>> slow-tier >>> page promotion. Here, Scanning is done by a global kernel thread >>> which routinely >>> scans all the processes' address spaces and checks for accesses by >>> reading the >>> PTE A bit. The hot pages thus identified are maintained in list and >>> subsequently >>> are promoted to a default top-tier node. Thus, the approach pushes >>> overhead of >>> scanning, NUMA hint faults and migrations off from process context. > > It seems that overall having a global view of hot memory is where folks > are leaning > towards. In the past we have discussed an external thread to harvest > information > from different sources and do the corresponding migration. I think your > work is a > step in this direction (and shows promising numbers), but I'm not sure > if it should > be doing the scanning part, as opposed to just receive the information > and migrate > (according to some policy based on a wider system view of what is hot; > ie: what CHMU > says is hot might not be so hot to the rest of the system, or as is > pointed out > below, workload based, as priorities). > >> >> This has been discussed before too. For example, in the following thread >> >> https://lore.kernel.org/ >> all/20200417100633.GU20730@hirez.programming.kicks-ass.net/T/ >> >> The drawbacks of asynchronous scanning including >> >> - The CPU cycles used are not charged properly >> >> - There may be no idle CPU cycles to use >> >> - The scanning CPU may be not near the workload CPUs enough > > One approach we experimented with was doing only the page migration > asynchronously, > leaving the scanning to the task context, which also knows the dest numa > node. > Results showed that page fault latencies were reduced without affecting > benchmark > performance. Of course busy systems are an issue, as the window between > servicing > the fault and actually making it available to the user in fast memory is > enlarged. > >> It's better to involve Mel and Peter in the discussion for this. >> >>> The topic was presented in the MM alignment session hosted by David >>> Rientjes [2]. >>> The topic also finds a mention in S J Park's LSFMM proposal [3]. >>> >>> Here is the list of potential discussion points: >>> 1. Other improvements and enhancements to PTE A bit scanning >>> approach. Use of >>> multiple kernel threads, throttling improvements, promotion policies, >>> per-process >>> opt-in via prctl, virtual vs physical address based scanning, tuning >>> hot page >>> detection algorithm etc. >> >> One drawback of physical address based scanning is that it's hard to >> apply some workload specific policy. For example, if a low priority >> workload has many relatively hot pages, while a high priority workload >> has many relative warm (not so hot) pages. We need to promote the warm >> pages in the high priority workload, while physcial address based >> scanning may report the hot pages in the low priority workload. Right? >> >>> 2. Possibility of maintaining single source of truth for page hotness >>> that would >>> maintain hot page information from multiple sources and let other >>> sub-systems >>> use that info. >>> >>> 3. Discuss how hardware provided hotness info (like AMD IBS) can >>> further aid >>> promotion. Bharata had posted an RFC [4] on this a while back. >>> >>> 4. Overlap with DAMON and potential reuse. >>> >>> Links: >>> >>> [1] https://lore.kernel.org/all/20241201153818.2633616-1- >>> raghavendra.kt@amd.com/ >>> [2] https://lore.kernel.org/linux- >>> mm/20241226012833.rmmbkws4wdhzdht6@ed.ac.uk/T/ >>> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora- >>> PF4VCD3F/T/ >>> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/ Hello All, Sorry to comeback late on this. But after "Unifying source of page temperature discussion", I was trying to get one step closer towards that. (along with Bharata). (also sometime spent on failed muti-threaded scanning that perhaps needs more time if it is needed). I am posting a single patch which is still in "raw" state (as reply to this email). I will cleanup, split the patch and post early next week. Sending this so to have a gist of what is coming atleast before LSFMM. So here are the list of implemented feedback that we can build further (depending on the consensus). 1. Scanning and migration is separated. A separate migration thread is created. Potential improvements that can be done here: - Have one instance of migration thread per node. - API to accept hot pages for promotion from different sources (for e.g., IBS / LRU as Bharata already mentioned) - Controlling throttling similar to what Huang has done in NUMAB=2 case - Take both PFN and folio as argument for migration - Make use of batch migration enhancements - usage of per mm migration list to have a easy lookup and control (using mmslot, This also helps build upon identifying actual hot pages (2 subsequent access) than single access.) 2. Implemented David's (Rientjes) suggestion of having a prctl approach. Currently prctl values can range from 0..10. 0 is for disabling >1 for enabling. But in the future idea is to use this as controlling scan rate further. 3. Steves' comment on tracing incorporated 4. Davidlohr's reported issue on the path series is fixed 5. Very importantly, I do have a basic algorithm that detects "target node for migration" which was the main pain point for PTE A bit scanning. Algorithm: As part of our scanning we are doing, scan of top tier pages also. During the scan, How many pages - scanned/accessed that belongs to particular toptier/slowtier node is also recorded. Currently my algorithm chooses the toptier node that had the maximum pages scanned. But we can really build complex algorithm using scanned/accessed recently. (for e.g. decay last scanned/accessed info, if current topteir node becomes nearly becomes full find next preferred node, thus using nodemask/or preferred list instead of single node etc). Potential improvements on scanning part can be use of complex data structures to maintain area of hotpages similar to what DAMON is doing or reuse some infrastructure from DAMON. Thanks and Regards - Raghu ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-03-14 1:56 ` Raghavendra K T @ 2025-03-14 2:12 ` Raghavendra K T 0 siblings, 0 replies; 33+ messages in thread From: Raghavendra K T @ 2025-03-14 2:12 UTC (permalink / raw) To: raghavendra.kt Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Michael.Day, abhishekd, akpm, bharata, dave.hansen, david, dongjoo.linux.dev, feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard, jon.grimm, k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett, linux-kernel, linux-mm, lsf-pc, mgorman, mingo, nadav.amit, nehagholkar, nphamcs, peterz, riel, rientjes, rppt, santosh.shukla, shivankg, shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy PTE A bit scanning single patch RFC v1 ---8x--- diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 09f0aed5a08b..78633cab3f1a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -195,6 +195,7 @@ read the file /proc/PID/status:: VmLib: 1412 kB VmPTE: 20 kb VmSwap: 0 kB + PTEAScanScale: 0 HugetlbPages: 0 kB CoreDumping: 0 THP_enabled: 1 @@ -278,6 +279,7 @@ It's slow but very precise. VmPTE size of page table entries VmSwap amount of swap used by anonymous private data (shmem swap usage is not included) + PTEAScanScale Integer representing async PTE A bit scan agrression HugetlbPages size of hugetlb memory portions CoreDumping process's memory is currently being dumped (killing the process may lead to a corrupted core) diff --git a/fs/exec.c b/fs/exec.c index 506cd411f4ac..e76285e4bc73 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -68,6 +68,7 @@ #include <linux/user_events.h> #include <linux/rseq.h> #include <linux/ksm.h> +#include <linux/kmmscand.h> #include <linux/uaccess.h> #include <asm/mmu_context.h> @@ -266,6 +267,8 @@ static int __bprm_mm_init(struct linux_binprm *bprm) if (err) goto err_ksm; + kmmscand_execve(mm); + /* * Place the stack at the largest stack address the architecture * supports. Later, we'll move this to an appropriate place. We don't @@ -288,6 +291,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm) return 0; err: ksm_exit(mm); + kmmscand_exit(mm); err_ksm: mmap_write_unlock(mm); err_free: diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f02cd362309a..55620a5178fb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -79,6 +79,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8); SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); seq_puts(m, " kB\n"); +#ifdef CONFIG_KMMSCAND + seq_put_decimal_ull_width(m, "PTEAScanScale:\t", mm->pte_scan_scale, 8); + seq_puts(m, "\n"); +#endif hugetlb_report_usage(m, mm); } #undef SEQ_PUT_DEC diff --git a/include/linux/kmmscand.h b/include/linux/kmmscand.h new file mode 100644 index 000000000000..7021f7d979a6 --- /dev/null +++ b/include/linux/kmmscand.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_KMMSCAND_H_ +#define _LINUX_KMMSCAND_H_ + +#ifdef CONFIG_KMMSCAND +extern void __kmmscand_enter(struct mm_struct *mm); +extern void __kmmscand_exit(struct mm_struct *mm); + +static inline void kmmscand_execve(struct mm_struct *mm) +{ + __kmmscand_enter(mm); +} + +static inline void kmmscand_fork(struct mm_struct *mm, struct mm_struct *oldmm) +{ + mm->pte_scan_scale = oldmm->pte_scan_scale; + __kmmscand_enter(mm); +} + +static inline void kmmscand_exit(struct mm_struct *mm) +{ + __kmmscand_exit(mm); +} +#else /* !CONFIG_KMMSCAND */ +static inline void __kmmscand_enter(struct mm_struct *mm) {} +static inline void __kmmscand_exit(struct mm_struct *mm) {} +static inline void kmmscand_execve(struct mm_struct *mm) {} +static inline void kmmscand_fork(struct mm_struct *mm, struct mm_struct *oldmm) {} +static inline void kmmscand_exit(struct mm_struct *mm) {} +#endif +#endif /* _LINUX_KMMSCAND_H_ */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 7b1068ddcbb7..fbd9273f6c65 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -682,6 +682,18 @@ struct vm_operations_struct { unsigned long addr); }; +#ifdef CONFIG_KMMSCAND +void count_kmmscand_mm_scans(void); +void count_kmmscand_vma_scans(void); +void count_kmmscand_migadded(void); +void count_kmmscand_migrated(void); +void count_kmmscand_migrate_failed(void); +void count_kmmscand_kzalloc_fail(void); +void count_kmmscand_slowtier(void); +void count_kmmscand_toptier(void); +void count_kmmscand_idlepage(void); +#endif + #ifdef CONFIG_NUMA_BALANCING static inline void vma_numab_state_init(struct vm_area_struct *vma) { diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0234f14f2aa6..7950554f7447 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1014,6 +1014,14 @@ struct mm_struct { /* numa_scan_seq prevents two threads remapping PTEs. */ int numa_scan_seq; +#endif +#ifdef CONFIG_KMMSCAND + /* Tracks promotion node. XXX: use nodemask */ + int target_node; + + /* Integer representing PTE A bit scan aggression (0-10) */ + unsigned int pte_scan_scale; + #endif /* * An operation with batched TLB flushing is going on. Anything diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index f70d0958095c..620c1b1c157a 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -65,6 +65,17 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, NUMA_HINT_FAULTS_LOCAL, NUMA_PAGE_MIGRATE, #endif +#ifdef CONFIG_KMMSCAND + KMMSCAND_MM_SCANS, + KMMSCAND_VMA_SCANS, + KMMSCAND_MIGADDED, + KMMSCAND_MIGRATED, + KMMSCAND_MIGRATE_FAILED, + KMMSCAND_KZALLOC_FAIL, + KMMSCAND_SLOWTIER, + KMMSCAND_TOPTIER, + KMMSCAND_IDLEPAGE, +#endif #ifdef CONFIG_MIGRATION PGMIGRATE_SUCCESS, PGMIGRATE_FAIL, THP_MIGRATION_SUCCESS, diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index b37eb0a7060f..be1a7188a192 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -9,6 +9,96 @@ #include <linux/tracepoint.h> #include <trace/events/mmflags.h> +DECLARE_EVENT_CLASS(kmem_mm_class, + + TP_PROTO(struct mm_struct *mm), + + TP_ARGS(mm), + + TP_STRUCT__entry( + __field( struct mm_struct *, mm ) + ), + + TP_fast_assign( + __entry->mm = mm; + ), + + TP_printk("mm = %p", __entry->mm) +); + +DEFINE_EVENT(kmem_mm_class, kmem_mm_enter, + TP_PROTO(struct mm_struct *mm), + TP_ARGS(mm) +); + +DEFINE_EVENT(kmem_mm_class, kmem_mm_exit, + TP_PROTO(struct mm_struct *mm), + TP_ARGS(mm) +); + +DEFINE_EVENT(kmem_mm_class, kmem_scan_mm_start, + TP_PROTO(struct mm_struct *mm), + TP_ARGS(mm) +); + +TRACE_EVENT(kmem_scan_mm_end, + + TP_PROTO( struct mm_struct *mm, + unsigned long start, + unsigned long total, + unsigned long scan_period, + unsigned long scan_size, + int target_node), + + TP_ARGS(mm, start, total, scan_period, scan_size, target_node), + + TP_STRUCT__entry( + __field( struct mm_struct *, mm ) + __field( unsigned long, start ) + __field( unsigned long, total ) + __field( unsigned long, scan_period ) + __field( unsigned long, scan_size ) + __field( int, target_node ) + ), + + TP_fast_assign( + __entry->mm = mm; + __entry->start = start; + __entry->total = total; + __entry->scan_period = scan_period; + __entry->scan_size = scan_size; + __entry->target_node = target_node; + ), + + TP_printk("mm=%p, start = %ld, total = %ld, scan_period = %ld, scan_size = %ld node = %d", + __entry->mm, __entry->start, __entry->total, __entry->scan_period, + __entry->scan_size, __entry->target_node) +); + +TRACE_EVENT(kmem_scan_mm_migrate, + + TP_PROTO(struct mm_struct *mm, + int rc, + int target_node), + + TP_ARGS(mm, rc, target_node), + + TP_STRUCT__entry( + __field( struct mm_struct *, mm ) + __field( int, rc ) + __field( int, target_node ) + ), + + TP_fast_assign( + __entry->mm = mm; + __entry->rc = rc; + __entry->target_node = target_node; + ), + + TP_printk("mm = %p rc = %d node = %d", + __entry->mm, __entry->rc, __entry->target_node) +); + TRACE_EVENT(kmem_cache_alloc, TP_PROTO(unsigned long call_site, diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 5c6080680cb2..18face11440a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -353,4 +353,11 @@ struct prctl_mm_map { */ #define PR_LOCK_SHADOW_STACK_STATUS 76 +/* Set/get PTE A bit scan scale */ +#define PR_SET_PTE_A_SCAN_SCALE 77 +#define PR_GET_PTE_A_SCAN_SCALE 78 +# define PR_PTE_A_SCAN_SCALE_MIN 0 +# define PR_PTE_A_SCAN_SCALE_MAX 10 +# define PR_PTE_A_SCAN_SCALE_DEFAULT 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 735405a9c5f3..bfbbacb8ec36 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -85,6 +85,7 @@ #include <linux/user-return-notifier.h> #include <linux/oom.h> #include <linux/khugepaged.h> +#include <linux/kmmscand.h> #include <linux/signalfd.h> #include <linux/uprobes.h> #include <linux/aio.h> @@ -105,6 +106,7 @@ #include <uapi/linux/pidfd.h> #include <linux/pidfs.h> #include <linux/tick.h> +#include <linux/prctl.h> #include <asm/pgalloc.h> #include <linux/uaccess.h> @@ -656,6 +658,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, mm->exec_vm = oldmm->exec_vm; mm->stack_vm = oldmm->stack_vm; + kmmscand_fork(mm, oldmm); + /* Use __mt_dup() to efficiently build an identical maple tree. */ retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL); if (unlikely(retval)) @@ -1289,6 +1293,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, init_tlb_flush_pending(mm); #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS) mm->pmd_huge_pte = NULL; +#endif +#ifdef CONFIG_KMMSCAND + mm->pte_scan_scale = PR_PTE_A_SCAN_SCALE_DEFAULT; #endif mm_init_uprobes_state(mm); hugetlb_count_init(mm); @@ -1353,6 +1360,7 @@ static inline void __mmput(struct mm_struct *mm) exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ + kmmscand_exit(mm); exit_mmap(mm); mm_put_huge_zero_folio(mm); set_mm_exe_file(mm, NULL); diff --git a/kernel/sys.c b/kernel/sys.c index cb366ff8703a..0518480d8f78 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2142,6 +2142,19 @@ static int prctl_set_auxv(struct mm_struct *mm, unsigned long addr, return 0; } +#ifdef CONFIG_KMMSCAND +static int prctl_pte_scan_scale_write(unsigned int scale) +{ + scale = clamp(scale, PR_PTE_A_SCAN_SCALE_MIN, PR_PTE_A_SCAN_SCALE_MAX); + current->mm->pte_scan_scale = scale; + return 0; +} + +static unsigned int prctl_pte_scan_scale_read(void) +{ + return current->mm->pte_scan_scale; +} +#endif static int prctl_set_mm(int opt, unsigned long addr, unsigned long arg4, unsigned long arg5) @@ -2811,6 +2824,18 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, return -EINVAL; error = arch_lock_shadow_stack_status(me, arg2); break; +#ifdef CONFIG_KMMSCAND + case PR_SET_PTE_A_SCAN_SCALE: + if (arg3 || arg4 || arg5) + return -EINVAL; + error = prctl_pte_scan_scale_write((unsigned int) arg2); + break; + case PR_GET_PTE_A_SCAN_SCALE: + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + error = prctl_pte_scan_scale_read(); + break; +#endif default: trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5); error = -EINVAL; diff --git a/mm/Kconfig b/mm/Kconfig index 1b501db06417..529bf140e1f7 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -783,6 +783,13 @@ config KSM until a program has madvised that an area is MADV_MERGEABLE, and root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set). +config KMMSCAND + bool "Enable PTE A bit scanning and Migration" + depends on NUMA_BALANCING + help + Enable PTE A bit scanning of page. CXL pages accessed are migrated to + regular NUMA node (node 0 - default). + config DEFAULT_MMAP_MIN_ADDR int "Low address space to protect from user allocation" depends on MMU diff --git a/mm/Makefile b/mm/Makefile index 850386a67b3e..45e2f8cc8fd6 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -94,6 +94,7 @@ obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_NUMA) += memory-tiers.o +obj-$(CONFIG_KMMSCAND) += kmmscand.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/kmmscand.c b/mm/kmmscand.c new file mode 100644 index 000000000000..2fc1b46cf512 --- /dev/null +++ b/mm/kmmscand.c @@ -0,0 +1,1505 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/mm.h> +#include <linux/mm_types.h> +#include <linux/sched.h> +#include <linux/sched/mm.h> +#include <linux/mmu_notifier.h> +#include <linux/migrate.h> +#include <linux/rmap.h> +#include <linux/pagewalk.h> +#include <linux/page_ext.h> +#include <linux/page_idle.h> +#include <linux/page_table_check.h> +#include <linux/pagemap.h> +#include <linux/swap.h> +#include <linux/mm_inline.h> +#include <linux/kthread.h> +#include <linux/kmmscand.h> +#include <linux/memory-tiers.h> +#include <linux/mempolicy.h> +#include <linux/string.h> +#include <linux/cleanup.h> +#include <linux/minmax.h> +#include <linux/delay.h> +#include <trace/events/kmem.h> + +#include <asm/pgalloc.h> +#include "internal.h" +#include "mm_slot.h" + +static struct task_struct *kmmscand_thread __read_mostly; +static DEFINE_MUTEX(kmmscand_mutex); +extern unsigned int sysctl_numa_balancing_scan_delay; + +/* + * Total VMA size to cover during scan. + * Min: 256MB default: 1GB max: 4GB + */ +#define KMMSCAND_SCAN_SIZE_MIN (256 * 1024 * 1024UL) +#define KMMSCAND_SCAN_SIZE_MAX (8 * 1024 * 1024 * 1024UL) +#define KMMSCAND_SCAN_SIZE (2 * 1024 * 1024 * 1024UL) + +static unsigned long kmmscand_scan_size __read_mostly = KMMSCAND_SCAN_SIZE; + +/* + * Scan period for each mm. + * Min: 400ms default: 2sec Max: 5sec + */ +#define KMMSCAND_SCAN_PERIOD_MAX 5000U +#define KMMSCAND_SCAN_PERIOD_MIN 400U +#define KMMSCAND_SCAN_PERIOD 2000U + +static unsigned int kmmscand_mm_scan_period_ms __read_mostly = KMMSCAND_SCAN_PERIOD; + +/* How long to pause between two scan and migration cycle */ +static unsigned int kmmscand_scan_sleep_ms __read_mostly = 16; + +/* Max number of mms to scan in one scan and migration cycle */ +#define KMMSCAND_MMS_TO_SCAN (4 * 1024UL) +static unsigned long kmmscand_mms_to_scan __read_mostly = KMMSCAND_MMS_TO_SCAN; + +volatile bool kmmscand_scan_enabled = true; +static bool need_wakeup; +static bool migrated_need_wakeup; + +/* How long to pause between two migration cycles */ +static unsigned int kmmmigrate_sleep_ms __read_mostly = 20; + +static struct task_struct *kmmmigrated_thread __read_mostly; +static DEFINE_MUTEX(kmmmigrated_mutex); +static DECLARE_WAIT_QUEUE_HEAD(kmmmigrated_wait); +static unsigned long kmmmigrated_sleep_expire; + +/* mm of the migrating folio entry */ +static struct mm_struct *kmmscand_cur_migrate_mm; + +/* Migration list is manipulated underneath because of mm_exit */ +static bool kmmscand_migration_list_dirty; + +static unsigned long kmmscand_sleep_expire; +#define KMMSCAND_DEFAULT_TARGET_NODE (0) +static int kmmscand_target_node = KMMSCAND_DEFAULT_TARGET_NODE; + +static DEFINE_SPINLOCK(kmmscand_mm_lock); +static DEFINE_SPINLOCK(kmmscand_migrate_lock); +static DECLARE_WAIT_QUEUE_HEAD(kmmscand_wait); + +#define KMMSCAND_SLOT_HASH_BITS 10 +static DEFINE_READ_MOSTLY_HASHTABLE(kmmscand_slots_hash, KMMSCAND_SLOT_HASH_BITS); + +static struct kmem_cache *kmmscand_slot_cache __read_mostly; + +struct kmmscand_nodeinfo { + unsigned long nr_scanned; + unsigned long nr_accessed; + int node; + bool is_toptier; +}; + +struct kmmscand_mm_slot { + struct mm_slot slot; + /* Unit: ms. Determines how aften mm scan should happen. */ + unsigned int scan_period; + unsigned long next_scan; + /* Tracks how many useful pages obtained for migration in the last scan */ + unsigned long scan_delta; + /* Determines how much VMA address space to be covered in the scanning */ + unsigned long scan_size; + long address; + volatile bool is_scanned; + int target_node; +}; + +struct kmmscand_scan { + struct list_head mm_head; + struct kmmscand_mm_slot *mm_slot; +}; + +struct kmmscand_scan kmmscand_scan = { + .mm_head = LIST_HEAD_INIT(kmmscand_scan.mm_head), +}; + +struct kmmscand_scanctrl { + struct list_head scan_list; + struct kmmscand_nodeinfo *nodeinfo[MAX_NUMNODES]; +}; + +struct kmmscand_scanctrl kmmscand_scanctrl; + +struct kmmscand_migrate_list { + struct list_head migrate_head; +}; + +struct kmmscand_migrate_list kmmscand_migrate_list = { + .migrate_head = LIST_HEAD_INIT(kmmscand_migrate_list.migrate_head), +}; + +struct kmmscand_migrate_info { + struct list_head migrate_node; + struct mm_struct *mm; + struct folio *folio; + unsigned long address; +}; + +#ifdef CONFIG_SYSFS +static ssize_t scan_sleep_ms_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sysfs_emit(buf, "%u\n", kmmscand_scan_sleep_ms); +} + +static ssize_t scan_sleep_ms_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned int msecs; + int err; + + err = kstrtouint(buf, 10, &msecs); + if (err) + return -EINVAL; + + kmmscand_scan_sleep_ms = msecs; + kmmscand_sleep_expire = 0; + wake_up_interruptible(&kmmscand_wait); + + return count; +} +static struct kobj_attribute scan_sleep_ms_attr = + __ATTR_RW(scan_sleep_ms); + +static ssize_t mm_scan_period_ms_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sysfs_emit(buf, "%u\n", kmmscand_mm_scan_period_ms); +} + +/* If a value less than MIN or greater than MAX asked for store value is clamped */ +static ssize_t mm_scan_period_ms_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned int msecs, stored_msecs; + int err; + + err = kstrtouint(buf, 10, &msecs); + if (err) + return -EINVAL; + + stored_msecs = clamp(msecs, KMMSCAND_SCAN_PERIOD_MIN, KMMSCAND_SCAN_PERIOD_MAX); + + kmmscand_mm_scan_period_ms = stored_msecs; + kmmscand_sleep_expire = 0; + wake_up_interruptible(&kmmscand_wait); + + return count; +} + +static struct kobj_attribute mm_scan_period_ms_attr = + __ATTR_RW(mm_scan_period_ms); + +static ssize_t mms_to_scan_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sysfs_emit(buf, "%lu\n", kmmscand_mms_to_scan); +} + +static ssize_t mms_to_scan_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long val; + int err; + + err = kstrtoul(buf, 10, &val); + if (err) + return -EINVAL; + + kmmscand_mms_to_scan = val; + kmmscand_sleep_expire = 0; + wake_up_interruptible(&kmmscand_wait); + + return count; +} + +static struct kobj_attribute mms_to_scan_attr = + __ATTR_RW(mms_to_scan); + +static ssize_t scan_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sysfs_emit(buf, "%u\n", kmmscand_scan_enabled ? 1 : 0); +} + +static ssize_t scan_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned int val; + int err; + + err = kstrtouint(buf, 10, &val); + if (err || val > 1) + return -EINVAL; + + if (val) { + kmmscand_scan_enabled = true; + need_wakeup = true; + } else + kmmscand_scan_enabled = false; + + kmmscand_sleep_expire = 0; + wake_up_interruptible(&kmmscand_wait); + + return count; +} + +static struct kobj_attribute scan_enabled_attr = + __ATTR_RW(scan_enabled); + +static ssize_t target_node_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sysfs_emit(buf, "%u\n", kmmscand_target_node); +} + +static ssize_t target_node_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err, node; + + err = kstrtoint(buf, 10, &node); + if (err) + return -EINVAL; + + kmmscand_sleep_expire = 0; + if (!node_is_toptier(node)) + return -EINVAL; + + kmmscand_target_node = node; + wake_up_interruptible(&kmmscand_wait); + + return count; +} +static struct kobj_attribute target_node_attr = + __ATTR_RW(target_node); + +static struct attribute *kmmscand_attr[] = { + &scan_sleep_ms_attr.attr, + &mm_scan_period_ms_attr.attr, + &mms_to_scan_attr.attr, + &scan_enabled_attr.attr, + &target_node_attr.attr, + NULL, +}; + +struct attribute_group kmmscand_attr_group = { + .attrs = kmmscand_attr, + .name = "kmmscand", +}; +#endif + +void count_kmmscand_mm_scans(void) +{ + count_vm_numa_event(KMMSCAND_MM_SCANS); +} +void count_kmmscand_vma_scans(void) +{ + count_vm_numa_event(KMMSCAND_VMA_SCANS); +} +void count_kmmscand_migadded(void) +{ + count_vm_numa_event(KMMSCAND_MIGADDED); +} +void count_kmmscand_migrated(void) +{ + count_vm_numa_event(KMMSCAND_MIGRATED); +} +void count_kmmscand_migrate_failed(void) +{ + count_vm_numa_event(KMMSCAND_MIGRATE_FAILED); +} +void count_kmmscand_kzalloc_fail(void) +{ + count_vm_numa_event(KMMSCAND_KZALLOC_FAIL); +} +void count_kmmscand_slowtier(void) +{ + count_vm_numa_event(KMMSCAND_SLOWTIER); +} +void count_kmmscand_toptier(void) +{ + count_vm_numa_event(KMMSCAND_TOPTIER); +} +void count_kmmscand_idlepage(void) +{ + count_vm_numa_event(KMMSCAND_IDLEPAGE); +} + +static int kmmscand_has_work(void) +{ + return !list_empty(&kmmscand_scan.mm_head); +} + +static int kmmmigrated_has_work(void) +{ + if (!list_empty(&kmmscand_migrate_list.migrate_head)) + return true; + return false; +} + +static bool kmmscand_should_wakeup(void) +{ + bool wakeup = kthread_should_stop() || need_wakeup || + time_after_eq(jiffies, kmmscand_sleep_expire); + if (need_wakeup) + need_wakeup = false; + + return wakeup; +} + +static bool kmmmigrated_should_wakeup(void) +{ + bool wakeup = kthread_should_stop() || migrated_need_wakeup || + time_after_eq(jiffies, kmmmigrated_sleep_expire); + if (migrated_need_wakeup) + migrated_need_wakeup = false; + + return wakeup; +} + +static void kmmscand_wait_work(void) +{ + const unsigned long scan_sleep_jiffies = + msecs_to_jiffies(kmmscand_scan_sleep_ms); + + if (!scan_sleep_jiffies) + return; + + kmmscand_sleep_expire = jiffies + scan_sleep_jiffies; + wait_event_timeout(kmmscand_wait, + kmmscand_should_wakeup(), + scan_sleep_jiffies); + return; +} + +static void kmmmigrated_wait_work(void) +{ + const unsigned long migrate_sleep_jiffies = + msecs_to_jiffies(kmmmigrate_sleep_ms); + + if (!migrate_sleep_jiffies) + return; + + kmmmigrated_sleep_expire = jiffies + migrate_sleep_jiffies; + wait_event_timeout(kmmmigrated_wait, + kmmmigrated_should_wakeup(), + migrate_sleep_jiffies); + return; +} + +static unsigned long get_slowtier_accesed(struct kmmscand_scanctrl *scanctrl) +{ + int node; + unsigned long accessed = 0; + + for_each_node_state(node, N_MEMORY) { + if (!node_is_toptier(node)) + accessed += (scanctrl->nodeinfo[node])->nr_accessed; + } + return accessed; +} + +static inline unsigned long get_nodeinfo_nr_accessed(struct kmmscand_nodeinfo *ni) +{ + return ni->nr_accessed; +} + +static inline void set_nodeinfo_nr_accessed(struct kmmscand_nodeinfo *ni, unsigned long val) +{ + ni->nr_accessed = val; +} + +static inline void reset_nodeinfo_nr_accessed(struct kmmscand_nodeinfo *ni) +{ + set_nodeinfo_nr_accessed(ni, 0); +} + +static inline void nodeinfo_nr_accessed_inc(struct kmmscand_nodeinfo *ni) +{ + ni->nr_accessed++; +} + +static inline unsigned long get_nodeinfo_nr_scanned(struct kmmscand_nodeinfo *ni) +{ + return ni->nr_scanned; +} + +static inline void set_nodeinfo_nr_scanned(struct kmmscand_nodeinfo *ni, unsigned long val) +{ + ni->nr_scanned = val; +} + +static inline void reset_nodeinfo_nr_scanned(struct kmmscand_nodeinfo *ni) +{ + set_nodeinfo_nr_scanned(ni, 0); +} + +static inline void nodeinfo_nr_scanned_inc(struct kmmscand_nodeinfo *ni) +{ + ni->nr_scanned++; +} + + +static inline void reset_nodeinfo(struct kmmscand_nodeinfo *ni) +{ + set_nodeinfo_nr_scanned(ni, 0); + set_nodeinfo_nr_accessed(ni, 0); +} + +static void init_one_nodeinfo(struct kmmscand_nodeinfo *ni, int node) +{ + ni->nr_scanned = 0; + ni->nr_accessed = 0; + ni->node = node; + ni->is_toptier = node_is_toptier(node) ? true : false; +} + +static struct kmmscand_nodeinfo *alloc_one_nodeinfo(int node) +{ + struct kmmscand_nodeinfo *ni; + + ni = kzalloc(sizeof(*ni), GFP_KERNEL); + + if (!ni) + return NULL; + + init_one_nodeinfo(ni, node); + + return ni; +} + +/* TBD: Handle errors */ +static void init_scanctrl(struct kmmscand_scanctrl *scanctrl) +{ + struct kmmscand_nodeinfo *ni; + int node; + for_each_node_state(node, N_MEMORY) { + ni = alloc_one_nodeinfo(node); + if (!ni) + WARN_ON_ONCE(ni); + scanctrl->nodeinfo[node] = ni; + } + pr_warn("scan ctrl init %d", node); +} + +static void reset_scanctrl(struct kmmscand_scanctrl *scanctrl) +{ + int node; + for_each_node_state(node, N_MEMORY) + reset_nodeinfo(scanctrl->nodeinfo[node]); +} + +static bool kmmscand_eligible_srcnid(int nid) +{ + if (!node_is_toptier(nid)) + return true; + + return false; +} + +/* + * Do not know what info to pass in the future to make + * decision on taget node. Keep it void * now. + */ +static int kmmscand_get_target_node(void *data) +{ + return kmmscand_target_node; +} + +static int get_target_node(struct kmmscand_scanctrl *scanctrl) +{ + int node, target_node = -9999; + unsigned long old_accessed = 0; + + for_each_node(node) { + if (get_nodeinfo_nr_scanned(scanctrl->nodeinfo[node]) > old_accessed + && node_is_toptier(node)) { + old_accessed = get_nodeinfo_nr_scanned(scanctrl->nodeinfo[node]); + target_node = node; + } + } + if (target_node == -9999) + target_node = kmmscand_get_target_node(NULL); + + return target_node; +} + +extern bool migrate_balanced_pgdat(struct pglist_data *pgdat, + unsigned long nr_migrate_pages); + +/*XXX: Taken from migrate.c to avoid NUMAB mode=2 and NULL vma checks*/ +static int kmmscand_migrate_misplaced_folio_prepare(struct folio *folio, + struct vm_area_struct *vma, int node) +{ + int nr_pages = folio_nr_pages(folio); + pg_data_t *pgdat = NODE_DATA(node); + + if (folio_is_file_lru(folio)) { + /* + * Do not migrate file folios that are mapped in multiple + * processes with execute permissions as they are probably + * shared libraries. + * + * See folio_likely_mapped_shared() on possible imprecision + * when we cannot easily detect if a folio is shared. + */ + if (vma && (vma->vm_flags & VM_EXEC) && + folio_likely_mapped_shared(folio)) + return -EACCES; + /* + * Do not migrate dirty folios as not all filesystems can move + * dirty folios in MIGRATE_ASYNC mode which is a waste of + * cycles. + */ + if (folio_test_dirty(folio)) + return -EAGAIN; + } + + /* Avoid migrating to a node that is nearly full */ + if (!migrate_balanced_pgdat(pgdat, nr_pages)) { + int z; + + for (z = pgdat->nr_zones - 1; z >= 0; z--) { + if (managed_zone(pgdat->node_zones + z)) + break; + } + + /* + * If there are no managed zones, it should not proceed + * further. + */ + if (z < 0) + return -EAGAIN; + + wakeup_kswapd(pgdat->node_zones + z, 0, + folio_order(folio), ZONE_MOVABLE); + return -EAGAIN; + } + + if (!folio_isolate_lru(folio)) + return -EAGAIN; + + node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio), + nr_pages); + + return 0; +} + +enum kmmscand_migration_err { + KMMSCAND_NULL_MM = 1, + KMMSCAND_EXITING_MM, + KMMSCAND_INVALID_FOLIO, + KMMSCAND_INVALID_VMA, + KMMSCAND_INELIGIBLE_SRC_NODE, + KMMSCAND_SAME_SRC_DEST_NODE, + KMMSCAND_PTE_NOT_PRESENT, + KMMSCAND_PMD_NOT_PRESENT, + KMMSCAND_NO_PTE_OFFSET_MAP_LOCK, + KMMSCAND_LRU_ISOLATION_ERR, +}; + +static int kmmscand_promote_folio(struct kmmscand_migrate_info *info, int destnid) +{ + unsigned long pfn; + unsigned long address; + struct page *page; + struct folio *folio; + int ret; + struct mm_struct *mm; + pmd_t *pmd; + pte_t *pte; + spinlock_t *ptl; + pmd_t pmde; + int srcnid; + + if (info->mm == NULL) + return KMMSCAND_NULL_MM; + + if (info->mm == READ_ONCE(kmmscand_cur_migrate_mm) && + READ_ONCE(kmmscand_migration_list_dirty)) { + WARN_ON_ONCE(mm); + return KMMSCAND_EXITING_MM; + } + + mm = info->mm; + + folio = info->folio; + + /* Check again if the folio is really valid now */ + if (folio) { + pfn = folio_pfn(folio); + page = pfn_to_online_page(pfn); + } + + if (!page || PageTail(page) || !folio || !folio_test_lru(folio) || folio_test_unevictable(folio) || + folio_is_zone_device(folio) || !folio_mapped(folio) || folio_likely_mapped_shared(folio)) + return KMMSCAND_INVALID_FOLIO; + + folio_get(folio); + + srcnid = folio_nid(folio); + + /* Do not try to promote pages from regular nodes */ + if (!kmmscand_eligible_srcnid(srcnid)) { + folio_put(folio); + return KMMSCAND_INELIGIBLE_SRC_NODE; + } + + + if (srcnid == destnid) { + folio_put(folio); + return KMMSCAND_SAME_SRC_DEST_NODE; + } + address = info->address; + pmd = pmd_off(mm, address); + pmde = pmdp_get(pmd); + + if (!pmd_present(pmde)) { + folio_put(folio); + return KMMSCAND_PMD_NOT_PRESENT; + } + pte = pte_offset_map_lock(mm, pmd, address, &ptl); + + if (!pte) { + folio_put(folio); + WARN_ON_ONCE(!pte); + return KMMSCAND_NO_PTE_OFFSET_MAP_LOCK; + } + + + ret = kmmscand_migrate_misplaced_folio_prepare(folio, NULL, destnid); + if (ret) { + folio_put(folio); + pte_unmap_unlock(pte, ptl); + return KMMSCAND_LRU_ISOLATION_ERR; + } + + folio_put(folio); + pte_unmap_unlock(pte, ptl); + + return migrate_misplaced_folio(folio, destnid); +} + +static bool folio_idle_clear_pte_refs_one(struct folio *folio, + struct vm_area_struct *vma, + unsigned long addr, + pte_t *ptep) +{ + bool referenced = false; + struct mm_struct *mm = vma->vm_mm; + pmd_t *pmd = pmd_off(mm, addr); + + if (ptep) { + if (ptep_clear_young_notify(vma, addr, ptep)) + referenced = true; + } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { + if (!pmd_present(*pmd)) + WARN_ON_ONCE(1); + if (pmdp_clear_young_notify(vma, addr, pmd)) + referenced = true; + } else { + WARN_ON_ONCE(1); + } + + if (referenced) { + folio_clear_idle(folio); + folio_set_young(folio); + } + + return true; +} + +static void page_idle_clear_pte_refs(struct page *page, pte_t *pte, struct mm_walk *walk) +{ + bool need_lock; + struct folio *folio = page_folio(page); + unsigned long address; + + if (!folio_mapped(folio) || !folio_raw_mapping(folio)) + return; + + need_lock = !folio_test_anon(folio) || folio_test_ksm(folio); + if (need_lock && !folio_trylock(folio)) + return; + address = vma_address(walk->vma, page_pgoff(folio, page), compound_nr(page)); + VM_BUG_ON_VMA(address == -EFAULT, walk->vma); + folio_idle_clear_pte_refs_one(folio, walk->vma, address, pte); + + if (need_lock) + folio_unlock(folio); +} + +static int hot_vma_idle_pte_entry(pte_t *pte, + unsigned long addr, + unsigned long next, + struct mm_walk *walk) +{ + struct page *page; + struct folio *folio; + struct mm_struct *mm; + struct vm_area_struct *vma; + struct kmmscand_migrate_info *info; + struct kmmscand_scanctrl *scanctrl = walk->private; + + int srcnid; + + pte_t pteval = ptep_get(pte); + + if (!pte_present(pteval)) + return 1; + + if (pte_none(pteval)) + return 1; + + vma = walk->vma; + mm = vma->vm_mm; + + page = pte_page(*pte); + + page_idle_clear_pte_refs(page, pte, walk); + + folio = page_folio(page); + folio_get(folio); + + if (!folio || folio_is_zone_device(folio) || folio_test_unevictable(folio) + || !folio_mapped(folio) || folio_likely_mapped_shared(folio)) { + folio_put(folio); + return 1; + } + + srcnid = folio_nid(folio); + + if (node_is_toptier(srcnid)) { + scanctrl->nodeinfo[srcnid]->nr_scanned++; + count_kmmscand_toptier(); + } + + if (!folio_test_idle(folio) || folio_test_young(folio) || + mmu_notifier_test_young(mm, addr) || + folio_test_referenced(folio) || pte_young(pteval)) { + + /* TBD: Use helpers */ + scanctrl->nodeinfo[srcnid]->nr_accessed++; + + /* Do not try to promote pages from regular nodes */ + if (!kmmscand_eligible_srcnid(srcnid)) + goto end; + + info = kzalloc(sizeof(struct kmmscand_migrate_info), GFP_NOWAIT); + if (info && scanctrl) { + + count_kmmscand_slowtier(); + info->mm = mm; + info->address = addr; + info->folio = folio; + + /* No need of lock now */ + list_add_tail(&info->migrate_node, &scanctrl->scan_list); + + count_kmmscand_migadded(); + } + } else + count_kmmscand_idlepage(); +end: + folio_set_idle(folio); + folio_put(folio); + return 0; +} + +static const struct mm_walk_ops hot_vma_set_idle_ops = { + .pte_entry = hot_vma_idle_pte_entry, + .walk_lock = PGWALK_RDLOCK, +}; + +static void kmmscand_walk_page_vma(struct vm_area_struct *vma, struct kmmscand_scanctrl *scanctrl) +{ + if (!vma_migratable(vma) || !vma_policy_mof(vma) || + is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) { + return; + } + if (!vma->vm_mm || + (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) + return; + + if (!vma_is_accessible(vma)) + return; + + walk_page_vma(vma, &hot_vma_set_idle_ops, scanctrl); +} + +static inline int kmmscand_test_exit(struct mm_struct *mm) +{ + return atomic_read(&mm->mm_users) == 0; +} + +static void kmmscand_cleanup_migration_list(struct mm_struct *mm) +{ + struct kmmscand_migrate_info *info, *tmp; + + spin_lock(&kmmscand_migrate_lock); + if (!list_empty(&kmmscand_migrate_list.migrate_head)) { + if (mm == READ_ONCE(kmmscand_cur_migrate_mm)) { + /* A folio in this mm is being migrated. wait */ + WRITE_ONCE(kmmscand_migration_list_dirty, true); + } + + list_for_each_entry_safe(info, tmp, &kmmscand_migrate_list.migrate_head, + migrate_node) { + if (info && (info->mm == mm)) { + info->mm = NULL; + WRITE_ONCE(kmmscand_migration_list_dirty, true); + } + } + } + spin_unlock(&kmmscand_migrate_lock); +} + +static void kmmscand_collect_mm_slot(struct kmmscand_mm_slot *mm_slot) +{ + struct mm_slot *slot = &mm_slot->slot; + struct mm_struct *mm = slot->mm; + + lockdep_assert_held(&kmmscand_mm_lock); + + if (kmmscand_test_exit(mm)) { + /* free mm_slot */ + hash_del(&slot->hash); + list_del(&slot->mm_node); + + kmmscand_cleanup_migration_list(mm); + + mm_slot_free(kmmscand_slot_cache, mm_slot); + mmdrop(mm); + } else { + WARN_ON_ONCE(mm_slot); + mm_slot->is_scanned = false; + } +} + +static void kmmscand_migrate_folio(void) +{ + int ret = 0, dest = -1; + struct mm_struct *oldmm = NULL; + struct kmmscand_migrate_info *info, *tmp; + + spin_lock(&kmmscand_migrate_lock); + + if (!list_empty(&kmmscand_migrate_list.migrate_head)) { + list_for_each_entry_safe(info, tmp, &kmmscand_migrate_list.migrate_head, + migrate_node) { + if (READ_ONCE(kmmscand_migration_list_dirty)) { + kmmscand_migration_list_dirty = false; + list_del(&info->migrate_node); + /* + * Do not try to migrate this entry because mm might have + * vanished underneath. + */ + kfree(info); + spin_unlock(&kmmscand_migrate_lock); + goto dirty_list_handled; + } + + list_del(&info->migrate_node); + /* Note down the mm of folio entry we are migrating */ + WRITE_ONCE(kmmscand_cur_migrate_mm, info->mm); + spin_unlock(&kmmscand_migrate_lock); + + if (info->mm) { + if (oldmm != info->mm) { + if(!mmap_read_trylock(info->mm)) { + dest = kmmscand_get_target_node(NULL); + } else { + dest = READ_ONCE(info->mm->target_node); + mmap_read_unlock(info->mm); + } + oldmm = info->mm; + } + + ret = kmmscand_promote_folio(info, dest); + trace_kmem_scan_mm_migrate(info->mm, ret, dest); + } + + /* TBD: encode migrated count here, currently assume folio_nr_pages */ + if (!ret) + count_kmmscand_migrated(); + else + count_kmmscand_migrate_failed(); + + kfree(info); + + spin_lock(&kmmscand_migrate_lock); + /* Reset mm of folio entry we are migrating */ + WRITE_ONCE(kmmscand_cur_migrate_mm, NULL); + spin_unlock(&kmmscand_migrate_lock); +dirty_list_handled: + cond_resched(); + spin_lock(&kmmscand_migrate_lock); + } + } + spin_unlock(&kmmscand_migrate_lock); +} + +/* + * This is the normal change percentage when old and new delta remain same. + * i.e., either both positive or both zero. + */ +#define SCAN_PERIOD_TUNE_PERCENT 15 + +/* This is to change the scan_period aggressively when deltas are different */ +#define SCAN_PERIOD_CHANGE_SCALE 3 +/* + * XXX: Hack to prevent unmigrated pages coming again and again while scanning. + * Actual fix needs to identify the type of unmigrated pages OR consider migration + * failures in next scan. + */ +#define KMMSCAND_IGNORE_SCAN_THR 100 + +#define SCAN_SIZE_CHANGE_SCALE 1 +/* + * X : Number of useful pages in the last scan. + * Y : Number of useful pages found in current scan. + * Tuning scan_period: + * Initial scan_period is 2s. + * case 1: (X = 0, Y = 0) + * Increase scan_period by SCAN_PERIOD_TUNE_PERCENT. + * case 2: (X = 0, Y > 0) + * Decrease scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE). + * case 3: (X > 0, Y = 0 ) + * Increase scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE). + * case 4: (X > 0, Y > 0) + * Decrease scan_period by SCAN_PERIOD_TUNE_PERCENT. + * Tuning scan_size: + * Initial scan_size is 4GB + * case 1: (X = 0, Y = 0) + * Decrease scan_size by (1 << SCAN_SIZE_CHANGE_SCALE). + * case 2: (X = 0, Y > 0) + * scan_size = KMMSCAND_SCAN_SIZE_MAX + * case 3: (X > 0, Y = 0 ) + * No change + * case 4: (X > 0, Y > 0) + * Increase scan_size by (1 << SCAN_SIZE_CHANGE_SCALE). + */ +static inline void kmmscand_update_mmslot_info(struct kmmscand_mm_slot *mm_slot, + unsigned long total, int target_node) +{ + unsigned int scan_period; + unsigned long now; + unsigned long scan_size; + unsigned long old_scan_delta; + + /* XXX: Hack to get rid of continuously failing/unmigrateable pages */ + if (total < KMMSCAND_IGNORE_SCAN_THR) + total = 0; + + scan_period = mm_slot->scan_period; + scan_size = mm_slot->scan_size; + + old_scan_delta = mm_slot->scan_delta; + + /* + * case 1: old_scan_delta and new delta are similar, (slow) TUNE_PERCENT used. + * case 2: old_scan_delta and new delta are different. (fast) CHANGE_SCALE used. + * TBD: + * 1. Further tune scan_period based on delta between last and current scan delta. + * 2. Optimize calculation + */ + if (!old_scan_delta && !total) { + scan_period = (100 + SCAN_PERIOD_TUNE_PERCENT) * scan_period; + scan_period /= 100; + scan_size = scan_size >> SCAN_SIZE_CHANGE_SCALE; + } else if (old_scan_delta && total) { + scan_period = (100 - SCAN_PERIOD_TUNE_PERCENT) * scan_period; + scan_period /= 100; + scan_size = scan_size << SCAN_SIZE_CHANGE_SCALE; + } else if (old_scan_delta && !total) { + scan_period = scan_period << SCAN_PERIOD_CHANGE_SCALE; + } else { + scan_period = scan_period >> SCAN_PERIOD_CHANGE_SCALE; + scan_size = KMMSCAND_SCAN_SIZE_MAX; + } + + scan_period = clamp(scan_period, KMMSCAND_SCAN_PERIOD_MIN, KMMSCAND_SCAN_PERIOD_MAX); + scan_size = clamp(scan_size, KMMSCAND_SCAN_SIZE_MIN, KMMSCAND_SCAN_SIZE_MAX); + + now = jiffies; + mm_slot->next_scan = now + msecs_to_jiffies(scan_period); + mm_slot->scan_period = scan_period; + mm_slot->scan_size = scan_size; + mm_slot->scan_delta = total; + mm_slot->target_node = target_node; +} + +static unsigned long kmmscand_scan_mm_slot(void) +{ + bool next_mm = false; + bool update_mmslot_info = false; + + unsigned int mm_slot_scan_period; + int target_node, mm_slot_target_node, mm_target_node; + unsigned long now; + unsigned long mm_slot_next_scan; + unsigned long mm_slot_scan_size; + unsigned long scanned_size = 0; + unsigned long address; + unsigned long total = 0; + + struct mm_slot *slot; + struct mm_struct *mm; + struct vma_iterator vmi; + struct vm_area_struct *vma = NULL; + struct kmmscand_mm_slot *mm_slot; + + /* Retrieve mm */ + spin_lock(&kmmscand_mm_lock); + + if (kmmscand_scan.mm_slot) { + mm_slot = kmmscand_scan.mm_slot; + slot = &mm_slot->slot; + address = mm_slot->address; + } else { + slot = list_entry(kmmscand_scan.mm_head.next, + struct mm_slot, mm_node); + mm_slot = mm_slot_entry(slot, struct kmmscand_mm_slot, slot); + address = mm_slot->address; + kmmscand_scan.mm_slot = mm_slot; + } + + mm = slot->mm; + mm_slot->is_scanned = true; + mm_slot_next_scan = mm_slot->next_scan; + mm_slot_scan_period = mm_slot->scan_period; + mm_slot_scan_size = mm_slot->scan_size; + mm_slot_target_node = mm_slot->target_node; + spin_unlock(&kmmscand_mm_lock); + + if (unlikely(!mmap_read_trylock(mm))) + goto outerloop_mmap_lock; + + if (unlikely(kmmscand_test_exit(mm))) { + next_mm = true; + goto outerloop; + } + + if (!mm->pte_scan_scale) { + next_mm = true; + goto outerloop; + } + + mm_target_node = READ_ONCE(mm->target_node); + /* XXX: Do we need write lock? */ + if (mm_target_node != mm_slot_target_node) + WRITE_ONCE(mm->target_node, mm_slot_target_node); + + trace_kmem_scan_mm_start(mm); + + now = jiffies; + + if (mm_slot_next_scan && time_before(now, mm_slot_next_scan)) + goto outerloop; + + vma_iter_init(&vmi, mm, address); + vma = vma_next(&vmi); + if (!vma) { + address = 0; + vma_iter_set(&vmi, address); + vma = vma_next(&vmi); + } + + for_each_vma(vmi, vma) { + /* Count the scanned pages here to decide exit */ + kmmscand_walk_page_vma(vma, &kmmscand_scanctrl); + count_kmmscand_vma_scans(); + scanned_size += vma->vm_end - vma->vm_start; + address = vma->vm_end; + + if (scanned_size >= mm_slot_scan_size) { + total = get_slowtier_accesed(&kmmscand_scanctrl); + + /* If we had got accessed pages, ignore the current scan_size threshold */ + if (total > KMMSCAND_IGNORE_SCAN_THR) { + mm_slot_scan_size = KMMSCAND_SCAN_SIZE_MAX; + continue; + } + next_mm = true; + break; + } + + /* Add scanned folios to migration list */ + spin_lock(&kmmscand_migrate_lock); + list_splice_tail_init(&kmmscand_scanctrl.scan_list, &kmmscand_migrate_list.migrate_head); + spin_unlock(&kmmscand_migrate_lock); + } + + if (!vma) + address = 0; + + update_mmslot_info = true; + + count_kmmscand_mm_scans(); + + total = get_slowtier_accesed(&kmmscand_scanctrl); + target_node = get_target_node(&kmmscand_scanctrl); + + mm_target_node = READ_ONCE(mm->target_node); + + /* XXX: Do we need write lock? */ + if (mm_target_node != target_node) + WRITE_ONCE(mm->target_node, target_node); + + reset_scanctrl(&kmmscand_scanctrl); + + if (update_mmslot_info) { + mm_slot->address = address; + kmmscand_update_mmslot_info(mm_slot, total, target_node); + } + + trace_kmem_scan_mm_end(mm, address, total, mm_slot_scan_period, + mm_slot_scan_size, target_node); + +outerloop: + /* exit_mmap will destroy ptes after this */ + mmap_read_unlock(mm); + +outerloop_mmap_lock: + spin_lock(&kmmscand_mm_lock); + VM_BUG_ON(kmmscand_scan.mm_slot != mm_slot); + + /* + * Release the current mm_slot if this mm is about to die, or + * if we scanned all vmas of this mm. + */ + if (unlikely(kmmscand_test_exit(mm)) || !vma || next_mm) { + /* + * Make sure that if mm_users is reaching zero while + * kmmscand runs here, kmmscand_exit will find + * mm_slot not pointing to the exiting mm. + */ + WARN_ON_ONCE(current->rcu_read_lock_nesting < 0); + if (slot->mm_node.next != &kmmscand_scan.mm_head) { + slot = list_entry(slot->mm_node.next, + struct mm_slot, mm_node); + kmmscand_scan.mm_slot = + mm_slot_entry(slot, struct kmmscand_mm_slot, slot); + + } else + kmmscand_scan.mm_slot = NULL; + + WARN_ON_ONCE(current->rcu_read_lock_nesting < 0); + if (kmmscand_test_exit(mm)) { + kmmscand_collect_mm_slot(mm_slot); + WARN_ON_ONCE(current->rcu_read_lock_nesting < 0); + goto end; + } + } + mm_slot->is_scanned = false; +end: + spin_unlock(&kmmscand_mm_lock); + return total; +} + +static void kmmscand_do_scan(void) +{ + unsigned long iter = 0, mms_to_scan; + + mms_to_scan = READ_ONCE(kmmscand_mms_to_scan); + + while (true) { + cond_resched(); + + if (unlikely(kthread_should_stop()) || !READ_ONCE(kmmscand_scan_enabled)) + break; + + if (kmmscand_has_work()) + kmmscand_scan_mm_slot(); + + iter++; + if (iter >= mms_to_scan) + break; + } +} + +static int kmmscand(void *none) +{ + for (;;) { + if (unlikely(kthread_should_stop())) + break; + + kmmscand_do_scan(); + + while (!READ_ONCE(kmmscand_scan_enabled)) { + cpu_relax(); + kmmscand_wait_work(); + } + + kmmscand_wait_work(); + } + return 0; +} + +#ifdef CONFIG_SYSFS +extern struct kobject *mm_kobj; +static int __init kmmscand_init_sysfs(struct kobject **kobj) +{ + int err; + + err = sysfs_create_group(*kobj, &kmmscand_attr_group); + if (err) { + pr_err("failed to register kmmscand group\n"); + goto err_kmmscand_attr; + } + + return 0; + +err_kmmscand_attr: + sysfs_remove_group(*kobj, &kmmscand_attr_group); + return err; +} + +static void __init kmmscand_exit_sysfs(struct kobject *kobj) +{ + sysfs_remove_group(kobj, &kmmscand_attr_group); +} +#else +static inline int __init kmmscand_init_sysfs(struct kobject **kobj) +{ + return 0; +} +static inline void __init kmmscand_exit_sysfs(struct kobject *kobj) +{ +} +#endif + +static inline void kmmscand_destroy(void) +{ + kmem_cache_destroy(kmmscand_slot_cache); + kmmscand_exit_sysfs(mm_kobj); +} + +void __kmmscand_enter(struct mm_struct *mm) +{ + struct kmmscand_mm_slot *kmmscand_slot; + struct mm_slot *slot; + unsigned long now; + int wakeup; + + /* __kmmscand_exit() must not run from under us */ + VM_BUG_ON_MM(kmmscand_test_exit(mm), mm); + + kmmscand_slot = mm_slot_alloc(kmmscand_slot_cache); + + if (!kmmscand_slot) + return; + + now = jiffies; + kmmscand_slot->address = 0; + kmmscand_slot->scan_period = kmmscand_mm_scan_period_ms; + kmmscand_slot->scan_size = kmmscand_scan_size; + kmmscand_slot->next_scan = now + + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); + kmmscand_slot->scan_delta = 0; + + slot = &kmmscand_slot->slot; + + spin_lock(&kmmscand_mm_lock); + mm_slot_insert(kmmscand_slots_hash, mm, slot); + + wakeup = list_empty(&kmmscand_scan.mm_head); + list_add_tail(&slot->mm_node, &kmmscand_scan.mm_head); + spin_unlock(&kmmscand_mm_lock); + + mmgrab(mm); + trace_kmem_mm_enter(mm); + if (wakeup) + wake_up_interruptible(&kmmscand_wait); +} + +void __kmmscand_exit(struct mm_struct *mm) +{ + struct kmmscand_mm_slot *mm_slot; + struct mm_slot *slot; + int free = 0, serialize = 1; + + trace_kmem_mm_exit(mm); + spin_lock(&kmmscand_mm_lock); + slot = mm_slot_lookup(kmmscand_slots_hash, mm); + mm_slot = mm_slot_entry(slot, struct kmmscand_mm_slot, slot); + if (mm_slot && kmmscand_scan.mm_slot != mm_slot) { + hash_del(&slot->hash); + list_del(&slot->mm_node); + free = 1; + } else if (mm_slot && kmmscand_scan.mm_slot == mm_slot && !mm_slot->is_scanned) { + hash_del(&slot->hash); + list_del(&slot->mm_node); + free = 1; + /* TBD: Set the actual next slot */ + kmmscand_scan.mm_slot = NULL; + } else if (mm_slot && kmmscand_scan.mm_slot == mm_slot && mm_slot->is_scanned) { + serialize = 0; + } + + spin_unlock(&kmmscand_mm_lock); + + if (serialize) + kmmscand_cleanup_migration_list(mm); + + if (free) { + mm_slot_free(kmmscand_slot_cache, mm_slot); + mmdrop(mm); + } else if (mm_slot) { + mmap_write_lock(mm); + mmap_write_unlock(mm); + } +} + +static int start_kmmscand(void) +{ + int err = 0; + + guard(mutex)(&kmmscand_mutex); + + /* Some one already succeeded in starting daemon */ + if (kmmscand_thread) + goto end; + + kmmscand_thread = kthread_run(kmmscand, NULL, "kmmscand"); + if (IS_ERR(kmmscand_thread)) { + pr_err("kmmscand: kthread_run(kmmscand) failed\n"); + err = PTR_ERR(kmmscand_thread); + kmmscand_thread = NULL; + goto end; + } else { + pr_info("kmmscand: Successfully started kmmscand"); + } + + if (!list_empty(&kmmscand_scan.mm_head)) + wake_up_interruptible(&kmmscand_wait); + +end: + return err; +} + +static int stop_kmmscand(void) +{ + int err = 0; + + guard(mutex)(&kmmscand_mutex); + + if (kmmscand_thread) { + kthread_stop(kmmscand_thread); + kmmscand_thread = NULL; + } + + return err; +} +static int kmmmigrated(void *arg) +{ + for (;;) { + WRITE_ONCE(migrated_need_wakeup, false); + if (unlikely(kthread_should_stop())) + break; + if (kmmmigrated_has_work()) + kmmscand_migrate_folio(); + msleep(20); + kmmmigrated_wait_work(); + } + return 0; +} + +static int start_kmmmigrated(void) +{ + int err = 0; + + guard(mutex)(&kmmmigrated_mutex); + + /* Someone already succeeded in starting daemon */ + if (kmmmigrated_thread) + goto end; + + kmmmigrated_thread = kthread_run(kmmmigrated, NULL, "kmmmigrated"); + if (IS_ERR(kmmmigrated_thread)) { + pr_err("kmmmigrated: kthread_run(kmmmigrated) failed\n"); + err = PTR_ERR(kmmmigrated_thread); + kmmmigrated_thread = NULL; + goto end; + } else { + pr_info("kmmmigrated: Successfully started kmmmigrated"); + } + + wake_up_interruptible(&kmmmigrated_wait); +end: + return err; +} + +static int stop_kmmmigrated(void) +{ + guard(mutex)(&kmmmigrated_mutex); + kthread_stop(kmmmigrated_thread); + return 0; +} + +static void init_migration_list(void) +{ + INIT_LIST_HEAD(&kmmscand_migrate_list.migrate_head); + INIT_LIST_HEAD(&kmmscand_scanctrl.scan_list); + spin_lock_init(&kmmscand_migrate_lock); + init_waitqueue_head(&kmmscand_wait); + init_waitqueue_head(&kmmmigrated_wait); + init_scanctrl(&kmmscand_scanctrl); +} + +static int __init kmmscand_init(void) +{ + int err; + + kmmscand_slot_cache = KMEM_CACHE(kmmscand_mm_slot, 0); + + if (!kmmscand_slot_cache) { + pr_err("kmmscand: kmem_cache error"); + return -ENOMEM; + } + + err = kmmscand_init_sysfs(&mm_kobj); + + if (err) + goto err_init_sysfs; + + init_migration_list(); + + err = start_kmmscand(); + if (err) + goto err_kmmscand; + + err = start_kmmmigrated(); + if (err) + goto err_kmmmigrated; + + return 0; + +err_kmmmigrated: + stop_kmmmigrated(); + +err_kmmscand: + stop_kmmscand(); +err_init_sysfs: + kmmscand_destroy(); + + return err; +} +subsys_initcall(kmmscand_init); diff --git a/mm/migrate.c b/mm/migrate.c index fb19a18892c8..9d39abc7662a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2598,7 +2598,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, * Returns true if this is a safe migration target node for misplaced NUMA * pages. Currently it only checks the watermarks which is crude. */ -static bool migrate_balanced_pgdat(struct pglist_data *pgdat, +bool migrate_balanced_pgdat(struct pglist_data *pgdat, unsigned long nr_migrate_pages) { int z; @@ -2656,7 +2656,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio, * See folio_likely_mapped_shared() on possible imprecision * when we cannot easily detect if a folio is shared. */ - if ((vma->vm_flags & VM_EXEC) && + if ((vma && vma->vm_flags & VM_EXEC) && folio_likely_mapped_shared(folio)) return -EACCES; diff --git a/mm/vmstat.c b/mm/vmstat.c index 16bfe1c694dd..cb21441969c5 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1340,6 +1340,17 @@ const char * const vmstat_text[] = { "numa_hint_faults_local", "numa_pages_migrated", #endif +#ifdef CONFIG_KMMSCAND + "nr_kmmscand_mm_scans", + "nr_kmmscand_vma_scans", + "nr_kmmscand_migadded", + "nr_kmmscand_migrated", + "nr_kmmscand_migrate_failed", + "nr_kmmscand_kzalloc_fail", + "nr_kmmscand_slowtier", + "nr_kmmscand_toptier", + "nr_kmmscand_idlepage", +#endif #ifdef CONFIG_MIGRATION "pgmigrate_success", "pgmigrate_fail", ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T ` (2 preceding siblings ...) 2025-01-26 2:27 ` Huang, Ying @ 2025-01-31 12:28 ` Jonathan Cameron 2025-01-31 13:09 ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron 2025-02-03 2:23 ` [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T 2025-04-07 3:13 ` Bharata B Rao 4 siblings, 2 replies; 33+ messages in thread From: Jonathan Cameron @ 2025-01-31 12:28 UTC (permalink / raw) To: Raghavendra K T Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen > Here is the list of potential discussion points: ... > 2. Possibility of maintaining single source of truth for page hotness that would > maintain hot page information from multiple sources and let other sub-systems > use that info. Hi, I was thinking of proposing a separate topic on a single source of hotness, but this question covers it so I'll add some thoughts here instead. I think we are very early, but sharing some experience and thoughts in a session may be useful. What do the other subsystems that want to use a single source of page hotness want to be able to find out? (subject to filters like memory range, process etc) A) How hot is page X? - Is this useful, or too much data? What would use it? * Application optimization maybe. Very handy for developing algorithms to do the rest of the options here as an Oracle! - Provides both the cold and hot end of the scale, but maybe measurement techniques vary and can not be easily combined. Hard in general to combine multiple sources of truth if aiming for an absolute number. B) Which pages are super hot? - Probably these that make the most difference if they are in a slower memory tier. C) Some pages are hot enough to consider moving? - This may be good enough to get the key data into the fast memory over time. - Can combine sources of info as being able to compare precise numbers doesn't matter. D) Which pages are fairly cold? - Likewise maybe good enough over time. E) Which pages are very cold? - Ideal case for tiering. Swap these with the super hot ones. - Maybe extra signal for swap / zswap etc F) Did these hot pages remain hot (and same for cold) - This is needed to know when to back off doing things as we have unstable hotness (two phase applications are a pain for this), sampling a few pages may be fine. Messy corners: Temporal aspects. - If only providing lists of hottest / coldest in last second, very hard to find those that are of a stable temperature. We end up moving very hot data (which is disruptive) and it doesn't stay hot. - Can reduce that affect by long sampling windows on some measurement approaches (on hardware trackers that can trash accuracy due to resource exhaustion and other subtle effects). - bistable / phase based applications are a pain but perhaps up to higher levels to back off. My main interest is migrating in tiered systems but good to look at what else would use a common layer. Mostly I want to know something that is useful to move, and assume convergence over the long term with the best things to move so to me the ideal layer has following interface (strawman so shoot holes in it!): 1) Give me up to X hotish pages from a slow tier (greater than a specific measure of temperature) 2) Give me X coldish pages a faster tier. 3) I expect to ask again in X seconds so please have some info ready for me! 4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this is bleeding the tiering application into a shared interface though). If we have multiple subsystems using the data we will need to resolve their conflicting demands to generate good enough data with appropriate overhead. I'd also like a virtualized solution for case of hardware PA trackers (what I have with CXL Hotness Monitoring Units) and classic memory pool / stranding avoidance case where the VM is the right entity to make migration decisions. Making that interface convey what the kernel is going to use would be an efficient option. I'd like to hide how the sausage was made from the VM. Jonathan ^ permalink raw reply [flat|nested] 33+ messages in thread
* [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-01-31 12:28 ` Jonathan Cameron @ 2025-01-31 13:09 ` Jonathan Cameron 2025-02-05 6:24 ` Bharata B Rao 2025-02-16 6:49 ` Huang, Ying 2025-02-03 2:23 ` [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T 1 sibling, 2 replies; 33+ messages in thread From: Jonathan Cameron @ 2025-01-31 13:09 UTC (permalink / raw) To: Raghavendra K T Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Fri, 31 Jan 2025 12:28:03 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > Here is the list of potential discussion points: > ... > > > 2. Possibility of maintaining single source of truth for page hotness that would > > maintain hot page information from multiple sources and let other sub-systems > > use that info. > Hi, > > I was thinking of proposing a separate topic on a single source of hotness, > but this question covers it so I'll add some thoughts here instead. > I think we are very early, but sharing some experience and thoughts in a > session may be useful. Thinking more on this over lunch, I think it is worth calling this out as a potential session topic in it's own right rather than trying to find time within other sessions. Hence the title change. I think a session would start with a brief listing of the temperature sources we have and those on the horizon to motivate what we are unifying, then discussion to focus on need for such a unification + requirements (maybe with a straw man). > > What do the other subsystems that want to use a single source of page hotness > want to be able to find out? (subject to filters like memory range, process etc) > > A) How hot is page X? > - Is this useful, or too much data? What would use it? > * Application optimization maybe. Very handy for developing algorithms > to do the rest of the options here as an Oracle! > - Provides both the cold and hot end of the scale, but maybe measurement > techniques vary and can not be easily combined. Hard in general to combine > multiple sources of truth if aiming for an absolute number. > > B) Which pages are super hot? > - Probably these that make the most difference if they are in a slower memory tier. > > C) Some pages are hot enough to consider moving? > - This may be good enough to get the key data into the fast memory over time. > - Can combine sources of info as being able to compare precise numbers doesn't matter. > > D) Which pages are fairly cold? > - Likewise maybe good enough over time. > > E) Which pages are very cold? > - Ideal case for tiering. Swap these with the super hot ones. > - Maybe extra signal for swap / zswap etc > > F) Did these hot pages remain hot (and same for cold) > - This is needed to know when to back off doing things as we have unstable > hotness (two phase applications are a pain for this), sampling a few > pages may be fine. > > Messy corners: > > Temporal aspects. > - If only providing lists of hottest / coldest in last second, very hard > to find those that are of a stable temperature. We end up moving > very hot data (which is disruptive) and it doesn't stay hot. > - Can reduce that affect by long sampling windows on some measurement approaches > (on hardware trackers that can trash accuracy due to resource exhaustion > and other subtle effects). > - bistable / phase based applications are a pain but perhaps up to higher > levels to back off. > > My main interest is migrating in tiered systems but good to look at what > else would use a common layer. > > Mostly I want to know something that is useful to move, and assume convergence > over the long term with the best things to move so to me the ideal layer has > following interface (strawman so shoot holes in it!): > > 1) Give me up to X hotish pages from a slow tier (greater than a specific measure > of temperature) > 2) Give me X coldish pages a faster tier. > 3) I expect to ask again in X seconds so please have some info ready for me! > 4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this > is bleeding the tiering application into a shared interface though). > > If we have multiple subsystems using the data we will need to resolve their > conflicting demands to generate good enough data with appropriate overhead. > > I'd also like a virtualized solution for case of hardware PA trackers (what > I have with CXL Hotness Monitoring Units) and classic memory pool / stranding > avoidance case where the VM is the right entity to make migration decisions. > Making that interface convey what the kernel is going to use would be an > efficient option. I'd like to hide how the sausage was made from the VM. > > Jonathan > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-01-31 13:09 ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron @ 2025-02-05 6:24 ` Bharata B Rao 2025-02-05 16:05 ` Johannes Weiner ` (2 more replies) 2025-02-16 6:49 ` Huang, Ying 1 sibling, 3 replies; 33+ messages in thread From: Bharata B Rao @ 2025-02-05 6:24 UTC (permalink / raw) To: Jonathan Cameron, Raghavendra K T Cc: linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On 31-Jan-25 6:39 PM, Jonathan Cameron wrote: > On Fri, 31 Jan 2025 12:28:03 +0000 > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > >>> Here is the list of potential discussion points: >> ... >> >>> 2. Possibility of maintaining single source of truth for page hotness that would >>> maintain hot page information from multiple sources and let other sub-systems >>> use that info. >> Hi, >> >> I was thinking of proposing a separate topic on a single source of hotness, >> but this question covers it so I'll add some thoughts here instead. >> I think we are very early, but sharing some experience and thoughts in a >> session may be useful. > > Thinking more on this over lunch, I think it is worth calling this out as a > potential session topic in it's own right rather than trying to find > time within other sessions. Hence the title change. > > I think a session would start with a brief listing of the temperature sources > we have and those on the horizon to motivate what we are unifying, then > discussion to focus on need for such a unification + requirements > (maybe with a straw man). Here is a compilation of available temperature sources and how the hot/access data is consumed by different subsystems: PA-Physical address available VA-Virtual address available AA-Access time available NA-accessing Node info available I have left the slot blank for those which I am not sure about. ================================================== Temperature PA VA AA NA source ================================================== PROT_NONE faults Y Y Y Y -------------------------------------------------- folio_mark_accessed() Y Y Y -------------------------------------------------- PTE A bit Y Y N N -------------------------------------------------- Platform hints Y Y Y Y (AMD IBS) -------------------------------------------------- Device hints Y (CXL HMU) ================================================== And here is an attempt to compile how different subsystems use the above data: ============================================================== Source Subsystem Consumption ============================================================== PROT_NONE faults NUMAB NUMAB=1 locality based via process pgtable balancing walk NUMAB=2 hot page promotion ============================================================== folio_mark_accessed() FS/filemap/GUP LRU list activation ============================================================== PTE A bit via Reclaim:LRU LRU list activation, rmap walk deactivation/demotion ============================================================== PTE A bit via Reclaim:MGLRU LRU list activation, rmap walk and process deactivation/demotion pgtable walk ============================================================== PTE A bit via DAMON LRU activation, rmap walk hot page promotion, demotion etc ============================================================== Platform hints NUMAB NUMAB=1 Locality based (AMD IBS) balancing and NUMAB=2 hot page promotion ============================================================== Device hints NUMAB NUMAB=2 hot page promotion ============================================================== The last two are listed as possibilities. Feel free to correct/clarify and add more. Regards, Bharata. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-02-05 6:24 ` Bharata B Rao @ 2025-02-05 16:05 ` Johannes Weiner 2025-02-06 6:46 ` SeongJae Park 2025-02-06 15:30 ` Jonathan Cameron 2025-02-07 9:50 ` Matthew Wilcox 2025-02-16 7:04 ` Huang, Ying 2 siblings, 2 replies; 33+ messages in thread From: Johannes Weiner @ 2025-02-05 16:05 UTC (permalink / raw) To: Bharata B Rao Cc: Jonathan Cameron, Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote: > On 31-Jan-25 6:39 PM, Jonathan Cameron wrote: > > On Fri, 31 Jan 2025 12:28:03 +0000 > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > > >>> Here is the list of potential discussion points: > >> ... > >> > >>> 2. Possibility of maintaining single source of truth for page hotness that would > >>> maintain hot page information from multiple sources and let other sub-systems > >>> use that info. > >> Hi, > >> > >> I was thinking of proposing a separate topic on a single source of hotness, > >> but this question covers it so I'll add some thoughts here instead. > >> I think we are very early, but sharing some experience and thoughts in a > >> session may be useful. > > > > Thinking more on this over lunch, I think it is worth calling this out as a > > potential session topic in it's own right rather than trying to find > > time within other sessions. Hence the title change. > > > > I think a session would start with a brief listing of the temperature sources > > we have and those on the horizon to motivate what we are unifying, then > > discussion to focus on need for such a unification + requirements > > (maybe with a straw man). > > Here is a compilation of available temperature sources and how the > hot/access data is consumed by different subsystems: This is super useful, thanks for collecting this. > PA-Physical address available > VA-Virtual address available > AA-Access time available > NA-accessing Node info available > > I have left the slot blank for those which I am not sure about. > ================================================== > Temperature PA VA AA NA > source > ================================================== > PROT_NONE faults Y Y Y Y > -------------------------------------------------- > folio_mark_accessed() Y Y Y > -------------------------------------------------- For fma(), the VA info is available in unmap, but usually it isn't - or doesn't meaningfully exist, as in the case of unmapped buffered IO. I'd say it's an N. > PTE A bit Y Y N N > -------------------------------------------------- > Platform hints Y Y Y Y > (AMD IBS) > -------------------------------------------------- > Device hints Y > (CXL HMU) > ================================================== For the following table, it might be useful to add *when* the source produces this information. Sampling frequency is a likely challenge: consumers have different requirements, and overhead should be limited to the minimum required to serve enabled consumers. Here is an (incomplete) attempt - sorry about the long lines: > And here is an attempt to compile how different subsystems > use the above data: > ============================================================== > Source Subsystem Consumption Activation/Frequency > ============================================================== > PROT_NONE faults NUMAB NUMAB=1 locality based While task is running, > via process pgtable balancing rate varies on observed > walk NUMAB=2 hot page locality and sysctl knobs. > promotion > ============================================================== > folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap > ============================================================== > PTE A bit via Reclaim:LRU LRU list activation, During memory pressure > rmap walk deactivation/demotion > ============================================================== > PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure > rmap walk and process deactivation/demotion - Continuous sampling (configurable) > pgtable walk for workingset reporting > ============================================================== > PTE A bit via DAMON LRU activation, Continuous sampling (configurable)? > rmap walk hot page promotion, (I believe SJ is looking into > demotion etc auto-tuning this). > ============================================================== > Platform hints NUMAB NUMAB=1 Locality based > (AMD IBS) balancing and > NUMAB=2 hot page > promotion > ============================================================== > Device hints NUMAB NUMAB=2 hot page > promotion > ============================================================== > The last two are listed as possibilities. > > Feel free to correct/clarify and add more. > > Regards, > Bharata. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-02-05 16:05 ` Johannes Weiner @ 2025-02-06 6:46 ` SeongJae Park 2025-02-06 15:30 ` Jonathan Cameron 1 sibling, 0 replies; 33+ messages in thread From: SeongJae Park @ 2025-02-06 6:46 UTC (permalink / raw) To: Johannes Weiner Cc: SeongJae Park, Bharata B Rao, Jonathan Cameron, Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, feng.tang, kbusch, Hasan.Maruf, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Wed, 5 Feb 2025 11:05:29 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote: > > On 31-Jan-25 6:39 PM, Jonathan Cameron wrote: > > > On Fri, 31 Jan 2025 12:28:03 +0000 > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: [...] > > Here is a compilation of available temperature sources and how the > > hot/access data is consumed by different subsystems: > > This is super useful, thanks for collecting this. Indeed. Thank you Bharata! [...] > For the following table, it might be useful to add *when* the source > produces this information. Sampling frequency is a likely challenge: > consumers have different requirements, and overhead should be limited > to the minimum required to serve enabled consumers. +1 > > Here is an (incomplete) attempt - sorry about the long lines: > > > And here is an attempt to compile how different subsystems > > use the above data: > > ============================================================== > > Source Subsystem Consumption Activation/Frequency [...] > > ============================================================== > > PTE A bit via DAMON LRU activation, Continuous sampling (configurable)? > > rmap walk hot page promotion, (I believe SJ is looking into > > demotion etc auto-tuning this). You're right. I'm working on auto-tuning of the sampling/aggregation intervals of DAMON based on its tuning guide theory[1]. Hopefully I will be able to post an RFC patch series within a couple of weeks. > > ============================================================== > > Platform hints NUMAB NUMAB=1 Locality based > > (AMD IBS) balancing and > > NUMAB=2 hot page > > promotion > > ============================================================== > > Device hints NUMAB NUMAB=2 hot page > > promotion > > ============================================================== > > The last two are listed as possibilities. I'm also trying to extend DAMON to use PROT_NONE faults and AMD IBS like access check sources. Hopefully I will share more details of the plan and experiment results for the PROT_NONE faults extension by LSFMMBPF. [1] https://lore.kernel.org/20241202175459.2005526-1-sj@kernel.org Thanks, SJ [...] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-02-05 16:05 ` Johannes Weiner 2025-02-06 6:46 ` SeongJae Park @ 2025-02-06 15:30 ` Jonathan Cameron 1 sibling, 0 replies; 33+ messages in thread From: Jonathan Cameron @ 2025-02-06 15:30 UTC (permalink / raw) To: Johannes Weiner Cc: Bharata B Rao, Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Wed, 5 Feb 2025 11:05:29 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote: > > On 31-Jan-25 6:39 PM, Jonathan Cameron wrote: > > > On Fri, 31 Jan 2025 12:28:03 +0000 > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > > > > >>> Here is the list of potential discussion points: > > >> ... > > >> > > >>> 2. Possibility of maintaining single source of truth for page hotness that would > > >>> maintain hot page information from multiple sources and let other sub-systems > > >>> use that info. > > >> Hi, > > >> > > >> I was thinking of proposing a separate topic on a single source of hotness, > > >> but this question covers it so I'll add some thoughts here instead. > > >> I think we are very early, but sharing some experience and thoughts in a > > >> session may be useful. > > > > > > Thinking more on this over lunch, I think it is worth calling this out as a > > > potential session topic in it's own right rather than trying to find > > > time within other sessions. Hence the title change. > > > > > > I think a session would start with a brief listing of the temperature sources > > > we have and those on the horizon to motivate what we are unifying, then > > > discussion to focus on need for such a unification + requirements > > > (maybe with a straw man). > > > > Here is a compilation of available temperature sources and how the > > hot/access data is consumed by different subsystems: > > This is super useful, thanks for collecting this. Absolutely agree! > > > PA-Physical address available > > VA-Virtual address available > > AA-Access time available > > NA-accessing Node info available > > > > I have left the slot blank for those which I am not sure about. > > ================================================== > > Temperature PA VA AA NA > > source > > ================================================== > > PROT_NONE faults Y Y Y Y > > -------------------------------------------------- > > folio_mark_accessed() Y Y Y > > -------------------------------------------------- > > For fma(), the VA info is available in unmap, but usually it isn't - > or doesn't meaningfully exist, as in the case of unmapped buffered IO. > > I'd say it's an N. > > > PTE A bit Y Y N N > > -------------------------------------------------- > > Platform hints Y Y Y Y > > (AMD IBS) > > -------------------------------------------------- > > Device hints Y > > (CXL HMU) > > ================================================== For the use cases where we have relatively few 'pages' the cost of a reverse map look up doesn't look to be a problem. Trick is to do it only after we've done what we can in PA space to cut down on the pages of interest. So maybe (Y) to reflect that it is indirect. Whether it makes sense to do that before or after some common layer is an interesting question. That PA/VA mapping might be out of date anyway by the time we see the data. > > For the following table, it might be useful to add *when* the source > produces this information. Sampling frequency is a likely challenge: > consumers have different requirements, and overhead should be limited > to the minimum required to serve enabled consumers. > > Here is an (incomplete) attempt - sorry about the long lines: > > > And here is an attempt to compile how different subsystems > > use the above data: > > ============================================================== > > Source Subsystem Consumption Activation/Frequency > > ============================================================== > > PROT_NONE faults NUMAB NUMAB=1 locality based While task is running, > > via process pgtable balancing rate varies on observed > > walk NUMAB=2 hot page locality and sysctl knobs. > > promotion > > ============================================================== > > folio_mark_accessed() FS/filemap/GUP LRU list activation On cache access and unmap > > ============================================================== > > PTE A bit via Reclaim:LRU LRU list activation, During memory pressure > > rmap walk deactivation/demotion > > ============================================================== > > PTE A bit via Reclaim:MGLRU LRU list activation, - During memory pressure > > rmap walk and process deactivation/demotion - Continuous sampling (configurable) > > pgtable walk for workingset reporting > > ============================================================== > > PTE A bit via DAMON LRU activation, Continuous sampling (configurable)? > > rmap walk hot page promotion, (I believe SJ is looking into > > demotion etc auto-tuning this). > > ============================================================== > > Platform hints NUMAB NUMAB=1 Locality based > > (AMD IBS) balancing and > > NUMAB=2 hot page > > promotion > > ============================================================== Based on the CXL one... > > Device hints NUMAB NUMAB=2 hot page Continuous sampling, frequency controllable. > > promotion Subsampling programable. > > ============================================================== > > The last two are listed as possibilities. > > > > Feel free to correct/clarify and add more. The above covers what the use cases require. Maybe we need to do similar for the controls needed the other way (frequency already covered) Filtering. * Process ID * Address range (PA / VA) * Access type (read vs write) may matter for migration cost. Also frequency is more nuanced perhaps: - How often to give data (timeliness) - How much data to give (bandwidth) - When don't I care (threshold) - How precise do I want it to be (subsampling etc) The layering is clearly to be complex, so maybe addressing each use case for what info that needs would be helpful? The following is probably too simplistic. ================================================================== Usecase Nature of data ================================================================== NUMAB =1 Enough hot pages with remote source. Balancing ================================================================== NUMAB =2 Enough hot pages in slow memory Tiering Promotion ================================================================== NUMAB = 2 Enough cold pages in fast memory Tiering Demotion =================================================================== LRU list Specific pages of interest accessed activation =================================================================== LRU list Enough cold pages? deactivation ==================================================================== Jonathan > > > > Regards, > > Bharata. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-02-05 6:24 ` Bharata B Rao 2025-02-05 16:05 ` Johannes Weiner @ 2025-02-07 9:50 ` Matthew Wilcox 2025-02-16 7:04 ` Huang, Ying 2 siblings, 0 replies; 33+ messages in thread From: Matthew Wilcox @ 2025-02-07 9:50 UTC (permalink / raw) To: Bharata B Rao Cc: Jonathan Cameron, Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Wed, Feb 05, 2025 at 11:54:05AM +0530, Bharata B Rao wrote: > Here is a compilation of available temperature sources and how the > hot/access data is consumed by different subsystems: > > PA-Physical address available > VA-Virtual address available > AA-Access time available > NA-accessing Node info available > > I have left the slot blank for those which I am not sure about. > ================================================== > Temperature PA VA AA NA > source > ================================================== > PROT_NONE faults Y Y Y Y > -------------------------------------------------- > folio_mark_accessed() Y Y Y > -------------------------------------------------- > PTE A bit Y Y N N > -------------------------------------------------- > Platform hints Y Y Y Y > (AMD IBS) > -------------------------------------------------- > Device hints Y > (CXL HMU) > ================================================== > > And here is an attempt to compile how different subsystems > use the above data: > ============================================================== > Source Subsystem Consumption > ============================================================== > PROT_NONE faults NUMAB NUMAB=1 locality based > via process pgtable balancing > walk NUMAB=2 hot page > promotion > ============================================================== > folio_mark_accessed() FS/filemap/GUP LRU list activation > ============================================================== > PTE A bit via Reclaim:LRU LRU list activation, > rmap walk deactivation/demotion > ============================================================== > PTE A bit via Reclaim:MGLRU LRU list activation, > rmap walk and process deactivation/demotion > pgtable walk > ============================================================== > PTE A bit via DAMON LRU activation, > rmap walk hot page promotion, > demotion etc > ============================================================== > Platform hints NUMAB NUMAB=1 Locality based > (AMD IBS) balancing and > NUMAB=2 hot page > promotion > ============================================================== > Device hints NUMAB NUMAB=2 hot page > promotion > ============================================================== > The last two are listed as possibilities. > > Feel free to correct/clarify and add more. There's PG_young / PG_idle as well. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-02-05 6:24 ` Bharata B Rao 2025-02-05 16:05 ` Johannes Weiner 2025-02-07 9:50 ` Matthew Wilcox @ 2025-02-16 7:04 ` Huang, Ying 2 siblings, 0 replies; 33+ messages in thread From: Huang, Ying @ 2025-02-16 7:04 UTC (permalink / raw) To: Bharata B Rao Cc: Jonathan Cameron, Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, yuanchu Hi, Bharata, Bharata B Rao <bharata@amd.com> writes: > On 31-Jan-25 6:39 PM, Jonathan Cameron wrote: >> On Fri, 31 Jan 2025 12:28:03 +0000 >> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: >> >>>> Here is the list of potential discussion points: >>> ... >>> >>>> 2. Possibility of maintaining single source of truth for page hotness that would >>>> maintain hot page information from multiple sources and let other sub-systems >>>> use that info. >>> Hi, >>> >>> I was thinking of proposing a separate topic on a single source of hotness, >>> but this question covers it so I'll add some thoughts here instead. >>> I think we are very early, but sharing some experience and thoughts in a >>> session may be useful. >> Thinking more on this over lunch, I think it is worth calling this >> out as a >> potential session topic in it's own right rather than trying to find >> time within other sessions. Hence the title change. >> I think a session would start with a brief listing of the >> temperature sources >> we have and those on the horizon to motivate what we are unifying, then >> discussion to focus on need for such a unification + requirements >> (maybe with a straw man). > > Here is a compilation of available temperature sources and how the > hot/access data is consumed by different subsystems: Thanks for your information! > PA-Physical address available > VA-Virtual address available > AA-Access time available > NA-accessing Node info available > > I have left the slot blank for those which I am not sure about. > ================================================== > Temperature PA VA AA NA > source > ================================================== > PROT_NONE faults Y Y Y Y > -------------------------------------------------- > folio_mark_accessed() Y Y Y > -------------------------------------------------- > PTE A bit Y Y N N We can get some coarse-grained AA from PTE A bit scanning. That is, the page is accessed at least once between two rounds of scanning. The AA is less the scanning interval. IIUC, the similar information is available in Yuanchu's MGLRU periodic aging series [1]. [1] https://lore.kernel.org/all/20221214225123.2770216-1-yuanchu@google.com/ > -------------------------------------------------- > Platform hints Y Y Y Y > (AMD IBS) > -------------------------------------------------- > Device hints Y > (CXL HMU) > ================================================== > > And here is an attempt to compile how different subsystems > use the above data: > ============================================================== > Source Subsystem Consumption > ============================================================== > PROT_NONE faults NUMAB NUMAB=1 locality based > via process pgtable balancing > walk NUMAB=2 hot page > promotion > ============================================================== > folio_mark_accessed() FS/filemap/GUP LRU list activation IIUC, Gregory is working on a patchset to promote unmapped file cache pages via folio_mark_accessed(). > ============================================================== > PTE A bit via Reclaim:LRU LRU list activation, > rmap walk deactivation/demotion > ============================================================== > PTE A bit via Reclaim:MGLRU LRU list activation, > rmap walk and process deactivation/demotion > pgtable walk > ============================================================== > PTE A bit via DAMON LRU activation, > rmap walk hot page promotion, > demotion etc > ============================================================== > Platform hints NUMAB NUMAB=1 Locality based > (AMD IBS) balancing and > NUMAB=2 hot page > promotion > ============================================================== > Device hints NUMAB NUMAB=2 hot page > promotion > ============================================================== > The last two are listed as possibilities. > > Feel free to correct/clarify and add more. --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-01-31 13:09 ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron 2025-02-05 6:24 ` Bharata B Rao @ 2025-02-16 6:49 ` Huang, Ying 2025-02-17 4:10 ` Bharata B Rao 2025-03-14 14:24 ` Jonathan Cameron 1 sibling, 2 replies; 33+ messages in thread From: Huang, Ying @ 2025-02-16 6:49 UTC (permalink / raw) To: Jonathan Cameron Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen Hi, Jonathan, Sorry for late reply. Jonathan Cameron <Jonathan.Cameron@huawei.com> writes: > On Fri, 31 Jan 2025 12:28:03 +0000 > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > >> > Here is the list of potential discussion points: >> ... >> >> > 2. Possibility of maintaining single source of truth for page hotness that would >> > maintain hot page information from multiple sources and let other sub-systems >> > use that info. >> Hi, >> >> I was thinking of proposing a separate topic on a single source of hotness, >> but this question covers it so I'll add some thoughts here instead. >> I think we are very early, but sharing some experience and thoughts in a >> session may be useful. > > Thinking more on this over lunch, I think it is worth calling this out as a > potential session topic in it's own right rather than trying to find > time within other sessions. Hence the title change. > > I think a session would start with a brief listing of the temperature sources > we have and those on the horizon to motivate what we are unifying, then > discussion to focus on need for such a unification + requirements > (maybe with a straw man). > >> >> What do the other subsystems that want to use a single source of page hotness >> want to be able to find out? (subject to filters like memory range, process etc) >> >> A) How hot is page X? >> - Is this useful, or too much data? What would use it? >> * Application optimization maybe. Very handy for developing algorithms >> to do the rest of the options here as an Oracle! >> - Provides both the cold and hot end of the scale, but maybe measurement >> techniques vary and can not be easily combined. Hard in general to combine >> multiple sources of truth if aiming for an absolute number. >> >> B) Which pages are super hot? >> - Probably these that make the most difference if they are in a slower memory tier. >> >> C) Some pages are hot enough to consider moving? >> - This may be good enough to get the key data into the fast memory over time. >> - Can combine sources of info as being able to compare precise numbers doesn't matter. >> >> D) Which pages are fairly cold? >> - Likewise maybe good enough over time. >> >> E) Which pages are very cold? >> - Ideal case for tiering. Swap these with the super hot ones. >> - Maybe extra signal for swap / zswap etc >> >> F) Did these hot pages remain hot (and same for cold) >> - This is needed to know when to back off doing things as we have unstable >> hotness (two phase applications are a pain for this), sampling a few >> pages may be fine. >> >> Messy corners: >> >> Temporal aspects. >> - If only providing lists of hottest / coldest in last second, very hard >> to find those that are of a stable temperature. We end up moving >> very hot data (which is disruptive) and it doesn't stay hot. >> - Can reduce that affect by long sampling windows on some measurement approaches >> (on hardware trackers that can trash accuracy due to resource exhaustion >> and other subtle effects). >> - bistable / phase based applications are a pain but perhaps up to higher >> levels to back off. >> >> My main interest is migrating in tiered systems but good to look at what >> else would use a common layer. >> >> Mostly I want to know something that is useful to move, and assume convergence >> over the long term with the best things to move so to me the ideal layer has >> following interface (strawman so shoot holes in it!): >> >> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure >> of temperature) Because the hot pages may be available upon page accessing (such PROT_NONE page fault), the interface may be "push" style instead of "pull" style, e.g., int register_hot_page_handler(void (*handler)(struct page *hot_page, int temperature)); >> 2) Give me X coldish pages a faster tier. >> 3) I expect to ask again in X seconds so please have some info ready for me! >> 4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this >> is bleeding the tiering application into a shared interface though). In addition to get a list hot/cold pages, it's also useful to get hot/cold statistics of a memory device (NUMA node), e.g., something like below, Access frequency percent > 1000 HZ 10% 600-1000 HZ 20% 200- 600 HZ 50% 1- 200 HZ 15% < 1 HZ 5% Compared with hot/cold pages list, this may be gotten with lower overhead and can be useful to tune the promotion/demotion alrogithm. At the same time, a sampled (incomplete) list of hot/cold page list may be available too. >> If we have multiple subsystems using the data we will need to resolve their >> conflicting demands to generate good enough data with appropriate overhead. >> >> I'd also like a virtualized solution for case of hardware PA trackers (what >> I have with CXL Hotness Monitoring Units) and classic memory pool / stranding >> avoidance case where the VM is the right entity to make migration decisions. >> Making that interface convey what the kernel is going to use would be an >> efficient option. I'd like to hide how the sausage was made from the VM. --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-02-16 6:49 ` Huang, Ying @ 2025-02-17 4:10 ` Bharata B Rao 2025-02-17 8:06 ` Huang, Ying 2025-03-14 14:24 ` Jonathan Cameron 1 sibling, 1 reply; 33+ messages in thread From: Bharata B Rao @ 2025-02-17 4:10 UTC (permalink / raw) To: Huang, Ying, Jonathan Cameron Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On 16-Feb-25 12:19 PM, Huang, Ying wrote: >>> >>> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure >>> of temperature) > > Because the hot pages may be available upon page accessing (such PROT_NONE > page fault), the interface may be "push" style instead of "pull" style, > e.g., > > int register_hot_page_handler(void (*handler)(struct page *hot_page, int temperature)); Yes, push model appears natural to me given that there are producers who are themselves consumers as well. Let's take an example of access being detected by PTE scan by DAMON first and LRU and hot page promotion subsystems have registered handlers for hot page info. Now if hot page promotion handler gets called first and if it promotes the page, calling LRU registered handler still makes sense? May be not I suppose. On the other hand if LRU subsystem handler gets first and it adjusts/modifies the hot page's list, it would still make sense to activate the hot page promotion handler to check for possible promotion. Is this how you are envisioning the different consumers of hot page access info could work/cooperate? Regards, Bharata. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-02-17 4:10 ` Bharata B Rao @ 2025-02-17 8:06 ` Huang, Ying 0 siblings, 0 replies; 33+ messages in thread From: Huang, Ying @ 2025-02-17 8:06 UTC (permalink / raw) To: Bharata B Rao Cc: Jonathan Cameron, Raghavendra K T, linux-mm, akpm, lsf-pc, gourry, nehagholkar, abhishekd, nphamcs, hannes, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen, feng.tang Bharata B Rao <bharata@amd.com> writes: > On 16-Feb-25 12:19 PM, Huang, Ying wrote: >>>> >>>> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure >>>> of temperature) >> Because the hot pages may be available upon page accessing (such >> PROT_NONE >> page fault), the interface may be "push" style instead of "pull" style, >> e.g., >> int register_hot_page_handler(void (*handler)(struct page *hot_page, >> int temperature)); > > Yes, push model appears natural to me given that there are producers > who are themselves consumers as well. > > Let's take an example of access being detected by PTE scan by DAMON > first and LRU and hot page promotion subsystems have registered > handlers for hot page info. > > Now if hot page promotion handler gets called first and if it promotes > the page, calling LRU registered handler still makes sense? May be not > I suppose. > > On the other hand if LRU subsystem handler gets first and it > adjusts/modifies the hot page's list, it would still make sense to > activate the hot page promotion handler to check for possible > promotion. > > Is this how you are envisioning the different consumers of hot page > access info could work/cooperate? Sorry, I have no idea about what is the right behavior now. It appears hard to coordinate different consumers. In theory, we can promote the hottest pages while activate (in LRU lists) the warm pages. --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-02-16 6:49 ` Huang, Ying 2025-02-17 4:10 ` Bharata B Rao @ 2025-03-14 14:24 ` Jonathan Cameron 2025-03-17 22:34 ` Davidlohr Bueso 1 sibling, 1 reply; 33+ messages in thread From: Jonathan Cameron @ 2025-03-14 14:24 UTC (permalink / raw) To: Huang, Ying Cc: Raghavendra K T, linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Sun, 16 Feb 2025 14:49:50 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote: > Hi, Jonathan, > > Sorry for late reply. Sorry for even later reply! > > Jonathan Cameron <Jonathan.Cameron@huawei.com> writes: > > > On Fri, 31 Jan 2025 12:28:03 +0000 > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > > >> > Here is the list of potential discussion points: > >> ... > >> > >> > 2. Possibility of maintaining single source of truth for page hotness that would > >> > maintain hot page information from multiple sources and let other sub-systems > >> > use that info. > >> Hi, > >> > >> I was thinking of proposing a separate topic on a single source of hotness, > >> but this question covers it so I'll add some thoughts here instead. > >> I think we are very early, but sharing some experience and thoughts in a > >> session may be useful. > > > > Thinking more on this over lunch, I think it is worth calling this out as a > > potential session topic in it's own right rather than trying to find > > time within other sessions. Hence the title change. > > > > I think a session would start with a brief listing of the temperature sources > > we have and those on the horizon to motivate what we are unifying, then > > discussion to focus on need for such a unification + requirements > > (maybe with a straw man). > > > >> > >> What do the other subsystems that want to use a single source of page hotness > >> want to be able to find out? (subject to filters like memory range, process etc) > >> > >> A) How hot is page X? > >> - Is this useful, or too much data? What would use it? > >> * Application optimization maybe. Very handy for developing algorithms > >> to do the rest of the options here as an Oracle! > >> - Provides both the cold and hot end of the scale, but maybe measurement > >> techniques vary and can not be easily combined. Hard in general to combine > >> multiple sources of truth if aiming for an absolute number. > >> > >> B) Which pages are super hot? > >> - Probably these that make the most difference if they are in a slower memory tier. > >> > >> C) Some pages are hot enough to consider moving? > >> - This may be good enough to get the key data into the fast memory over time. > >> - Can combine sources of info as being able to compare precise numbers doesn't matter. > >> > >> D) Which pages are fairly cold? > >> - Likewise maybe good enough over time. > >> > >> E) Which pages are very cold? > >> - Ideal case for tiering. Swap these with the super hot ones. > >> - Maybe extra signal for swap / zswap etc > >> > >> F) Did these hot pages remain hot (and same for cold) > >> - This is needed to know when to back off doing things as we have unstable > >> hotness (two phase applications are a pain for this), sampling a few > >> pages may be fine. > >> > >> Messy corners: > >> > >> Temporal aspects. > >> - If only providing lists of hottest / coldest in last second, very hard > >> to find those that are of a stable temperature. We end up moving > >> very hot data (which is disruptive) and it doesn't stay hot. > >> - Can reduce that affect by long sampling windows on some measurement approaches > >> (on hardware trackers that can trash accuracy due to resource exhaustion > >> and other subtle effects). > >> - bistable / phase based applications are a pain but perhaps up to higher > >> levels to back off. > >> > >> My main interest is migrating in tiered systems but good to look at what > >> else would use a common layer. > >> > >> Mostly I want to know something that is useful to move, and assume convergence > >> over the long term with the best things to move so to me the ideal layer has > >> following interface (strawman so shoot holes in it!): > >> > >> 1) Give me up to X hotish pages from a slow tier (greater than a specific measure > >> of temperature) > > Because the hot pages may be available upon page accessing (such PROT_NONE > page fault), the interface may be "push" style instead of "pull" style, > e.g., Absolutely agree that might be the approach, but with some form of back pressure as for at least some approaches it is much cheaper to find a find a few hot pages than to find lots of them. More complex if you want a few of the very hottest or just hotter than X. > > int register_hot_page_handler(void (*handler)(struct page *hot_page, int temperature)); > > >> 2) Give me X coldish pages a faster tier. > >> 3) I expect to ask again in X seconds so please have some info ready for me! > >> 4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this > >> is bleeding the tiering application into a shared interface though). > > In addition to get a list hot/cold pages, it's also useful to get > hot/cold statistics of a memory device (NUMA node), e.g., something like > below, > > Access frequency percent > > 1000 HZ 10% > 600-1000 HZ 20% > 200- 600 HZ 50% > 1- 200 HZ 15% > < 1 HZ 5% > > Compared with hot/cold pages list, this may be gotten with lower > overhead and can be useful to tune the promotion/demotion alrogithm. At > the same time, a sampled (incomplete) list of hot/cold page list may be > available too. I agree it's useful info and 'might' be cheaper to get. Depends on the tracking solution and impacts of sampling approaches. > > >> If we have multiple subsystems using the data we will need to resolve their > >> conflicting demands to generate good enough data with appropriate overhead. > >> > >> I'd also like a virtualized solution for case of hardware PA trackers (what > >> I have with CXL Hotness Monitoring Units) and classic memory pool / stranding > >> avoidance case where the VM is the right entity to make migration decisions. > >> Making that interface convey what the kernel is going to use would be an > >> efficient option. I'd like to hide how the sausage was made from the VM. > > --- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? 2025-03-14 14:24 ` Jonathan Cameron @ 2025-03-17 22:34 ` Davidlohr Bueso 0 siblings, 0 replies; 33+ messages in thread From: Davidlohr Bueso @ 2025-03-17 22:34 UTC (permalink / raw) To: Jonathan Cameron Cc: Huang, Ying, Raghavendra K T, linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On Fri, 14 Mar 2025, Jonathan Cameron wrote: >On Sun, 16 Feb 2025 14:49:50 +0800 >"Huang, Ying" <ying.huang@linux.alibaba.com> wrote: >> Because the hot pages may be available upon page accessing (such PROT_NONE >> page fault), the interface may be "push" style instead of "pull" style, >> e.g., > Right, I was also thinking along those lines. Hot pages could be fed right into kpromoted (with the appropriate interface for 'phi' of course), then kicked to do the migration. This already has the frequency, and the destination node so no guessing as to where the page should be placed. So this makes me wonder kmmscand vs NUMAB=2... should both co-exist? Doubling the scanning overhead, so I think not (albeit non mapped page cache pages). The original data from kmmscand is with a busted nid selection, but now Raghu has a proposed some heuristics, so I am curious what kind of numbers come up in terms of accuracy and performance vs a NUMAB=2 migration offload. >Absolutely agree that might be the approach, but with some form of back pressure >as for at least some approaches it is much cheaper to find a find a few hot >pages than to find lots of them. More complex if you want a few of the very hottest >or just hotter than X. Yeah, also cases like different CXL type3 devices with different access latencies both saying here's what's hot. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-31 12:28 ` Jonathan Cameron 2025-01-31 13:09 ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron @ 2025-02-03 2:23 ` Raghavendra K T 1 sibling, 0 replies; 33+ messages in thread From: Raghavendra K T @ 2025-02-03 2:23 UTC (permalink / raw) To: Jonathan Cameron Cc: linux-mm, akpm, lsf-pc, bharata, gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, honggyu.kim, leillc, kmanaouil.dev, rppt, dave.hansen On 1/31/2025 5:58 PM, Jonathan Cameron wrote: > >> Here is the list of potential discussion points: > ... > >> 2. Possibility of maintaining single source of truth for page hotness that would >> maintain hot page information from multiple sources and let other sub-systems >> use that info. > Hi, > > I was thinking of proposing a separate topic on a single source of hotness, > but this question covers it so I'll add some thoughts here instead. > I think we are very early, but sharing some experience and thoughts in a > session may be useful. > > What do the other subsystems that want to use a single source of page hotness > want to be able to find out? (subject to filters like memory range, process etc) > > A) How hot is page X? > - Is this useful, or too much data? What would use it? > * Application optimization maybe. Very handy for developing algorithms > to do the rest of the options here as an Oracle! > - Provides both the cold and hot end of the scale, but maybe measurement > techniques vary and can not be easily combined. Hard in general to combine > multiple sources of truth if aiming for an absolute number.> > B) Which pages are super hot? > - Probably these that make the most difference if they are in a slower memory tier. > > C) Some pages are hot enough to consider moving? > - This may be good enough to get the key data into the fast memory over time. > - Can combine sources of info as being able to compare precise numbers doesn't matter. > > D) Which pages are fairly cold? > - Likewise maybe good enough over time. > > E) Which pages are very cold? > - Ideal case for tiering. Swap these with the super hot ones. > - Maybe extra signal for swap / zswap etc > > F) Did these hot pages remain hot (and same for cold) > - This is needed to know when to back off doing things as we have unstable > hotness (two phase applications are a pain for this), sampling a few > pages may be fine. > > Messy corners: > > Temporal aspects. > - If only providing lists of hottest / coldest in last second, very hard > to find those that are of a stable temperature. We end up moving > very hot data (which is disruptive) and it doesn't stay hot. > - Can reduce that affect by long sampling windows on some measurement approaches > (on hardware trackers that can trash accuracy due to resource exhaustion > and other subtle effects). > - bistable / phase based applications are a pain but perhaps up to higher > levels to back off. > > My main interest is migrating in tiered systems but good to look at what > else would use a common layer. > > Mostly I want to know something that is useful to move, and assume convergence > over the long term with the best things to move so to me the ideal layer has > following interface (strawman so shoot holes in it!): > > 1) Give me up to X hotish pages from a slow tier (greater than a specific measure > of temperature) > 2) Give me X coldish pages a faster tier. > 3) I expect to ask again in X seconds so please have some info ready for me!> 4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this > is bleeding the tiering application into a shared interface though). > Hello Jonathan, Thank you for listing all these points in detail. Agree with you all the points in general. Very hard/tricky to find balance. for. e.g. if we are slow in finding hot pages with more accuracy, we need to store more information, on the other hand prematurely moving pages may result in ping ponging. So may be we have to balance with moderately accurate information, but at the same time optimizing such that we avoid redundant scans from everybody. Thinking loud again, apart from slow-tier optimization potentials listed above, I also hope that we can provide necessary information for NUMAB=1 case, to get it to know more about hot VMA's (Mel had pointed that identifying hot VMA's helps scanning long back). > If we have multiple subsystems using the data we will need to resolve their > conflicting demands to generate good enough data with appropriate overhead. > > I'd also like a virtualized solution for case of hardware PA trackers (what > I have with CXL Hotness Monitoring Units) and classic memory pool / stranding > avoidance case where the VM is the right entity to make migration decisions. > Making that interface convey what the kernel is going to use would be an > efficient option. I'd like to hide how the sausage was made from the VM. > Thanks and Regards - Raghu ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning 2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T ` (3 preceding siblings ...) 2025-01-31 12:28 ` Jonathan Cameron @ 2025-04-07 3:13 ` Bharata B Rao 4 siblings, 0 replies; 33+ messages in thread From: Bharata B Rao @ 2025-04-07 3:13 UTC (permalink / raw) To: Raghavendra K T, linux-mm, akpm, lsf-pc Cc: gourry, nehagholkar, abhishekd, ying.huang, nphamcs, hannes, feng.tang, kbusch, Hasan.Maruf, sj, david, willy, k.shutemov, mgorman, vbabka, hughd, rientjes, shy828301, liam.howlett, peterz, mingo, nadav.amit, shivankg, ziy, jhubbard, AneeshKumar.KizhakeVeetil, linux-kernel, jon.grimm, santosh.shukla, Michael.Day, riel, weixugc, leesuyeon0506, leillc, kmanaouil.dev, rppt, dave.hansen On 23-Jan-25 4:27 PM, Raghavendra K T wrote: > Bharata and I would like to propose the following topic for LSFMM. > > Topic: Overhauling hot page detection and promotion based on PTE A bit scanning. Slides that were used during LSFMM discussion - https://docs.google.com/presentation/d/1zLyGriEyky_HLJPrrdKdhAS7h5oiGf4tIdGuhGX3fJ8/edit?usp=sharing Regards, Bharata. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2025-04-07 3:14 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-01-23 10:57 [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T 2025-01-23 18:20 ` SeongJae Park 2025-01-24 8:54 ` Raghavendra K T 2025-01-24 18:05 ` Jonathan Cameron 2025-01-24 5:53 ` Hyeonggon Yoo 2025-01-24 9:02 ` Raghavendra K T 2025-01-27 7:01 ` David Rientjes 2025-01-27 7:11 ` Raghavendra K T 2025-02-06 3:14 ` Yuanchu Xie 2025-01-26 2:27 ` Huang, Ying 2025-01-27 5:11 ` Bharata B Rao 2025-01-27 18:34 ` SeongJae Park 2025-02-07 8:10 ` Huang, Ying 2025-02-07 9:06 ` Gregory Price 2025-02-07 19:52 ` SeongJae Park 2025-02-07 19:06 ` Davidlohr Bueso 2025-03-14 1:56 ` Raghavendra K T 2025-03-14 2:12 ` Raghavendra K T 2025-01-31 12:28 ` Jonathan Cameron 2025-01-31 13:09 ` [LSF/MM/BPF TOPIC] Unifying sources of page temperature information - what info is actually wanted? Jonathan Cameron 2025-02-05 6:24 ` Bharata B Rao 2025-02-05 16:05 ` Johannes Weiner 2025-02-06 6:46 ` SeongJae Park 2025-02-06 15:30 ` Jonathan Cameron 2025-02-07 9:50 ` Matthew Wilcox 2025-02-16 7:04 ` Huang, Ying 2025-02-16 6:49 ` Huang, Ying 2025-02-17 4:10 ` Bharata B Rao 2025-02-17 8:06 ` Huang, Ying 2025-03-14 14:24 ` Jonathan Cameron 2025-03-17 22:34 ` Davidlohr Bueso 2025-02-03 2:23 ` [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning Raghavendra K T 2025-04-07 3:13 ` Bharata B Rao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox