* [Linux Memory Hotness and Promotion] Notes from October 23, 2025
@ 2025-11-03 0:41 David Rientjes
2025-11-14 1:42 ` SeongJae Park
0 siblings, 1 reply; 4+ messages in thread
From: David Rientjes @ 2025-11-03 0:41 UTC (permalink / raw)
To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker,
SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
Zi Yan
Cc: linux-mm
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, October 9. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
Ravi Jonnalagadda presented dynamic interleaving slides, co-developed with
Bijan Tabatabai, discussing the current approach of promoting all hot
pages into DRAM tier and demoting all cold pages. If the bandwidth
utilization is high, it will saturate the top tier even though there is
bandwidth available on the lower tier. The preference was to demote cold
pages when under-utilizing memory in the top tier and then interleave hot
pages to maximize bandwidth utilization. For Ravi's experimentation, this
has been 3/4 of maximum write bandwidth for the top tier. If this
threshold is not reached, memory is demoted.
Ravi suggested adaptive interleaving of memory to optimize both bandwidth
and capacity utilization. He suggested an approach of a migrator in
kernel space and a calibrator in userspace. The calibrator would monitor
system bandwidth utilization and, using different weights, determine the
optimal weights for interleaving the hot pages for the highest bandwidth.
If bandwidth saturation is not hit, only cold pages get demoted. The
migrator reads the target interleave ratio and rearrange the hot pages
from the calibrator and demotes cold pages to the target node. Currently
this uses DAMOS policies, Migrate_hot and Migrate_cold.
It was shown how the optimal weights change over time for both the
multiload and MERCI benchmarks. For MERCI, a few results using this
approach were obtained (lower is better):
- Local DRAM
+ Avg Baseline Total Time - 1457.97 ms
+ Memory Footprint
o Node 0 - 20.3 GB
- Static Weighted Interleave
+ Avg Baseline Total Time - 1023.81 ms
+ Memory Footprint
o Node 0 - 10.3 GB
o Node 1 - 10 GB
- Adaptive interleaving
+ Avg Baseline Total Time - 1030.41 ms
+ Memory Footprint
o Node 0 - 7 GB
o Node 1 - 13 GB
Jonathan Cameron asked if we are using all of the bandwidth for this
benchmark, then what is the use of the extra capacity in top tier? Ravi
said if there are two applications, one latency bound and other is
bandwidth bound, then we can run both at optimal levels.
Ravi suggested hotness information need not be used exclusively for
promotion and that there is an advantage seen in rearranging hot pages
based on weights. He also suggested a standard subsystem that can provide
bandwidth information would be very useful (including sources such as IBS,
PEBS, and PMU sources). Wei Xu noted this should be resctrl and Jonathan
agreed.
Ravi also noted a challenge where NUMA nodes may not be directly related
to DRAM or CXL. CXL nodes can be asymmetric with different bandwidth and
capacity. Similarly, we'd need to differentiate between direct attached
and fabric attached bandwidth information.
Asked about the methodology for the testing, Ravi noted that bandwidth
monitoring is system wide but the migration and weights were application
specific (virtual address space).
Wei noted a challenge that we cannot differentiate write bandwidth with
CXL; with reads, this is possible but we cannot do it for writes today.
System wide this would still be possible, however. Jonathan noted with
resctrl you can reserve some allocation of bandwidth for a given
application and you can optimize within that.
Wei asked, given there will be significant overhead in migration, why the
workloads here are not using hardware interleaving? Ravi emphasized the
need for adaptive tuning where it was necessary to find the right weights
based on application signature; this does not restrict our setup to hard
interleaving ratios.
Ravi's slides were attached to the shared drive.
----->o-----
Raghu noted as an update to his patch series that he finished the changes
previously discussed but there were performnace issues so he continues to
work on those.
----->o-----
Shivank noted that he prepared a presentation for kpromoted with migration
offload to DMA that we can see in the next instance of the meeting.
----->o-----
Next meeting will be on Thursday, November 6 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm
NOTE!!! Daylight Savings Time has ended in the United States, so please
check your local time carefully:
Time zones
PST (UTC-8) 8:30am
MST (UTC-7) 9:30am
CST (UTC-6) 10:30am
EST (UTC-5) 11:30am
Rio de Janeiro (UTC-3) 1:30pm
London (UTC) 4:30pm
Berlin (UTC+1) 5:30pm
Moscow (UTC+3) 7:30pm
Dubai (UTC+4) 8:30pm
Mumbai (UTC+5:30) 10:00pm
Singapore (UTC+8) 12:30am Friday
Beijing (UTC+8) 12:30am Friday
Tokyo (UTC+9) 1:30am Friday
Sydney (UTC+11) 3:30am Friday
Auckland (UTC+13) 5:30am Friday
Topics for the next meeting:
- discuss generalized subsystem for providing bandwidth information
independent of the underlying platform, ideally through resctrl,
otherwise utilizing bandwidth information will be challenging
+ preferably this bandwidth monitoring is not per NUMA node but rather
slow and fast
- similarly, discuss generalized subsystem for providing memory hotness
information
- determine minimal viable upstream opportunity to optimize for tiering
that is extensible for future use cases and optimizations
- Shivank presentation for kpromoted with migration offload to DMA
- update on the latest kmigrated series from Bharata as discussed in the
last meeting and combining all sources of memory hotness
+ discuss performance optimizations achieved by Shivank with migration
offload
- update on Raghu's series after addressing Jonathan's comments
- update on non-temporal stores enlightenment for memory tiering
- enlightening migrate_pages() for hardware assists and how this work
will be charged to userspace
- discuss overall testing and benchmarking methodology for various
approaches as we go along
Please let me know if you'd like to propose additional topics for
discussion, thank you!
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025 2025-11-03 0:41 [Linux Memory Hotness and Promotion] Notes from October 23, 2025 David Rientjes @ 2025-11-14 1:42 ` SeongJae Park 2025-11-17 11:36 ` Honggyu Kim 0 siblings, 1 reply; 4+ messages in thread From: SeongJae Park @ 2025-11-14 1:42 UTC (permalink / raw) To: David Rientjes Cc: SeongJae Park, Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron, Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos, Zi Yan, linux-mm, damon, Honggyu Kim, Yunjeong Mun Cc-ing HMSDK developers and DAMON mailing list. On Sun, 2 Nov 2025 16:41:19 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > Hi everybody, > > Here are the notes from the last Linux Memory Hotness and Promotion call > that happened on Thursday, October 9. Thanks to everybody who was > involved! > > These notes are intended to bring people up to speed who could not attend > the call as well as keep the conversation going in between meetings. I was unable to join the call due to a conflict. This note is very helpful. Thank you for taking and sharing this note, David! > > ----->o----- > Ravi Jonnalagadda presented dynamic interleaving slides, co-developed with > Bijan Tabatabai, discussing the current approach of promoting all hot > pages into DRAM tier and demoting all cold pages. If the bandwidth > utilization is high, it will saturate the top tier even though there is > bandwidth available on the lower tier. The preference was to demote cold > pages when under-utilizing memory in the top tier and then interleave hot > pages to maximize bandwidth utilization. For Ravi's experimentation, this > has been 3/4 of maximum write bandwidth for the top tier. If this > threshold is not reached, memory is demoted. I had a grateful chance to discuss about above in more detail with Ravi. Sharing my detailed thoughts here, too. I agree to the concern. I also heard similar concerns for general latency-aware memory tiering approaches from multiple people in the past. The memory capacity extension solution of HMSDK [1], which is developed by SK Hynix, is one good example. To my understanding (please correct me if I'm wrong), HMSDK is providing separate solutions for bandwidth and capacity expansions. The user should first understand whether their workload is bandwidth-hungry or capacity-hungry, and select a proper solution. I suspect the concern from Ravi was one of the reasons. I also recently developed a DAMON-based memory tiering approach [2] that implementing the main idea of TPP [3]: promoting and demoting hot and cold pages aiming a level of the faster node's space utilization. I didn't see the bandwidth issue from my simple tests of it, but I think the very same problem can be applied to both DAMON-based approach and the original TPP implementation. > > Ravi suggested adaptive interleaving of memory to optimize both bandwidth > and capacity utilization. He suggested an approach of a migrator in > kernel space and a calibrator in userspace. The calibrator would monitor > system bandwidth utilization and, using different weights, determine the > optimal weights for interleaving the hot pages for the highest bandwidth. > If bandwidth saturation is not hit, only cold pages get demoted. The > migrator reads the target interleave ratio and rearrange the hot pages > from the calibrator and demotes cold pages to the target node. Currently > this uses DAMOS policies, Migrate_hot and Migrate_cold. This implementation makes sense to me, especially if the aimed use case is for specific virtual address spaces. Nevertheless, if a physical address space based version is also an option, I think there could be yet another way to achive the goal (optimizing both bandwidth and capacity). My idea is tweaking TPP idea a little bit: migrate pages among NUMA nodees aiming a level of both space and bandwidth utilization of the faster (e.g., DRAM) node. In more detail, do the hot pages promotion and cold pages demotions for the target level of faster node space utilization, same to the original TPP idea. But, stop the hot page promotions if the memory bandwidth consumption of the faster node exceeds a level. In the case, instead, start demoting _hot_ pages until the memory bandwidth consumption on the faster node decreases below the limit level. I think this idea could easily be prototyped by extending the DAMON-based TPP implementation [2]. Let me briefly explain the prototyping idea assuming the readers are familiar with the DAMON-based TPP implementation. If you are not familiar with, please feel free to ask questions to me, or refer to the cover letter [2] of the patch series. First, add another DAMOS quota goal for the hot pages promotion scheme. The goal will aim to achieve a high level memory bandwidth consumption of the faster node. The target level will be reasonably high but not too high to keep head room remained. So the hot pages promotion scheme will be activated at the beginning, promote hot pages, make the faster node's space and bandwidth utilization increase. But if the memory bandwidth consumption of the faster node surpasses the target leevel as a result of the hot pages promotion or the workload's access pattern change, the hot pages promotion scheme will be less aggressive and eventually stop. Second, add another DAMOS scheme to the faster node access monitoring DAMON context. The new scheme does hot pages demotion with a quota goal that aim to make unused (free, or available) memory bandwidth of the faster node a headroom level. This scheme will do nothing at the beginning of the system since the faster node may have available (unused) memory bandwidth more than the headroom level. This scheme will start the hot pages demotion once the faster node's available memory bandwidth becomes less than the desired headroom level, due to increased loads or the hot pages promotion. And once the unused memory bandwidth of the faster node becomes higher than the head room level as a result of the hot pages demotion or access pattern change, the hot pages demotion will be deactivated again. For example, a change like below can be made to the simple DAMON-based TPP implementation [4]. diff --git a/scripts/mem_tier.sh b/scripts/mem_tier.sh index 9e685751..83757fa9 100644 --- a/scripts/mem_tier.sh +++ b/scripts/mem_tier.sh @@ -30,16 +30,25 @@ fi "$damo_bin" module stat write enabled N "$damo_bin" start \ --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \ + `# demote cold pages for faster node headroom space` \ --damos_action migrate_cold 1 --damos_access_rate 0% 0% \ --damos_apply_interval 1s \ --damos_quota_interval 1s --damos_quota_space 200MB \ --damos_quota_goal node_mem_free_bp 0.5% 0 \ --damos_filter reject young \ + `# demote hot pages for faster node headroom bandwidth` \ + --damos_action migrate_hot 1 --damos_access_rate 5% max \ + --damos_apply_interval 1s \ + --damos_quota_interval 1s --damos_quota_space 200MB \ + --damos_quota_goal node_membw_free_bp 5% 0 \ + --damos_filter allow young \ --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \ + `# promote hot pages for faster node space/bandwidth high utilization` \ --damos_action migrate_hot 0 --damos_access_rate 5% max \ --damos_apply_interval 1s \ --damos_quota_interval 1s --damos_quota_space 200MB \ --damos_quota_goal node_mem_used_bp 99.7% 0 \ + --damos_quota_goal node_membw_used_bp 95% 0 \ --damos_filter allow young \ - --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \ - --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1 + --damos_nr_quota_goals 1 1 2 --damos_nr_filters 1 1 1 \ + --nr_targets 1 1 --nr_schemes 2 1 --nr_ctxs 1 1 "node_membw_free_bp" and "node_membw_used_bp" are _imaginary_ DAMOS quota goal metrics representing the available (unused) or consuming level of memory bandiwdth of a given NUMA node. Those are imaginery ones that arenot supported on DAMON of today. If this idea makes sense, we may develop the support of the metrics. But even before the metrics are implemented, we could prototype this for early proof of concepts by setting the DAMOS quota goals using the user_input goal metric [5] and run a user-space program that measures the memory bandwidth of the faster node and feeds it to DAMON using the DAMON sysfs interface. Implementing both the memory bandwidth/space utilization monitoring and the quota auto-tuning logic on user-space, and directly adjusting the quotas of DAMOS schemes instead of using the quota goals could also be an option. I have no plan to implement the "node_membw_{free,used}_bp" quota goal metrics or do the user_input based prototyping at the moment. But, as always, if you watn the features and/or willing to step up for development of the features, I will be happy to help. [...] > Ravi suggested hotness information need not be used exclusively for > promotion and that there is an advantage seen in rearranging hot pages > based on weights. He also suggested a standard subsystem that can provide > bandwidth information would be very useful (including sources such as IBS, > PEBS, and PMU sources). If we decide to implement the above per-node memory bandwidth based DAMOS quota goal metrics, I think this standard subsystem could also be useful for the implementation. FYI, users can _estimate_ memory bandwidth of the system or workloads from DAMON's monitoring results snapshot. For example, if DAMON is seeing a 1 GiB memory region that is consistently being accessed about 10 times per second, we can estimate it is consuming 10 GiB/s memory bandwidth. DAMON user-space tool provides this estimated bandwidth per monitoring results snapshot with 'damo report access' command. DAMON_STAT module, which is recently developed for providing system wide high level data access pattern in an easy way, also provides this _estimated_ memory bandwidth usage. [1] https://events.linuxfoundation.org/open-source-summit-korea/program/schedule/ [2] https://lore.kernel.org/all/20250420194030.75838-1-sj@kernel.org/ [3] https://symbioticlab.org/publications/files/tpp:asplos23/tpp-asplos23.pdf [4] https://github.com/damonitor/damo/blob/v3.0.4/scripts/mem_tier.sh [5] https://docs.kernel.org/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning Thanks, SJ [...] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025 2025-11-14 1:42 ` SeongJae Park @ 2025-11-17 11:36 ` Honggyu Kim 2025-11-21 2:27 ` SeongJae Park 0 siblings, 1 reply; 4+ messages in thread From: Honggyu Kim @ 2025-11-17 11:36 UTC (permalink / raw) To: SeongJae Park, David Rientjes Cc: kernel_team, Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron, Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos, Zi Yan, linux-mm, damon, Yunjeong Mun Hi SJ, David, Ravi and all, On 11/14/2025 10:42 AM, SeongJae Park wrote: > > Cc-ing HMSDK developers and DAMON mailing list. Thanks for sharing with us. AFAIU, the discussion is about whether to use bandwidth information as a metric for page migration in addition to page temperature information. In general, I think this makes sense in theory but would like to find if there are some practical workloads that can get benefits from this work. I've left some comments below. > On Sun, 2 Nov 2025 16:41:19 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > >> Hi everybody, >> >> Here are the notes from the last Linux Memory Hotness and Promotion call >> that happened on Thursday, October 9. Thanks to everybody who was >> involved! >> >> These notes are intended to bring people up to speed who could not attend >> the call as well as keep the conversation going in between meetings. > > I was unable to join the call due to a conflict. This note is very helpful. > Thank you for taking and sharing this note, David! I also see that this summary is very helpful. Thanks very much David! BTW, could anyone please add me and Yunjeong in the cc list? I was aware of this meeting but couldn't be able to join because the meeting time is right after midnight here in Korea. (1:30am) If anyone add us in the cc list, then we will try to follow the discussion based on the summary. >> >> ----->o----- >> Ravi Jonnalagadda presented dynamic interleaving slides, co-developed with >> Bijan Tabatabai, discussing the current approach of promoting all hot >> pages into DRAM tier and demoting all cold pages. If the bandwidth >> utilization is high, it will saturate the top tier even though there is >> bandwidth available on the lower tier. The preference was to demote cold >> pages when under-utilizing memory in the top tier and then interleave hot >> pages to maximize bandwidth utilization. For Ravi's experimentation, this >> has been 3/4 of maximum write bandwidth for the top tier. If this >> threshold is not reached, memory is demoted. I also think that this sounds right. > I had a grateful chance to discuss about above in more detail with Ravi. > Sharing my detailed thoughts here, too. > > I agree to the concern. I also heard similar concerns for general > latency-aware memory tiering approaches from multiple people in the past. > > The memory capacity extension solution of HMSDK [1], which is developed by SK > Hynix, is one good example. To my understanding (please correct me if I'm > wrong), HMSDK is providing separate solutions for bandwidth and capacity > expansions. The user should first understand whether their workload is > bandwidth-hungry or capacity-hungry, and select a proper solution. I suspect > the concern from Ravi was one of the reasons. Yeah, your understanding is correct in HMSDK cases. > I also recently developed a DAMON-based memory tiering approach [2] that > implementing the main idea of TPP [3]: promoting and demoting hot and cold > pages aiming a level of the faster node's space utilization. I didn't see the > bandwidth issue from my simple tests of it, but I think the very same problem > can be applied to both DAMON-based approach and the original TPP > implementation. > >> >> Ravi suggested adaptive interleaving of memory to optimize both bandwidth >> and capacity utilization. He suggested an approach of a migrator in >> kernel space and a calibrator in userspace. The calibrator would monitor >> system bandwidth utilization and, using different weights, determine the >> optimal weights for interleaving the hot pages for the highest bandwidth. I also think that monitoring bandwidth makes sense. We recently released a tool called bwprof for bandwidth recording and monitoring based on intel pcm. https://github.com/skhynix/hmsdk/blob/hmsdk-v4.0/tools/bwprof/bwprof.cc This tool can be slightly changed to monitor bandwidth and write it to some sysfs interface knobs for this purpose. >> If bandwidth saturation is not hit, only cold pages get demoted. The >> migrator reads the target interleave ratio and rearrange the hot pages >> from the calibrator and demotes cold pages to the target node. Currently >> this uses DAMOS policies, Migrate_hot and Migrate_cold. > > This implementation makes sense to me, especially if the aimed use case is for > specific virtual address spaces. I think the current issues of adaptive weighted interleave are as follows. 1. The adaptive interleaving only works for virtual address mode. 2. It scans the entire pages and redistributes them based on the given weight ratios so it limits the general usage as of now. > Nevertheless, if a physical address space > based version is also an option, I think there could be yet another way to > achive the goal (optimizing both bandwidth and capacity). 3. As mentioned above, having physical address mode is needed, but it makes scanning the entire physical address space and redistribute them and it might require too much overhead in practice. > My idea is tweaking TPP idea a little bit: migrate pages among NUMA nodees > aiming a level of both space and bandwidth utilization of the faster (e.g., > DRAM) node. In more detail, do the hot pages promotion and cold pages > demotions for the target level of faster node space utilization, same to the > original TPP idea. But, stop the hot page promotions if the memory bandwidth > consumption of the faster node exceeds a level. In the case, instead, start > demoting _hot_ pages until the memory bandwidth consumption on the faster node > decreases below the limit level. I also think this approach is more practical because it doesn't scan and redistribute the entire pages while lowering bandwidth pressure for the fast tier node. > I think this idea could easily be prototyped by extending the > DAMON-based TPP implementation [2]. Let me briefly explain the prototyping > idea assuming the readers are familiar with the DAMON-based TPP implementation. > If you are not familiar with, please feel free to ask questions to me, or refer > to the cover letter [2] of the patch series. > > First, add another DAMOS quota goal for the hot pages promotion scheme. The > goal will aim to achieve a high level memory bandwidth consumption of the > faster node. The target level will be reasonably high but not too high to keep > head room remained. So the hot pages promotion scheme will be activated at the > beginning, promote hot pages, make the faster node's space and bandwidth > utilization increase. But if the memory bandwidth consumption of the faster > node surpasses the target leevel as a result of the hot pages promotion or the > workload's access pattern change, the hot pages promotion scheme will be less > aggressive and eventually stop. > > Second, add another DAMOS scheme to the faster node access monitoring DAMON > context. The new scheme does hot pages demotion with a quota goal that aim to > make unused (free, or available) memory bandwidth of the faster node a headroom > level. This scheme will do nothing at the beginning of the system since the > faster node may have available (unused) memory bandwidth more than the headroom > level. This scheme will start the hot pages demotion once the faster node's > available memory bandwidth becomes less than the desired headroom level, due to > increased loads or the hot pages promotion. And once the unused memory > bandwidth of the faster node becomes higher than the head room level as a > result of the hot pages demotion or access pattern change, the hot pages > demotion will be deactivated again. > > For example, a change like below can be made to the simple DAMON-based TPP > implementation [4]. > > diff --git a/scripts/mem_tier.sh b/scripts/mem_tier.sh > index 9e685751..83757fa9 100644 > --- a/scripts/mem_tier.sh > +++ b/scripts/mem_tier.sh > @@ -30,16 +30,25 @@ fi > "$damo_bin" module stat write enabled N > "$damo_bin" start \ > --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \ > + `# demote cold pages for faster node headroom space` \ > --damos_action migrate_cold 1 --damos_access_rate 0% 0% \ > --damos_apply_interval 1s \ > --damos_quota_interval 1s --damos_quota_space 200MB \ > --damos_quota_goal node_mem_free_bp 0.5% 0 \ > --damos_filter reject young \ > + `# demote hot pages for faster node headroom bandwidth` \ > + --damos_action migrate_hot 1 --damos_access_rate 5% max \ > + --damos_apply_interval 1s \ > + --damos_quota_interval 1s --damos_quota_space 200MB \ > + --damos_quota_goal node_membw_free_bp 5% 0 \ > + --damos_filter allow young \ > --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \ > + `# promote hot pages for faster node space/bandwidth high utilization` \ > --damos_action migrate_hot 0 --damos_access_rate 5% max \ > --damos_apply_interval 1s \ > --damos_quota_interval 1s --damos_quota_space 200MB \ > --damos_quota_goal node_mem_used_bp 99.7% 0 \ > + --damos_quota_goal node_membw_used_bp 95% 0 \ > --damos_filter allow young \ > - --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \ > - --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1 > + --damos_nr_quota_goals 1 1 2 --damos_nr_filters 1 1 1 \ > + --nr_targets 1 1 --nr_schemes 2 1 --nr_ctxs 1 1 > > "node_membw_free_bp" and "node_membw_used_bp" are _imaginary_ DAMOS quota goal > metrics representing the available (unused) or consuming level of memory > bandiwdth of a given NUMA node. Those are imaginery ones that arenot supported > on DAMON of today. If this idea makes sense, we may develop the support of the > metrics. I think this makes sense. > But even before the metrics are implemented, we could prototype this for early > proof of concepts by setting the DAMOS quota goals using the user_input goal > metric [5] and run a user-space program that measures the memory bandwidth of > the faster node and feeds it to DAMON using the DAMON sysfs interface. > > Implementing both the memory bandwidth/space utilization monitoring and the > quota auto-tuning logic on user-space, and directly adjusting the quotas of > DAMOS schemes instead of using the quota goals could also be an option. > > I have no plan to implement the "node_membw_{free,used}_bp" quota goal metrics > or do the user_input based prototyping at the moment. But, as always, if you > watn the features and/or willing to step up for development of the features, I > will be happy to help. > > [...] >> Ravi suggested hotness information need not be used exclusively for >> promotion and that there is an advantage seen in rearranging hot pages >> based on weights. He also suggested a standard subsystem that can provide >> bandwidth information would be very useful (including sources such as IBS, >> PEBS, and PMU sources). > > If we decide to implement the above per-node memory bandwidth based DAMOS quota > goal metrics, I think this standard subsystem could also be useful for the > implementation. As I mentioned at the top of this mail, I think this work makes sense in theory but would like to find some practical workloads that can get benefits from this work. It would be grateful if someone can share practical use cases in large scale memory systems. Thanks, Honggyu > FYI, users can _estimate_ memory bandwidth of the system or workloads from > DAMON's monitoring results snapshot. For example, if DAMON is seeing a 1 GiB > memory region that is consistently being accessed about 10 times per second, we > can estimate it is consuming 10 GiB/s memory bandwidth. > > DAMON user-space tool provides this estimated bandwidth per monitoring results > snapshot with 'damo report access' command. DAMON_STAT module, which is > recently developed for providing system wide high level data access pattern in > an easy way, also provides this _estimated_ memory bandwidth usage. > > [1] https://events.linuxfoundation.org/open-source-summit-korea/program/schedule/ > [2] https://lore.kernel.org/all/20250420194030.75838-1-sj@kernel.org/ > [3] https://symbioticlab.org/publications/files/tpp:asplos23/tpp-asplos23.pdf > [4] https://github.com/damonitor/damo/blob/v3.0.4/scripts/mem_tier.sh > [5] https://docs.kernel.org/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning > > > Thanks, > SJ > > [...] > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025 2025-11-17 11:36 ` Honggyu Kim @ 2025-11-21 2:27 ` SeongJae Park 0 siblings, 0 replies; 4+ messages in thread From: SeongJae Park @ 2025-11-21 2:27 UTC (permalink / raw) To: Honggyu Kim Cc: SeongJae Park, David Rientjes, kernel_team, Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron, Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos, Zi Yan, linux-mm, damon, Yunjeong Mun On Mon, 17 Nov 2025 20:36:59 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote: > Hi SJ, David, Ravi and all, > > On 11/14/2025 10:42 AM, SeongJae Park wrote: [...] > > The memory capacity extension solution of HMSDK [1], which is developed by SK > > Hynix, is one good example. To my understanding (please correct me if I'm > > wrong), HMSDK is providing separate solutions for bandwidth and capacity > > expansions. The user should first understand whether their workload is > > bandwidth-hungry or capacity-hungry, and select a proper solution. I suspect > > the concern from Ravi was one of the reasons. > > Yeah, your understanding is correct in HMSDK cases. Thank you for confirming! > > > I also recently developed a DAMON-based memory tiering approach [2] that > > implementing the main idea of TPP [3]: promoting and demoting hot and cold > > pages aiming a level of the faster node's space utilization. I didn't see the > > bandwidth issue from my simple tests of it, but I think the very same problem > > can be applied to both DAMON-based approach and the original TPP > > implementation. > > > >> > >> Ravi suggested adaptive interleaving of memory to optimize both bandwidth > >> and capacity utilization. He suggested an approach of a migrator in > >> kernel space and a calibrator in userspace. The calibrator would monitor > >> system bandwidth utilization and, using different weights, determine the > >> optimal weights for interleaving the hot pages for the highest bandwidth. > > I also think that monitoring bandwidth makes sense. We recently released a > tool called bwprof for bandwidth recording and monitoring based on intel pcm. > https://github.com/skhynix/hmsdk/blob/hmsdk-v4.0/tools/bwprof/bwprof.cc > > This tool can be slightly changed to monitor bandwidth and write it to some > sysfs interface knobs for this purpose. Thank you for introducing the tool, I think that can be useful for not only this case but also general investigations and optimizations of this kind of memory systems. > > >> If bandwidth saturation is not hit, only cold pages get demoted. The > >> migrator reads the target interleave ratio and rearrange the hot pages > >> from the calibrator and demotes cold pages to the target node. Currently > >> this uses DAMOS policies, Migrate_hot and Migrate_cold. > > > > This implementation makes sense to me, especially if the aimed use case is for > > specific virtual address spaces. > > I think the current issues of adaptive weighted interleave are as follows. > 1. The adaptive interleaving only works for virtual address mode. This is true. But I don't really think this is an issue, since I found no clear physical address mode interleaving use case. Since we have clear use case of virtual mode DAMOS-based interleaving, and I heard no problem from the use case, I think "all is well". By the way, interleaving in this context is somewhat confusing to me. Technically speaking it is DAMOS_MIGRATE_{HOT,COLD} towards multiple destination nodes with different weights. And how it should be implemented on physical address space (whether to decide the migration target node of each page based on its physical address or its virtual address) was discussed on the patch series for the multiple migration destination node, but we didn't find good answer so far. That's one of the reasons why physical mode DAMOS-migration to multiple destination nodes is not yet supported. I understand you are saying it would be nice if Ravi's general idea (optimizing both bandwidth and capacity) can be implemented for not only virtual address space but also for physical address space, since it would be easier for sysadmins? I agree if so. Please correct me if I'm getting you wrong, though. > 2. It scans the entire pages and redistributes them based on the given weight > ratios so it limits the general usage as of now. I think it depends on the detailed usage. In this specific use case, to my understanding (correct me if I'm wrong, Ravi), the user-space tool applies interleaving (or, DAMOS_MIGRATE_HOT to multiple destination nodes) only for hot pages. Hence the scanning for interleaving will be executed only for DAMON-found hot pages. Also the users may use DAMOS quota or similar features to further tune the overhead. Maybe my humble edit of the original mail made you be confused about Ravi's implementation details? Sorry if that's the case. > > > Nevertheless, if a physical address space > > based version is also an option, I think there could be yet another way to > > achive the goal (optimizing both bandwidth and capacity). > > 3. As mentioned above, having physical address mode is needed, but it makes > scanning the entire physical address space and redistribute them and it > might require too much overhead in practice. Same to my comment to above reply to your second point, I think the overhead could be controlled by adjusting the target page hotness and/or DAMOS quota. Or, I might misreading your opinion. Please feel free to correct it in the case. Anyway, my idea is not very different from Ravi's one. It is just a more simply re-phrased version of it. In essence, I only changed the word 'interleave', which is not very clear its behavir on physical address to me, to 'migrate_hot' and gave more concrete example using DAMON user-space tool example commands. > > > My idea is tweaking TPP idea a little bit: migrate pages among NUMA nodees > > aiming a level of both space and bandwidth utilization of the faster (e.g., > > DRAM) node. In more detail, do the hot pages promotion and cold pages > > demotions for the target level of faster node space utilization, same to the > > original TPP idea. But, stop the hot page promotions if the memory bandwidth > > consumption of the faster node exceeds a level. In the case, instead, start > > demoting _hot_ pages until the memory bandwidth consumption on the faster node > > decreases below the limit level. [...] > As I mentioned at the top of this mail, I think this work makes sense in theory Glad to get publicly confirmed I'm not the one who sees what I see :D > but would like to find some practical workloads that can get benefits from this > work. It would be grateful if someone can share practical use cases in large > scale memory systems. Fully agreed. Buildable code is much better than words, and test results are even better than such code. Nevertheless, I have no good answer for the practical use cases of my idea, for now. I even have no plan to find it by myself at the moment, mainly because I don't have CXL memory to test, for now. So please don't be blocked by me. I will be more than happy to help for any chance though, as always :) Thanks, SJ [...] ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-11-21 2:27 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-11-03 0:41 [Linux Memory Hotness and Promotion] Notes from October 23, 2025 David Rientjes 2025-11-14 1:42 ` SeongJae Park 2025-11-17 11:36 ` Honggyu Kim 2025-11-21 2:27 ` SeongJae Park
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox