[Linux Memory Hotness and Promotion] Notes from October 23, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Linux Memory Hotness and Promotion] Notes from October 23, 2025
@ 2025-11-03  0:41 David Rientjes
  2025-11-14  1:42 ` SeongJae Park
  0 siblings, 1 reply; 4+ messages in thread
From: David Rientjes @ 2025-11-03  0:41 UTC (permalink / raw)
  To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
	Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker,
	SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	Zi Yan
  Cc: linux-mm

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, October 9.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Ravi Jonnalagadda presented dynamic interleaving slides, co-developed with 
Bijan Tabatabai, discussing the current approach of promoting all hot 
pages into DRAM tier and demoting all cold pages.  If the bandwidth 
utilization is high, it will saturate the top tier even though there is 
bandwidth available on the lower tier.  The preference was to demote cold 
pages when under-utilizing memory in the top tier and then interleave hot 
pages to maximize bandwidth utilization.  For Ravi's experimentation, this 
has been 3/4 of maximum write bandwidth for the top tier.  If this 
threshold is not reached, memory is demoted.

Ravi suggested adaptive interleaving of memory to optimize both bandwidth 
and capacity utilization.  He suggested an approach of a migrator in 
kernel space and a calibrator in userspace.  The calibrator would monitor 
system bandwidth utilization and, using different weights, determine the 
optimal weights for interleaving the hot pages for the highest bandwidth.  
If bandwidth saturation is not hit, only cold pages get demoted.  The 
migrator reads the target interleave ratio and rearrange the hot pages 
from the calibrator and demotes cold pages to the target node.  Currently 
this uses DAMOS policies, Migrate_hot and Migrate_cold.

It was shown how the optimal weights change over time for both the 
multiload and MERCI benchmarks.  For MERCI, a few results using this 
approach were obtained (lower is better):

- Local DRAM
  + Avg Baseline Total Time - 1457.97 ms
  + Memory Footprint
    o Node 0 - 20.3 GB
- Static Weighted Interleave
  + Avg Baseline Total Time - 1023.81 ms
  + Memory Footprint
    o Node 0 - 10.3 GB
    o Node 1 - 10 GB
- Adaptive interleaving
  + Avg Baseline Total Time - 1030.41 ms
  + Memory Footprint
    o Node 0 - 7 GB
    o Node 1 - 13 GB

Jonathan Cameron asked if we are using all of the bandwidth for this 
benchmark, then what is the use of the extra capacity in top tier?  Ravi 
said if there are two applications, one latency bound and other is 
bandwidth bound, then we can run both at optimal levels.

Ravi suggested hotness information need not be used exclusively for 
promotion and that there is an advantage seen in rearranging hot pages 
based on weights.  He also suggested a standard subsystem that can provide 
bandwidth information would be very useful (including sources such as IBS, 
PEBS, and PMU sources).  Wei Xu noted this should be resctrl and Jonathan 
agreed.

Ravi also noted a challenge where NUMA nodes may not be directly related 
to DRAM or CXL.  CXL nodes can be asymmetric with different bandwidth and 
capacity.  Similarly, we'd need to differentiate between direct attached 
and fabric attached bandwidth information. 

Asked about the methodology for the testing, Ravi noted that bandwidth 
monitoring is system wide but the migration and weights were application 
specific (virtual address space).

Wei noted a challenge that we cannot differentiate write bandwidth with 
CXL; with reads, this is possible but we cannot do it for writes today.  
System wide this would still be possible, however.  Jonathan noted with 
resctrl you can reserve some allocation of bandwidth for a given 
application and you can optimize within that.

Wei asked, given there will be significant overhead in migration, why the 
workloads here are not using hardware interleaving?  Ravi emphasized the 
need for adaptive tuning where it was necessary to find the right weights 
based on application signature; this does not restrict our setup to hard 
interleaving ratios.

Ravi's slides were attached to the shared drive.

----->o-----
Raghu noted as an update to his patch series that he finished the changes 
previously discussed but there were performnace issues so he continues to 
work on those.

----->o-----
Shivank noted that he prepared a presentation for kpromoted with migration 
offload to DMA that we can see in the next instance of the meeting.

----->o-----
Next meeting will be on Thursday, November 6 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

NOTE!!!  Daylight Savings Time has ended in the United States, so please
check your local time carefully:

Time zones

PST (UTC-8)		8:30am
MST (UTC-7)		9:30am
CST (UTC-6)		10:30am
EST (UTC-5)		11:30am
Rio de Janeiro (UTC-3)	1:30pm
London (UTC)		4:30pm
Berlin (UTC+1)		5:30pm
Moscow (UTC+3)		7:30pm
Dubai (UTC+4)		8:30pm
Mumbai (UTC+5:30)	10:00pm
Singapore (UTC+8)	12:30am Friday
Beijing (UTC+8)		12:30am Friday
Tokyo (UTC+9)		1:30am Friday
Sydney (UTC+11)		3:30am Friday
Auckland (UTC+13)	5:30am Friday

Topics for the next meeting:

 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - similarly, discuss generalized subsystem for providing memory hotness
   information
 - determine minimal viable upstream opportunity to optimize for tiering
   that is extensible for future use cases and optimizations
 - Shivank presentation for kpromoted with migration offload to DMA
 - update on the latest kmigrated series from Bharata as discussed in the
   last meeting and combining all sources of memory hotness
   + discuss performance optimizations achieved by Shivank with migration
     offload
 - update on Raghu's series after addressing Jonathan's comments
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace
 - discuss overall testing and benchmarking methodology for various
   approaches as we go along

Please let me know if you'd like to propose additional topics for
discussion, thank you!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025
  2025-11-03  0:41 [Linux Memory Hotness and Promotion] Notes from October 23, 2025 David Rientjes
@ 2025-11-14  1:42 ` SeongJae Park
  2025-11-17 11:36   ` Honggyu Kim
  0 siblings, 1 reply; 4+ messages in thread
From: SeongJae Park @ 2025-11-14  1:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Davidlohr Bueso, Fan Ni, Gregory Price,
	Jonathan Cameron, Joshua Hahn, Raghavendra K T, Rao,
	Bharata Bhasker, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	Zi Yan, linux-mm, damon, Honggyu Kim, Yunjeong Mun

Cc-ing HMSDK developers and DAMON mailing list.

On Sun, 2 Nov 2025 16:41:19 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> Hi everybody,
> 
> Here are the notes from the last Linux Memory Hotness and Promotion call
> that happened on Thursday, October 9.  Thanks to everybody who was 
> involved!
> 
> These notes are intended to bring people up to speed who could not attend 
> the call as well as keep the conversation going in between meetings.

I was unable to join the call due to a conflict.  This note is very helpful.
Thank you for taking and sharing this note, David!

> 
> ----->o-----
> Ravi Jonnalagadda presented dynamic interleaving slides, co-developed with 
> Bijan Tabatabai, discussing the current approach of promoting all hot 
> pages into DRAM tier and demoting all cold pages.  If the bandwidth 
> utilization is high, it will saturate the top tier even though there is 
> bandwidth available on the lower tier.  The preference was to demote cold 
> pages when under-utilizing memory in the top tier and then interleave hot 
> pages to maximize bandwidth utilization.  For Ravi's experimentation, this 
> has been 3/4 of maximum write bandwidth for the top tier.  If this 
> threshold is not reached, memory is demoted.

I had a grateful chance to discuss about above in more detail with Ravi.
Sharing my detailed thoughts here, too.

I agree to the concern.  I also heard similar concerns for general
latency-aware memory tiering approaches from multiple people in the past.

The memory capacity extension solution of HMSDK [1], which is developed by SK
Hynix, is one good example.  To my understanding (please correct me if I'm
wrong), HMSDK is providing separate solutions for bandwidth and capacity
expansions.  The user should first understand whether their workload is
bandwidth-hungry or capacity-hungry, and select a proper solution.  I suspect
the concern from Ravi was one of the reasons.

I also recently developed a DAMON-based memory tiering approach [2] that
implementing the main idea of TPP [3]: promoting and demoting hot and cold
pages aiming a level of the faster node's space utilization.  I didn't see the
bandwidth issue from my simple tests of it, but I think the very same problem
can be applied to both DAMON-based approach and the original TPP
implementation.

> 
> Ravi suggested adaptive interleaving of memory to optimize both bandwidth 
> and capacity utilization.  He suggested an approach of a migrator in 
> kernel space and a calibrator in userspace.  The calibrator would monitor 
> system bandwidth utilization and, using different weights, determine the 
> optimal weights for interleaving the hot pages for the highest bandwidth.  
> If bandwidth saturation is not hit, only cold pages get demoted.  The 
> migrator reads the target interleave ratio and rearrange the hot pages 
> from the calibrator and demotes cold pages to the target node.  Currently 
> this uses DAMOS policies, Migrate_hot and Migrate_cold.

This implementation makes sense to me, especially if the aimed use case is for
specific virtual address spaces.  Nevertheless, if a physical address space
based version is also an option, I think there could be yet another way to
achive the goal (optimizing both bandwidth and capacity).

My idea is tweaking TPP idea a little bit: migrate pages among NUMA nodees
aiming a level of both space and bandwidth utilization of the faster (e.g.,
DRAM) node.  In more detail, do the hot pages promotion and cold pages
demotions for the target level of faster node space utilization, same to the
original TPP idea.  But, stop the hot page promotions if the memory bandwidth
consumption of the faster node exceeds a level.  In the case, instead, start
demoting _hot_ pages until the memory bandwidth consumption on the faster node
decreases below the limit level.

I think this idea could easily be prototyped by extending the
DAMON-based TPP implementation [2].  Let me briefly explain the prototyping
idea assuming the readers are familiar with the DAMON-based TPP implementation.
If you are not familiar with, please feel free to ask questions to me, or refer
to the cover letter [2] of the patch series.

First, add another DAMOS quota goal for the hot pages promotion scheme.  The
goal will aim to achieve a high level memory bandwidth consumption of the
faster node.  The target level will be reasonably high but not too high to keep
head room remained.  So the hot pages promotion scheme will be activated at the
beginning, promote hot pages, make the faster node's space and bandwidth
utilization increase.  But if the memory bandwidth consumption of the faster
node surpasses the target leevel as a result of the hot pages promotion or the
workload's access pattern change, the hot pages promotion scheme will be less
aggressive and eventually stop.

Second, add another DAMOS scheme to the faster node access monitoring DAMON
context.  The new scheme does hot pages demotion with a quota goal that aim to
make unused (free, or available) memory bandwidth of the faster node a headroom
level.  This scheme will do nothing at the beginning of the system since the
faster node may have available (unused) memory bandwidth more than the headroom
level.  This scheme will start the hot pages demotion once the faster node's
available memory bandwidth becomes less than the desired headroom level, due to
increased loads or the hot pages promotion.  And once the unused memory
bandwidth of the faster node becomes higher than the head room level as a
result of the hot pages demotion or access pattern change, the hot pages
demotion will be deactivated again.

For example, a change like below can be made to the simple DAMON-based TPP
implementation [4].

diff --git a/scripts/mem_tier.sh b/scripts/mem_tier.sh
index 9e685751..83757fa9 100644
--- a/scripts/mem_tier.sh
+++ b/scripts/mem_tier.sh
@@ -30,16 +30,25 @@ fi
 "$damo_bin" module stat write enabled N
 "$damo_bin" start \
        --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \
+               `# demote cold pages for faster node headroom space` \
                --damos_action migrate_cold 1 --damos_access_rate 0% 0% \
                --damos_apply_interval 1s \
                --damos_quota_interval 1s --damos_quota_space 200MB \
                --damos_quota_goal node_mem_free_bp 0.5% 0 \
                --damos_filter reject young \
+               `# demote hot pages for faster node headroom bandwidth` \
+               --damos_action migrate_hot 1 --damos_access_rate 5% max \
+                       --damos_apply_interval 1s \
+                       --damos_quota_interval 1s --damos_quota_space 200MB \
+                       --damos_quota_goal node_membw_free_bp 5% 0 \
+                       --damos_filter allow young \
        --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \
+               `# promote hot pages for faster node space/bandwidth high utilization` \
                --damos_action migrate_hot 0 --damos_access_rate 5% max \
                --damos_apply_interval 1s \
                --damos_quota_interval 1s --damos_quota_space 200MB \
                --damos_quota_goal node_mem_used_bp 99.7% 0 \
+               --damos_quota_goal node_membw_used_bp 95% 0 \
                --damos_filter allow young \
-               --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \
-       --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1
+               --damos_nr_quota_goals 1 1 2 --damos_nr_filters 1 1 1 \
+       --nr_targets 1 1 --nr_schemes 2 1 --nr_ctxs 1 1

"node_membw_free_bp" and "node_membw_used_bp" are _imaginary_ DAMOS quota goal
metrics representing the available (unused) or consuming level of memory
bandiwdth of a given NUMA node.  Those are imaginery ones that arenot supported
on DAMON of today.  If this idea makes sense, we may develop the support of the
metrics.

But even before the metrics are implemented, we could prototype this for early
proof of concepts by setting the DAMOS quota goals using the user_input goal
metric [5] and run a user-space program that measures the memory bandwidth of
the faster node and feeds it to DAMON using the DAMON sysfs interface.

Implementing both the memory bandwidth/space utilization monitoring and the
quota auto-tuning logic on user-space, and directly adjusting the quotas of
DAMOS schemes instead of using the quota goals could also be an option.

I have no plan to implement the "node_membw_{free,used}_bp" quota goal metrics
or do the user_input based prototyping at the moment.  But, as always, if you
watn the features and/or willing to step up for development of the features, I
will be happy to help.

[...]
> Ravi suggested hotness information need not be used exclusively for 
> promotion and that there is an advantage seen in rearranging hot pages 
> based on weights.  He also suggested a standard subsystem that can provide 
> bandwidth information would be very useful (including sources such as IBS, 
> PEBS, and PMU sources).

If we decide to implement the above per-node memory bandwidth based DAMOS quota
goal metrics, I think this standard subsystem could also be useful for the
implementation.

FYI, users can _estimate_ memory bandwidth of the system or workloads from
DAMON's monitoring results snapshot.  For example, if DAMON is seeing a 1 GiB
memory region that is consistently being accessed about 10 times per second, we
can estimate it is consuming 10 GiB/s memory bandwidth.

DAMON user-space tool provides this estimated bandwidth per monitoring results
snapshot with 'damo report access' command.  DAMON_STAT module, which is
recently developed for providing system wide high level data access pattern in
an easy way, also provides this _estimated_ memory bandwidth usage.

[1] https://events.linuxfoundation.org/open-source-summit-korea/program/schedule/
[2] https://lore.kernel.org/all/20250420194030.75838-1-sj@kernel.org/
[3] https://symbioticlab.org/publications/files/tpp:asplos23/tpp-asplos23.pdf
[4] https://github.com/damonitor/damo/blob/v3.0.4/scripts/mem_tier.sh
[5] https://docs.kernel.org/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025
  2025-11-14  1:42 ` SeongJae Park
@ 2025-11-17 11:36   ` Honggyu Kim
  2025-11-21  2:27     ` SeongJae Park
  0 siblings, 1 reply; 4+ messages in thread
From: Honggyu Kim @ 2025-11-17 11:36 UTC (permalink / raw)
  To: SeongJae Park, David Rientjes
  Cc: kernel_team, Davidlohr Bueso, Fan Ni, Gregory Price,
	Jonathan Cameron, Joshua Hahn, Raghavendra K T, Rao,
	Bharata Bhasker, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	Zi Yan, linux-mm, damon, Yunjeong Mun

Hi SJ, David, Ravi and all,

On 11/14/2025 10:42 AM, SeongJae Park wrote:
> 
> Cc-ing HMSDK developers and DAMON mailing list.

Thanks for sharing with us.

AFAIU, the discussion is about whether to use bandwidth information as a metric
for page migration in addition to page temperature information.

In general, I think this makes sense in theory but would like to find if there
are some practical workloads that can get benefits from this work.

I've left some comments below.

> On Sun, 2 Nov 2025 16:41:19 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> 
>> Hi everybody,
>>
>> Here are the notes from the last Linux Memory Hotness and Promotion call
>> that happened on Thursday, October 9.  Thanks to everybody who was
>> involved!
>>
>> These notes are intended to bring people up to speed who could not attend
>> the call as well as keep the conversation going in between meetings.
> 
> I was unable to join the call due to a conflict.  This note is very helpful.
> Thank you for taking and sharing this note, David!

I also see that this summary is very helpful. Thanks very much David!

BTW, could anyone please add me and Yunjeong in the cc list? I was aware of this
meeting but couldn't be able to join because the meeting time is right after
midnight here in Korea. (1:30am)

If anyone add us in the cc list, then we will try to follow the discussion based
on the summary.

>>
>> ----->o-----
>> Ravi Jonnalagadda presented dynamic interleaving slides, co-developed with
>> Bijan Tabatabai, discussing the current approach of promoting all hot
>> pages into DRAM tier and demoting all cold pages.  If the bandwidth
>> utilization is high, it will saturate the top tier even though there is
>> bandwidth available on the lower tier.  The preference was to demote cold
>> pages when under-utilizing memory in the top tier and then interleave hot
>> pages to maximize bandwidth utilization.  For Ravi's experimentation, this
>> has been 3/4 of maximum write bandwidth for the top tier.  If this
>> threshold is not reached, memory is demoted.

I also think that this sounds right.

> I had a grateful chance to discuss about above in more detail with Ravi.
> Sharing my detailed thoughts here, too.
> 
> I agree to the concern.  I also heard similar concerns for general
> latency-aware memory tiering approaches from multiple people in the past.
> 
> The memory capacity extension solution of HMSDK [1], which is developed by SK
> Hynix, is one good example.  To my understanding (please correct me if I'm
> wrong), HMSDK is providing separate solutions for bandwidth and capacity
> expansions.  The user should first understand whether their workload is
> bandwidth-hungry or capacity-hungry, and select a proper solution.  I suspect
> the concern from Ravi was one of the reasons.

Yeah, your understanding is correct in HMSDK cases.

> I also recently developed a DAMON-based memory tiering approach [2] that
> implementing the main idea of TPP [3]: promoting and demoting hot and cold
> pages aiming a level of the faster node's space utilization.  I didn't see the
> bandwidth issue from my simple tests of it, but I think the very same problem
> can be applied to both DAMON-based approach and the original TPP
> implementation.
> 
>>
>> Ravi suggested adaptive interleaving of memory to optimize both bandwidth
>> and capacity utilization.  He suggested an approach of a migrator in
>> kernel space and a calibrator in userspace.  The calibrator would monitor
>> system bandwidth utilization and, using different weights, determine the
>> optimal weights for interleaving the hot pages for the highest bandwidth.

I also think that monitoring bandwidth makes sense.  We recently released a
tool called bwprof for bandwidth recording and monitoring based on intel pcm.
https://github.com/skhynix/hmsdk/blob/hmsdk-v4.0/tools/bwprof/bwprof.cc

This tool can be slightly changed to monitor bandwidth and write it to some
sysfs interface knobs for this purpose.

>> If bandwidth saturation is not hit, only cold pages get demoted.  The
>> migrator reads the target interleave ratio and rearrange the hot pages
>> from the calibrator and demotes cold pages to the target node.  Currently
>> this uses DAMOS policies, Migrate_hot and Migrate_cold.
> 
> This implementation makes sense to me, especially if the aimed use case is for
> specific virtual address spaces.

I think the current issues of adaptive weighted interleave are as follows.
1. The adaptive interleaving only works for virtual address mode.
2. It scans the entire pages and redistributes them based on the given weight
    ratios so it limits the general usage as of now.

> Nevertheless, if a physical address space
> based version is also an option, I think there could be yet another way to
> achive the goal (optimizing both bandwidth and capacity).

3. As mentioned above, having physical address mode is needed, but it makes
    scanning the entire physical address space and redistribute them and it
    might require too much overhead in practice.

> My idea is tweaking TPP idea a little bit: migrate pages among NUMA nodees
> aiming a level of both space and bandwidth utilization of the faster (e.g.,
> DRAM) node.  In more detail, do the hot pages promotion and cold pages
> demotions for the target level of faster node space utilization, same to the
> original TPP idea.  But, stop the hot page promotions if the memory bandwidth
> consumption of the faster node exceeds a level.  In the case, instead, start
> demoting _hot_ pages until the memory bandwidth consumption on the faster node
> decreases below the limit level.

I also think this approach is more practical because it doesn't scan and
redistribute the entire pages while lowering bandwidth pressure for the fast
tier node.

> I think this idea could easily be prototyped by extending the
> DAMON-based TPP implementation [2].  Let me briefly explain the prototyping
> idea assuming the readers are familiar with the DAMON-based TPP implementation.
> If you are not familiar with, please feel free to ask questions to me, or refer
> to the cover letter [2] of the patch series.
> 
> First, add another DAMOS quota goal for the hot pages promotion scheme.  The
> goal will aim to achieve a high level memory bandwidth consumption of the
> faster node.  The target level will be reasonably high but not too high to keep
> head room remained.  So the hot pages promotion scheme will be activated at the
> beginning, promote hot pages, make the faster node's space and bandwidth
> utilization increase.  But if the memory bandwidth consumption of the faster
> node surpasses the target leevel as a result of the hot pages promotion or the
> workload's access pattern change, the hot pages promotion scheme will be less
> aggressive and eventually stop.
> 
> Second, add another DAMOS scheme to the faster node access monitoring DAMON
> context.  The new scheme does hot pages demotion with a quota goal that aim to
> make unused (free, or available) memory bandwidth of the faster node a headroom
> level.  This scheme will do nothing at the beginning of the system since the
> faster node may have available (unused) memory bandwidth more than the headroom
> level.  This scheme will start the hot pages demotion once the faster node's
> available memory bandwidth becomes less than the desired headroom level, due to
> increased loads or the hot pages promotion.  And once the unused memory
> bandwidth of the faster node becomes higher than the head room level as a
> result of the hot pages demotion or access pattern change, the hot pages
> demotion will be deactivated again.
> 
> For example, a change like below can be made to the simple DAMON-based TPP
> implementation [4].
> 
> diff --git a/scripts/mem_tier.sh b/scripts/mem_tier.sh
> index 9e685751..83757fa9 100644
> --- a/scripts/mem_tier.sh
> +++ b/scripts/mem_tier.sh
> @@ -30,16 +30,25 @@ fi
>   "$damo_bin" module stat write enabled N
>   "$damo_bin" start \
>          --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \
> +               `# demote cold pages for faster node headroom space` \
>                  --damos_action migrate_cold 1 --damos_access_rate 0% 0% \
>                  --damos_apply_interval 1s \
>                  --damos_quota_interval 1s --damos_quota_space 200MB \
>                  --damos_quota_goal node_mem_free_bp 0.5% 0 \
>                  --damos_filter reject young \
> +               `# demote hot pages for faster node headroom bandwidth` \
> +               --damos_action migrate_hot 1 --damos_access_rate 5% max \
> +                       --damos_apply_interval 1s \
> +                       --damos_quota_interval 1s --damos_quota_space 200MB \
> +                       --damos_quota_goal node_membw_free_bp 5% 0 \
> +                       --damos_filter allow young \
>          --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \
> +               `# promote hot pages for faster node space/bandwidth high utilization` \
>                  --damos_action migrate_hot 0 --damos_access_rate 5% max \
>                  --damos_apply_interval 1s \
>                  --damos_quota_interval 1s --damos_quota_space 200MB \
>                  --damos_quota_goal node_mem_used_bp 99.7% 0 \
> +               --damos_quota_goal node_membw_used_bp 95% 0 \
>                  --damos_filter allow young \
> -               --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \
> -       --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1
> +               --damos_nr_quota_goals 1 1 2 --damos_nr_filters 1 1 1 \
> +       --nr_targets 1 1 --nr_schemes 2 1 --nr_ctxs 1 1
> 
> "node_membw_free_bp" and "node_membw_used_bp" are _imaginary_ DAMOS quota goal
> metrics representing the available (unused) or consuming level of memory
> bandiwdth of a given NUMA node.  Those are imaginery ones that arenot supported
> on DAMON of today.  If this idea makes sense, we may develop the support of the
> metrics.

I think this makes sense.

> But even before the metrics are implemented, we could prototype this for early
> proof of concepts by setting the DAMOS quota goals using the user_input goal
> metric [5] and run a user-space program that measures the memory bandwidth of
> the faster node and feeds it to DAMON using the DAMON sysfs interface.
> 
> Implementing both the memory bandwidth/space utilization monitoring and the
> quota auto-tuning logic on user-space, and directly adjusting the quotas of
> DAMOS schemes instead of using the quota goals could also be an option.
> 
> I have no plan to implement the "node_membw_{free,used}_bp" quota goal metrics
> or do the user_input based prototyping at the moment.  But, as always, if you
> watn the features and/or willing to step up for development of the features, I
> will be happy to help.
> 
> [...]
>> Ravi suggested hotness information need not be used exclusively for
>> promotion and that there is an advantage seen in rearranging hot pages
>> based on weights.  He also suggested a standard subsystem that can provide
>> bandwidth information would be very useful (including sources such as IBS,
>> PEBS, and PMU sources).
> 
> If we decide to implement the above per-node memory bandwidth based DAMOS quota
> goal metrics, I think this standard subsystem could also be useful for the
> implementation.

As I mentioned at the top of this mail, I think this work makes sense in theory
but would like to find some practical workloads that can get benefits from this
work.  It would be grateful if someone can share practical use cases in large
scale memory systems.

Thanks,
Honggyu

> FYI, users can _estimate_ memory bandwidth of the system or workloads from
> DAMON's monitoring results snapshot.  For example, if DAMON is seeing a 1 GiB
> memory region that is consistently being accessed about 10 times per second, we
> can estimate it is consuming 10 GiB/s memory bandwidth.
> 
> DAMON user-space tool provides this estimated bandwidth per monitoring results
> snapshot with 'damo report access' command.  DAMON_STAT module, which is
> recently developed for providing system wide high level data access pattern in
> an easy way, also provides this _estimated_ memory bandwidth usage.
> 
> [1] https://events.linuxfoundation.org/open-source-summit-korea/program/schedule/
> [2] https://lore.kernel.org/all/20250420194030.75838-1-sj@kernel.org/
> [3] https://symbioticlab.org/publications/files/tpp:asplos23/tpp-asplos23.pdf
> [4] https://github.com/damonitor/damo/blob/v3.0.4/scripts/mem_tier.sh
> [5] https://docs.kernel.org/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning
> 
> 
> Thanks,
> SJ
> 
> [...]
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Linux Memory Hotness and Promotion] Notes from October 23, 2025
  2025-11-17 11:36   ` Honggyu Kim
@ 2025-11-21  2:27     ` SeongJae Park
  0 siblings, 0 replies; 4+ messages in thread
From: SeongJae Park @ 2025-11-21  2:27 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: SeongJae Park, David Rientjes, kernel_team, Davidlohr Bueso,
	Fan Ni, Gregory Price, Jonathan Cameron, Joshua Hahn,
	Raghavendra K T, Rao, Bharata Bhasker, Wei Xu, Xuezheng Chu,
	Yiannis Nikolakopoulos, Zi Yan, linux-mm, damon, Yunjeong Mun

On Mon, 17 Nov 2025 20:36:59 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:

> Hi SJ, David, Ravi and all,
> 
> On 11/14/2025 10:42 AM, SeongJae Park wrote:
[...]
> > The memory capacity extension solution of HMSDK [1], which is developed by SK
> > Hynix, is one good example.  To my understanding (please correct me if I'm
> > wrong), HMSDK is providing separate solutions for bandwidth and capacity
> > expansions.  The user should first understand whether their workload is
> > bandwidth-hungry or capacity-hungry, and select a proper solution.  I suspect
> > the concern from Ravi was one of the reasons.
> 
> Yeah, your understanding is correct in HMSDK cases.

Thank you for confirming!

> 
> > I also recently developed a DAMON-based memory tiering approach [2] that
> > implementing the main idea of TPP [3]: promoting and demoting hot and cold
> > pages aiming a level of the faster node's space utilization.  I didn't see the
> > bandwidth issue from my simple tests of it, but I think the very same problem
> > can be applied to both DAMON-based approach and the original TPP
> > implementation.
> > 
> >>
> >> Ravi suggested adaptive interleaving of memory to optimize both bandwidth
> >> and capacity utilization.  He suggested an approach of a migrator in
> >> kernel space and a calibrator in userspace.  The calibrator would monitor
> >> system bandwidth utilization and, using different weights, determine the
> >> optimal weights for interleaving the hot pages for the highest bandwidth.
> 
> I also think that monitoring bandwidth makes sense.  We recently released a
> tool called bwprof for bandwidth recording and monitoring based on intel pcm.
> https://github.com/skhynix/hmsdk/blob/hmsdk-v4.0/tools/bwprof/bwprof.cc
> 
> This tool can be slightly changed to monitor bandwidth and write it to some
> sysfs interface knobs for this purpose.

Thank you for introducing the tool, I think that can be useful for not only
this case but also general investigations and optimizations of this kind of
memory systems.

> 
> >> If bandwidth saturation is not hit, only cold pages get demoted.  The
> >> migrator reads the target interleave ratio and rearrange the hot pages
> >> from the calibrator and demotes cold pages to the target node.  Currently
> >> this uses DAMOS policies, Migrate_hot and Migrate_cold.
> > 
> > This implementation makes sense to me, especially if the aimed use case is for
> > specific virtual address spaces.
> 
> I think the current issues of adaptive weighted interleave are as follows.
> 1. The adaptive interleaving only works for virtual address mode.

This is true.  But I don't really think this is an issue, since I found no
clear physical address mode interleaving use case.  Since we have clear use
case of virtual mode DAMOS-based interleaving, and I heard no problem from the
use case, I think "all is well".

By the way, interleaving in this context is somewhat confusing to me.
Technically speaking it is DAMOS_MIGRATE_{HOT,COLD} towards multiple
destination nodes with different weights.  And how it should be implemented on
physical address space (whether to decide the migration target node of each
page based on its physical address or its virtual address) was discussed on the
patch series for the multiple migration destination node, but we didn't find
good answer so far.  That's one of the reasons why physical mode
DAMOS-migration to multiple destination nodes is not yet supported.

I understand you are saying it would be nice if Ravi's general idea (optimizing
both bandwidth and capacity) can be implemented for not only virtual address
space but also for physical address space, since it would be easier for
sysadmins?  I agree if so.  Please correct me if I'm getting you wrong, though.

> 2. It scans the entire pages and redistributes them based on the given weight
>     ratios so it limits the general usage as of now.

I think it depends on the detailed usage.  In this specific use case, to my
understanding (correct me if I'm wrong, Ravi), the user-space tool applies
interleaving (or, DAMOS_MIGRATE_HOT to multiple destination nodes) only for
hot pages.  Hence the scanning for interleaving will be executed only for
DAMON-found hot pages.  Also the users may use DAMOS quota or similar features
to further tune the overhead.

Maybe my humble edit of the original mail made you be confused about Ravi's
implementation details?  Sorry if that's the case.

> 
> > Nevertheless, if a physical address space
> > based version is also an option, I think there could be yet another way to
> > achive the goal (optimizing both bandwidth and capacity).
> 
> 3. As mentioned above, having physical address mode is needed, but it makes
>     scanning the entire physical address space and redistribute them and it
>     might require too much overhead in practice.

Same to my comment to above reply to your second point, I think the overhead
could be controlled by adjusting the target page hotness and/or DAMOS quota.
Or, I might misreading your opinion.  Please feel free to correct it in the
case.

Anyway, my idea is not very different from Ravi's one.  It is just a more
simply re-phrased version of it.  In essence, I only changed the word
'interleave', which is not very clear its behavir on physical address to me, to
'migrate_hot' and gave more concrete example using DAMON user-space tool
example commands.

> 
> > My idea is tweaking TPP idea a little bit: migrate pages among NUMA nodees
> > aiming a level of both space and bandwidth utilization of the faster (e.g.,
> > DRAM) node.  In more detail, do the hot pages promotion and cold pages
> > demotions for the target level of faster node space utilization, same to the
> > original TPP idea.  But, stop the hot page promotions if the memory bandwidth
> > consumption of the faster node exceeds a level.  In the case, instead, start
> > demoting _hot_ pages until the memory bandwidth consumption on the faster node
> > decreases below the limit level.
[...]
> As I mentioned at the top of this mail, I think this work makes sense in theory

Glad to get publicly confirmed I'm not the one who sees what I see :D

> but would like to find some practical workloads that can get benefits from this
> work.  It would be grateful if someone can share practical use cases in large
> scale memory systems.

Fully agreed.  Buildable code is much better than words, and test results are
even better than such code.

Nevertheless, I have no good answer for the practical use cases of my idea, for
now.  I even have no plan to find it by myself at the moment, mainly because I
don't have CXL memory to test, for now.  So please don't be blocked by me.  I
will be more than happy to help for any chance though, as always :)

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-11-21  2:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-03  0:41 [Linux Memory Hotness and Promotion] Notes from October 23, 2025 David Rientjes
2025-11-14  1:42 ` SeongJae Park
2025-11-17 11:36   ` Honggyu Kim
2025-11-21  2:27     ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox