From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
Gregory Price <gourry@gourry.net>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Joshua Hahn <joshua.hahnjy@gmail.com>,
Raghavendra K T <rkodsara@amd.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
SeongJae Park <sj@kernel.org>, Wei Xu <weixugc@google.com>,
Xuezheng Chu <xuezhengchu@huawei.com>,
Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from October 23, 2025
Date: Sun, 2 Nov 2025 16:41:19 -0800 (PST) [thread overview]
Message-ID: <d952a84f-332e-8f7a-4816-2c1cbd8f5b00@google.com> (raw)
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, October 9. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
Ravi Jonnalagadda presented dynamic interleaving slides, co-developed with
Bijan Tabatabai, discussing the current approach of promoting all hot
pages into DRAM tier and demoting all cold pages. If the bandwidth
utilization is high, it will saturate the top tier even though there is
bandwidth available on the lower tier. The preference was to demote cold
pages when under-utilizing memory in the top tier and then interleave hot
pages to maximize bandwidth utilization. For Ravi's experimentation, this
has been 3/4 of maximum write bandwidth for the top tier. If this
threshold is not reached, memory is demoted.
Ravi suggested adaptive interleaving of memory to optimize both bandwidth
and capacity utilization. He suggested an approach of a migrator in
kernel space and a calibrator in userspace. The calibrator would monitor
system bandwidth utilization and, using different weights, determine the
optimal weights for interleaving the hot pages for the highest bandwidth.
If bandwidth saturation is not hit, only cold pages get demoted. The
migrator reads the target interleave ratio and rearrange the hot pages
from the calibrator and demotes cold pages to the target node. Currently
this uses DAMOS policies, Migrate_hot and Migrate_cold.
It was shown how the optimal weights change over time for both the
multiload and MERCI benchmarks. For MERCI, a few results using this
approach were obtained (lower is better):
- Local DRAM
+ Avg Baseline Total Time - 1457.97 ms
+ Memory Footprint
o Node 0 - 20.3 GB
- Static Weighted Interleave
+ Avg Baseline Total Time - 1023.81 ms
+ Memory Footprint
o Node 0 - 10.3 GB
o Node 1 - 10 GB
- Adaptive interleaving
+ Avg Baseline Total Time - 1030.41 ms
+ Memory Footprint
o Node 0 - 7 GB
o Node 1 - 13 GB
Jonathan Cameron asked if we are using all of the bandwidth for this
benchmark, then what is the use of the extra capacity in top tier? Ravi
said if there are two applications, one latency bound and other is
bandwidth bound, then we can run both at optimal levels.
Ravi suggested hotness information need not be used exclusively for
promotion and that there is an advantage seen in rearranging hot pages
based on weights. He also suggested a standard subsystem that can provide
bandwidth information would be very useful (including sources such as IBS,
PEBS, and PMU sources). Wei Xu noted this should be resctrl and Jonathan
agreed.
Ravi also noted a challenge where NUMA nodes may not be directly related
to DRAM or CXL. CXL nodes can be asymmetric with different bandwidth and
capacity. Similarly, we'd need to differentiate between direct attached
and fabric attached bandwidth information.
Asked about the methodology for the testing, Ravi noted that bandwidth
monitoring is system wide but the migration and weights were application
specific (virtual address space).
Wei noted a challenge that we cannot differentiate write bandwidth with
CXL; with reads, this is possible but we cannot do it for writes today.
System wide this would still be possible, however. Jonathan noted with
resctrl you can reserve some allocation of bandwidth for a given
application and you can optimize within that.
Wei asked, given there will be significant overhead in migration, why the
workloads here are not using hardware interleaving? Ravi emphasized the
need for adaptive tuning where it was necessary to find the right weights
based on application signature; this does not restrict our setup to hard
interleaving ratios.
Ravi's slides were attached to the shared drive.
----->o-----
Raghu noted as an update to his patch series that he finished the changes
previously discussed but there were performnace issues so he continues to
work on those.
----->o-----
Shivank noted that he prepared a presentation for kpromoted with migration
offload to DMA that we can see in the next instance of the meeting.
----->o-----
Next meeting will be on Thursday, November 6 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm
NOTE!!! Daylight Savings Time has ended in the United States, so please
check your local time carefully:
Time zones
PST (UTC-8) 8:30am
MST (UTC-7) 9:30am
CST (UTC-6) 10:30am
EST (UTC-5) 11:30am
Rio de Janeiro (UTC-3) 1:30pm
London (UTC) 4:30pm
Berlin (UTC+1) 5:30pm
Moscow (UTC+3) 7:30pm
Dubai (UTC+4) 8:30pm
Mumbai (UTC+5:30) 10:00pm
Singapore (UTC+8) 12:30am Friday
Beijing (UTC+8) 12:30am Friday
Tokyo (UTC+9) 1:30am Friday
Sydney (UTC+11) 3:30am Friday
Auckland (UTC+13) 5:30am Friday
Topics for the next meeting:
- discuss generalized subsystem for providing bandwidth information
independent of the underlying platform, ideally through resctrl,
otherwise utilizing bandwidth information will be challenging
+ preferably this bandwidth monitoring is not per NUMA node but rather
slow and fast
- similarly, discuss generalized subsystem for providing memory hotness
information
- determine minimal viable upstream opportunity to optimize for tiering
that is extensible for future use cases and optimizations
- Shivank presentation for kpromoted with migration offload to DMA
- update on the latest kmigrated series from Bharata as discussed in the
last meeting and combining all sources of memory hotness
+ discuss performance optimizations achieved by Shivank with migration
offload
- update on Raghu's series after addressing Jonathan's comments
- update on non-temporal stores enlightenment for memory tiering
- enlightening migrate_pages() for hardware assists and how this work
will be charged to userspace
- discuss overall testing and benchmarking methodology for various
approaches as we go along
Please let me know if you'd like to propose additional topics for
discussion, thank you!
next reply other threads:[~2025-11-03 0:41 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-03 0:41 David Rientjes [this message]
2025-11-14 1:42 ` SeongJae Park
2025-11-17 11:36 ` Honggyu Kim
2025-11-21 2:27 ` SeongJae Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d952a84f-332e-8f7a-4816-2c1cbd8f5b00@google.com \
--to=rientjes@google.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=bharata@amd.com \
--cc=dave@stgolabs.net \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=linux-mm@kvack.org \
--cc=nifan.cxl@gmail.com \
--cc=rkodsara@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox