[Linux Memory Hotness and Promotion] Notes from October 23, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
	 Gregory Price <gourry@gourry.net>,
	 Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	 Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	 "Rao, Bharata Bhasker" <bharata@amd.com>,
	SeongJae Park <sj@kernel.org>,  Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	 Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from October 23, 2025
Date: Sun, 2 Nov 2025 16:41:19 -0800 (PST)	[thread overview]
Message-ID: <d952a84f-332e-8f7a-4816-2c1cbd8f5b00@google.com> (raw)

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, October 9.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Ravi Jonnalagadda presented dynamic interleaving slides, co-developed with 
Bijan Tabatabai, discussing the current approach of promoting all hot 
pages into DRAM tier and demoting all cold pages.  If the bandwidth 
utilization is high, it will saturate the top tier even though there is 
bandwidth available on the lower tier.  The preference was to demote cold 
pages when under-utilizing memory in the top tier and then interleave hot 
pages to maximize bandwidth utilization.  For Ravi's experimentation, this 
has been 3/4 of maximum write bandwidth for the top tier.  If this 
threshold is not reached, memory is demoted.

Ravi suggested adaptive interleaving of memory to optimize both bandwidth 
and capacity utilization.  He suggested an approach of a migrator in 
kernel space and a calibrator in userspace.  The calibrator would monitor 
system bandwidth utilization and, using different weights, determine the 
optimal weights for interleaving the hot pages for the highest bandwidth.  
If bandwidth saturation is not hit, only cold pages get demoted.  The 
migrator reads the target interleave ratio and rearrange the hot pages 
from the calibrator and demotes cold pages to the target node.  Currently 
this uses DAMOS policies, Migrate_hot and Migrate_cold.

It was shown how the optimal weights change over time for both the 
multiload and MERCI benchmarks.  For MERCI, a few results using this 
approach were obtained (lower is better):

- Local DRAM
  + Avg Baseline Total Time - 1457.97 ms
  + Memory Footprint
    o Node 0 - 20.3 GB
- Static Weighted Interleave
  + Avg Baseline Total Time - 1023.81 ms
  + Memory Footprint
    o Node 0 - 10.3 GB
    o Node 1 - 10 GB
- Adaptive interleaving
  + Avg Baseline Total Time - 1030.41 ms
  + Memory Footprint
    o Node 0 - 7 GB
    o Node 1 - 13 GB

Jonathan Cameron asked if we are using all of the bandwidth for this 
benchmark, then what is the use of the extra capacity in top tier?  Ravi 
said if there are two applications, one latency bound and other is 
bandwidth bound, then we can run both at optimal levels.

Ravi suggested hotness information need not be used exclusively for 
promotion and that there is an advantage seen in rearranging hot pages 
based on weights.  He also suggested a standard subsystem that can provide 
bandwidth information would be very useful (including sources such as IBS, 
PEBS, and PMU sources).  Wei Xu noted this should be resctrl and Jonathan 
agreed.

Ravi also noted a challenge where NUMA nodes may not be directly related 
to DRAM or CXL.  CXL nodes can be asymmetric with different bandwidth and 
capacity.  Similarly, we'd need to differentiate between direct attached 
and fabric attached bandwidth information. 

Asked about the methodology for the testing, Ravi noted that bandwidth 
monitoring is system wide but the migration and weights were application 
specific (virtual address space).

Wei noted a challenge that we cannot differentiate write bandwidth with 
CXL; with reads, this is possible but we cannot do it for writes today.  
System wide this would still be possible, however.  Jonathan noted with 
resctrl you can reserve some allocation of bandwidth for a given 
application and you can optimize within that.

Wei asked, given there will be significant overhead in migration, why the 
workloads here are not using hardware interleaving?  Ravi emphasized the 
need for adaptive tuning where it was necessary to find the right weights 
based on application signature; this does not restrict our setup to hard 
interleaving ratios.

Ravi's slides were attached to the shared drive.

----->o-----
Raghu noted as an update to his patch series that he finished the changes 
previously discussed but there were performnace issues so he continues to 
work on those.

----->o-----
Shivank noted that he prepared a presentation for kpromoted with migration 
offload to DMA that we can see in the next instance of the meeting.

----->o-----
Next meeting will be on Thursday, November 6 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

NOTE!!!  Daylight Savings Time has ended in the United States, so please
check your local time carefully:

Time zones

PST (UTC-8)		8:30am
MST (UTC-7)		9:30am
CST (UTC-6)		10:30am
EST (UTC-5)		11:30am
Rio de Janeiro (UTC-3)	1:30pm
London (UTC)		4:30pm
Berlin (UTC+1)		5:30pm
Moscow (UTC+3)		7:30pm
Dubai (UTC+4)		8:30pm
Mumbai (UTC+5:30)	10:00pm
Singapore (UTC+8)	12:30am Friday
Beijing (UTC+8)		12:30am Friday
Tokyo (UTC+9)		1:30am Friday
Sydney (UTC+11)		3:30am Friday
Auckland (UTC+13)	5:30am Friday

Topics for the next meeting:

 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - similarly, discuss generalized subsystem for providing memory hotness
   information
 - determine minimal viable upstream opportunity to optimize for tiering
   that is extensible for future use cases and optimizations
 - Shivank presentation for kpromoted with migration offload to DMA
 - update on the latest kmigrated series from Bharata as discussed in the
   last meeting and combining all sources of memory hotness
   + discuss performance optimizations achieved by Shivank with migration
     offload
 - update on Raghu's series after addressing Jonathan's comments
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace
 - discuss overall testing and benchmarking methodology for various
   approaches as we go along

Please let me know if you'd like to propose additional topics for
discussion, thank you!

next             reply	other threads:[~2025-11-03  0:41 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-03  0:41 David Rientjes [this message]
2025-11-14  1:42 ` SeongJae Park
2025-11-17 11:36   ` Honggyu Kim
2025-11-21  2:27     ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d952a84f-332e-8f7a-4816-2c1cbd8f5b00@google.com \
    --to=rientjes@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rkodsara@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox