[Linux Memory Hotness and Promotion] Notes from December 4, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Linux Memory Hotness and Promotion] Notes from December 4, 2025
@ 2025-12-16  3:16 David Rientjes
  0 siblings, 0 replies; only message in thread
From: David Rientjes @ 2025-12-16  3:16 UTC (permalink / raw)
  To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
	Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker,
	SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	Zi Yan
  Cc: linux-mm

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, December 4.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Bharata updated that he is ready to post v4 of his series with two main 
changes: per section indicator for hotness, which reduces the effort 
required for kmigrated, and then folio_mark_accessed() that shows good 
results in initial testing.  Bharata is also working on getting the right 
configuration for redis-memtier to ensure we are moving memory back and 
forth for promotion and demotion.

Discussing the consolidation of tunables, the current plan was to ensure 
that these live under sysfs.  There was discussion about doing this in 
debugfs first, but the API should be solidified enough by the time of 
upstream inclusion that this won't be necessary.

----->o-----
I pivoted the discussion to talk about single threaded vs multi threaded 
promotion.  Wei Xu noted that we likely need more than one thread, 
probably at least one per NUMA node.  He said the memory bandwidth 
contention was so severe that promotion was very reduced.  Hardware assist 
can help, and so can memory bandwidth QoS can help, but another option is 
also multi threaded promotions.  Jonathan Cameron asked if multiple 
threads were being used for QoS; Wei said that by moving the memory off 
the low tier that we are shifting the bandwidth elsewhere.  The ideal 
scenario, Wei said, was to ensure that hardware based memory bandwidth QoS 
would ensure that the promotion threads can make steady progress.

Gregory suggested this may be a transient factor, we've had the discussion 
before about how aggressive tiering should be to improve overall latency.  
Moving things as fast as possible may not always be the desired effect, 
the goal is to converge on stability so you may not maximize bandwidth to 
balance memory but rather eventually reach an ideal steady state.  Wei 
shared that in some cases this was as limited as <100MB/s for promotion as 
a result of bandwidth saturation.  Gregory asked how realistic of a 
scenario this actually is because the fact that we have gotten into this 
situation is already problematic: the only scenario whre this "should" 
happen is if the DRAM tier bandwidth is already capped and the amount of 
hot memory exceeds the DRAM tier capacity in which case we would have 
thrashing.  The goal is to ensure this scenario doesn't happen at all.

Jonathan said boosting the migration threads may not be the ultimate 
solution, but rather reducing the workload that is using all the bandwidth 
from being scheduled.  Wei believed that a single threaded promotion 
thread could still be a bottleneck.  Gregory suggested that 100MB/s 
promotion may not be too slow, the goal is to eventually get there but not 
by promoting memory over a very short window -- if we promote as fast as 
possible, we could find that memory does not remain hot in which case the 
promotion may not have been justified.

----->o-----
We shifted to discussing about workloads that can be used for 
experimentation.  Gregory suggested we reached critical mass on the topic 
such that we would really benefit from production data or actual 
deployments.  He took the AI to do this, although admittedly it may take 
some time.

Wei had been working primarily with memcached although production data was 
not imminent.

----->o-----
Next meeting will be on Thursday, December 18 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:
 - updates on Bharata's RFC v4 with new benchmarks and consolidation of
   tunables
 - continued discussion on memory overheads used to save the memory
   hotness state and the list of promotion targets
 - workloads to use as the industry standard beyond just memcache, such as
   redis
 - later: Gregory's analysis of more production-like workloads
 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - similarly, discuss generalized subsystem for providing memory hotness
   information
 - determine minimal viable upstream opportunity to optimize for tiering
   that is extensible for future use cases and optimizations
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace, including for memory compaction

Please let me know if you'd like to propose additional topics for
discussion, thank you!

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2025-12-16  3:17 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-16  3:16 [Linux Memory Hotness and Promotion] Notes from December 4, 2025 David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox