linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [Linux Memory Hotness and Promotion] Notes from November 20, 2025
@ 2025-11-24  3:04 David Rientjes
  2025-11-24  4:05 ` Bharata B Rao
  0 siblings, 1 reply; 2+ messages in thread
From: David Rientjes @ 2025-11-24  3:04 UTC (permalink / raw)
  To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
	Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker,
	SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
	Zi Yan
  Cc: linux-mm

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, November 20.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Bharata updated that he had a set of results for the scenario that 
involves promotion upstream, he posted this as a reply to his RFC v3.  Any 
feedback on that series or proposed benchmarks to run would be very 
useful.  He was also thinking about consolidating all the tunables in 
sysfs into a sub directory rather than have them in the parent directory 
for MM.  I suggested this may also start out in debugfs until the APIs 
become more stable.

Bharata was also planning on redoing the NUMAB2 support so that its 
cleaner and the page movement ratelimiting and associated logic is 
separated, which enables using faults as a source.  He's also planning on 
using folio_mark_accessed() as a source of hotness to cover promotion of 
unmapped file folios.  He'll be writing a dedicated microbenchmark for 
testing of this.  He'll also be investing additional benchmarks for the 
overall series as a whole.

----->o-----
Jonathan Cameron asked what the general feel was about the memory 
overhead: currently this tracking requires ~2GB per 1TB.  Wei Xu noted 
that Google is taking a similar approach but with one byte per page in 
page flags.  If just for promotion purposes, we likely don't need eight 
bytes per page.  Even NUMA Balancing does not use eight bytes per page.  
Jonathan said it currently uses 33 bits per page so some shrinkage might 
be possible.  Wei said promotion still requires the per-pfn scan which can 
be expensive.

Wei said there would be one data structure with the information so we can 
do atomic updates on the hot metadata and then there is a much smaller 
data structure that tracks which pages to promote. 

Raghu noted that in discussion with Bharata that it was pointed out that 
the tracking of memory here is only necessary for the low tier since that 
memory is the only viable set of pages to promote.  Jonathan noted that 
may be the majority of system memory.  The metadata itself is only stored 
in top tier memory, which is expensive.

----->o-----
We discussed the benchmarks that we should use for evaluation of all of 
these approaches.  SeongJae noted that he had no specific benchmark in 
mind but we should discuss the access pattern the benchmark should have.  
This should have some temporal access patterns but also have different 
hotness in different locations of memory; secondly, the pattern should 
change during runtime.

Jonathan said there's been a heavy reliance on memcached but that's not 
ideal because it's too predictable; we actually need the opposite of this.  
I noted that I've had some success running specjbb and redis workloads.  
Redis is interesting because it does not always observe spatial locality.

Yiannis noted one of the challenges with specint is that the duration of 
the benchmark itself is not long enough to assess optimal placement logic.  
Wei agreed with this, the benchmark would need to run for a long time.  
Yiannis further mentioned that these can be used to over-subscribe cores, 
however, the induce pressure (and consume bandwidth).

----->o-----
Raghu updated on his patch series to use the LRU gen scan API which 
iterates through a single mm, this provides more control over the memory 
that is being iterated.  He was working through some issues in the patch 
series and may need to reach out to Kinsey for discussion on klruscand.  
Jonathan also provided some feedback on the mailing list.

Raghu asked Kinsey if it would be possible to have an API that scanned a 
single mm; Kinsey said yes, this was similar to what was being thought 
about internally.  Raghu said this would be useful for integration.

Wei asked Raghu if his series will integrate the scanning and promotion 
together so that when a page is identified we can promote right away.  
Raghu said this was implemented like NUMAB but does not happen after a 
single access.  There is a separate migration thread.

Jonathan asked if we necessarily care if we lose some information; if 
there is a ton of memory to promote, we can't migrate everything, so do we 
care if some hotness information is actually lost?  Wei suggested that we 
must have a mechanism for promoting the hottest pages, not just hot pages, 
so some amount of history is required.  Jonathan said that if everything 
was insanely hot and we lose some information it would readily reappear 
again.  Raghu's patch series only uses a single bit from page flags, Wei 
suggested extending this.

----->o-----
Next meeting will be on Thursday, December 4 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:

 - updates on Bharata's RFC v3 with new benchmarks and consolidation of
   tunables
 - continued discussion on memory overheads used to save the memory
   hotness state and the list of promotion targets
 - benchmarks to use as the industry standard beyond just memcache, such
   as redis
 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - similarly, discuss generalized subsystem for providing memory hotness
   information
 - determine minimal viable upstream opportunity to optimize for tiering
   that is extensible for future use cases and optimizations
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace, including for memory compaction

Please let me know if you'd like to propose additional topics for
discussion, thank you!


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-11-24  4:05 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-24  3:04 [Linux Memory Hotness and Promotion] Notes from November 20, 2025 David Rientjes
2025-11-24  4:05 ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox