* [Linux Memory Hotness and Promotion] Notes from December 18, 2025
@ 2025-12-21 4:10 David Rientjes
0 siblings, 0 replies; only message in thread
From: David Rientjes @ 2025-12-21 4:10 UTC (permalink / raw)
To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker,
SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
Zi Yan
Cc: linux-mm
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, December 18. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
Raghu provided an update on his progress; he was trying to fit klruscand
into his current approach but some redesign would be necessary; since
there is already one approach working, klruscand + pghot, he will be going
slow on this. He was planning on posting the latest set of patches for
the record so that it would be possible to revisit later, mainly focused
on Jonathan's feedback and new optimizations. This was likely to be
posted by the end of the year. Mainline would continue with klruscand
with pghot.
Raghu had a question for klruscand, however: for the latest cleanup and
MGLRU LRU changes proposed on the mailing list, will this affect anything?
Wei said this would not affect klruscand since the core of MGLRU is the
LRU of the page which is unaffected by those proposed changes.
----->o-----
We moved into discussing memory overheads for storing page hotness
especially since this would be coming from super expensive top tier
memory; we felt this was likely best to align so that we could determine
the minimal viable upstream opportunity for a landing. Gregory has a
short discussion about this at LPC and the current proposal was around 64
bits per tracked page and this would be limited to the CXL memory tier so
the shorthand would be 2GB of overhead per 1TB of memory tracked. He was
interested in seeing how this would generalize to supporting N tiers which
would be a minimal viable upstream requirement (HBM, DRAM, CXL tier).
Jonathan suggested that HBM had nothing to do with hotness but was rather
focused only on bandwidth. The consensus of the group was that we still
need to be able to support N tiers.
I asked about how this overlaps with NUMAB, there as been a lot of
discussion about NUMAB=2 in this series of meetings but it's likely
worthwile also to consider NUMAB=1. Raghu suggested we could update the
VMAs for DRAM tier in that case and only track the hotness of memory for
that VMA. I said that would still be operating on the sliding window.
We aligned that any upstream landed support must be extensible for
additional memory tiers in the future. Jonathan generalized this by
saying that we need to be able to turn the support off with no overhead.
I said this would be required for virtualizing the lower memory tiers into
the guest where you may not care to track the hotness for optimal page
placement.
I suggested that 2GB per 1TB of tracked memory sounded fine but it likely
also the ceiling. Gregory said that colleagues were surprised by the
amount of overhead and so we should discuss this on the mailing list.
Yiannis understood the pushback and said that we should show what this
additional overhead is getting us: we need to demonstrate the value in the
hotness tracking to justify the cost.
Wei said that internally he is using one byte per page for hotness
tracking and this is a simple solution. Jonathan said we could allow a
precision vs cost trade-off, some mechanisms use even less than one byte
per page; they work, but sometimes they promote the wrong thing. We
agreed this could be configurable. Gregory made a good point that for
single socket systems, for example, we don't need to capture source or
access information. It does drive configuration complexity, however.
Gregory suggested we may want to avoid the accessor information on some
systems and, when we get it wrong, require double migration to get it
right. He was on board to limit to eight bits to start and then add a
precision mode later.
----->o-----
The next meeting will be canceled for New Years Day. We'll come back two
weeks after that. Happy New Year!
Next meeting will be on Thursday, January 15 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm
Topics for the next meeting:
- updates on Bharata's patch series with new benchmarks and consolidation
of tunables
- workloads to use as the industry standard beyond just memcached, such
as redis-memtier
- later: Gregory's analysis of more production-like workloads
- discuss generalized subsystem for providing bandwidth information
independent of the underlying platform, ideally through resctrl,
otherwise utilizing bandwidth information will be challenging
+ preferably this bandwidth monitoring is not per NUMA node but rather
slow and fast
- similarly, discuss generalized subsystem for providing memory hotness
information
- determine minimal viable upstream opportunity to optimize for tiering
that is extensible for future use cases and optimizations
+ extensible for multiple tiers
+ suggestion: limited to 8 bits per page to start, add a precision mode
later
+ limited to 64 bits per page as a ceiling, may be less
+ must be possible to disable with no memory or performance overhead
- update on non-temporal stores enlightenment for memory tiering
- enlightening migrate_pages() for hardware assists and how this work
will be charged to userspace, including for memory compaction
Please let me know if you'd like to propose additional topics for
discussion, thank you!
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2025-12-21 4:10 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-21 4:10 [Linux Memory Hotness and Promotion] Notes from December 18, 2025 David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox