* [Linux Memory Hotness and Promotion] Notes from October 9, 2025
@ 2025-10-12 1:48 David Rientjes
2025-10-17 17:23 ` Gregory Price
0 siblings, 1 reply; 2+ messages in thread
From: David Rientjes @ 2025-10-12 1:48 UTC (permalink / raw)
To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker,
SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
Zi Yan
Cc: linux-mm
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, October 9. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
I relayed some updates from Bharata: he had added the pruning logic to
kpromoted and then addressed Jonathan's review comments. He spent some
time reviving the earlier approach[1] and the kmigrated approach uses
statically allocated space (32-bits for extended page flags to store hot
page information). The kmigrated thread would be too slow; Bharata has
subsequently did improvements to this and integrated all the sources of
memory hotness information into it -- this is now comparable in
functionality to the kpromoted approach. He will be sending the patches
for the kmigrated approach soon.
Bharata provided a slide[2] to compare both kpromoted and kmigrated.
Raghu summarized this: with the kmigrated approach the allocation becomes
very small (reduced memory footprint); the status of the memory is
available in extended page flags. This drastically reduces the amount of
complexity involved.
Wei Xu asked if AMD had considered Xarray for storing this information.
Raghu mentioned that there are approache for both Xarray and Maple Tree.
Gregory noted that we'd need to operate on the Xarray which requires a
locking structure which would not be great. Davidlohr said with Maple
Tree we'd at least have lockless searching. Gregory countered that this
would likely be a write-often scenario rather than a read-seldom scenario.
Wei noted another approach may be similar to LRU. Rather than a global
data structure, if we want to promote memory per job, the hotness
information might be maintained in a LRU-like data structure (one half is
hot, one half is cold). I noted this would support both per-node and
per-memcg. Gregory asked what would be the cost of aging off the MRU back
onto the LRU.
If there are any concerns about pursuing the kmigrated approach based on
the slide comparing kmigrated and kpromoted, we should continue to discuss
this on the upstream mailing list before the next meeting.
----->o-----
Kinsey gave a brief update on the resume fix for klruscand. He had been
experimenting with it; it currently patches the PMD access bit testing and
clearing process for up to 64 PMDs. He had been investigating this and
will likely be removing optimizations that actually harvest 65-66 PMDs.
He plans to send the next revision out within the next couple of weeks,
but this continues to be under experimentation.
----->o-----
Raghu plans to integrate this patch series without waiting for the resume
fix upstream for klruscand. He was incorporating Jonathan's comments and
feedback upstream. He is planning on providing a link to the series; he
wanted to have some rough patches before the next biweekly meeting.
Raghu shouted out Jonathan for all the feedback that he had been providing
on the series upstream, this was very useful.
----->o-----
We shifted to a lengthy discussion on optimizing for memory bandwidth as
opposed to only latency. The question came up for whether kmigrated
should optimize for memory bandwidth and interleaving. Gregory splits
this into trigger and mechanism components; the scanning is the mechanism
to do the movement and the trigger is the oversubscription of bandwidth on
the CXL link. For example, if the DRAM tier has a 10:1 ratio of
bandwidth, then you would want to kick off promotions if the DRAM link is
under-subscribed and the CXL link is over-subscribed in terms of that
ratio. This is similar to the weighted interleave work that was done. We
don't need perfect hotness information to do this; once we know it's
over-subscribed, we start promoting what we think is hot. A handful of
hot pages can contribute significantly to this over-subscription if they
are writable. He was unsure whether this should be userland policy.
My thought was to avoid the kernel being the home for all policy, but
rather use the memory hotness abstraction in the kernel as the source of
truth for that information (as well as bandwidth). Policy should be left
to userspace and ask the kernel to do the migration itself, including for
hardware assist.
Gregory noted that both latency and bandwidth were related; once he
bandwidth is over-subscribed, the latency goes through the room. The
kernel wouldn't want to stop paying attention to bandwidth. We should
decide if we're just going to allow this kernel agent in the background
continue to promote memory to optimize for latency.
We discussed how per-job attributes would play into this. Gregory was
looking at this from the perpsective of optimizing for the entire system
rather than per job. If we care about latency, then we have to care about
bandwidth. Gregory suggested two ways of thinking about it:
- we're over-subscribed in DRAM and need to offload some hot memory to
CXL
- minimize the bandwidth to CXL as much as possible because there's
headroom on DRAM
The latter is fundamentally a latency optimization. Once DRAM becomes
over-subscribed, latencies go up so migrating to CXL reduces the average
latency of a random fetch.
Joshua Hahn talked about this at LSF/MM/BPF: a userspace agent to tune
weighted interleave so that when we face bandwidth pressure, we start
allocating more from CXL. The consensus was that this should be
userspace, but that we need mechanisms in the kernel for tuning the job
for when bandwidth pressure on DRAM is too high we want to offload to CXL.
----->o-----
Shivank Garg gave a quick update on testing kpromoted patches with
migration offload to DMA. He had been seeing 10-15% performance benefit
with this. His slide showed, for abench access pattern:
mean (in us) offload disabled offload enabled (DMA 8 channel)
random 39114520.00 35313111.00
random repeat 11089918.20 9045263.60
sequential 27677787.00 24578768.40
We planned on discussing this is more detail in the next meeting.
----->o-----
Next meeting will be on Thursday, October 23 at 8:30am PDT (UTC-7),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm
Topics for the next meeting:
- discussion on how to optimize page placement for bandwidth and not
simply latency based on access based on weighted interleave
+ discussion on the role of userspace to optimize for weighted
interleave and kernel mechanisms to offload to CXL when DRAM
bandwidth is saturated
- update on the latest kmigrated series from Bharata as discussed in the
last meeting and combining all sources of memory hotness
+ discuss performance optimizations achieved by Shivank with migration
offload
- update on the resume fix for klruscand and timelines for sharing
upstream
- update on Raghu's series after addressing Jonathan's comments and next
steps
- update on non-temporal stores enlightenment for memory tiering
- enlightening migrate_pages() for hardware assists and how this work
will be charged to userspace
- discuss proactive demotion interface as an extension to memory.reclaim
- discuss overall testing and benchmarking methodology for various
approaches as we go along
Please let me know if you'd like to propose additional topics for
discussion, thank you!
[1]
https://lore.kernel.org/linux-mm/20250616133931.206626-1-bharata@amd.com/
[2]
https://drive.google.com/file/d/1gJ_geNAu0fzv6kjdM4qdramVTfyY4Pla/view?usp=drive_link&resourcekey=0-qh2DPLK1GX3joy0XSM5KrQ
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [Linux Memory Hotness and Promotion] Notes from October 9, 2025
2025-10-12 1:48 [Linux Memory Hotness and Promotion] Notes from October 9, 2025 David Rientjes
@ 2025-10-17 17:23 ` Gregory Price
0 siblings, 0 replies; 2+ messages in thread
From: Gregory Price @ 2025-10-17 17:23 UTC (permalink / raw)
To: David Rientjes
Cc: Davidlohr Bueso, Fan Ni, Jonathan Cameron, Joshua Hahn,
Raghavendra K T, Rao, Bharata Bhasker, SeongJae Park, Wei Xu,
Xuezheng Chu, Yiannis Nikolakopoulos, Zi Yan, linux-mm
On Sat, Oct 11, 2025 at 06:48:59PM -0700, David Rientjes wrote:
> Gregory noted that both latency and bandwidth were related; once he
> bandwidth is over-subscribed, the latency goes through the room. The
> kernel wouldn't want to stop paying attention to bandwidth. We should
> decide if we're just going to allow this kernel agent in the background
> continue to promote memory to optimize for latency.
>
> We discussed how per-job attributes would play into this. Gregory was
> looking at this from the perpsective of optimizing for the entire system
> rather than per job. If we care about latency, then we have to care about
> bandwidth. Gregory suggested two ways of thinking about it:
>
> - we're over-subscribed in DRAM and need to offload some hot memory to
> CXL
> - minimize the bandwidth to CXL as much as possible because there's
> headroom on DRAM
Making the thoughts here a little more discrete, consider the following
Bandwidth Capacities:
[cpu]---[dram] 300GB/s
|-----[cxl] 30GB/s
On this system we have a 10:1 distribution of bandwidth. We can think
of 5 relevant "System States" with these limits in mind.
[Over-sub DRAM] - CPU is stalling on DRAM access
[cpu]---[dram] 320GB/s - would use more than is available if it could
|-----[cxl] 0GB/s
[Over-sub CXL] - Headroom on DRAM, but LTC hit as CXL is hot
[cpu]---[dram] 0-270GB/s
|-----[cxl] 30GB/s
[Balanced] - No Over-sub, CPU isn't stalling
[cpu]---[dram] 0-300GB/s
|-----[cxl] 0-30GB/s
[Under-sub CXL] - Headroom on CXL, DRAM may or may be at limit
[cpu]---[dram] 0-300GB/s
|-----[cxl] 0-29GB/s
[Full-sub] - Links are fully saturated.
[cpu]---[dram] 320GB/s
|-----[cxl] 32GB/s
----------------------------------------------------------
Minimizing Average Random-Access Latency / Naive Bandwidth
----------------------------------------------------------
In this scenario, you any given [RANDOM ACCESS] to produce the lowest
latency possible. This is different than [PREDICTABLE ACCESS] patterns.
In our 4 scenarios above, what is the best state transition we can do
[Over-sub DRAM] -> [Balanced] or [Full sub]
[Over-sub CXL] -> [Balanced] or [Under-sub CXL]
[Balanced] -> [Balanced]
[Under-sub CXL] -> [Balanced] or [Under-sub CXL]
[Full sub] -> [Full sub]
All scenarios trend toward [Balanced], and once we reach
balanced or [Full sub], we start doing nothing, as we recognize
any movement is likely harmful
-----------------------------------------------------------
Minimizing Average Predictable Access Latency w/o Bandwidth
-----------------------------------------------------------
Let's assume we have perfect knowledge of the following:
- [Chunk A] is hot and on a remote node.
For demotion to matter you actually need future information:
- [Chunk B] on local node IS HOT and ABOUT TO become COLD.
So we will consider demotion has having no affect on BW utilization.
Without bandwidth data, naive promotion produces the following:
[Over-sub DRAM] -> [Over-sub DRAM]
[Over-sub CXL] -> [Over-sub DRAM] or [Balanced] or [Under-sub CXL]
[Balanced] -> [Over-sub DRAM] or [Balanced]
[Undersub CXL] -> [Over-sub DRAM] or [Balanced] or [Under-sub CXL]
[Full sub] -> [Over-sub DRAM] or [Full sub]
1) All states trend toward [Over-sub DRAM]
2) If you never [Over-sub DRAM] trend toward [Balanced]
3) [Full sub] is now an actively unstable state.
But remember, promotion *drives bandwidth*, so the naive approach
can easily push the system state into any given [Over-sub] scenario.
Overall this system trends towards [Over-sub DRAM], because [Balanced]
and [Full sub] trend toward [Over-sub DRAM].
-----------------------------------------------------------
Minimizing Average Predictable Access Latency w/ Bandwidth
-----------------------------------------------------------
Now lets augment the naive approach with bandwidth data.
[Over-sub DRAM] -> [Balanced] or [Full Sub]
[Over-sub CXL] -> [Balanced] or [Under-sub CXL]
[Balanced] -> [Over-sub DRAM] or [Balanced]
[Undersub CXL] -> [Balanced] or [Under-sub CXL]
[Full sub] -> [Full sub]
Major difference:
1) [Over-sub DRAM] has an off ram to [Balanced] and [Full Sub]
2) [Balanced] and [Over-sub DRAM] now bounce off each other.
3) [Full Sub] is now a stable state
So in the most degenerate scenarios (over-subscription and
full-subscription) the system will trend toward stability.
This differs slightly from the naive bandwidth approach:
[Balanced] is no longer a stable state as we're seeking better
immediate latency for hot data. Maybe this is desired, maybe
it's harmful - this comes down to quality of data from some
profile or whatever.
-------------------------------------------------------------
</wall of text>
~Gregory
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-10-17 17:24 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-12 1:48 [Linux Memory Hotness and Promotion] Notes from October 9, 2025 David Rientjes
2025-10-17 17:23 ` Gregory Price
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox