From: Gregory Price <gourry@gourry.net>
To: David Rientjes <rientjes@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Joshua Hahn <joshua.hahnjy@gmail.com>,
Raghavendra K T <rkodsara@amd.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
SeongJae Park <sj@kernel.org>, Wei Xu <weixugc@google.com>,
Xuezheng Chu <xuezhengchu@huawei.com>,
Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
Zi Yan <ziy@nvidia.com>,
linux-mm@kvack.org
Subject: Re: [Linux Memory Hotness and Promotion] Notes from October 9, 2025
Date: Fri, 17 Oct 2025 13:23:53 -0400 [thread overview]
Message-ID: <aPJ7qe_2xznFSUMZ@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <3a1586b1-4107-06dc-f630-8951cc044c5a@google.com>
On Sat, Oct 11, 2025 at 06:48:59PM -0700, David Rientjes wrote:
> Gregory noted that both latency and bandwidth were related; once he
> bandwidth is over-subscribed, the latency goes through the room. The
> kernel wouldn't want to stop paying attention to bandwidth. We should
> decide if we're just going to allow this kernel agent in the background
> continue to promote memory to optimize for latency.
>
> We discussed how per-job attributes would play into this. Gregory was
> looking at this from the perpsective of optimizing for the entire system
> rather than per job. If we care about latency, then we have to care about
> bandwidth. Gregory suggested two ways of thinking about it:
>
> - we're over-subscribed in DRAM and need to offload some hot memory to
> CXL
> - minimize the bandwidth to CXL as much as possible because there's
> headroom on DRAM
Making the thoughts here a little more discrete, consider the following
Bandwidth Capacities:
[cpu]---[dram] 300GB/s
|-----[cxl] 30GB/s
On this system we have a 10:1 distribution of bandwidth. We can think
of 5 relevant "System States" with these limits in mind.
[Over-sub DRAM] - CPU is stalling on DRAM access
[cpu]---[dram] 320GB/s - would use more than is available if it could
|-----[cxl] 0GB/s
[Over-sub CXL] - Headroom on DRAM, but LTC hit as CXL is hot
[cpu]---[dram] 0-270GB/s
|-----[cxl] 30GB/s
[Balanced] - No Over-sub, CPU isn't stalling
[cpu]---[dram] 0-300GB/s
|-----[cxl] 0-30GB/s
[Under-sub CXL] - Headroom on CXL, DRAM may or may be at limit
[cpu]---[dram] 0-300GB/s
|-----[cxl] 0-29GB/s
[Full-sub] - Links are fully saturated.
[cpu]---[dram] 320GB/s
|-----[cxl] 32GB/s
----------------------------------------------------------
Minimizing Average Random-Access Latency / Naive Bandwidth
----------------------------------------------------------
In this scenario, you any given [RANDOM ACCESS] to produce the lowest
latency possible. This is different than [PREDICTABLE ACCESS] patterns.
In our 4 scenarios above, what is the best state transition we can do
[Over-sub DRAM] -> [Balanced] or [Full sub]
[Over-sub CXL] -> [Balanced] or [Under-sub CXL]
[Balanced] -> [Balanced]
[Under-sub CXL] -> [Balanced] or [Under-sub CXL]
[Full sub] -> [Full sub]
All scenarios trend toward [Balanced], and once we reach
balanced or [Full sub], we start doing nothing, as we recognize
any movement is likely harmful
-----------------------------------------------------------
Minimizing Average Predictable Access Latency w/o Bandwidth
-----------------------------------------------------------
Let's assume we have perfect knowledge of the following:
- [Chunk A] is hot and on a remote node.
For demotion to matter you actually need future information:
- [Chunk B] on local node IS HOT and ABOUT TO become COLD.
So we will consider demotion has having no affect on BW utilization.
Without bandwidth data, naive promotion produces the following:
[Over-sub DRAM] -> [Over-sub DRAM]
[Over-sub CXL] -> [Over-sub DRAM] or [Balanced] or [Under-sub CXL]
[Balanced] -> [Over-sub DRAM] or [Balanced]
[Undersub CXL] -> [Over-sub DRAM] or [Balanced] or [Under-sub CXL]
[Full sub] -> [Over-sub DRAM] or [Full sub]
1) All states trend toward [Over-sub DRAM]
2) If you never [Over-sub DRAM] trend toward [Balanced]
3) [Full sub] is now an actively unstable state.
But remember, promotion *drives bandwidth*, so the naive approach
can easily push the system state into any given [Over-sub] scenario.
Overall this system trends towards [Over-sub DRAM], because [Balanced]
and [Full sub] trend toward [Over-sub DRAM].
-----------------------------------------------------------
Minimizing Average Predictable Access Latency w/ Bandwidth
-----------------------------------------------------------
Now lets augment the naive approach with bandwidth data.
[Over-sub DRAM] -> [Balanced] or [Full Sub]
[Over-sub CXL] -> [Balanced] or [Under-sub CXL]
[Balanced] -> [Over-sub DRAM] or [Balanced]
[Undersub CXL] -> [Balanced] or [Under-sub CXL]
[Full sub] -> [Full sub]
Major difference:
1) [Over-sub DRAM] has an off ram to [Balanced] and [Full Sub]
2) [Balanced] and [Over-sub DRAM] now bounce off each other.
3) [Full Sub] is now a stable state
So in the most degenerate scenarios (over-subscription and
full-subscription) the system will trend toward stability.
This differs slightly from the naive bandwidth approach:
[Balanced] is no longer a stable state as we're seeking better
immediate latency for hot data. Maybe this is desired, maybe
it's harmful - this comes down to quality of data from some
profile or whatever.
-------------------------------------------------------------
</wall of text>
~Gregory
prev parent reply other threads:[~2025-10-17 17:24 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-12 1:48 David Rientjes
2025-10-17 17:23 ` Gregory Price [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPJ7qe_2xznFSUMZ@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=Jonathan.Cameron@huawei.com \
--cc=bharata@amd.com \
--cc=dave@stgolabs.net \
--cc=joshua.hahnjy@gmail.com \
--cc=linux-mm@kvack.org \
--cc=nifan.cxl@gmail.com \
--cc=rientjes@google.com \
--cc=rkodsara@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox