Re: [Linux Memory Hotness and Promotion] Notes from October 9, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Gregory Price <gourry@gourry.net>
To: David Rientjes <rientjes@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	SeongJae Park <sj@kernel.org>, Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org
Subject: Re: [Linux Memory Hotness and Promotion] Notes from October 9, 2025
Date: Fri, 17 Oct 2025 13:23:53 -0400	[thread overview]
Message-ID: <aPJ7qe_2xznFSUMZ@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <3a1586b1-4107-06dc-f630-8951cc044c5a@google.com>

On Sat, Oct 11, 2025 at 06:48:59PM -0700, David Rientjes wrote:
> Gregory noted that both latency and bandwidth were related; once he 
> bandwidth is over-subscribed, the latency goes through the room.  The 
> kernel wouldn't want to stop paying attention to bandwidth.  We should 
> decide if we're just going to allow this kernel agent in the background 
> continue to promote memory to optimize for latency.
> 
> We discussed how per-job attributes would play into this.  Gregory was 
> looking at this from the perpsective of optimizing for the entire system 
> rather than per job.  If we care about latency, then we have to care about 
> bandwidth.  Gregory suggested two ways of thinking about it:
> 
>  - we're over-subscribed in DRAM and need to offload some hot memory to 
> CXL
>  - minimize the bandwidth to CXL as much as possible because there's 
> headroom on DRAM

Making the thoughts here a little more discrete, consider the following

Bandwidth Capacities:
[cpu]---[dram] 300GB/s
  |-----[cxl]  30GB/s

On this system we have a 10:1 distribution of bandwidth.  We can think
of 5 relevant "System States" with these limits in mind.

[Over-sub DRAM] - CPU is stalling on DRAM access
[cpu]---[dram] 320GB/s - would use more than is available if it could
  |-----[cxl]  0GB/s

[Over-sub CXL]  - Headroom on DRAM, but LTC hit as CXL is hot
[cpu]---[dram] 0-270GB/s
  |-----[cxl]  30GB/s

[Balanced]      - No Over-sub, CPU isn't stalling
[cpu]---[dram] 0-300GB/s
  |-----[cxl]  0-30GB/s

[Under-sub CXL] - Headroom on CXL, DRAM may or may be at limit
[cpu]---[dram] 0-300GB/s
  |-----[cxl]  0-29GB/s

[Full-sub]      - Links are fully saturated.
[cpu]---[dram] 320GB/s
  |-----[cxl]  32GB/s

----------------------------------------------------------
Minimizing Average Random-Access Latency / Naive Bandwidth
----------------------------------------------------------
In this scenario, you any given [RANDOM ACCESS] to produce the lowest
latency possible. This is different than [PREDICTABLE ACCESS] patterns.

In our 4 scenarios above, what is the best state transition we can do

[Over-sub DRAM] -> [Balanced] or [Full sub]
[Over-sub CXL]  -> [Balanced] or [Under-sub CXL]
[Balanced]      -> [Balanced]
[Under-sub CXL] -> [Balanced] or [Under-sub CXL]
[Full sub]      -> [Full sub]

All scenarios trend toward [Balanced], and once we reach
balanced or [Full sub], we start doing nothing, as we recognize
any movement is likely harmful

-----------------------------------------------------------
Minimizing Average Predictable Access Latency w/o Bandwidth
-----------------------------------------------------------
Let's assume we have perfect knowledge of the following:
- [Chunk A] is hot and on a remote node.

For demotion to matter you actually need future information:
- [Chunk B] on local node IS HOT and ABOUT TO become COLD.
So we will consider demotion has having no affect on BW utilization.

Without bandwidth data, naive promotion produces the following:

[Over-sub DRAM] -> [Over-sub DRAM]
[Over-sub CXL]  -> [Over-sub DRAM] or [Balanced] or [Under-sub CXL]
[Balanced]      -> [Over-sub DRAM] or [Balanced]
[Undersub CXL]  -> [Over-sub DRAM] or [Balanced] or [Under-sub CXL]
[Full sub]      -> [Over-sub DRAM] or [Full sub]

1) All states trend toward [Over-sub DRAM]
2) If you never [Over-sub DRAM] trend toward [Balanced]
3) [Full sub] is now an actively unstable state.

But remember, promotion *drives bandwidth*, so the naive approach
can easily push the system state into any given [Over-sub] scenario.

Overall this system trends towards [Over-sub DRAM], because [Balanced]
and [Full sub] trend toward [Over-sub DRAM].

-----------------------------------------------------------
Minimizing Average Predictable Access Latency w/ Bandwidth
-----------------------------------------------------------
Now lets augment the naive approach with bandwidth data.

[Over-sub DRAM] -> [Balanced] or [Full Sub]
[Over-sub CXL]  -> [Balanced] or [Under-sub CXL]
[Balanced]      -> [Over-sub DRAM] or [Balanced]
[Undersub CXL]  -> [Balanced] or [Under-sub CXL]
[Full sub]      -> [Full sub]

Major difference:
1) [Over-sub DRAM] has an off ram to [Balanced] and [Full Sub]
2) [Balanced] and [Over-sub DRAM] now bounce off each other.
3) [Full Sub] is now a stable state

So in the most degenerate scenarios (over-subscription and
full-subscription) the system will trend toward stability.

This differs slightly from the naive bandwidth approach:
   [Balanced] is no longer a stable state as we're seeking better
   immediate latency for hot data.  Maybe this is desired, maybe
   it's harmful - this comes down to quality of data from some
   profile or whatever.

-------------------------------------------------------------

</wall of text>

~Gregory

     prev parent reply	other threads:[~2025-10-17 17:24 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-12  1:48 David Rientjes
2025-10-17 17:23 ` Gregory Price [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPJ7qe_2xznFSUMZ@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dave@stgolabs.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox