[Linux Memory Hotness and Promotion] Notes from January 15, 2026

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
	 Gregory Price <gourry@gourry.net>,
	 Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	 Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	 "Rao, Bharata Bhasker" <bharata@amd.com>,
	SeongJae Park <sj@kernel.org>,  Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	 Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from January 15, 2026
Date: Sat, 24 Jan 2026 19:35:43 -0800 (PST)	[thread overview]
Message-ID: <684fb18e-6367-a043-3ee5-dd435da30b91@google.com> (raw)

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, January 15.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
We started by chatting about benchmarks and workloads that are useful for 
evaluating different approaches to native memory tiering support in the 
kernel.  Wei Xu noted that there has been heavy reliance on memcached and 
specjbb which have been useful to evaluate policy decisions.  I brought up 
the previous discussion in this series of meetings back in November where 
Jonathan noted that memcached was not ideal because it's too predictable.

Yiannis reiterated that they've used a mixture of workloads that were 
oversubscribed, it's never a single workload.  He questioned how much this 
represents real production-like scenarios, however.  He wanted the 
community to provide the direction on how to evaluate this.  Jonathan said 
oversubscribed, multi-tenant containers would give temporal variation -- 
he gave the example of webservers that may sometimes be busy and sometimes 
they aren't.  This may give the dynamics that are necessary with the 
variability needed to evaluate different approaches.

Gregory compared this to using microbenchmarks that originally get 
scheduled on CXL memory nodes and then finding the hot memory to promote 
to top tier, but we can just randomize what pages are actually located on 
the CXL device.  He suggested that this was more functional testing than 
production representative workloads.  His plan is to run workloads with 
synthetic data and then share that with the group for something that more 
closely resembles real-world workloads.

Gregory noted that preventing churn, however, is the hard thing to 
actually measure in situations where there is more warm/hot memory than 
top tier memory.  He suggested monitoring bandwidth stats: you back off if 
bandwidth is high across all the devices.

I further suggested that performance consistency is likely more important 
than small slices of time with optimal performance that may turn out to be 
inconsistent.  We want to avoid always optimizing memory locality only to 
take it away later when another workload schedules or spikes in memory 
usage.  Gregory connected that with general QoS with two forms: limiting 
the variance and minimizing the variance.  Minimizing the variance can go 
very degenerate very quickly; limiting the variance is likely the goal.

Jonathan said that, today, the guard rails for limiting the variance is 
page faulting which is pretty slow.  Gregory said we lack this on 
multi-tenant systems because we don't have a sense of reclaim fairness, so 
we can't limit the downside of any given workload.  We'll need this to 
provide a consistent quality of service.

Wei said that when we do promotion that the allocation function does not 
trigger direct reclaim so if there is no space on top tier memory then we 
just fail the promotion.  Promotion itself will not cause this thrashing; 
the question is whether we want to aggressively demote to make room for 
promotion.  He preferred to focus on getting a promotion story upstream 
beyond just today's NUMA Balancing support.

Gregory said that the promotion rate is a function of the demotion rate 
when capacity is fully used; thus, promotion will not occur if top tier 
capacity is fully utilized.  Demotion will only occur if new allocation 
pressure happens.  So there is a guard rail, but the demotion policy has 
to be put on the user.  If there is some amount of proactive demotion, 
then that is the possible rate of churn.  Capturing this as part of the 
story is imperative; we won't be able to sell this without a comprehensive 
story.

----->o-----
I suggested that performance consistency is imperative; we don't want to 
go free a lot of top tier memory and then suddenly promote everything from 
the bottom tier only to find when we land another workload that the 
performance of the first workload ends up tanking.  Wei said that we need 
to ensure that we only demote memory that matches the coldness definition 
that the user asserts.  Gregory noted that, in this example, the original 
workload is optimized for having some level of consistent upward motion 
performance and the second workload necessarily ends up in the opposite 
situation.

There was discussion about per-node memory limits that has always been met 
with resistance upstream.  To get performance consistency across hosts 
regardless of other tenants on the system for a single workload, it 
required a static allocation for each memory tier.  I suggested that we 
could proactively demote or avoid promotion of warm memory to ensure that 
there isn't transient performance improvement for a customer VM based on 
other VMs that were running on the same host.  This could be handled with 
userspace policy through a memory.reclaim-like interface for demotion.

----->o-----
Next meeting will be on Thursday, January 29 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:

 - updates on Bharata's patch series with new benchmarks and consolidation
   of tunables
 - avoiding noisy neighbor situations especially for cloud workloads based
   on the amount of hot memory that may saturate the top tier
 - later: Gregory's analysis of more production-like workloads
 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - similarly, discuss generalized subsystem for providing memory hotness
   information
 - determine minimal viable upstream opportunity to optimize for tiering
   that is extensible for future use cases and optimizations
   + extensible for multiple tiers
   + suggestion: limited to 8 bits per page to start, add a precision mode
     later
   + limited to 64 bits per page as a ceiling, may be less
   + must be possible to disable with no memory or performance overhead
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace, including for memory compaction

Please let me know if you'd like to propose additional topics for
discussion, thank you!

                 reply	other threads:[~2026-01-25  3:35 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=684fb18e-6367-a043-3ee5-dd435da30b91@google.com \
    --to=rientjes@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rkodsara@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox