From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
Gregory Price <gourry@gourry.net>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Joshua Hahn <joshua.hahnjy@gmail.com>,
Raghavendra K T <rkodsara@amd.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
SeongJae Park <sj@kernel.org>, Wei Xu <weixugc@google.com>,
Xuezheng Chu <xuezhengchu@huawei.com>,
Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from January 15, 2026
Date: Sat, 24 Jan 2026 19:35:43 -0800 (PST) [thread overview]
Message-ID: <684fb18e-6367-a043-3ee5-dd435da30b91@google.com> (raw)
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, January 15. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
We started by chatting about benchmarks and workloads that are useful for
evaluating different approaches to native memory tiering support in the
kernel. Wei Xu noted that there has been heavy reliance on memcached and
specjbb which have been useful to evaluate policy decisions. I brought up
the previous discussion in this series of meetings back in November where
Jonathan noted that memcached was not ideal because it's too predictable.
Yiannis reiterated that they've used a mixture of workloads that were
oversubscribed, it's never a single workload. He questioned how much this
represents real production-like scenarios, however. He wanted the
community to provide the direction on how to evaluate this. Jonathan said
oversubscribed, multi-tenant containers would give temporal variation --
he gave the example of webservers that may sometimes be busy and sometimes
they aren't. This may give the dynamics that are necessary with the
variability needed to evaluate different approaches.
Gregory compared this to using microbenchmarks that originally get
scheduled on CXL memory nodes and then finding the hot memory to promote
to top tier, but we can just randomize what pages are actually located on
the CXL device. He suggested that this was more functional testing than
production representative workloads. His plan is to run workloads with
synthetic data and then share that with the group for something that more
closely resembles real-world workloads.
Gregory noted that preventing churn, however, is the hard thing to
actually measure in situations where there is more warm/hot memory than
top tier memory. He suggested monitoring bandwidth stats: you back off if
bandwidth is high across all the devices.
I further suggested that performance consistency is likely more important
than small slices of time with optimal performance that may turn out to be
inconsistent. We want to avoid always optimizing memory locality only to
take it away later when another workload schedules or spikes in memory
usage. Gregory connected that with general QoS with two forms: limiting
the variance and minimizing the variance. Minimizing the variance can go
very degenerate very quickly; limiting the variance is likely the goal.
Jonathan said that, today, the guard rails for limiting the variance is
page faulting which is pretty slow. Gregory said we lack this on
multi-tenant systems because we don't have a sense of reclaim fairness, so
we can't limit the downside of any given workload. We'll need this to
provide a consistent quality of service.
Wei said that when we do promotion that the allocation function does not
trigger direct reclaim so if there is no space on top tier memory then we
just fail the promotion. Promotion itself will not cause this thrashing;
the question is whether we want to aggressively demote to make room for
promotion. He preferred to focus on getting a promotion story upstream
beyond just today's NUMA Balancing support.
Gregory said that the promotion rate is a function of the demotion rate
when capacity is fully used; thus, promotion will not occur if top tier
capacity is fully utilized. Demotion will only occur if new allocation
pressure happens. So there is a guard rail, but the demotion policy has
to be put on the user. If there is some amount of proactive demotion,
then that is the possible rate of churn. Capturing this as part of the
story is imperative; we won't be able to sell this without a comprehensive
story.
----->o-----
I suggested that performance consistency is imperative; we don't want to
go free a lot of top tier memory and then suddenly promote everything from
the bottom tier only to find when we land another workload that the
performance of the first workload ends up tanking. Wei said that we need
to ensure that we only demote memory that matches the coldness definition
that the user asserts. Gregory noted that, in this example, the original
workload is optimized for having some level of consistent upward motion
performance and the second workload necessarily ends up in the opposite
situation.
There was discussion about per-node memory limits that has always been met
with resistance upstream. To get performance consistency across hosts
regardless of other tenants on the system for a single workload, it
required a static allocation for each memory tier. I suggested that we
could proactively demote or avoid promotion of warm memory to ensure that
there isn't transient performance improvement for a customer VM based on
other VMs that were running on the same host. This could be handled with
userspace policy through a memory.reclaim-like interface for demotion.
----->o-----
Next meeting will be on Thursday, January 29 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm
Topics for the next meeting:
- updates on Bharata's patch series with new benchmarks and consolidation
of tunables
- avoiding noisy neighbor situations especially for cloud workloads based
on the amount of hot memory that may saturate the top tier
- later: Gregory's analysis of more production-like workloads
- discuss generalized subsystem for providing bandwidth information
independent of the underlying platform, ideally through resctrl,
otherwise utilizing bandwidth information will be challenging
+ preferably this bandwidth monitoring is not per NUMA node but rather
slow and fast
- similarly, discuss generalized subsystem for providing memory hotness
information
- determine minimal viable upstream opportunity to optimize for tiering
that is extensible for future use cases and optimizations
+ extensible for multiple tiers
+ suggestion: limited to 8 bits per page to start, add a precision mode
later
+ limited to 64 bits per page as a ceiling, may be less
+ must be possible to disable with no memory or performance overhead
- update on non-temporal stores enlightenment for memory tiering
- enlightening migrate_pages() for hardware assists and how this work
will be charged to userspace, including for memory compaction
Please let me know if you'd like to propose additional topics for
discussion, thank you!
reply other threads:[~2026-01-25 3:35 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=684fb18e-6367-a043-3ee5-dd435da30b91@google.com \
--to=rientjes@google.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=bharata@amd.com \
--cc=dave@stgolabs.net \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=linux-mm@kvack.org \
--cc=nifan.cxl@gmail.com \
--cc=rkodsara@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox