[Linux Memory Hotness and Promotion] Notes from February 12, 2026

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
	 Gregory Price <gourry@gourry.net>,
	 Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	 Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	 "Rao, Bharata Bhasker" <bharata@amd.com>,
	SeongJae Park <sj@kernel.org>,  Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	 Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from February 12, 2026
Date: Sat, 14 Feb 2026 20:04:55 -0800 (PST)	[thread overview]
Message-ID: <5146e3b4-751f-ca6a-0bdd-7b1f4d425ff0@google.com> (raw)

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, February 12.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Bharata updated on the status of v5 of his patch series from a couple 
weeks ago with two modes of operation.  Based on discussion with people 
are LPC, he proposed two mechanisms to track hotness that would have 
vastly different memory footprint requirements, one as little as one byte 
per page.  He posted benchmark numbers over a series of updates upstream.  
He found a clear advantage of hot page promotion when top tier memory was 
not saturated for both of these modes of operation.

However, when there is top-tier memory pressure (when the benchmark 
working set spills over into CXL memory), then we tend to demote and 
promote simultaneously as expected.  Interestingly, he has not observed 
promotion in these cases actually being helpful; in fact, in some cases it 
actually turns out to hurt.  He asked for feedback on tunings for when the 
demotion case can actually help this scenario.  Note in one of these 
benchmarking scenarios, however, that the benchmark would perform random 
access.

Gregory suggested that we may want to track the number of demotions and 
promotions; when the rate of demotions equals the rate of promotions then 
we should reduce the number of promotions.  Bharata noted that there was 
ratelimiting on promotions but not considering the current rate of 
demotions.  In this case, Gregory said, we are seeing promotions drive 
demotions and there will be an endless loop with these workload 
characteristics that will impact performance.

Gregory's idea was that if we promote 1,000 pages and then a few seconds 
later we demote 1,000 pages, then this is an indication of churn.  If this 
is consistent, we should back off.  In an ideal scenario, we would be 
doing some proactive demotion from top tier so there is room to promote.  
He wanted to see data on the rate of promotions and rate of demotions over 
time for these bnechmarks.  This may reveal an indicator that we can use 
as a heuristic for a back off mechanism.

I asked if we need a back off mechanism or we actually need a fairness 
mechanism, the worry was that we would find churn in promotion and 
demotion, then back off, then start promoting and demoting again at the 
same rate only to rinse and repeat.  Gregory noted that this will continue 
to consume precious bandwidth on the device so any churn will naturally 
impact the rest of the system.

Wei Xu thought that this was more about the sorting of the cold memory: we 
need to rank the page coldness across tiers because what we promoted may 
not necessarily be hotter than what we promoted.  One possible approach is 
to use a time dimension where we refuse to demote any memory that is not 
cold enough (like MGLRU's min_ttl).  Gregory noted this is a function of 
the call that we use to promote the memory which, today, is just a call to 
migrate_pages(); we could pass gfp flags to specify what to do if the 
promotion node is out of memory.  If we are calling direct reclaim, that's 
probably an indication that we are not aging off inactive pages and we may 
be forcing demotion of active pages.

Bharata's current approach uses migrate_misplaced_folio() which 
unfortuantely does not give a migration context but does wake up kswapd if 
the top tier is near an oom condition.  There is no gfp flags being 
passed, however.  We may want to extend the batch function to allow 
passing its own allocation function; the current benchmarks would not be 
calling into direct reclaim today.

Wei noted that we do not have a mechanism today to be able to compare 
relative coldness of memory to make promotion and demotion decisions.  He 
suggested using a time dimension such that the page being demoted is 
guaranteed to not have been accessed in the last interval whereas the page 
being promoted *is* guaranteed to have been accessed in that interval.

----->o-----
Next meeting will be on Thursday, February 26 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:

 - any on-going work for CHMU and a generic hotness interface to leverage
   it
 - RFC v5 of Bharata's patch series for pghot with two modes of operation
 - promotion and demotion thrashing detection and back-off mechanisms
 - LSF/MM/BPF 2026 topics to propose for discussion on hotness, promotion,
   and memory tiering overall
 - Gregory's testing of reclaim fairness with Bharata's changes and first
   posting for an RFC
 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - determine minimal viable upstream opportunity to optimize for tiering
   that is extensible for future use cases and optimizations
   + extensible for multiple tiers
   + must be possible to disable with no memory or performance overhead
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace, including for memory compaction

Please let me know if you'd like to propose additional topics for
discussion, thank you!

                 reply	other threads:[~2026-02-15  4:05 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5146e3b4-751f-ca6a-0bdd-7b1f4d425ff0@google.com \
    --to=rientjes@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rkodsara@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox