linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
	 Gregory Price <gourry@gourry.net>,
	 Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	 Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	 "Rao, Bharata Bhasker" <bharata@amd.com>,
	SeongJae Park <sj@kernel.org>,  Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	 Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from January 29, 2026
Date: Sun, 1 Feb 2026 18:51:39 -0800 (PST)	[thread overview]
Message-ID: <c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com> (raw)

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, January 29.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Bharata had updated on the status of his series, he just recently posted 
RFC v5[1] that included new pghot support with two modes of operation:

 - the default one uses one byte for hotness record and tracks frequency
   and time (bucketed time) of access.  Default target_nid (=0) and which
   can also be changed via debugfs tunable will be used for promotion.
 - there is a compile-time configurable precision mode
   (CONFIG_PGHOT_PRECISE) that tracks frequency, time (in more fine
   granularity) and NID as well.  It uses 4 bytes for hotness record.
   This will be suitable for systems having multiple nodes in the top
   tier.
 
There are also lots of code cleanups, fixes, and reorganization.  His next 
step is to do some extensive performance benchmarking with additional 
industry standard benchmarks.  This should be in a good place for others 
to test out.

Both Gregory and Wei did not surface any strong objections to this 
direction.

----->o-----
Gregory has testing going on with reclaim fairness.  There was discussion 
that referred back to the previous instance of the meeting about avoiding 
opportunistic promotion.  He had a similar use case so they have been 
testing opportunistic and "fixed share" approaches.  His thought was to 
test with both reclaim fairness and Bharata's series.  I emphasized the 
customer observable experience that would have different requirements for 
different use cases.

Joshua Hahn went over the current thinking for reclaim fairness as three 
components: set effective memory.low and memory.high based on system-wide 
capacity (in addition to existing memcg tunables); the goal was to ensure 
that some amount of proportional top tier memory was always resident.  
This avoids interfering with other memcgs on the system unnecessarily.  
Additionally, the goal was to make sure that kswapd and reclaim are aware 
of this and can be more proactive.

I asked about the effective memory.high and memory.low as being hidden 
from the user; Gregory said there is a single toggle that has a tristate: 
none (default reclaim), fixed share, and opportunistic.  Fixed share is a 
self policing option; for example, if 3:1 ratio for top tier to CXL memory 
capacity, that is calculated and the effective share of top tier is 75%.  
Another goal was to ensure that this would be extensible in the future.

Jonathan asked how this would work for the overall system; Gregory noted 
that either everybody participates or nobody participates.  If you cannot 
use fairness, then the scheduler needs to be more effective.  If a single 
user excludes themselves, then you need to reduce the effective capacity 
for everybody else on the system: does this scale when containers are 
coming and going?  Likely not for the first iteration.

Gregory also suggested one possible mechanism could be to add a tunable to 
a reclaim fairness sysctl that allows userspace to reduce the effective 
capacity of a single tier on its own.  Reclaim would read that value 
directly instead of adding up all the values itself.  I asked if there are 
any per-memcg toggles for this an Gregory said that only the existing 
memory.max and memory.high play a role in this approach.

There are some interesting caveats with memory hotplug but they think they 
have that resolved.

It appears as though reclaim fairness doesn't have any strict dependency 
on Bharata's series; Gregory noted that we want to ensure that no 
mechanism can over promote, hence the goal is to test these two appoaches 
together.

Joshua was working on allocation throttling mechanisms and would hope to 
post the patch series over the next 2-4 weeks.

----->o-----
Next meeting will be on Thursday, February 12 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:

 - RFC v5 of Bharata's patch series for pghot with two modes of operation
 - Gregory's testing of reclaim fairness with Bharata's changes
 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - determine minimal viable upstream opportunity to optimize for tiering
   that is extensible for future use cases and optimizations
   + extensible for multiple tiers
   + must be possible to disable with no memory or performance overhead
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace, including for memory compaction

Please let me know if you'd like to propose additional topics for
discussion, thank you!

[1] 
https://lore.kernel.org/linux-mm/20260129144043.231636-1-bharata@amd.com/


                 reply	other threads:[~2026-02-02  2:51 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com \
    --to=rientjes@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rkodsara@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox