* [Linux Memory Hotness and Promotion] Notes from January 29, 2026
@ 2026-02-02 2:51 David Rientjes
0 siblings, 0 replies; only message in thread
From: David Rientjes @ 2026-02-02 2:51 UTC (permalink / raw)
To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker,
SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
Zi Yan
Cc: linux-mm
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, January 29. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
Bharata had updated on the status of his series, he just recently posted
RFC v5[1] that included new pghot support with two modes of operation:
- the default one uses one byte for hotness record and tracks frequency
and time (bucketed time) of access. Default target_nid (=0) and which
can also be changed via debugfs tunable will be used for promotion.
- there is a compile-time configurable precision mode
(CONFIG_PGHOT_PRECISE) that tracks frequency, time (in more fine
granularity) and NID as well. It uses 4 bytes for hotness record.
This will be suitable for systems having multiple nodes in the top
tier.
There are also lots of code cleanups, fixes, and reorganization. His next
step is to do some extensive performance benchmarking with additional
industry standard benchmarks. This should be in a good place for others
to test out.
Both Gregory and Wei did not surface any strong objections to this
direction.
----->o-----
Gregory has testing going on with reclaim fairness. There was discussion
that referred back to the previous instance of the meeting about avoiding
opportunistic promotion. He had a similar use case so they have been
testing opportunistic and "fixed share" approaches. His thought was to
test with both reclaim fairness and Bharata's series. I emphasized the
customer observable experience that would have different requirements for
different use cases.
Joshua Hahn went over the current thinking for reclaim fairness as three
components: set effective memory.low and memory.high based on system-wide
capacity (in addition to existing memcg tunables); the goal was to ensure
that some amount of proportional top tier memory was always resident.
This avoids interfering with other memcgs on the system unnecessarily.
Additionally, the goal was to make sure that kswapd and reclaim are aware
of this and can be more proactive.
I asked about the effective memory.high and memory.low as being hidden
from the user; Gregory said there is a single toggle that has a tristate:
none (default reclaim), fixed share, and opportunistic. Fixed share is a
self policing option; for example, if 3:1 ratio for top tier to CXL memory
capacity, that is calculated and the effective share of top tier is 75%.
Another goal was to ensure that this would be extensible in the future.
Jonathan asked how this would work for the overall system; Gregory noted
that either everybody participates or nobody participates. If you cannot
use fairness, then the scheduler needs to be more effective. If a single
user excludes themselves, then you need to reduce the effective capacity
for everybody else on the system: does this scale when containers are
coming and going? Likely not for the first iteration.
Gregory also suggested one possible mechanism could be to add a tunable to
a reclaim fairness sysctl that allows userspace to reduce the effective
capacity of a single tier on its own. Reclaim would read that value
directly instead of adding up all the values itself. I asked if there are
any per-memcg toggles for this an Gregory said that only the existing
memory.max and memory.high play a role in this approach.
There are some interesting caveats with memory hotplug but they think they
have that resolved.
It appears as though reclaim fairness doesn't have any strict dependency
on Bharata's series; Gregory noted that we want to ensure that no
mechanism can over promote, hence the goal is to test these two appoaches
together.
Joshua was working on allocation throttling mechanisms and would hope to
post the patch series over the next 2-4 weeks.
----->o-----
Next meeting will be on Thursday, February 12 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm
Topics for the next meeting:
- RFC v5 of Bharata's patch series for pghot with two modes of operation
- Gregory's testing of reclaim fairness with Bharata's changes
- discuss generalized subsystem for providing bandwidth information
independent of the underlying platform, ideally through resctrl,
otherwise utilizing bandwidth information will be challenging
+ preferably this bandwidth monitoring is not per NUMA node but rather
slow and fast
- determine minimal viable upstream opportunity to optimize for tiering
that is extensible for future use cases and optimizations
+ extensible for multiple tiers
+ must be possible to disable with no memory or performance overhead
- update on non-temporal stores enlightenment for memory tiering
- enlightening migrate_pages() for hardware assists and how this work
will be charged to userspace, including for memory compaction
Please let me know if you'd like to propose additional topics for
discussion, thank you!
[1]
https://lore.kernel.org/linux-mm/20260129144043.231636-1-bharata@amd.com/
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-02-02 2:51 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-02 2:51 [Linux Memory Hotness and Promotion] Notes from January 29, 2026 David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox