From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
Gregory Price <gourry@gourry.net>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Joshua Hahn <joshua.hahnjy@gmail.com>,
Raghavendra K T <rkodsara@amd.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
SeongJae Park <sj@kernel.org>, Wei Xu <weixugc@google.com>,
Xuezheng Chu <xuezhengchu@huawei.com>,
Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from November 20, 2025
Date: Sun, 23 Nov 2025 19:04:47 -0800 (PST) [thread overview]
Message-ID: <58dcd4db-a923-0d5d-37eb-1a539f1f275d@google.com> (raw)
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, November 20. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
Bharata updated that he had a set of results for the scenario that
involves promotion upstream, he posted this as a reply to his RFC v3. Any
feedback on that series or proposed benchmarks to run would be very
useful. He was also thinking about consolidating all the tunables in
sysfs into a sub directory rather than have them in the parent directory
for MM. I suggested this may also start out in debugfs until the APIs
become more stable.
Bharata was also planning on redoing the NUMAB2 support so that its
cleaner and the page movement ratelimiting and associated logic is
separated, which enables using faults as a source. He's also planning on
using folio_mark_accessed() as a source of hotness to cover promotion of
unmapped file folios. He'll be writing a dedicated microbenchmark for
testing of this. He'll also be investing additional benchmarks for the
overall series as a whole.
----->o-----
Jonathan Cameron asked what the general feel was about the memory
overhead: currently this tracking requires ~2GB per 1TB. Wei Xu noted
that Google is taking a similar approach but with one byte per page in
page flags. If just for promotion purposes, we likely don't need eight
bytes per page. Even NUMA Balancing does not use eight bytes per page.
Jonathan said it currently uses 33 bits per page so some shrinkage might
be possible. Wei said promotion still requires the per-pfn scan which can
be expensive.
Wei said there would be one data structure with the information so we can
do atomic updates on the hot metadata and then there is a much smaller
data structure that tracks which pages to promote.
Raghu noted that in discussion with Bharata that it was pointed out that
the tracking of memory here is only necessary for the low tier since that
memory is the only viable set of pages to promote. Jonathan noted that
may be the majority of system memory. The metadata itself is only stored
in top tier memory, which is expensive.
----->o-----
We discussed the benchmarks that we should use for evaluation of all of
these approaches. SeongJae noted that he had no specific benchmark in
mind but we should discuss the access pattern the benchmark should have.
This should have some temporal access patterns but also have different
hotness in different locations of memory; secondly, the pattern should
change during runtime.
Jonathan said there's been a heavy reliance on memcached but that's not
ideal because it's too predictable; we actually need the opposite of this.
I noted that I've had some success running specjbb and redis workloads.
Redis is interesting because it does not always observe spatial locality.
Yiannis noted one of the challenges with specint is that the duration of
the benchmark itself is not long enough to assess optimal placement logic.
Wei agreed with this, the benchmark would need to run for a long time.
Yiannis further mentioned that these can be used to over-subscribe cores,
however, the induce pressure (and consume bandwidth).
----->o-----
Raghu updated on his patch series to use the LRU gen scan API which
iterates through a single mm, this provides more control over the memory
that is being iterated. He was working through some issues in the patch
series and may need to reach out to Kinsey for discussion on klruscand.
Jonathan also provided some feedback on the mailing list.
Raghu asked Kinsey if it would be possible to have an API that scanned a
single mm; Kinsey said yes, this was similar to what was being thought
about internally. Raghu said this would be useful for integration.
Wei asked Raghu if his series will integrate the scanning and promotion
together so that when a page is identified we can promote right away.
Raghu said this was implemented like NUMAB but does not happen after a
single access. There is a separate migration thread.
Jonathan asked if we necessarily care if we lose some information; if
there is a ton of memory to promote, we can't migrate everything, so do we
care if some hotness information is actually lost? Wei suggested that we
must have a mechanism for promoting the hottest pages, not just hot pages,
so some amount of history is required. Jonathan said that if everything
was insanely hot and we lose some information it would readily reappear
again. Raghu's patch series only uses a single bit from page flags, Wei
suggested extending this.
----->o-----
Next meeting will be on Thursday, December 4 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm
Topics for the next meeting:
- updates on Bharata's RFC v3 with new benchmarks and consolidation of
tunables
- continued discussion on memory overheads used to save the memory
hotness state and the list of promotion targets
- benchmarks to use as the industry standard beyond just memcache, such
as redis
- discuss generalized subsystem for providing bandwidth information
independent of the underlying platform, ideally through resctrl,
otherwise utilizing bandwidth information will be challenging
+ preferably this bandwidth monitoring is not per NUMA node but rather
slow and fast
- similarly, discuss generalized subsystem for providing memory hotness
information
- determine minimal viable upstream opportunity to optimize for tiering
that is extensible for future use cases and optimizations
- update on non-temporal stores enlightenment for memory tiering
- enlightening migrate_pages() for hardware assists and how this work
will be charged to userspace, including for memory compaction
Please let me know if you'd like to propose additional topics for
discussion, thank you!
next reply other threads:[~2025-11-24 3:04 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-24 3:04 David Rientjes [this message]
2025-11-24 4:05 ` Bharata B Rao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=58dcd4db-a923-0d5d-37eb-1a539f1f275d@google.com \
--to=rientjes@google.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=bharata@amd.com \
--cc=dave@stgolabs.net \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=linux-mm@kvack.org \
--cc=nifan.cxl@gmail.com \
--cc=rkodsara@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox