From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
Gregory Price <gourry@gourry.net>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Joshua Hahn <joshua.hahnjy@gmail.com>,
Raghavendra K T <rkodsara@amd.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
SeongJae Park <sj@kernel.org>, Wei Xu <weixugc@google.com>,
Xuezheng Chu <xuezhengchu@huawei.com>,
Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from November 6, 2025
Date: Sat, 8 Nov 2025 18:51:07 -0800 (PST) [thread overview]
Message-ID: <8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com> (raw)
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, November 6. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
I read off updates from both Bharata and Raghu on the latest series of
their work.
Bharata is in the progress of getting ready to send the next revision of
his series and he's collecting microbenchmarks to share. The next series
is intended to go out over the weekend or Monday at the latest.
Raghu updated that he could root cause a performance issue for kscand
based on feedback received from Jonathan. He has not yet pushed that
series upstream yet. He's also made some early progress on LRU based
scanning and will have some updates for the next meeting.
----->o-----
Shivank presented slides that were added to the shared drive.
He walked through his methodology for microbenchmark testing: the kernel
promotes hot pages to the local NUMA node and the benchmark completes
faster when more hot pages are migrated early. He intended to answer the
question of whether kpromoted with DMA offload will shorten the benchmark
completion time.
To emulate this environment, he created a memoryless NUMA node to act as
slow-tier memory and then tested kpromoted from Node 2 -> 1 (remote vs
local NUMA latency). This environment was close to real-world CXL systems
in terms of memory access latency.
Gregory noted that with this set up using the cross socket link that we
need to account for whatever the bandwidth limitations are; the latency
graph as it goes up as bandwidth is pressured will not look the same.
Results for microbenchmark completion time with and without DMA offload,
in microseconds:
Mean (in us) Offload Offload Improvement (%)
abench Access Disabled Enabled (DMA
Pattern 8 channel)
random 39114520.00 35313111.00 (9.72)
random repeat 11089918.20 9045263.60 (18.44)
Sequential 27677787.00 24576768.40 (11.20)
Overall conclusion was that DMA offloading reduces abench completion time
by 9-18% (lower is better).
Gregory asked how there could be any improvement here if there will be
promotions from Node 2 -> 1, but also demotions from Node 1 -> 2. Shivank
noted there was no memory pressure on Node 1, which may not be realistic
(there will always be demotions).
Yiannis noted it would be better to include numbers on promotions and
demotions. Gregory asked for memory usage before and after on each node.
Wei Xu noted that you may not want to saturate the bandwidth with
promotions but rather leave the majority of bandwidth for organic memory
accesses. He also noted that since we are uisng thps here, on an access
we may need to migrate all 512 4KB memory that may not be ideal. Shivank
noted that this doesn't necessarily require thp, there were observed
improvement even by batch migration of 100 or 200 4KB pages.
----->o-----
For DMA offloading for migrate_pages(), Shivank opined that SDXI may be
better but he lacked the hardware needed for the evaluation. He also
noted an upstream series that provided a multi-threaded offload using idle
cpu cores being worked on by Nvidia.
Using the new batched design for migrate_folios_batch_move(), the batch
copy of folio data includes a migrator that is configurable at runtime via
sysfs. For example, static_call(_folios_copy) allows for pluggable
migrators (Default, MT CPU copy, DMA Offload, etc).
I suggested that this would be useful even beyond CXL use cases, for
example, if you can offload most of your memory compaction then this would
be generally useful. Shivank said the patch series is currently in v3 and
is receiving feedback from Jonathan. There is discussion on the right
interface for configuring the migrator and its parameters.
Shivank is currently using move_pages(2) for benchmarking directly which
will use all of this under the hood. He is getting the most benefit out
of batched migration. Gregory noted that reclaim will do batched
migration and it is generally useful.
----->o-----
Next meeting will be on Thursday, November 20 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm
Topics for the next meeting:
- updates on Bharata's series sent out between meetings with
microbenchmarks
- updates on Raghu's series for kscand and integration of LRU based
scanning
- discuss generalized subsystem for providing bandwidth information
independent of the underlying platform, ideally through resctrl,
otherwise utilizing bandwidth information will be challenging
+ preferably this bandwidth monitoring is not per NUMA node but rather
slow and fast
- similarly, discuss generalized subsystem for providing memory hotness
information
- determine minimal viable upstream opportunity to optimize for tiering
that is extensible for future use cases and optimizations
- update on non-temporal stores enlightenment for memory tiering
- enlightening migrate_pages() for hardware assists and how this work
will be charged to userspace, including for memory compaction
- discuss overall testing and benchmarking methodology for various
approaches as we go along
Please let me know if you'd like to propose additional topics for
discussion, thank you!
reply other threads:[~2025-11-09 2:51 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com \
--to=rientjes@google.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=bharata@amd.com \
--cc=dave@stgolabs.net \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=linux-mm@kvack.org \
--cc=nifan.cxl@gmail.com \
--cc=rkodsara@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox