[Linux Memory Hotness and Promotion] Notes from November 6, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
	 Gregory Price <gourry@gourry.net>,
	 Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	 Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	 "Rao, Bharata Bhasker" <bharata@amd.com>,
	SeongJae Park <sj@kernel.org>,  Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	 Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from November 6, 2025
Date: Sat, 8 Nov 2025 18:51:07 -0800 (PST)	[thread overview]
Message-ID: <8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com> (raw)

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, November 6.  Thanks to everybody who was 
involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
I read off updates from both Bharata and Raghu on the latest series of 
their work.

Bharata is in the progress of getting ready to send the next revision of 
his series and he's collecting microbenchmarks to share.  The next series 
is intended to go out over the weekend or Monday at the latest.

Raghu updated that he could root cause a performance issue for kscand 
based on feedback received from Jonathan.  He has not yet pushed that 
series upstream yet.  He's also made some early progress on LRU based 
scanning and will have some updates for the next meeting.

----->o-----
Shivank presented slides that were added to the shared drive.

He walked through his methodology for microbenchmark testing: the kernel 
promotes hot pages to the local NUMA node and the benchmark completes 
faster when more hot pages are migrated early.  He intended to answer the 
question of whether kpromoted with DMA offload will shorten the benchmark 
completion time.

To emulate this environment, he created a memoryless NUMA node to act as 
slow-tier memory and then tested kpromoted from Node 2 -> 1 (remote vs 
local NUMA latency).  This environment was close to real-world CXL systems 
in terms of memory access latency.

Gregory noted that with this set up using the cross socket link that we 
need to account for whatever the bandwidth limitations are; the latency 
graph as it goes up as bandwidth is pressured will not look the same.

Results for microbenchmark completion time with and without DMA offload, 
in microseconds:

	Mean (in us)	Offload		Offload		Improvement (%)
	abench Access	Disabled	Enabled (DMA
	Pattern				8 channel)
	random		39114520.00	35313111.00	(9.72)
	random repeat	11089918.20	9045263.60	(18.44)
	Sequential	27677787.00	24576768.40	(11.20)

Overall conclusion was that DMA offloading reduces abench completion time 
by 9-18% (lower is better).

Gregory asked how there could be any improvement here if there will be 
promotions from Node 2 -> 1, but also demotions from Node 1 -> 2.  Shivank 
noted there was no memory pressure on Node 1, which may not be realistic 
(there will always be demotions).

Yiannis noted it would be better to include numbers on promotions and 
demotions.  Gregory asked for memory usage before and after on each node.

Wei Xu noted that you may not want to saturate the bandwidth with 
promotions but rather leave the majority of bandwidth for organic memory 
accesses.  He also noted that since we are uisng thps here, on an access 
we may need to migrate all 512 4KB memory that may not be ideal.  Shivank 
noted that this doesn't necessarily require thp, there were observed 
improvement even by batch migration of 100 or 200 4KB pages.

----->o-----
For DMA offloading for migrate_pages(), Shivank opined that SDXI may be 
better but he lacked the hardware needed for the evaluation.  He also 
noted an upstream series that provided a multi-threaded offload using idle 
cpu cores being worked on by Nvidia.

Using the new batched design for migrate_folios_batch_move(), the batch 
copy of folio data includes a migrator that is configurable at runtime via 
sysfs.  For example, static_call(_folios_copy) allows for pluggable 
migrators (Default, MT CPU copy, DMA Offload, etc).

I suggested that this would be useful even beyond CXL use cases, for 
example, if you can offload most of your memory compaction then this would 
be generally useful.  Shivank said the patch series is currently in v3 and 
is receiving feedback from Jonathan.  There is discussion on the right 
interface for configuring the migrator and its parameters.

Shivank is currently using move_pages(2) for benchmarking directly which 
will use all of this under the hood.  He is getting the most benefit out 
of batched migration.  Gregory noted that reclaim will do batched 
migration and it is generally useful.

----->o-----
Next meeting will be on Thursday, November 20 at 8:30am PST (UTC-8),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:

 - updates on Bharata's series sent out between meetings with
   microbenchmarks
 - updates on Raghu's series for kscand and integration of LRU based
   scanning
 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - similarly, discuss generalized subsystem for providing memory hotness
   information
 - determine minimal viable upstream opportunity to optimize for tiering
   that is extensible for future use cases and optimizations
 - update on non-temporal stores enlightenment for memory tiering
 - enlightening migrate_pages() for hardware assists and how this work
   will be charged to userspace, including for memory compaction
 - discuss overall testing and benchmarking methodology for various
   approaches as we go along

Please let me know if you'd like to propose additional topics for
discussion, thank you!

                 reply	other threads:[~2025-11-09  2:51 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com \
    --to=rientjes@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rkodsara@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox