[Linux Memory Hotness and Promotion] Notes from April 9, 2026

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Fan Ni <nifan.cxl@gmail.com>,
	 Gregory Price <gourry@gourry.net>,
	 Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	 Joshua Hahn <joshua.hahnjy@gmail.com>,
	Raghavendra K T <rkodsara@amd.com>,
	 "Rao, Bharata Bhasker" <bharata@amd.com>,
	SeongJae Park <sj@kernel.org>,  Wei Xu <weixugc@google.com>,
	Xuezheng Chu <xuezhengchu@huawei.com>,
	 Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
	Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Subject: [Linux Memory Hotness and Promotion] Notes from April 9, 2026
Date: Sun, 12 Apr 2026 17:30:20 -0700 (PDT)	[thread overview]
Message-ID: <4b9961f6-8571-1d45-6a67-2c9896ac04ef@google.com> (raw)

Hi everybody,

Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, April 9.  Thanks to everybody who was involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Shivank updated offline that he is working on addressing the v4 review 
feedback for his patch series and that he posted a compaction benchmark 
result on the v4 thread showing DMA offload freeing ~6% more CPU cycles 
for competing workloads on a busy system.

----->o-----
Bharata updated on the status of v6 of his patch series.  He has addressed 
all of the review comments and is getting ready to post v7.  This series 
will drop the RFC tag.  It will also include another source of page 
hotness information: the IBS based memory profiler.  This is a new 
instance of IBS that is available on Zen6 and later.  Earlier revisions 
were using the standard IBS subsystem and there was an open question abou 
how this would be shared with the Linux perf subsystem.

He is also working on migration from non-process context: in this case 
there is no access to VMAs or VMA flags.  This poses a limitation for 
shared executable pages and prevents them from getting promoted/migrated.  
He was specifically looking at migrate_misplaced_folio_prepare() which has 
a folio; with traditional NUMA Balancing this has a VMA, in process 
context.  This will not be there in the asynchronous promotion path 
through kmigrated.  In process context, you would be able to check if 
VM_EXEC is set or if the folio is mapped shared.

The v7 of this series should be available within two weeks time.

----->o-----
Joshua updated on tier-aware memcg limits and suggested that v2 is going 
to look very different than v1.  This is a byproduct of being the first 
project that is bringing the concept of tiering to memcg, that has caused 
a lot of prerequisite work.  There is an awkward interaction with per-cpu 
stock and limit checking for top tier memory.  He started looking into how 
stock could work with different page counter metrics.

Wei suggested treating all the stock as top tier and this should be the 
default location for where the memory originates from.  Joshua tried to 
rework the stock mechanism which is per-memcg per cpu but this is now 
pushed to the page counter level where each page counter has its own 
stock.  There are other tangential benefits to doing it that way, 
including for non-tiered users.  We will have two different page counters: 
one for top tier and one for low tier; this is not user visible, however.  
Instead of passing a number of pages to charge, we'd either pass a folio 
or an indication if the node is top tier or not.  This also requires 
converting all the memcg stat items to lruvec stat items.

Joshua noted that there are configurations where kernel memory would come 
from lower tiers if set as ZONE_NORMAL.  In very stressed situations, he 
has observed socket memory getting demoted to the low tier.

The page counter addition would be sent out soon and then we can decide 
how to manage stock for top tier memory.

----->o-----
Yiannis updated that he was looking into the non-temporal stores for 
memory tiering.  He's prepared a follow-up from his previous patch series 
that was shared with this group that should be posted upstream by Monday.  
Preferably this would include performance numbers to share.

He is slightly concerned about the duplication of arch/x86 code that is 
called into for memory copy from the migrate_pages() path.  The next 
proposal may not be the cleanest implementation but he was still looking 
to solicit upstream feedback.

Bharata asked if the non-temporal store work is happening in parallel to 
Shivank's work for DMA offload.  Yiannis looked into the first version of 
Shivank's series but hasn't looked recently.  The goal was to get 
non-temporal store feedback even independent of other work happening.

Bharata noted that he was doing experiments for non-temporal writes in the 
page clearing path.  This shows promising throughput results with 
handwritten benchmarks but when running for upstream benchmarks the gain 
was not as significant.  Yiannis noted that his main motivation was for 
compression backends.  Wei noted that using non-temporal writes should 
reduce bandwidth consumption to the device.

----->o-----
Next meeting will be on Thursday, April 23 at 8:30am PDT (UTC-7),
everybody is welcome: https://meet.google.com/jak-ytdx-hnm

Topics for the next meeting:

 - upcoming non-RFC v7 of Bharata's patch series, including new IBS
   hotness data separated from the general IBS subsystem
 - v4 of Shivank's series for enlightening migrate_pages() for hardware
   assists and how this work will be charged to userspace, including for
   memory compaction
 - v2 of tier-aware memcg limits, including new page counters and rework
   to pass folios into the charge path
 - Yiannis's patch series for non-temporal stores support
 - discuss generalized subsystem for providing bandwidth information
   independent of the underlying platform, ideally through resctrl,
   otherwise utilizing bandwidth information will be challenging
   + preferably this bandwidth monitoring is not per NUMA node but rather
     slow and fast
 - later: testing of tier aware memcg limits with Bharata's changes once
   tier aware memcg limits is stable and further along

Please let me know if you'd like to propose additional topics for
discussion, thank you!

                 reply	other threads:[~2026-04-13  0:30 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4b9961f6-8571-1d45-6a67-2c9896ac04ef@google.com \
    --to=rientjes@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bharata@amd.com \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=nifan.cxl@gmail.com \
    --cc=rkodsara@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox