Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: Wei Xu <weixugc@google.com>, David Rientjes <rientjes@google.com>,
	Bharata B Rao <bharata@amd.com>
Cc: Gregory Price <gourry@gourry.net>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Jonathan.Cameron@huawei.com, dave.hansen@intel.com,
	hannes@cmpxchg.org, mgorman@techsingularity.net,
	mingo@redhat.com, peterz@infradead.org, raghavendra.kt@amd.com,
	riel@surriel.com, sj@kernel.org, ying.huang@linux.alibaba.com,
	ziy@nvidia.com, dave@stgolabs.net, nifan.cxl@gmail.com,
	xuezhengchu@huawei.com, yiannis@zptcorp.com,
	akpm@linux-foundation.org, david@redhat.com, byungchul@sk.com,
	kinseyho@google.com, joshua.hahnjy@gmail.com, yuanchu@google.com,
	alok.rathore@samsung.com
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
Date: Wed, 17 Sep 2025 13:20:49 +1000	[thread overview]
Message-ID: <71ac5779-d535-4b0f-bf8d-7a60bf6a6ecf@nvidia.com> (raw)
In-Reply-To: <CAAPL-u-d6taxKZuhTe=T-0i2gdoDYSSqOeSVi3JmFt_dDbU4cQ@mail.gmail.com>

On 9/17/25 10:30, Wei Xu wrote:
> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote:
>>
>> On Wed, 10 Sep 2025, Gregory Price wrote:
>>
>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:
>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
>>>>> This patchset introduces a new subsystem for hot page tracking
>>>>> and promotion (pghot) that consolidates memory access information
>>>>> from various sources and enables centralized promotion of hot
>>>>> pages across memory tiers.
>>>>
>>>> Just to be clear, I continue to believe this is a terrible idea and we
>>>> should not do this.  If systems will be built with CXL (and given the
>>>> horrendous performance, I cannot see why they would be), the kernel
>>>> should not be migrating memory around like this.
>>>
>>> I've been considered this problem from the opposite approach since LSFMM.
>>>
>>> Rather than decide how to move stuff around, what if instead we just
>>> decide not to ever put certain classes of memory on CXL.  Right now, so
>>> long as CXL is in the page allocator, it's the wild west - any page can
>>> end up anywhere.
>>>
>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
>>> workloads to show local CXL expansion is valuable and performant enough
>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
>>> disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of
>>> CXL, but allows any given user-driven page allocation (including page
>>> cache, file, and anon mappings) to land there.
>>>
>>
>> This is similar to our use case, although the direct allocation can be
>> controlled by cpusets or mempolicies as needed depending on the memory
>> access latency required for the workload; nothing new there, though, it's
>> the same argument as NUMA in general and the abstraction of these far
>> memory nodes as separate NUMA nodes makes this very straightforward.
>>
>>> I'm hoping to share some of this data in the coming months.
>>>
>>> I've yet to see any strong indication that a complex hotness/movement
>>> system is warranted (yet) - but that may simply be because we have
>>> local cards with no switching involved. So far LRU-based promotion and
>>> demotion has been sufficient.
>>>
>>
>> To me, this is a key point.  As we've discussed in meetings, we're in the
>> early days here.  The CHMU does provide a lot of flexibility, both to
>> create very good and very bad hotness trackers.  But I think the key point
>> is that we have multiple sources of hotness information depending on the
>> platform and some of these sources only make sense for the kernel (or a
>> BPF offload) to maintain as the source of truth.  Some of these sources
>> will be clear-on-read so only one entity would be possible to have as the
>> source of truth of page hotness.
>>
>> I've been pretty focused on the promotion story here rather than demotion
>> because of how responsive it needs to be.  Harvesting the page table
>> accessed bits or waiting on a sliding window through NUMA Balancing (even
>> NUMAB=2) is not as responsive as needed for very fast promotion to top
>> tier memory, hence things like the CHMU (or PEBS or IBS etc).
>>
>> A few things that I think we need to discuss and align on:
>>
>>  - the kernel as the source of truth for all memory hotness information,
>>    which can then be abstracted and used for multiple downstream purposes,
>>    memory tiering only being one of them
>>
>>  - the long-term plan for NUMAB=2 and memory tiering support in the kernel
>>    in general, are we planning on supporting this through NUMA hint faults
>>    forever despite their drawbacks (too slow, too much overhead for KVM)
>>
>>  - the role of the kernel vs userspace in driving the memory migration;
>>    lots of discussion on hardware assists that can be leveraged for memory
>>    migration but today the balancing is driven in process context.  The
>>    kthread as the driver of migration is yet to be a sold argument, but
>>    are where a number of companies are currently looking
>>
>> There's also some feature support that is possible with these CXL memory
>> expansion devices that have started to pop up in labs that can also
>> drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to
>> chime in as well.
>>
>> This topic seems due for an alignment session as well, so will look to get
>> that scheduled in the coming weeks if people are up for it.
> 
> Our experience is that workloads in hyper-scalar data centers such as
> Google often have significant cold memory. Offloading this to CXL memory
> devices, backed by cheaper, lower-performance media (e.g. DRAM with
> hardware compression), can be a practical approach to reduce overall
> TCO. Page promotion and demotion are then critical for such a tiered
> memory system.
> 
> A kernel thread to drive hot page collection and promotion seems
> logical, especially since hot page data from new sources (e.g. CHMU)
> are collected outside the process execution context and in the form of
> physical addresses.
> 
> I do agree that we need to balance the complexity and benefits of any
> new data structures for hotness tracking.


I think there is a mismatch in the tiering structure and
the patches. If you see the example in memory tiering

/*
 * ...
 * Example 3:
 *
 * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
 *
 * node distances:
 * node   0    1    2
 *    0  10   20   30
 *    1  20   10   40
 *    2  30   40   10
 *
 * memory_tiers0 = 1
 * memory_tiers1 = 0
 * memory_tiers2 = 2
 *..
 */

The topmost tier need not be DRAM, patch 3 states

"
[..]
 * kpromoted is a kernel thread that runs on each toptier node and
 * promotes pages from max_heap.
"

Also, there is no data in the cover letter to indicate what workloads benefit from
migration to top-tier and by how much?


Balbir

next prev parent reply	other threads:[~2025-09-17  3:21 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-10 14:46 Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36   ` Jonathan Cameron
2025-10-03 11:02     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17   ` Jonathan Cameron
2025-10-06  4:13     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19   ` Jonathan Cameron
2025-10-06  4:28     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38   ` Jonathan Cameron
2025-10-06  5:57     ` Bharata B Rao
2025-10-06  9:53       ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01   ` Gregory Price
2025-09-16 19:45     ` David Rientjes
2025-09-16 22:02       ` Gregory Price
2025-09-17  0:30       ` Wei Xu
2025-09-17  3:20         ` Balbir Singh [this message]
2025-09-17  4:15           ` Bharata B Rao
2025-09-17 16:49         ` Jonathan Cameron
2025-09-25 14:03           ` Yiannis Nikolakopoulos
2025-09-25 14:41             ` Gregory Price
2025-10-16 11:48               ` Yiannis Nikolakopoulos
2025-09-25 15:00             ` Jonathan Cameron
2025-09-25 15:08               ` Gregory Price
2025-09-25 15:18                 ` Gregory Price
2025-09-25 15:24                 ` Jonathan Cameron
2025-09-25 16:06                   ` Gregory Price
2025-09-25 17:23                     ` Jonathan Cameron
2025-09-25 19:02                       ` Gregory Price
2025-10-01  7:22                         ` Gregory Price
2025-10-17  9:53                           ` Yiannis Nikolakopoulos
2025-10-17 14:15                             ` Gregory Price
2025-10-17 14:36                               ` Jonathan Cameron
2025-10-17 14:59                                 ` Gregory Price
2025-10-20 14:05                                   ` Jonathan Cameron
2025-10-21 18:52                                     ` Gregory Price
2025-10-21 18:57                                       ` Gregory Price
2025-10-22  9:09                                         ` Jonathan Cameron
2025-10-22 15:05                                           ` Gregory Price
2025-10-23 15:29                                             ` Jonathan Cameron
2025-10-16 16:16               ` Yiannis Nikolakopoulos
2025-10-20 14:23                 ` Jonathan Cameron
2025-10-20 15:05                   ` Gregory Price
2025-10-08 17:59       ` Vinicius Petrucci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=71ac5779-d535-4b0f-bf8d-7a60bf6a6ecf@nvidia.com \
    --to=balbirs@nvidia.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox