From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
Cc: Wei Xu <weixugc@google.com>, David Rientjes <rientjes@google.com>,
Gregory Price <gourry@gourry.net>,
Matthew Wilcox <willy@infradead.org>,
Bharata B Rao <bharata@amd.com>, <linux-kernel@vger.kernel.org>,
<linux-mm@kvack.org>, <dave.hansen@intel.com>,
<hannes@cmpxchg.org>, <mgorman@techsingularity.net>,
<mingo@redhat.com>, <peterz@infradead.org>,
<raghavendra.kt@amd.com>, <riel@surriel.com>, <sj@kernel.org>,
<ying.huang@linux.alibaba.com>, <ziy@nvidia.com>,
<dave@stgolabs.net>, <nifan.cxl@gmail.com>,
<xuezhengchu@huawei.com>, <akpm@linux-foundation.org>,
<david@redhat.com>, <byungchul@sk.com>, <kinseyho@google.com>,
<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
<balbirs@nvidia.com>, <alok.rathore@samsung.com>,
<yiannis@zptcorp.com>, Adam Manzanares <a.manzanares@samsung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
Date: Thu, 25 Sep 2025 16:00:58 +0100 [thread overview]
Message-ID: <20250925160058.00002645@huawei.com> (raw)
In-Reply-To: <5A7E0646-0324-4463-8D93-A1105C715EB3@gmail.com>
On Thu, 25 Sep 2025 16:03:46 +0200
Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote:
Hi Yiannis,
> > On 17 Sep 2025, at 18:49, Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> >
> > On Tue, 16 Sep 2025 17:30:46 -0700
> > Wei Xu <weixugc@google.com> wrote:
> >
> >> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote:
> >>>
> >>> On Wed, 10 Sep 2025, Gregory Price wrote:
> >>>
> >>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:
> >>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
> >>>>>> This patchset introduces a new subsystem for hot page tracking
> >>>>>> and promotion (pghot) that consolidates memory access information
> >>>>>> from various sources and enables centralized promotion of hot
> >>>>>> pages across memory tiers.
> >>>>>
> >>>>> Just to be clear, I continue to believe this is a terrible idea and we
> >>>>> should not do this. If systems will be built with CXL (and given the
> >>>>> horrendous performance, I cannot see why they would be), the kernel
> >>>>> should not be migrating memory around like this.
> >>>>
> >>>> I've been considered this problem from the opposite approach since LSFMM.
> >>>>
> >>>> Rather than decide how to move stuff around, what if instead we just
> >>>> decide not to ever put certain classes of memory on CXL. Right now, so
> >>>> long as CXL is in the page allocator, it's the wild west - any page can
> >>>> end up anywhere.
> >>>>
> >>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
> >>>> workloads to show local CXL expansion is valuable and performant enough
> >>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
> >>>> disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of
> >>>> CXL, but allows any given user-driven page allocation (including page
> >>>> cache, file, and anon mappings) to land there.
> >>>>
> >>>
> [snip]
> >>> There's also some feature support that is possible with these CXL memory
> >>> expansion devices that have started to pop up in labs that can also
> >>> drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to
> >>> chime in as well.
> >>>
> >>> This topic seems due for an alignment session as well, so will look to get
> >>> that scheduled in the coming weeks if people are up for it.
> >>
> >> Our experience is that workloads in hyper-scalar data centers such as
> >> Google often have significant cold memory. Offloading this to CXL memory
> >> devices, backed by cheaper, lower-performance media (e.g. DRAM with
> >> hardware compression), can be a practical approach to reduce overall
> >> TCO. Page promotion and demotion are then critical for such a tiered
> >> memory system.
> >
> > For the hardware compression devices how are you dealing with capacity variation
> > / overcommit?
> I understand that this is indeed one of the key questions from the upstream
> kernel’s perspective.
> So, I am jumping in to answer w.r.t. what we do in ZeroPoint; obviously I can
> not speak of other solutions/deployments. However, our HW interface follows
> existing open specifications from OCP [1], so what I am describing below is
> more widely applicable.
>
> At a very high level, the way our HW works is that the DPA is indeed
> overcommitted. Then, there is a control plane over CXL.io (PCIe) which
> exposes the real remaining capacity, as well as some configurable
> MSI-X interrupts that raise warnings when the capacity crosses over
> certain configurable thresholds.
>
> Last year I presented this interface in LSF/MM [2]. Based on the feedback I
> got there, we have an early prototype that acts as the *last* memory tier
> before reclaim (kind of "compressed tier in lieu of discard" as was
> suggested to me by Dan).
>
> What is different from standard tiering is that the control plane is
> checked on demotion to make sure there is still capacity left. If not, the
> demotion fails. While this seems stable so far, a missing piece is to
> ensure that this tier is mainly written by demotions and not arbitrary kernel
> allocations (at least as a starting point). I want to explore how mempolicies
> can help there, or something of the sort that Gregory described.
>
> This early prototype still needs quite some work in order to find the right
> abstractions. Hopefully, I will be able to push an RFC in the near future
> (a couple of months).
>
> > Whilst there have been some discussions on that but without a
> > backing store of flash or similar it seems to be challenging to use
> > compressed memory in a tiering system (so as 'normalish' memory) unless you
> > don't mind occasionally and unexpectedly running out of memory (in nasty
> > async ways as dirty cache lines get written back).
> There are several things that may be done on the device side. For now, I
> think the kernel should be unaware of these. But with what I described
> above, the goal is to have the capacity thresholds configured in a way
> that we can absorb the occasional dirty cache lines that are written back.
In worst case they are far from occasional. It's not hard to imagine a malicious
program that ensures that all L3 in a system (say 256MiB+) is full of cache lines
from the far compressed memory all of which are changed in a fashion that makes
the allocation much less compressible. If you are doing compression at cache line
granularity that's not so bad because it would only be 256MiB margin needed.
If the system in question is doing large block side compression, say 4KiB.
Then we have a 64x write amplification multiplier. If the virus is streaming over
memory the evictions we are seeing at the result of new lines being fetched
to be made much less compressible.
Add a accelerator (say DPDK or other zero copy into userspace buffers) into the
mix and you have a mess. You'll need to be extremely careful with what goes
in this compressed memory or hold enormous buffer capacity against fast
changes in compressability.
Key is that all software is potentially malicious (sometimes accidentally so ;)
Now, if we can put this into a special pool where it is acceptable to drop the writes
and return poison (so the application crashes) then that may be fine.
Or block writes. Running compressed memory as read only CoW is one way to
avoid this problem.
> >
> > Or do you mean zswap type use with a hardware offload of the actual
> > compression?
> I would categorize this as a completely different discussion (and product
> line for us).
>
> [1] https://www.opencompute.org/documents/hyperscale-tiered-memory-expander-specification-for-compute-express-link-cxl-1-pdf
> [2] https://www.youtube.com/watch?v=tXWEbaJmZ_s
>
> Thanks,
> Yiannis
>
> PS: Sending from a personal email address to avoid issues with
> confidentiality footers of the corporate domain.
next prev parent reply other threads:[~2025-09-25 15:01 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 14:46 Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36 ` Jonathan Cameron
2025-10-03 11:02 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17 ` Jonathan Cameron
2025-10-06 4:13 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19 ` Jonathan Cameron
2025-10-06 4:28 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38 ` Jonathan Cameron
2025-10-06 5:57 ` Bharata B Rao
2025-10-06 9:53 ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01 ` Gregory Price
2025-09-16 19:45 ` David Rientjes
2025-09-16 22:02 ` Gregory Price
2025-09-17 0:30 ` Wei Xu
2025-09-17 3:20 ` Balbir Singh
2025-09-17 4:15 ` Bharata B Rao
2025-09-17 16:49 ` Jonathan Cameron
2025-09-25 14:03 ` Yiannis Nikolakopoulos
2025-09-25 14:41 ` Gregory Price
2025-10-16 11:48 ` Yiannis Nikolakopoulos
2025-09-25 15:00 ` Jonathan Cameron [this message]
2025-09-25 15:08 ` Gregory Price
2025-09-25 15:18 ` Gregory Price
2025-09-25 15:24 ` Jonathan Cameron
2025-09-25 16:06 ` Gregory Price
2025-09-25 17:23 ` Jonathan Cameron
2025-09-25 19:02 ` Gregory Price
2025-10-01 7:22 ` Gregory Price
2025-10-17 9:53 ` Yiannis Nikolakopoulos
2025-10-17 14:15 ` Gregory Price
2025-10-17 14:36 ` Jonathan Cameron
2025-10-17 14:59 ` Gregory Price
2025-10-20 14:05 ` Jonathan Cameron
2025-10-21 18:52 ` Gregory Price
2025-10-21 18:57 ` Gregory Price
2025-10-22 9:09 ` Jonathan Cameron
2025-10-22 15:05 ` Gregory Price
2025-10-23 15:29 ` Jonathan Cameron
2025-10-16 16:16 ` Yiannis Nikolakopoulos
2025-10-20 14:23 ` Jonathan Cameron
2025-10-20 15:05 ` Gregory Price
2025-10-08 17:59 ` Vinicius Petrucci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250925160058.00002645@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=a.manzanares@samsung.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis.nikolakop@gmail.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox