From: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
To: Gregory Price <gourry@gourry.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>,
Wei Xu <weixugc@google.com>,
David Rientjes <rientjes@google.com>,
Matthew Wilcox <willy@infradead.org>,
Bharata B Rao <bharata@amd.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
dave.hansen@intel.com, hannes@cmpxchg.org,
mgorman@techsingularity.net, mingo@redhat.com,
peterz@infradead.org, raghavendra.kt@amd.com, riel@surriel.com,
sj@kernel.org, ying.huang@linux.alibaba.com, ziy@nvidia.com,
dave@stgolabs.net, nifan.cxl@gmail.com, xuezhengchu@huawei.com,
akpm@linux-foundation.org, david@redhat.com, byungchul@sk.com,
kinseyho@google.com, joshua.hahnjy@gmail.com,
yuanchu@google.com, balbirs@nvidia.com,
alok.rathore@samsung.com, yiannis@zptcorp.com,
Adam Manzanares <a.manzanares@samsung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
Date: Thu, 16 Oct 2025 13:48:00 +0200 [thread overview]
Message-ID: <CAOi6=wTXSTgTB+KUpn+LOUGvXg4UeEz-DN0mh-LjChn3g8YiHA@mail.gmail.com> (raw)
In-Reply-To: <aNVUj0s30rrXEh4C@gourry-fedora-PF4VCD3F>
Hi Gregory,
Thanks for all the feedback. I am finally getting some time to come
back to this.
On Thu, Sep 25, 2025 at 4:41 PM Gregory Price <gourry@gourry.net> wrote:
>
> On Thu, Sep 25, 2025 at 04:03:46PM +0200, Yiannis Nikolakopoulos wrote:
> > >
> > > For the hardware compression devices how are you dealing with capacity variation
> > > / overcommit?
> ...
> > What is different from standard tiering is that the control plane is
> > checked on demotion to make sure there is still capacity left. If not, the
> > demotion fails. While this seems stable so far, a missing piece is to
> > ensure that this tier is mainly written by demotions and not arbitrary kernel
> > allocations (at least as a starting point). I want to explore how mempolicies
> > can help there, or something of the sort that Gregory described.
> >
>
> Writing back the description as i understand it:
>
> 1) The intent is to only have this memory allocable via demotion
> (i.e. no fault or direct allocation from userland possible)
Yes that is what looks to me like the "safe" way to begin with. In
theory you could have userland apps/middleware that is aware of this
memory and its quirks and are ok to use it but I guess we can leave
that for later and it feels like it could be provided by a separate
driver.
>
> 2) The intent is to still have this memory accessible directly (DMA),
> while compressed, not trigger a fault/promotion on access
> (i.e. no zswap faults)
Correct. One of the big advantages of CXL.mem is the cache-line access
granularity and our customers don't want to lose that.
>
> 3) The intent is to have an external monitoring software handle
> outrunning run-away decompression/hotness by promoting that data.
External is not strictly necessary. E.g. it could be an additional
source of input to the kpromote/kmigrate solution.
>
> So basically we want a zswap-like interface for allocation, but to
If by "zswap-like interface" you mean something that can reject the
demote (or store according to the zswap semantics) then yes.
I just want to be careful when comparing with zswap.
> retain the `struct page` in page tables such that no faults are incurred
> on access. Then if the page becomes hot, depend on some kind of HMU
> tiering system to get it off the device.
Correct.
>
> I think we all understand there's some bear we have to outrun to deal
> with problem #3 - and many of us are skeptical that the bear won't catch
> up with our pants down. Let's ignore this for the moment.
Agreed.
>
> If such a device's memory is added to the default page allocator, then
> the question becomes one of *isolation* - such that the kernel will
> provide some "Capital-G Guarantee" that certain NUMA nodes will NEVER
> be used except under very explicit scenarios.
>
> There are only 3 mechanisms with which to restrict this (presently):
>
> 1) ZONE membership (to disallow GFP_KERNEL)
> 2) cgroups->cpusets->mems_allowed
> 3) task/vma mempolicy
> (obvious #4: Don't put it in the default page allocator)
>
> cpusets and mempolicy are not sufficient to provide full isolation
> - cgroups have the opposite hierarchical relationship than desired.
> The parent cgroup will lock out all children cgroups from using nodes
> not present in the parent mems_allowed. e.g. if you lock out access
> from the root cgroup, no cgroup on the entire system is eligible to
> allocate the memory. If you don't lock out the root cgroup - any root
> cgroup task is eligible. This isn't tractible.
>
> - task/vma mempolicy gets ignored in many cases and is closer to a
> suggestion than enforcible. It's also subject to rebinding as a
> task's cgroups.cpuset.mems_allowed changes.
>
> I haven't read up enough on ZONE_DEVICE to understand the implications
> of membership there, but have you explored this as an option? I don't
> see the work i'm doing intersecting well with your efforts - except
> maybe on the vmscan.c work around allocation on demotion.
Thanks for the very helpful breakdown. Your take on #2 & #3 seems
reasonable. About #1, I've skimmed through the rest of the thread and
I'll continue addressing your responses there.
Yiannis
>
> The work i'm doing is more aligned with - hey, filesystems are a global
> resource, why are we using cgroup/task/vma policies to dictate whether a
> filesystem's cache is eligible to land in remote nodes? i.e. drawing
> better boundaries and controls around what can land in some set of
> remote nodes "by default". You're looking for *strong isolation*
> controls, which implies a different kind of allocator interface.
>
> ~Gregory
next prev parent reply other threads:[~2025-10-16 11:48 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 14:46 Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36 ` Jonathan Cameron
2025-10-03 11:02 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17 ` Jonathan Cameron
2025-10-06 4:13 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19 ` Jonathan Cameron
2025-10-06 4:28 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38 ` Jonathan Cameron
2025-10-06 5:57 ` Bharata B Rao
2025-10-06 9:53 ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01 ` Gregory Price
2025-09-16 19:45 ` David Rientjes
2025-09-16 22:02 ` Gregory Price
2025-09-17 0:30 ` Wei Xu
2025-09-17 3:20 ` Balbir Singh
2025-09-17 4:15 ` Bharata B Rao
2025-09-17 16:49 ` Jonathan Cameron
2025-09-25 14:03 ` Yiannis Nikolakopoulos
2025-09-25 14:41 ` Gregory Price
2025-10-16 11:48 ` Yiannis Nikolakopoulos [this message]
2025-09-25 15:00 ` Jonathan Cameron
2025-09-25 15:08 ` Gregory Price
2025-09-25 15:18 ` Gregory Price
2025-09-25 15:24 ` Jonathan Cameron
2025-09-25 16:06 ` Gregory Price
2025-09-25 17:23 ` Jonathan Cameron
2025-09-25 19:02 ` Gregory Price
2025-10-01 7:22 ` Gregory Price
2025-10-17 9:53 ` Yiannis Nikolakopoulos
2025-10-17 14:15 ` Gregory Price
2025-10-17 14:36 ` Jonathan Cameron
2025-10-17 14:59 ` Gregory Price
2025-10-20 14:05 ` Jonathan Cameron
2025-10-21 18:52 ` Gregory Price
2025-10-21 18:57 ` Gregory Price
2025-10-22 9:09 ` Jonathan Cameron
2025-10-22 15:05 ` Gregory Price
2025-10-23 15:29 ` Jonathan Cameron
2025-10-16 16:16 ` Yiannis Nikolakopoulos
2025-10-20 14:23 ` Jonathan Cameron
2025-10-20 15:05 ` Gregory Price
2025-10-08 17:59 ` Vinicius Petrucci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAOi6=wTXSTgTB+KUpn+LOUGvXg4UeEz-DN0mh-LjChn3g8YiHA@mail.gmail.com' \
--to=yiannis.nikolakop@gmail.com \
--cc=a.manzanares@samsung.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox