From: Gregory Price <gourry@gourry.net>
To: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>,
Wei Xu <weixugc@google.com>, David Rientjes <rientjes@google.com>,
Matthew Wilcox <willy@infradead.org>,
Bharata B Rao <bharata@amd.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
dave.hansen@intel.com, hannes@cmpxchg.org,
mgorman@techsingularity.net, mingo@redhat.com,
peterz@infradead.org, raghavendra.kt@amd.com, riel@surriel.com,
sj@kernel.org, ying.huang@linux.alibaba.com, ziy@nvidia.com,
dave@stgolabs.net, nifan.cxl@gmail.com, xuezhengchu@huawei.com,
akpm@linux-foundation.org, david@redhat.com, byungchul@sk.com,
kinseyho@google.com, joshua.hahnjy@gmail.com, yuanchu@google.com,
balbirs@nvidia.com, alok.rathore@samsung.com,
yiannis@zptcorp.com, Adam Manzanares <a.manzanares@samsung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
Date: Wed, 1 Oct 2025 03:22:43 -0400 [thread overview]
Message-ID: <aNzWwz5OYLOjwjLv@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <aNWRuKGurAntxhxG@gourry-fedora-PF4VCD3F>
On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote:
> On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> > On Thu, 25 Sep 2025 12:06:28 -0400
> > Gregory Price <gourry@gourry.net> wrote:
> >
> > > It feels much more natural to put this as a zswap/zram backend.
> > >
> > Agreed. I currently see two paths that are generic (ish).
> >
> > 1. zswap route - faulting as you describe on writes.
>
> aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub
>
> The interposition point for zswap/zram is the PTE present bit being
> hacked off to generate access faults.
>
I went digging around a bit.
Not only this, but the PTE is used to store the swap entry ID, so you
can't just use a swap backend and keep the mapping. It's just not a
compatible abstraction - so as a zswap-backend this is DOA.
Even if you could figure out a way to re-use the abstraction and just
take a hard-fault to fault it back in as read-only, you lose the swap
entry on fault. That just gets nasty trying to reconcile the
differences between this interface and swap at that point.
So here's a fun proposal. I'm not sure of how NUMA nodes for devices
get determined -
1. Carve out an explicit proximity domain (NUMA node) for the compressed
region via SRAT.
https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
2. Make sure this proximity domain (NUMA node) has separate data in the
HMAT so it can be an explicit demotion target for higher tiers
https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html
3. Create a node-to-zone-allocator registration and retrieval function
device_folio_alloc = nid_to_alloc(nid)
4. Create a DAX extension that registers the above allocator interface
5. in `alloc_migration_target()` mm/migrate.c
Since nid is not a valid buddy-allocator target, everything here
will fail. So we can simply append the following to the bottom
device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
if (device_folio_alloc)
folio = device_folio_alloc(...)
return folio;
6. in `struct migration_target_control` add a new .no_writable value
- This will say the new mapping replacements should have the
writable bit chopped off.
7. On write-fault, extent mm/memory.c:do_numa_page to detect this
and simply promote the page to allow writes. Write faults will
be expensive, but you'll have pretty strong guarantees around
not unexpectedly running out of space.
You can then loosen the .no_writable restriction with settings if
you have high confidence that your system will outrun your ability
to promote/evict/whatever if device memory becomes hot.
The only thing I don't know off hand is how shared pages will work in
this setup. For VMAs with a mapping that exist at demotion time, this
all works wonderfully - less so if the mapping doesn't exist or a new
VMA is created after a demotion has occurred.
I don't know what will happen there.
I think this would also sate the desire for a "separate CXL allocator"
for integration into other paths as well.
~Gregory
next prev parent reply other threads:[~2025-10-01 7:22 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 14:46 Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36 ` Jonathan Cameron
2025-10-03 11:02 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17 ` Jonathan Cameron
2025-10-06 4:13 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19 ` Jonathan Cameron
2025-10-06 4:28 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38 ` Jonathan Cameron
2025-10-06 5:57 ` Bharata B Rao
2025-10-06 9:53 ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01 ` Gregory Price
2025-09-16 19:45 ` David Rientjes
2025-09-16 22:02 ` Gregory Price
2025-09-17 0:30 ` Wei Xu
2025-09-17 3:20 ` Balbir Singh
2025-09-17 4:15 ` Bharata B Rao
2025-09-17 16:49 ` Jonathan Cameron
2025-09-25 14:03 ` Yiannis Nikolakopoulos
2025-09-25 14:41 ` Gregory Price
2025-10-16 11:48 ` Yiannis Nikolakopoulos
2025-09-25 15:00 ` Jonathan Cameron
2025-09-25 15:08 ` Gregory Price
2025-09-25 15:18 ` Gregory Price
2025-09-25 15:24 ` Jonathan Cameron
2025-09-25 16:06 ` Gregory Price
2025-09-25 17:23 ` Jonathan Cameron
2025-09-25 19:02 ` Gregory Price
2025-10-01 7:22 ` Gregory Price [this message]
2025-10-17 9:53 ` Yiannis Nikolakopoulos
2025-10-17 14:15 ` Gregory Price
2025-10-17 14:36 ` Jonathan Cameron
2025-10-17 14:59 ` Gregory Price
2025-10-20 14:05 ` Jonathan Cameron
2025-10-21 18:52 ` Gregory Price
2025-10-21 18:57 ` Gregory Price
2025-10-22 9:09 ` Jonathan Cameron
2025-10-22 15:05 ` Gregory Price
2025-10-23 15:29 ` Jonathan Cameron
2025-10-16 16:16 ` Yiannis Nikolakopoulos
2025-10-20 14:23 ` Jonathan Cameron
2025-10-20 15:05 ` Gregory Price
2025-10-08 17:59 ` Vinicius Petrucci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aNzWwz5OYLOjwjLv@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=a.manzanares@samsung.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis.nikolakop@gmail.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox