linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
To: Gregory Price <gourry@gourry.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>,
	Wei Xu <weixugc@google.com>,
	 David Rientjes <rientjes@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Bharata B Rao <bharata@amd.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	 dave.hansen@intel.com, hannes@cmpxchg.org,
	mgorman@techsingularity.net,  mingo@redhat.com,
	peterz@infradead.org, raghavendra.kt@amd.com,  riel@surriel.com,
	sj@kernel.org, ying.huang@linux.alibaba.com, ziy@nvidia.com,
	 dave@stgolabs.net, nifan.cxl@gmail.com, xuezhengchu@huawei.com,
	 akpm@linux-foundation.org, david@redhat.com, byungchul@sk.com,
	 kinseyho@google.com, joshua.hahnjy@gmail.com,
	yuanchu@google.com,  balbirs@nvidia.com,
	alok.rathore@samsung.com, yiannis@zptcorp.com,
	 Adam Manzanares <a.manzanares@samsung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
Date: Fri, 17 Oct 2025 11:53:31 +0200	[thread overview]
Message-ID: <CAOi6=wTsY=EWt=yQ_7QJONsJpTM_3HKp0c42FKaJ8iJ2q8-n+w@mail.gmail.com> (raw)
In-Reply-To: <aNzWwz5OYLOjwjLv@gourry-fedora-PF4VCD3F>

On Wed, Oct 1, 2025 at 9:22 AM Gregory Price <gourry@gourry.net> wrote:
>
> On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote:
> > On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> > > On Thu, 25 Sep 2025 12:06:28 -0400
> > > Gregory Price <gourry@gourry.net> wrote:
> > >
> > > > It feels much more natural to put this as a zswap/zram backend.
> > > >
> > > Agreed.  I currently see two paths that are generic (ish).
> > >
> > > 1. zswap route - faulting as you describe on writes.
> >
> > aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub
> >
> > The interposition point for zswap/zram is the PTE present bit being
> > hacked off to generate access faults.
> >
>
> I went digging around a bit.
>
> Not only this, but the PTE is used to store the swap entry ID, so you
> can't just use a swap backend and keep the mapping. It's just not a
> compatible abstraction - so as a zswap-backend this is DOA.
>
> Even if you could figure out a way to re-use the abstraction and just
> take a hard-fault to fault it back in as read-only, you lose the swap
> entry on fault.  That just gets nasty trying to reconcile the
> differences between this interface and swap at that point.
>
> So here's a fun proposal.  I'm not sure of how NUMA nodes for devices
> get determined -
>
> 1. Carve out an explicit proximity domain (NUMA node) for the compressed
>    region via SRAT.
>    https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
>
> 2. Make sure this proximity domain (NUMA node) has separate data in the
>    HMAT so it can be an explicit demotion target for higher tiers
>    https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html
This makes sense. I've done a dirty hardcoding trick in my prototype
so that my node is always the last target. I'll have a look on how to
make this right.
>
> 3. Create a node-to-zone-allocator registration and retrieval function
>    device_folio_alloc = nid_to_alloc(nid)
>
> 4. Create a DAX extension that registers the above allocator interface
>
> 5. in `alloc_migration_target()` mm/migrate.c
>    Since nid is not a valid buddy-allocator target, everything here
>    will fail.  So we can simply append the following to the bottom
>
>    device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
>    if (device_folio_alloc)
>        folio = device_folio_alloc(...)
>    return folio;
In my current prototype alloc_migration_target was working (naively).
Steps 3, 4 and 5 seem like an interesting thing to try after all this
discussion.
>
> 6. in `struct migration_target_control` add a new .no_writable value
>    - This will say the new mapping replacements should have the
>      writable bit chopped off.
>
> 7. On write-fault, extent mm/memory.c:do_numa_page to detect this
>    and simply promote the page to allow writes.  Write faults will
>    be expensive, but you'll have pretty strong guarantees around
>    not unexpectedly running out of space.
>
>    You can then loosen the .no_writable restriction with settings if
>    you have high confidence that your system will outrun your ability
>    to promote/evict/whatever if device memory becomes hot.
That looks modular enough that will allow me to test both writable and
no_writable and being able to compare.
>
> The only thing I don't know off hand is how shared pages will work in
> this setup.  For VMAs with a mapping that exist at demotion time, this
> all works wonderfully - less so if the mapping doesn't exist or a new
> VMA is created after a demotion has occurred.
I'll keep that in mind.
>
> I don't know what will happen there.
>
> I think this would also sate the desire for a "separate CXL allocator"
> for integration into other paths as well.
>
> ~Gregory
Thanks a lot for all the discussion and the input. I can move my
prototype towards this direction and will get back with what I 've
learned and an RFC if it makes sense. Please keep me in the loop in
any related discussions.

Best,
/Yiannis


  reply	other threads:[~2025-10-17  9:53 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-10 14:46 Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36   ` Jonathan Cameron
2025-10-03 11:02     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17   ` Jonathan Cameron
2025-10-06  4:13     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19   ` Jonathan Cameron
2025-10-06  4:28     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38   ` Jonathan Cameron
2025-10-06  5:57     ` Bharata B Rao
2025-10-06  9:53       ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01   ` Gregory Price
2025-09-16 19:45     ` David Rientjes
2025-09-16 22:02       ` Gregory Price
2025-09-17  0:30       ` Wei Xu
2025-09-17  3:20         ` Balbir Singh
2025-09-17  4:15           ` Bharata B Rao
2025-09-17 16:49         ` Jonathan Cameron
2025-09-25 14:03           ` Yiannis Nikolakopoulos
2025-09-25 14:41             ` Gregory Price
2025-10-16 11:48               ` Yiannis Nikolakopoulos
2025-09-25 15:00             ` Jonathan Cameron
2025-09-25 15:08               ` Gregory Price
2025-09-25 15:18                 ` Gregory Price
2025-09-25 15:24                 ` Jonathan Cameron
2025-09-25 16:06                   ` Gregory Price
2025-09-25 17:23                     ` Jonathan Cameron
2025-09-25 19:02                       ` Gregory Price
2025-10-01  7:22                         ` Gregory Price
2025-10-17  9:53                           ` Yiannis Nikolakopoulos [this message]
2025-10-17 14:15                             ` Gregory Price
2025-10-17 14:36                               ` Jonathan Cameron
2025-10-17 14:59                                 ` Gregory Price
2025-10-20 14:05                                   ` Jonathan Cameron
2025-10-21 18:52                                     ` Gregory Price
2025-10-21 18:57                                       ` Gregory Price
2025-10-22  9:09                                         ` Jonathan Cameron
2025-10-22 15:05                                           ` Gregory Price
2025-10-23 15:29                                             ` Jonathan Cameron
2025-10-16 16:16               ` Yiannis Nikolakopoulos
2025-10-20 14:23                 ` Jonathan Cameron
2025-10-20 15:05                   ` Gregory Price
2025-10-08 17:59       ` Vinicius Petrucci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOi6=wTsY=EWt=yQ_7QJONsJpTM_3HKp0c42FKaJ8iJ2q8-n+w@mail.gmail.com' \
    --to=yiannis.nikolakop@gmail.com \
    --cc=a.manzanares@samsung.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=balbirs@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox