linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	 linux-cxl@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org,  linux-trace-kernel@vger.kernel.org,
	damon@lists.linux.dev, kernel-team@meta.com,
	 gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
	dave@stgolabs.net,  jonathan.cameron@huawei.com,
	dave.jiang@intel.com, alison.schofield@intel.com,
	 vishal.l.verma@intel.com, ira.weiny@intel.com,
	dan.j.williams@intel.com,  longman@redhat.com,
	akpm@linux-foundation.org, david@kernel.org,
	 lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org,  surenb@google.com,
	mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
	 matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com,
	 ying.huang@linux.alibaba.com, axelrasmussen@google.com,
	yuanchu@google.com, weixugc@google.com,  yury.norov@gmail.com,
	linux@rasmusvillemoes.dk, mhiramat@kernel.org,
	 mathieu.desnoyers@efficios.com, tj@kernel.org,
	hannes@cmpxchg.org, mkoutny@suse.com,  jackmanb@google.com,
	sj@kernel.org, baolin.wang@linux.alibaba.com, npache@redhat.com,
	 ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
	lance.yang@linux.dev,  muchun.song@linux.dev,
	xu.xin16@zte.com.cn, chengming.zhou@linux.dev, jannh@google.com,
	 linmiaohe@huawei.com, nao.horiguchi@gmail.com, pfalcato@suse.de,
	rientjes@google.com,  shakeel.butt@linux.dev, riel@surriel.com,
	harry.yoo@oracle.com, cl@gentwo.org,  roman.gushchin@linux.dev,
	chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com,
	 nphamcs@gmail.com, bhe@redhat.com, zhengqi.arch@bytedance.com,
	terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Thu, 26 Feb 2026 14:27:24 +1100	[thread overview]
Message-ID: <a6izpi2wlqro72erhbvxhlx2lwdnae7my3ghfs6t33ivtixo4h@bi2u4x6qv7ul> (raw)
In-Reply-To: <aZ3BEn_73Rk8Fn7L@gourry-fedora-PF4VCD3F>

On 2026-02-25 at 02:17 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Tue, Feb 24, 2026 at 05:19:11PM +1100, Alistair Popple wrote:
> > On 2026-02-22 at 19:48 +1100, Gregory Price <gourry@gourry.net> wrote...
> > 
> > Based on our discussion at LPC I believe one of the primary motivators here was
> > to re-use the existing mm buddy allocator rather than writing your own. I remain
> > to be convinced that alone is justification enough for doing all this - DRM for
> > example already has quite a nice standalone buddy allocator (drm_buddy.c) that
> > could presumably be used, or adapted for use, by any device driver.
> >
> > The interesting part of this series (which I have skimmed but not read in
> > detail) is how device memory gets exposed to userspace - this is something that
> > existing ZONE_DEVICE implementations don't address, instead leaving it up to
> > drivers and associated userspace stacks to deal with allocation, migration, etc.
> > 
> 
> I agree that buddy-access alone is insufficient justification, it
> started off that way - but if you want mempolicy/NUMA UAPI access,
> it turns into "Re-use all of MM" - and that means using the buddy.
> 
> I also expected ZONE_DEVICE vs NODE_DATA to be the primary discussion,
> 
> I raise replacing it as a thought experiment, but not the proposal.
> 
> The idea that drm/ is going to switch to private nodes is outside the
> realm of reality, but part of that is because of years of infrastructure
> built on the assumption that re-using mm/ is infeasible.
> 
> But, lets talk about DEVICE_COHERENT
> 
> ---
> 
> DEVICE_COHERENT is the odd-man out among ZONE_DEVICE modes. The others
> use softleaf entries and don't allow direct mappings.

I think you have this around the wrong way - DEVICE_PRIVATE is the odd one out as
it is the one ZONE_DEVICE page type that uses softleaf entries and doesn't
allow direct mappings. Every other type of ZONE_DEVICE page allows for direct
mappings.

> (DEVICE_PRIVATE sort of does if you squint, but you can also view that
>  a bit like PROT_NONE or read-only controls to force migrations).
> 
> If you take DEVICE_COHERENT and:
> 
> - Move pgmap out of the struct page (page_ext, NODE_DATA, etc) to free
>   the LRU list_head
> - Put pages in the buddy (free lists, watermarks, managed_pages) or add
>   pgmap->device_alloc() at every allocation callsite / buddy hook
> - Add LRU support (aging, reclaim, compaction)
> - Add isolated gating (new GFP flag and adjusted zonelist filtering)
> - Add new dev_pagemap_ops callbacks for the various mm/ features
> - Audit evey folio_is_zone_device() to distinguish zone device modes
> 
> ... you've built N_MEMORY_PRIVATE inside ZONE_DEVICE. Except now
> page_zone(page) returns ZONE_DEVICE - so you inherit the wrong
> defaults at every existing ZONE_DEVICE check. 
> 
> Skip-sites become things to opt-out of instead of opting into.
> 
> You just end up with
> 
> if (folio_is_zone_device(folio))
>     if (folio_is_my_special_zone_device())
>     else ....
> 
> and this just generalizes to
> 
> if (folio_is_private_managed(folio))
>     folio_managed_my_hooked_operation()

I don't quite get this - couldn't you just as easily do:

if (folio_is_zone_device(folio))
     folio_device_my_hooked_operation()

Where folio_device_my_hooked_operation() is just:

if (pgmap->ops->my_hoooked_operation)
	pgmap->ops->my_hooked_operation();

> So you get the same code, but have added more complexity to ZONE_DEVICE.

Don't you still have to add code to hook every operation you care about for your
private managed nodes?

> I don't think that's needed if we just recognize ZONE is the wrong
> abstraction to be operating on.
> 
> Honestly, even ZONE_MOVABLE becomes pointless with N_MEMORY_PRIVATE
> if you disallow longterm pinning - because the managing service handles
> allocations (it has to inject GFP_PRIVATE to get access) or selectively
> enables the mm/ services it knows are safe (mempolicy).
> 
> Even if you allow longterm pinning, if your service controls what does
> the pinning it can still be reclaimable - just manually (killing
> processes) instead of letting hotplug do it via migration.
> 
> If your service only allocates movable pages - your ZONE_NORMAL is
> effectively ZONE_MOVABLE.  

This is interesting - it sounds like the conclusion of this is ZONE_* is just a
bad abstraction and should be replaced with something else maybe some like this?

And FWIW I'm not tied to the ZONE_DEVICE as being a good abstraction, it's just
what we seem to have today for determing page types. It almost sounds like what
we want is just a bunch of hooks that can be associated with a range of pages,
and then you just get rid of ZONE_DEVICE and instead install hooks appropriate
for each page a driver manages. I have to think more about that though, this
is just what popped into my head when you start saying ZONE_MOVABLE could also
disappear :-)

> In some cases we use ZONE_MOVABLE to prevent the kernel from allocating
> memory onto devices (like CXL).  This means struct page is forced to
> take up DRAM or use memmap_on_memory - meaning you lose high-value
> capacity or sacrifice contiguity (less huge page support).

One of the other reasons is to prevent long term pinning. But I think that's a
conversation that warrants a whole separate thread.

> This entire problem can evaporate if you can just use ZONE_NORMAL.
> 
> There are a lot of benefits to just re-using the buddy like this.
> 
> Zones are the wrong abstraction and cause more problems.
> 
> > >   free_folio           - mirrors ZONE_DEVICE's
> > >   folio_split          - mirrors ZONE_DEVICE's
> > >   migrate_to           - ... same as ZONE_DEVICE
> > >   handle_fault         - mirrors the ZONE_DEVICE ...
> > >   memory_failure       - parallels memory_failure_dev_pagemap(),
> > 
> > One does not have to squint too hard to see that the above is not so different
> > from what ZONE_DEVICE provides today via dev_pagemap_ops(). So I think I think
> > it would be worth outlining why the existing ZONE_DEVICE mechanism can't be
> > extended to provide these kind of services.
> > 
> > This seems to add a bunch of code just to use NODE_DATA instead of page->pgmap,
> > without really explaining why just extending dev_pagemap_ops wouldn't work. The
> > obvious reason is that if you want to support things like reclaim, compaction,
> > etc. these pages need to be on the LRU, which is a little bit hard when that
> > field is also used by the pgmap pointer for ZONE_DEVICE pages.
> > 
> 
> You don't have to squint because it was deliberate :]

Nice.

> The callback similarity is the feature - they're the same logical
> operations.  The difference is the direction of the defaults.
> 
> Extending ZONE_DEVICE into these areas requires the same set of hooks,
> plus distinguishing "old ZONE_DEVICE" from "new ZONE_DEVICE".
> 
> Where there are new injection sites, it's because ZONE_DEVICE opts
> out of ever touching that code in some other silently implied way.

Yeah, I hate that aspect of ZONE_DEVICE. There are far too many places where we
"prove" you can't have a ZONE_DEVICE page because of ad-hoc "reasons". Usually
they take the form of it's not on the LRU, or it's not an anonymous page and
this isn't DAX, etc.

> For example, reclaim/compaction doesn't run because ZONE_DEVICE doesn't
> add to managed_pages (among other reasons).

And people can't even agree on the reasons. I would argue the primary reason is
reclaim/compaction doesn't run because it can't even find the pages due to them
not being on the LRU. But everyone is equally correct.

> You'd have to go figure out how to hack those things into ZONE_DEVICE 
> *and then* opt every *other* ZONE_DEVICE mode *back out*.
> 
> So you still end up with something like this anyway:
> 
> static inline bool folio_managed_handle_fault(struct folio *folio,
>                                               struct vm_fault *vmf,
>                                               enum pgtable_level level,
>                                               vm_fault_t *ret)
> {
>         /* Zone device pages use swap entries; handled in do_swap_page */
>         if (folio_is_zone_device(folio))
>                 return false;
> 
>         if (folio_is_private_node(folio))
> 		...
>         return false;
> }
> 
> 
> > example page_ext could be used.  Or I hear struct page may go away in place of
> > folios any day now, so maybe that gives us space for both :-)
> > 
> 
> If NUMA is the interface we want, then NODE_DATA is the right direction
> regardless of struct page's future or what zone it lives in.
> 
> There's no reason to keep per-page pgmap w/ device-to-node mappings.

In reality I suspect that's already the case today. I'm not sure we need
per-page pgmap.

> You can have one driver manage multiple devices with the same numa node
> if it uses the same owner context (PFN already differentiates devices).
> 
> The existing code allows for this.
> 
> > The above also looks pretty similar to the existing ZONE_DEVICE methods for
> > doing this which is another reason to argue for just building up the feature set
> > of the existing boondoggle rather than adding another thingymebob.
> >
> > It seems the key thing we are looking for is:
> > 
> > 1) A userspace API to allocate/manage device memory (ie. move_pages(), mbind(),
> > etc.)
> > 
> > 2) Allowing reclaim/LRU list processing of device memory.
> > 
> > From my perspective both of these are interesting and I look forward to the
> > discussion (hopefully I can make it to LSFMM). Mostly I'm interested in the
> > implementation as this does on the surface seem to sprinkle around and duplicate
> > a lot of hooks similar to what ZONE_DEVICE already provides.
> > 
> 
> On (1): ZONE_DEVICE NUMA UAPI is harder than it looks from the surface

Ok, I will admit I've only been hovering on the surface so need to give this
some more thought. Everything you've written below makes sense and is definitely
food for thought. Thanks.

 - Alistair

> Much of the kernel mm/ infrastructure is written on top of the buddy and
> expects N_MEMORY to be the sole arbiter of "Where to Acquire Pages".
> 
> Mempolicy depends on:
>    - Buddy support or a new alloc hook around the buddy
> 
>    - Migration support (mbind() after allocation migrates)
>      - Migration also deeply assumes buddy and LRU support
> 
>    - Changing validations on node states
>      - mempolicy checks N_MEMORY membership, so you have to hack
>        N_MEMORY onto ZONE_DEVICE
>        (or teach it about a new node state... N_MEMORY_PRIVATE)
> 
> 
> Getting mempolicy to work with N_MEMORY_PRIVATE amounts to adding 2
> lines of code in vma_alloc_folio_noprof:
> 
> struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
>                                      struct vm_area_struct *vma,
> 				     unsigned long addr)
> {
>         if (pol->flags & MPOL_F_PRIVATE)
>                 gfp |= __GFP_PRIVATE;
> 
>         folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
> 	/* Woo! I faulted a DEVICE PAGE! */
> }
> 
> But this requires the pages to be managed by the buddy.
> 
> The rest of the mempolicy support is around keeping sane nodemasks when
> things like cpuset.mems rebinds occur and validating you don't end up
> with private nodes that don't support mempolicy in your nodemask.
> 
> You have to do all of this anyway, but with the added bonus of fighting
> with the overloaded nature of ZONE_DEVICE at every step.
> 
> ==========
> 
> On (2): Assume you solve LRU. 
> 
> Zone Device has no free lists, managed_pages, or watermarks.
> 
> kswapd can't run, compaction has no targets, vmscan's pressure model
> doesn't function.  These all come for free when the pages are
> buddy-managed on a real zone.  Why re-invent the wheel?
> 
> ==========
> 
> So you really have two options here:
> 
> a) Put pages in the buddy, or
> 
> b) Add pgmap->device_alloc() callbacks at every allocation site that
>    could target a node:
>      - vma_alloc_folio
>      - alloc_migration_target
>      - alloc_demote_folio
>      - alloc_pages_node
>      - alloc_contig_pages
>      - list goes on
> 
> Or more likely - hooking get_page_from_freelist.  Which at that
> point... just use the buddy?  You're already deep in the hot path.
> 
> > 
> > For basic allocation I agree this is the case. But there's no reason some device
> > allocator library couldn't be written. Or in fact as pointed out above reuse the
> > already existing one in drm_buddy.c.  So would be interested to hear arguments
> > for why allocation has to be done by the mm allocator and/or why an allocation
> > library wouldn't work here given DRM already has them.
> > 
> 
> Using the buddy underpins the rest of mm/ services we want to re-use.
> 
> That's basically it.  Otherwise you have to inject hooks into every
> surface that touches the buddy...
> 
> ... or in the buddy (get_page_from_freelist), at which point why not
> just use the buddy?
> 
> ~Gregory


  parent reply	other threads:[~2026-02-26  3:27 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-22  8:48 Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54   ` Gregory Price
2026-02-23 16:08     ` Gregory Price
2026-02-24  6:19 ` Alistair Popple
2026-02-24 15:17   ` Gregory Price
2026-02-24 16:54     ` Gregory Price
2026-02-25 22:21     ` Matthew Brost
2026-02-25 23:58       ` Gregory Price
2026-02-26  3:27     ` Alistair Popple [this message]
2026-02-26  5:54       ` Gregory Price
2026-02-25 12:40 ` Alejandro Lucero Palau
2026-02-25 14:43   ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a6izpi2wlqro72erhbvxhlx2lwdnae7my3ghfs6t33ivtixo4h@bi2u4x6qv7ul \
    --to=apopple@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=cl@gentwo.org \
    --cc=dakr@kernel.org \
    --cc=damon@lists.linux.dev \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linmiaohe@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=terry.bowman@amd.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=weixugc@google.com \
    --cc=xu.xin16@zte.com.cn \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=yury.norov@gmail.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox