linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Gregory Price <gourry@gourry.net>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	damon@lists.linux.dev, kernel-team@meta.com,
	gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
	dave@stgolabs.net, jonathan.cameron@huawei.com,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com,
	dan.j.williams@intel.com, longman@redhat.com,
	akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, osalvador@suse.de,
	ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
	apopple@nvidia.com, axelrasmussen@google.com, yuanchu@google.com,
	weixugc@google.com, yury.norov@gmail.com,
	linux@rasmusvillemoes.dk, mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, tj@kernel.org,
	hannes@cmpxchg.org, mkoutny@suse.com, jackmanb@google.com,
	sj@kernel.org, baolin.wang@linux.alibaba.com, npache@redhat.com,
	ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
	lance.yang@linux.dev, muchun.song@linux.dev, xu.xin16@zte.com.cn,
	chengming.zhou@linux.dev, jannh@google.com, linmiaohe@huawei.com,
	nao.horiguchi@gmail.com, pfalcato@suse.de, rientjes@google.com,
	shakeel.butt@linux.dev, riel@surriel.com, harry.yoo@oracle.com,
	cl@gentwo.org, roman.gushchin@linux.dev, chrisl@kernel.org,
	kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com,
	bhe@redhat.com, zhengqi.arch@bytedance.com, terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Mon, 13 Apr 2026 13:05:19 -0400	[thread overview]
Message-ID: <ad0iT4UWka3gMUpu@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <2608a03b-72bb-4033-8e6f-a439502b5573@kernel.org>

On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote:
> > Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM
> > because the entire boot region gets marked shared.  
> 
> What exactly do you mean with "mark shared". Do you mean, that "shared
> memory" is used in the hypervisor for all boot memory?
> 

Sorry, meant MAP_SHARED.  But yes, in some setups the hypervisor simply
makes a memfd with the entire main memory region MAP_SHARED.

This is because the virtio-net device / network stack does GFP_KERNEL
allocations and then pins them on the host to allow zero-copy - so all
of ZONE_NORMAL is a valid target.

(At least that's my best understanding of the entire setup).

> 
> You mean, in the VM, memory usable by virtio-net can only be consumed
> from a dedicated physical memory region, and that region would be a
> separate node?
>

Correct - it does requires teaching the network stack numa awareness.

I was surprised by how little code this required, though I can't be
100% sure of its correctness since networking isn't my normal space.

Alternatively you could imagine this as a real device bringing its own
dedicated networking memory for network buffers, and then telling the
network start "Hey, prefer this node over normal kernel allocations".

What I'd been hacking on was cobbled together with memfd + SRAT bits to
bring up a private node statically and then have the device claim it -
but this is just a proof of concept.  A proper implementation would be
extending virtio-net to report a dedicated EFI_RESERVED region.

> > 
> > I see you saw below that one of the extensions is removing the nodes
> > from the fallback list.  That is part one, but it's insufficient to
> > prevent complete leakage (someone might iterate over the nodes-possible
> > list and try migrating memory).
> 
> Which code would do that?
> 

There are many callers of for_each_node() throughout the system.

but one discrete example:

int alloc_shrinker_info(struct mem_cgroup *memcg)
{
... snip ...
  for_each_node(nid) {
    struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size,
                                               GFP_KERNEL, nid);
... snip ..
}

If you disallow fallbacks in this scenario, this allocation always fails.

This partially answers your question about slub fallback allocations,
there are slab allocations like this that depend on fallbacks (more
below on this explicitly).

> > Basically the only isolation mechanism we have today is ZONE_DEVICE.
> > 
> > Either via mbind and friends, or even just the driver itself managing it
> > directly via alloc_pages_node() and exposing some userland interface.
> 
> Would mbind() work here? I thought mbind() would not suddenly give
> access to some ZONE_DEVICE memory.
>

Sorry these were orthogonal thoughts.

1) We don't have such a mechanism. ZONE_DEVICE's preferred mechanism is
   setting up explicit migrations via migrate_device.c

2) mbind / alloc_pages_node would only work for private nodes.

   Extending ZONE_DEVICE to enable mbind() would be an extreme lift,
   as the kernel makes a lot of assumptions about folio->lru.

   This is why i went the node route in the first place.

> > 
> > in the NP_OPS_MIGRATION patch, this gets covered.
> 
> Right, but I am not sure if NP_OPS_MIGRATION is really the right
> approach for that. Have to think about that.
>

So, OPS is a bit misleading, but it's the closest i came to some
existing pattern.  OPS does not necessarily need to imply callbacks.

I've been trying to minimize the patch set and I'm starting to think
the MVP may actually be able to do away with the private_ops structure
for a basic migration+mempolicy example by simply teaching some services
(migrate.c, mempolicy.c) how/when to inject __GFP_PRIVATE.

the mempolicy.c patch already does this, but not migrate.c - i haven't
figured out the right pattern for that yet.

> > 1) as you note, removing it from the default bitmaps, which is actually
> >    hard.  You can't remove it from the possible-node bitmap, so that
> >    just seemed non-tractable.
> 
> What about making people use a different set of bitmaps here? Quite some
> work, but maybe that's the right direction given that we'll now treat
> some nodes differently.
>

It's an option, although it is fragile.  That means having to police all
future users of possible-nodes and for_each_node and etc.

I've been err'ing on the side of "not fragile", but i'm open to rework.

> > 
> > 2) __GFP_THISNODE actually means (among other things) "don't fallback".
> >    And, in fact, there are some hotplug-time allocations that occur in
> >    SLAB (pglist_data) that target the private node that *must* fallback
> >    to successfully allocate for successful kernel operation.
> 
> 
> Can you point me at the code?
>

There is actually a comment in slub.c that addresses this directly:

static int slab_mem_going_online_callback(int nid)
{
... snip ...
	/*
	 * XXX: kmem_cache_alloc_node will fallback to other nodes
	 *      since memory is not yet available from the node that
	 *      is brought up.
	 */
	n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
... snip ...
}

Slab basically acknowledges the behavior is required on existing nodes
and just falls back immediately for the "going online" path.

Other specific calls in the hotplug path:

  mm/sparse.c:           kzalloc_node(size, GFP_KERNEL, nid)
  mm/sparse-vmemmap.c:   alloc_pages_node(nid, GFP_KERNEL|...)
  mm/slub.c:             kmalloc_node(sizeof(*barn), GFP_KERNEL, nid)

There are quite a number of callers to kmem_cache_alloc_node() that
would have to be individually audited.

And some non-slab interfaces examples as well:
	alloc_shrinker_info
	alloc_node_nr_active

I've been looking at this for a while, but I'm starting to think trying
to touch all this surface area is simply too fragile compared to just
letting normal memory be a fallback for private nodes and adding:

      __GFP_PRIVATE   - unlock's private node, but allow fallback
#define GFP_PRIVATE   (__GFP_PRIVATE | __GFP_THISNODE) - only this node

__GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case.

For mbind() it probably makes sense we'd use GFP_PRIVATE - either it
succeeds or it OOMs.

> > The flexibility is kind of the point :]
> 
> Yeah, but it would be interesting which minimal support we would need to
> just let some special memory be managed by the kernel, allowing mbind()
> users to use it, but not have any other fallback allocations end up on it.
> 
> Something very basic, on which we could build additional functionality.
> 

I actually have a simplistic CXL driver that does exactly this:
https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65

We have to support migration because mbind can migrate on bind if the
VMA already has memory - but all this means is the migrate interfaces
are live - not that the kernel actually uses them.

so mbind requires (OPS_MIGRATE | OPS_MEMPOLICY)

All these flags say is:
   - move_pages() syscalls can accept these nodes
   - migrate_pages() function calls can accept these nodes
   - mempolicy.c nodemasks allow the nodes (should restrict to mbind)
   - vma's with these nodes now inject __GFP_PRIVATE on fault

All other services (reclaim, compaction, khugepaged, etc) do not scan
these nodes and do not know about __GFP_PRIVATE, so they never see
private node folios and can't allocate from the node.

In this example, all migrate_to() really does is inject __GFP_THISNODE,
but I've been thinking about whether we can just do this in migrate.c
and leave implementing the .ops to a user that requires is.

But otherwise "it just works".

One note here though - OOM conditions and allocation failures are not
intuitive, especially when THP/non-order-0 allocations are involved.

But that might just mean this minimal setup should only allow order-0
allocations - which is fiiiiiiiiiiiiiine :P.

-----------------

For basic examples

I've implemented 4 examples to consider building on:

  1) CXL mempolicy driver:
     https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65

     As described above

  2) Virtio-net / CXL.mem Network Card
     (Not published yet)

     This doesn't require any ops at all - the plumbing happens entirely
     inside the kernel.  I onlined the node with an SRAT hack and no ops
     structure at all associated with the device (just set node affinity
     to the pcie_dev and plumbed it through the network stack).

     A proper implementation would have virtio-net register is own
     reserved memory region and online it during probe.
  
  3) Accelerator
     (Not published yet)

     I have converted an open source but out of tree GPU driver which
     uses NUMA nodes to use private nodes.  This required:
            NP_OPS_MIGRATION
            NP_OPS_MEMPOLICY

     The pattern is very similar to the CXL mempolicy driver, except
     that the driver had alloc_pages_node() calls that needed to have
     __GFP_PRIVATE added to ensure allocations landed on the device.


  4) CXL Compressed RAM driver:
     https://github.com/gourryinverse/linux/blob/55c06eb6bced58132d9001e318f2958e8ac80614/mm/cram.c#L340
     needs pretty much everything - it's "normal memory" with access
     rules, so the driver isn't really in the management lifecycle.

     In this example - the only way to allocate memory on the node is
     via demotion.  This allows us to close off the device to new
     allocations if the hardware reports low memory but the OS percieves
     the device to still have free memory.

     Which is a cool example:  The driver just sets up the node with
     certain attributes and then lets the kernel deal with it.


I have started compacting the _OPS_* flags related to reclaim into a
single NP_OPS_RECLAIM flag while testing with this.  Really i've come
around to thinking many mm/ services need to be taken as a package,
not fully piecemeal.

The tl;dr: Once you cede some control over to the kernel, you're
very close to ceding ALL control, but you still get some control
over how/when allocations on the node can be made.


It is important to note that even if we don't expose callbacks, we do
still need a modicum of node filtering in some places that still use
for_each_node() (vmscan.c, compaction.c, oom_kill.c, etc).

These are basically all the places ZONE_DEVICE *implicitly* opts itself
out of by having managed_pages=0.  We have to make those situations
explicit - but that doesn't mean we need callbacks.

> > 
> > I would simply state: "That depends on the memory device"
> 
> Let's keep it very simple: just some memory that you mbind(), and you
> only want the mbind() user to make use of that memory.
> 
> What would be the minimal set of hooks to guarantee that.
> 

If you want the mbind contract to stay intact:

   NP_OPS_MIGRATION (mbind can generate migrations)
   NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node)

The set of callbacks required should be exactly 0 (assuming we teach
migrate.c to inject __GFP_PRIVATE like we have mempolicy.c).

If your device requires some special notification on allocation, free
or migration to/from you need:

   ops.free_folio(folio)
   ops.migrate_to(folios, nid, mode, reason, nr_success)
   ops.migrate_folio(src_folio, dst_folio)

The free path is the tricky one to get right.  You can imagine:

   buf = malloc(...);
   mbind(buf, private_node);
   memset(buf, 0x42, ...);
   ioctl(driver, CHECK_OUT_THIS_DATA, buf); 
   exit(0);

The task dies and frees the pages back to the buddy - the question is
whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can
all eat an ops.free_folio() callback to inform the driver the memory has
been freed.

In practice - this worked on my accelerator and compressed examples, but
I can't say it's 100% safe in all contexts.  The free path needs more
scrutiny.

> For example, I assume compaction could just be supported for such
> memory? Similarly, longterm-pinning.
> 
> For some of the other hooks it's rather unclear how they would affect
> the very simple mbind() rule. What is the effect of demotion or NUMA
> balancing?
> 
> I'm afraid we're making things too complicated here or it might be the
> wrong abstraction, if i cannot even figure out how to make the simplest
> use case work.
> 
> Maybe I'm wrong :)
>

Actually, quite the opposite:  None of that should be engaged by
default.  In our above example:

   OPS_MIGRATION | OPS_MEMPOLICY

All this should say is that migration and mempolicy are supported - not
that anything in the kernel that uses migration will suddenly operate on
that memory.

So:  Compaction, Longterm Pin, NUMA balancing, Demotion - etc - all of
these do not ever operate on this memory by default.  Your device driver
or service would have to specifically opt-in to those services and must
be capable of dealing with the implications of that.

---

kind of neat aside:

You can hotplug private ZONE_NORMAL without NP_OPS_LONGTERMPIN and as
long as the driver/service controls the type/lifetime of allocations,
the node can remain hot-unpluggable in the future.

e.g. if the service only ever allocates movable allocations, the lack
of NP_OPS_LONGTERMPIN prevents those pages from being pinned.  If you
add NP_OPS_MIGRATION - the attempt to pin will cause migration :]

~Gregory


  reply	other threads:[~2026-04-13 17:05 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-22  8:48 Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54   ` Gregory Price
2026-02-23 16:08     ` Gregory Price
2026-03-17 13:05       ` David Hildenbrand (Arm)
2026-03-19 14:29         ` Gregory Price
2026-02-24  6:19 ` Alistair Popple
2026-02-24 15:17   ` Gregory Price
2026-02-24 16:54     ` Gregory Price
2026-02-25 22:21     ` Matthew Brost
2026-02-25 23:58       ` Gregory Price
2026-02-26  3:27     ` Alistair Popple
2026-02-26  5:54       ` Gregory Price
2026-02-26 22:49         ` Gregory Price
2026-03-03 20:36       ` Gregory Price
2026-02-25 12:40 ` Alejandro Lucero Palau
2026-02-25 14:43   ` Gregory Price
2026-03-17 13:25 ` David Hildenbrand (Arm)
2026-03-19 15:09   ` Gregory Price
2026-04-13 13:11     ` David Hildenbrand (Arm)
2026-04-13 17:05       ` Gregory Price [this message]
2026-04-15  9:49         ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ad0iT4UWka3gMUpu@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=cl@gentwo.org \
    --cc=dakr@kernel.org \
    --cc=damon@lists.linux.dev \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linmiaohe@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=terry.bowman@amd.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=weixugc@google.com \
    --cc=xu.xin16@zte.com.cn \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=yury.norov@gmail.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox