From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Frank van der Linden <fvdl@google.com>,
Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
damon@lists.linux.dev, kernel-team@meta.com,
gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
dave@stgolabs.net, jonathan.cameron@huawei.com,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, longman@redhat.com,
akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, osalvador@suse.de,
ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
apopple@nvidia.com, axelrasmussen@google.com, yuanchu@google.com,
weixugc@google.com, yury.norov@gmail.com,
linux@rasmusvillemoes.dk, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, tj@kernel.org,
hannes@cmpxchg.org, mkoutny@suse.com, jackmanb@google.com,
sj@kernel.org, baolin.wang@linux.alibaba.com, npache@redhat.com,
ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
lance.yang@linux.dev, muchun.song@linux.dev, xu.xin16@zte.com.cn,
chengming.zhou@linux.dev, jannh@google.com, linmiaohe@huawei.com,
nao.horiguchi@gmail.com, pfalcato@suse.de, rientjes@google.com,
shakeel.butt@linux.dev, riel@surriel.com, harry.yoo@oracle.com,
cl@gentwo.org, roman.gushchin@linux.dev, chrisl@kernel.org,
kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com,
bhe@redhat.com, zhengqi.arch@bytedance.com, terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Fri, 17 Apr 2026 11:39:33 +0200 [thread overview]
Message-ID: <3d077c4c-6abe-47f6-a4df-3e853fbcc551@kernel.org> (raw)
In-Reply-To: <CAPTztWajm_JLpp9BjRcX=h72r25ELrXeGkOXVachybBxLJGS=g@mail.gmail.com>
On 4/15/26 21:47, Frank van der Linden wrote:
> On Wed, Apr 15, 2026 at 8:18 AM Gregory Price <gourry@gourry.net> wrote:
>>
>> On Wed, Apr 15, 2026 at 11:49:59AM +0200, David Hildenbrand (Arm) wrote:
>>
>> As a preface - the current RFC was informed by ZONE_DEVICE patterns.
>>
>> I think that was useful as a way to find existing friction points - but
>> ultimately wrong for this new interface.
>>
>> I don't thinks an ops struct here is the right design, and I think there
>> are only a few patterns that actually make sense for device memory using
>> nodes this way.
>>
>> So there's going to be a *major* contraction in the complexity of this
>> patch series (hopefully I'll have something next week), and much of what
>> you point out below is already in-flight.
>>
>>>
>> ... snip ...
>>>
>>> A related series proposed some MEM_READ/WRITE backend requests [1]
>>>
>>> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg02693.html
>>>
>>
>> Oh interesting, thank you for the reference here.
>>
>>>
>>> Something else people were discussing in the past was to physically
>>> limit the area where virtio queues could be placed.
>>>
>>
>> That is functionally what I did - the idea was pretty simple, just have
>> a separate memfd/node dedicated for the queues:
>>
>> guest_memory = memfd(MAP_PRIVATE)
>> net_memory = memfd(MAP_SHARED)
>>
>> And boom, you get what you want.
>>
>> So yeah "It works" - but there's likely other ways to do this too, and
>> as you note re: compatibility, i'm not sure virtio actually wants this,
>> but it's a nice proof-of-concept for a network device on the host that
>> carries its own memory.
>>
>> I'll try post my hack as an example with the next RFC version, as I
>> think it's informative.
>>
>>>
>>> But that's a different "fallback" problem, no?
>>>
>>> You want allocations that target the "special node" to fallback to
>>> *other* nodes, but not other allocations to fallback to *this special* node.
>>>
>> ... snip - slight reordering to put thoughts together ...
>>>
>>> Needs a second thought regarding fallback logic I raised above.
>>>
>>> What I think would have to be audited is the usage of __GFP_THISNODE by
>>> kernel allocations, where we would not actually want to allocate from
>>> this private node.
>>>
>>
>> This is fair, and I a re-visit is absolutely warranted.
>>
>> Re-examining the quick audit from my last response suggests - I should
>> never have seen leakage in those cases, but the fallbacks are needed.
>>
>> So yes, this all requires a second look (and a third, and a ninth).
>>
>> I'm not married to __GFP_PRIVATE, but it has been reliable for me.
>>
>>> Maybe we could just outright refuse *any* non-user (movable) allocations
>>> that target the node, even with __GFP_THISNODE.
>>>
>>> Because, why would we want kernel allocations to even end up on a
>>> private node that is supposed to only be consumed by user space? Or
>>> which use cases are there where we would want to place kernel
>>> allocations on there?
>>>
>>
>> As a start, maybe? But as a permanent invariant? I would wonder whether
>> the decision here would lock us into a design.
>>
>> But then - this is all kernel internal, so i think it would be feasible
>> to change this out from under users without backward compatibility pain.
>>
>> So far I have done my best to avoid changing any userland interfaces in
>> a way that would fundamentally change the contracts. If anything
>> private-node other than just the node's `has_memory_private` attribute
>> leaks into userland, someone messed up.
>>
>> So... I think that's reasonable.
>>
>>>
>>> I assume you will be as LSF/MM? Would be good to discuss some of that in
>>> person.
>>>
>>
>> Yes, looking forward to it :]
>>
>>
>>>
>>>
>>> Again, I am not sure about compaction and khugepaged. All we want to
>>> guarantee is that our memory does not leave the private node.
>>>
>>> That doesn't require any __GFP_PRIVATE magic, just en-lighting these
>>> subsystems that private nodes must use __GFP_THISNODE and must not leak
>>> to other nodes.
>>
>> This is where specific use-cases matter.
>>
>> In the compressed memory example - the device doesn't care about memory
>> leaving - but it cares about memory arriving and *and being modified*.
>> (more on this in your next question)
>>
>> So i'm not convinced *all possible devices* would always want to support
>> move_pages(), mbind(), and set_mempolicy().
>>
>> But, I do want to give this serious thought, and I agree the absolute
>> minimal patch set could just be the fallback control mechanism and
>> mm/ component filters/audit on __GFP_*.
>>
>>
>>>
>>> I'm missing why these are even opt-in. What's the problem with allowing
>>> mbind and mempolicy to use these nodes in some of your drivers?
>>>
>>
>> First:
>>
>> In my latest working branch these two flags have been folded into just
>> _OPS_MEMPOLICY and any other migration interaction is just handled by
>> filtering with the GFP flag.
>>
>>
>> on always allowing mbind and mempolicy vs opt-in
>> ---
>>
>> A proper compressed memory solution should not allow mbind/mempolicy.
>>
>> Compressed memory is different from normal memory - as the kernel can
>> percieves free memory (many unused struct page in the buddy) when the
>> device knows there's none left (the physical capacity is actually full).
>>
>> Any form of write to a compressed memory device is essentially a
>> dangerous condition (OOMs = poison, not oom_kill()).
>>
>> So you need two controls: Allocation and (userland) Write protection
>> I implemented via:
>> - Demotion-only (allocations only happen in reclaim path)
>> - Write-protecting the entire node
>>
>> (I fully accept that a write-protection extension here might be a bridge
>> to far, but please stick with me for the sake of exploration).
>>
>>
>> There's a serious argument to limit these devices to using an mbind
>> pattern, but I wanted to make a full-on attempt to integrate this device
>> into the demotion path as a transparent tier (kinda like zswap).
>>
>> I could not square write-protection with mempolicy, so i had to make
>> them both optional and mutually exclusive.
>>
>> If you limit the device to mbind interactions, you do limit what can
>> crash - but this forces userland software to be less portable by design:
>>
>> - am i running on a system where this device is present?
>> - is that device exposing its memory on a node?
>> - which node?
>> - what memory can i put on that node? (can you prevent a process from
>> putting libc on that node?)
>> - how much compression ratio is left on the device?
>> - can i safety write to this virtual address?
>> - should i write-protect compressed VMAs? Can i handle those faults?
>> - many more
>>
>> That sounds a lot like re-implementing a bunch of mm/ in userland, and
>> that's exactly where we were at with DAX. We know this pattern failed.
>>
>> I'm trying to very much avoid repeating these mistakes, and so I'm very
>> much trying to find a good path forward here that results in transparent
>> usage of this memory.
>>
>>
>>> I also have some questions about longterm pinnings, but that's better
>>> discussed in person :)
>>>
>>
>> The longterm pin extention came from auditing existing zone_device
>> filters.
>>
>> tl;dr: informative mechanism - but it probably should be dropped,
>> it makes no sense (it's device memory, pinnings mean nothing?).
>>
>>
>>>
>>> Right, that's rather invasive.
>>>
>>
>> Yeah i'm trying to avoid it, and the answer may actually just exist in
>> the task-death and VMA cleanup path rather than the folio-free path.
>>
>> From what i've seen of accelerator drivers that implement this, when you
>> inform the driver of a memory region with a task, the driver should have
>> a mechanism to take references on that VMA (or something like this) - so
>> that when the task dies the driver has a way to be notified of the VMA
>> being cleaned up.
>>
>> This probably exists - I just haven't gotten there yet.
>>
>> ~Gregory
>
> This has been a really great discussion. I just wanted to add a few
> points that I think I have mentioned in other forums, but not here.
>
> In essence, this is a discussion about memory properties and the level
> at which they should be dealt with. Right now there are basically 3
> levels: pageblocks, zones and nodes. While these levels exist for good
> reasons, they also sometimes lead to issues. There's duplication of
> functionality. MIGRATE_CMA and ZONE_MOVABLE both implement the same
> basic property, but at different levels (attempts have been made to
> merge them, but it didn't work out). There's also memory with clashing
> properties inhabiting the same data structure: LRUs. Having strictly
> movable memory on the same LRU as unmovable memory is a mismatch. It
> leads to the well known problem of reclaim done in the name of an
> unmovable allocation attempt can be entirely pointless in the face of
> large amounts of ZONE_MOVABLE or MIGRATE_CMA memory: the anon LRU will
> be chock full of movable-only pages. Reclaiming them is useless for
> your allocation, and skipping them leads to locking up the system
> because you're holding on to the LRU lock a long time.
>
> So, looking at having some properties set at the node level makes
> sense to me even in the non-device case. But perhaps that is out of
> scope for the initial discussion.
>
> One use case that seems like a good match for private nodes is guest
> memory. Guest memory is special enough to want to allocate / maintain
> it separately, which is acknowledged by the introduction of
> guest_memfd.
Yes. There is now an interface to configure mbind() for guest_memfd. So
with that and some tweaks, maybe that ... would just work, if we get the
mbind() interaction right?
--
Cheers,
David
next prev parent reply other threads:[~2026-04-17 9:40 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-22 8:48 Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54 ` Gregory Price
2026-02-23 16:08 ` Gregory Price
2026-03-17 13:05 ` David Hildenbrand (Arm)
2026-03-19 14:29 ` Gregory Price
2026-02-24 6:19 ` Alistair Popple
2026-02-24 15:17 ` Gregory Price
2026-02-24 16:54 ` Gregory Price
2026-02-25 22:21 ` Matthew Brost
2026-02-25 23:58 ` Gregory Price
2026-02-26 3:27 ` Alistair Popple
2026-02-26 5:54 ` Gregory Price
2026-02-26 22:49 ` Gregory Price
2026-03-03 20:36 ` Gregory Price
2026-02-25 12:40 ` Alejandro Lucero Palau
2026-02-25 14:43 ` Gregory Price
2026-03-17 13:25 ` David Hildenbrand (Arm)
2026-03-19 15:09 ` Gregory Price
2026-04-13 13:11 ` David Hildenbrand (Arm)
2026-04-13 17:05 ` Gregory Price
2026-04-15 9:49 ` David Hildenbrand (Arm)
2026-04-15 15:17 ` Gregory Price
2026-04-15 19:47 ` Frank van der Linden
2026-04-16 1:24 ` Gregory Price
2026-04-17 9:50 ` David Hildenbrand (Arm)
2026-04-16 20:23 ` Gregory Price
2026-04-17 9:39 ` David Hildenbrand (Arm) [this message]
2026-04-17 9:37 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3d077c4c-6abe-47f6-a4df-3e853fbcc551@kernel.org \
--to=david@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=cl@gentwo.org \
--cc=dakr@kernel.org \
--cc=damon@lists.linux.dev \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=dev.jain@arm.com \
--cc=fvdl@google.com \
--cc=gourry@gourry.net \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jannh@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linmiaohe@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=matthew.brost@intel.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=pfalcato@suse.de \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=terry.bowman@amd.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=weixugc@google.com \
--cc=xu.xin16@zte.com.cn \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=yury.norov@gmail.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox