From: Alistair Popple <apopple@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org, kernel-team@meta.com,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
cgroups@vger.kernel.org, dave@stgolabs.net,
jonathan.cameron@huawei.com, dave.jiang@intel.com,
alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, dan.j.williams@intel.com,
longman@redhat.com, akpm@linux-foundation.org, david@redhat.com,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, tj@kernel.org, hannes@cmpxchg.org,
mkoutny@suse.com, kees@kernel.org, muchun.song@linux.dev,
roman.gushchin@linux.dev, shakeel.butt@linux.dev,
rientjes@google.com, jackmanb@google.com, cl@gentwo.org,
harry.yoo@oracle.com, axelrasmussen@google.com,
yuanchu@google.com, weixugc@google.com,
zhengqi.arch@bytedance.com, yosry.ahmed@linux.dev,
nphamcs@gmail.com, chengming.zhou@linux.dev,
fabio.m.de.francesco@linux.intel.com, rrichter@amd.com,
ming.li@zohomail.com, usamaarif642@gmail.com,
brauner@kernel.org, oleg@redhat.com, namcao@linutronix.de,
escape@linux.alibaba.com, dongjoo.seo1@samsung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
Date: Thu, 27 Nov 2025 16:03:24 +1100 [thread overview]
Message-ID: <ti434m4sbveft6jw4zqrzzis47ycjupgiw5csj2wxmcac74xva@xyvx3ebqoqhu> (raw)
In-Reply-To: <aSR5l_fuONlCws8i@gourry-fedora-PF4VCD3F>
On 2025-11-25 at 02:28 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Mon, Nov 24, 2025 at 10:09:37AM +1100, Alistair Popple wrote:
> > On 2025-11-22 at 08:07 +1100, Gregory Price <gourry@gourry.net> wrote...
> > > On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> > > >
> >
> > There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The former
> > is mostly irrelevant for this discussion but I'm including the descriptions here
> > for completeness.
> >
>
> I appreciate you taking the time here. I'll maybe try to look at
> updating the docs as this evolves.
I believe the DEVICE_PRIVATE bit is documented here
https://www.kernel.org/doc/Documentation/vm/hmm.rst , but if there is anything
there that you think needs improvement I'd be happy to look or review. I'm not
sure if that was updated for DEVICE_COHERENT though.
> > > But I could imagine an (overly simplistic) pattern with SPM Nodes:
> > >
> > > fd = open("/dev/gpu_mem", ...)
> > > buf = mmap(fd, ...)
> > > buf[0]
> > > 1) driver takes the fault
> > > 2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
> > > 3) driver manages any special page table masks
> > > Like marking pages RO/RW to manage ownership.
> >
> > Of course as an aside this needs to match the CPU PTEs logic (this what
> > hmm_range_fault() is primarily used for).
> >
>
> This is actually the most interesting part of series for me. I'm using
> a compressed memory device as a stand-in for a memory type that requires
> special page table entries (RO) to avoid compression ratios tanking
> (resulting, eventually, in a MCE as there's no way to slow things down).
>
> You can somewhat "Get there from here" through device coherent
> ZONE_DEVICE, but you still don't have access to basic services like
> compaction and reclaim - which you absolutely do want for such a memory
> type (for the same reasons we groom zswap and zram).
>
> I wonder if we can even re-use the hmm interfaces for SPM nodes to make
> managing special page table policies easier as well. That seems
> promising.
It might depend on what exactly you're looking to do - HMM is really too parts,
one for mirroring page tables and another for allowing special non-present PTEs
to be setup to map a dummy ZONE_DEVICE struct page that notifies a driver when
the CPU attempts access.
> I said this during LSFMM: Without isolation, "memory policy" is really
> just a suggestion. What we're describing here is all predicated on
> isolation work, and all of a sudden much clearer examples of managing
> memory on NUMA boundaries starts to make a little more sense.
I very much agree with the views of memory policy that you shared in one of the
other threads. I don't think it is adequate for providing isolation, and agree
the isolation (and degree of isolation) is the interesting bit of the work here,
at least for now.
>
> > > 4) driver sends the gpu the (mapping_id, pfn, index) information
> > > so that gpu can map the region in its page tables.
> >
> > On coherent systems this often just uses HW address translation services
> > (ATS), although I think the specific implementation of how page-tables are
> > mirrored/shared is orthogonal to this.
> >
>
> Yeah this part is completely foreign to me, I just presume there's some
> way to tell the GPU how to recontruct the virtually contiguous setup.
> That mechanism would be entirely reusable here (I assume).
>
> > This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today,
> > except in the case of DEVICE_PRIVATE in step (5) above. In that case the page is
> > mapped as a non-present special swap entry that triggers a driver callback due
> > to the lack of cache coherence.
> >
>
> Btw, just an aside, Lorenzo is moving to rename these entries to
> softleaf (software-leaf) entries. I think you'll find it welcome.
> https://lore.kernel.org/linux-mm/c879383aac77d96a03e4d38f7daba893cd35fc76.1762812360.git.lorenzo.stoakes@oracle.com/
>
> > > Driver doesn't have to do much in the way of allocationg management.
> > >
> > > This is probably less compelling since you don't want general purposes
> > > services like reclaim, migration, compaction, tiering - etc.
> >
> > On at least some of our systems I'm told we do want this, hence my interest
> > here. Currently we have systems not using DEVICE_COHERENT and instead just
> > onlining everything as normal system managed memory in order to get reclaim
> > and tiering. Of course then people complain that it's managed as normal system
> > memory and non-GPU related things (ie. page-cache) end up in what's viewed as
> > special purpose memory.
> >
>
> Ok, so now this gets interesting then. I don't understand how this
> makes sense (not saying it doesn't, I simply don't understand).
>
> I would presume that under no circumstance do you want device memory to
> just suddenly disappear without some coordination from the driver.
>
> Whether it's compaction or reclaim, you have some thread that's going to
> migrate a virtual mapping from HPA(A) to HPA(B) and HPA(B) may or may not
> even map to the same memory device.
>
> That thread may not even be called in the context of a thread which
> accesses GPU memory (although, I think we could enforce that on top
> of SPM nodes, but devil is in the details).
>
> Maybe that "all magically works" because of the ATS described above?
Pretty much - both ATS and hmm_range_fault() are, conceptually at least, just
methods of sharing/mirroring the CPU page table to a device. So in your example
above if a thread was to migrate a mapping from one page to another this "black
magic" would keep everything in sync. Eg. For hmm_range_fault() the driver
gets a mmu_notifier callback saying the virtual mapping no longer points to
HPA(A). If it needs to find the new mapping to HPA(B) it can look it up using
hmm_range_fault() and program it's page tables with the new mapping.
At a sufficiently high level ATS is just a HW implemented equivalence of this.
> I suppose this assumes you have some kind of unified memory view between
> host and device memory? Are there docs here you can point me at that
> might explain this wizardry? (Sincerely, this is fascinating)
Right - it's all predicated on the host and device sharing the same view of the
virtual address space. I'm not sure of any good docs on this, but I will be at
LPC so would be happy to have a discussion there.
> > > The value is clearly that you get to manage GPU memory like any other
> > > memory, but without worry that other parts of the system will touch it.
> > >
> > > I'm much more focused on the "I have memory that is otherwise general
> > > purpose, and wants services like reclaim and compaction, but I want
> > > strong controls over how things can land there in the first place".
> >
> > So maybe there is some overlap here - what I have is memoy that we want managed
> > much like normal memory but with strong controls over what it can be used for
> > (ie. just for tasks utilising the processing element on the accelerator).
> >
>
> I think it might be great if we could discuss this a bit more in-depth,
> as i've already been considering very mild refactors to reclaim to
> enable a driver to engage it with an SPM node as the only shrink target.
Absolutely! Looking forward to an in-person discussion.
- Alistair
> This all becomes much more complicated due to per-memcg LRUs and such.
>
> All that said, I'm focused on the isolation / allocation pieces first.
> If that can't be agreed upon, the rest isn't worth exploring.
>
> I do have a mild extension to mempolicy that allows mbind() to hit an
> SPM node as an example as well. I'll discuss this in the response to
> David's thread, as he had some related questions about the GFP flag.
>
> ~Gregory
>
next prev parent reply other threads:[~2025-11-27 5:03 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-12 19:29 Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
2025-11-13 14:58 ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
2025-11-13 16:37 ` kernel test robot
2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
2025-11-18 7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
2025-11-18 10:36 ` Gregory Price
2025-11-21 21:07 ` Gregory Price
2025-11-23 23:09 ` Alistair Popple
2025-11-24 15:28 ` Gregory Price
2025-11-27 5:03 ` Alistair Popple [this message]
2025-11-24 9:19 ` David Hildenbrand (Red Hat)
2025-11-24 18:06 ` Gregory Price
2025-11-25 14:09 ` Kiryl Shutsemau
2025-11-25 15:05 ` Gregory Price
2025-11-27 5:12 ` Alistair Popple
2025-11-26 3:23 ` Balbir Singh
2025-11-26 8:29 ` Gregory Price
2025-12-03 4:36 ` Balbir Singh
2025-12-03 5:25 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ti434m4sbveft6jw4zqrzzis47ycjupgiw5csj2wxmcac74xva@xyvx3ebqoqhu \
--to=apopple@nvidia.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=axelrasmussen@google.com \
--cc=brauner@kernel.org \
--cc=bsegall@google.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=cl@gentwo.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=dietmar.eggemann@arm.com \
--cc=dongjoo.seo1@samsung.com \
--cc=escape@linux.alibaba.com \
--cc=fabio.m.de.francesco@linux.intel.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=juri.lelli@redhat.com \
--cc=kees@kernel.org \
--cc=kernel-team@meta.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=ming.li@zohomail.com \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=namcao@linutronix.de \
--cc=nphamcs@gmail.com \
--cc=nvdimm@lists.linux.dev \
--cc=oleg@redhat.com \
--cc=osalvador@suse.de \
--cc=peterz@infradead.org \
--cc=rakie.kim@sk.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=rrichter@amd.com \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=vbabka@suse.cz \
--cc=vincent.guittot@linaro.org \
--cc=vishal.l.verma@intel.com \
--cc=vschneid@redhat.com \
--cc=weixugc@google.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox