linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org, kernel-team@meta.com,
	linux-cxl@vger.kernel.org,  linux-kernel@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
	 cgroups@vger.kernel.org, dave@stgolabs.net,
	jonathan.cameron@huawei.com,  dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	 ira.weiny@intel.com, dan.j.williams@intel.com,
	longman@redhat.com,  akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com,  Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	 mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
	matthew.brost@intel.com,  joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
	 mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	 vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org,  bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, tj@kernel.org,  hannes@cmpxchg.org,
	mkoutny@suse.com, kees@kernel.org, muchun.song@linux.dev,
	 roman.gushchin@linux.dev, shakeel.butt@linux.dev,
	rientjes@google.com, jackmanb@google.com,  cl@gentwo.org,
	harry.yoo@oracle.com, axelrasmussen@google.com,
	 yuanchu@google.com, weixugc@google.com,
	zhengqi.arch@bytedance.com,  yosry.ahmed@linux.dev,
	nphamcs@gmail.com, chengming.zhou@linux.dev,
	 fabio.m.de.francesco@linux.intel.com, rrichter@amd.com,
	ming.li@zohomail.com, usamaarif642@gmail.com,
	 brauner@kernel.org, oleg@redhat.com, namcao@linutronix.de,
	escape@linux.alibaba.com,  dongjoo.seo1@samsung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
Date: Mon, 24 Nov 2025 10:09:37 +1100	[thread overview]
Message-ID: <c5enwlaui37lm4uxlsjbuhesy6hfwwqbxzzs77zn7kmsceojv3@f6tquznpmizu> (raw)
In-Reply-To: <aSDUl7kU73LJR78g@gourry-fedora-PF4VCD3F>

On 2025-11-22 at 08:07 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> > 
> > I'm interested in the contrast with zone_device, and in particular why
> > device_coherent memory doesn't end up being a good fit for this.
> > 
> > > - Why mempolicy.c and cpusets as-is are insufficient
> > > - SPM types seeking this form of interface (Accelerator, Compression)
> > 
> > I'm sure you can guess my interest is in GPUs which also have memory some people
> > consider should only be used for specific purposes :-) Currently our coherent
> > GPUs online this as a normal NUMA noode, for which we have also generally
> > found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
> > hear what short comings you have been running into (I'm less familiar with the
> > Compression cases you talk about here though).
> > 
> 
> after some thought, talks, and doc readings it seems like the
> zone_device setups don't allow the CPU to map the devmem into page
> tables, and instead depends on migrate_device logic (unless the docs are
> out of sync with the code these days).  That's at least what's described
> in hmm and migrate_device.  

There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The former
is mostly irrelevant for this discussion but I'm including the descriptions here
for completeness. You are correct in saying that the only way either of these
currently get mapped into the page tables is via explicit migration of memory
to ZONE_DEVICE by a driver. There is also a corner case for first touch handling
which allows drivers to establish mappings to zero pages on a device if the page
hasn't been populated previously on the CPU.

These pages can, in some sense at least, be mapped on the CPU. DEVICE_COHERENT
pages are mapped normally (ie. CPU can access these directly) where as
DEVICE_PRIVATE pages are mapped using special swap entries so drivers can
emulate coherence by migrating pages back. This is used by devices without
coherent interconnects (ie. PCIe) where as the former could be used by eg. CXL.

> Assuming this is out of date and ZONE_DEVICE memory is mappable into
> page tables, assuming you want sparse allocation, ZONE_DEVICE seems to
> suggest you at least have to re-implement the buddy logic (which isn't
> that tall of an ask).

That's basically what happens - GPU drivers need memory allocation and therefore
re-implement some form of memory allocator. Agree that just being able to
reuse the buddy logic probably isn't that compelling though and isn't really of
interest (hence some of my original questions on what this is about).

> But I could imagine an (overly simplistic) pattern with SPM Nodes:
> 
> fd = open("/dev/gpu_mem", ...)
> buf = mmap(fd, ...)
> buf[0] 
>    1) driver takes the fault
>    2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
>    3) driver manages any special page table masks
>       Like marking pages RO/RW to manage ownership.

Of course as an aside this needs to match the CPU PTEs logic (this what
hmm_range_fault() is primarily used for).

>    4) driver sends the gpu the (mapping_id, pfn, index) information
>       so that gpu can map the region in its page tables.

On coherent systems this often just uses HW address translation services
(ATS), although I think the specific implementation of how page-tables are
mirrored/shared is orthogonal to this.

>    5) since the memory is cache coherent, gpu and cpu are free to
>       operate directly on the pages without any additional magic
>       (except typical concurrency controls).

This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today,
except in the case of DEVICE_PRIVATE in step (5) above. In that case the page is
mapped as a non-present special swap entry that triggers a driver callback due
to the lack of cache coherence.

> Driver doesn't have to do much in the way of allocationg management.
> 
> This is probably less compelling since you don't want general purposes
> services like reclaim, migration, compaction, tiering - etc.  

On at least some of our systems I'm told we do want this, hence my interest
here. Currently we have systems not using DEVICE_COHERENT and instead just
onlining everything as normal system managed memory in order to get reclaim
and tiering. Of course then people complain that it's managed as normal system
memory and non-GPU related things (ie. page-cache) end up in what's viewed as
special purpose memory.

> The value is clearly that you get to manage GPU memory like any other
> memory, but without worry that other parts of the system will touch it.
> 
> I'm much more focused on the "I have memory that is otherwise general
> purpose, and wants services like reclaim and compaction, but I want
> strong controls over how things can land there in the first place".

So maybe there is some overlap here - what I have is memoy that we want managed
much like normal memory but with strong controls over what it can be used for
(ie. just for tasks utilising the processing element on the accelerator).

 - Alistair

> ~Gregory
> 


  reply	other threads:[~2025-11-23 23:09 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-12 19:29 Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
2025-11-13 14:58   ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
2025-11-13 16:37     ` kernel test robot
2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
2025-11-18 10:36   ` Gregory Price
2025-11-21 21:07   ` Gregory Price
2025-11-23 23:09     ` Alistair Popple [this message]
2025-11-24 15:28       ` Gregory Price
2025-11-27  5:03         ` Alistair Popple
2025-11-24  9:19 ` David Hildenbrand (Red Hat)
2025-11-24 18:06   ` Gregory Price
2025-11-25 14:09 ` Kiryl Shutsemau
2025-11-25 15:05   ` Gregory Price
2025-11-27  5:12     ` Alistair Popple
2025-11-26  3:23 ` Balbir Singh
2025-11-26  8:29   ` Gregory Price
2025-12-03  4:36     ` Balbir Singh
2025-12-03  5:25       ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c5enwlaui37lm4uxlsjbuhesy6hfwwqbxzzs77zn7kmsceojv3@f6tquznpmizu \
    --to=apopple@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=cl@gentwo.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dongjoo.seo1@samsung.com \
    --cc=escape@linux.alibaba.com \
    --cc=fabio.m.de.francesco@linux.intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=juri.lelli@redhat.com \
    --cc=kees@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=ming.li@zohomail.com \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=namcao@linutronix.de \
    --cc=nphamcs@gmail.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=oleg@redhat.com \
    --cc=osalvador@suse.de \
    --cc=peterz@infradead.org \
    --cc=rakie.kim@sk.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=rrichter@amd.com \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vincent.guittot@linaro.org \
    --cc=vishal.l.verma@intel.com \
    --cc=vschneid@redhat.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox