Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Gregory Price <gourry@gourry.net>
To: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org, kernel-team@meta.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
	cgroups@vger.kernel.org, dave@stgolabs.net,
	jonathan.cameron@huawei.com, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, dan.j.williams@intel.com,
	longman@redhat.com, akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
	mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, tj@kernel.org, hannes@cmpxchg.org,
	mkoutny@suse.com, kees@kernel.org, muchun.song@linux.dev,
	roman.gushchin@linux.dev, shakeel.butt@linux.dev,
	rientjes@google.com, jackmanb@google.com, cl@gentwo.org,
	harry.yoo@oracle.com, axelrasmussen@google.com,
	yuanchu@google.com, weixugc@google.com,
	zhengqi.arch@bytedance.com, yosry.ahmed@linux.dev,
	nphamcs@gmail.com, chengming.zhou@linux.dev,
	fabio.m.de.francesco@linux.intel.com, rrichter@amd.com,
	ming.li@zohomail.com, usamaarif642@gmail.com, brauner@kernel.org,
	oleg@redhat.com, namcao@linutronix.de, escape@linux.alibaba.com,
	dongjoo.seo1@samsung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
Date: Tue, 18 Nov 2025 04:36:47 -0600	[thread overview]
Message-ID: <aRxMP_wDRxJIhIiB@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <aktv2ivkrvtrox6nvcpxsnq6sagxnmj4yymelgkst6pazzpogo@aexnxfcklg75>

On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> On 2025-11-13 at 06:29 +1100, Gregory Price <gourry@gourry.net> wrote...
> > - Why? (In short: shunting to DAX is a failed pattern for users)
> > - Other designs I considered (mempolicy, cpusets, zone_device)
> 
> I'm interested in the contrast with zone_device, and in particular why
> device_coherent memory doesn't end up being a good fit for this.
>

I did consider zone_device briefly, but if you want sparse allocation
you end up essentially re-implementing some form of buddy allocator.

That seemed less then ideal, to say the least.

Additionally, pgmap use precludes these pages from using LRU/Reclaim,
and some devices may very well be compatible with such patterns.

(I think compression will be, but it still needs work)

> > - Why mempolicy.c and cpusets as-is are insufficient
> > - SPM types seeking this form of interface (Accelerator, Compression)
> 
> I'm sure you can guess my interest is in GPUs which also have memory some people
> consider should only be used for specific purposes :-) Currently our coherent
> GPUs online this as a normal NUMA noode, for which we have also generally
> found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
> hear what short comings you have been running into (I'm less familiar with the
> Compression cases you talk about here though).
> 

The TL;DR:

cpusets as-designed doesn't really allow the concept of "Nothing can
access XYZ node except specific things" because this would involve
removing a node from the root cpusets.mems - and that can't be loosened.

mempolicy is more of a suggestion and can be completely overridden. It
is entirely ignored by things like demotion/reclaim/etc.

I plan to discuss a bit of the specifics at LPC, but a lot of this stems
from the zone-iteration logic in page_alloc.c and the rather... ermm...
"complex" nature of how mempolicy and cpusets interacts with each other.

I may add some additional notes on this thread prior to LPC given that
time may be too short to get into the nasty bits in the session.

> > - Platform extensions that would be nice to see (SPM-only Bits)
> > 
> > Open Questions
> > - Single SPM nodemask, or multiple based on features?
> > - Apply SPM/SysRAM bit on-boot only or at-hotplug?
> > - Allocate extra "possible" NUMA nodes for flexbility?
> 
> I guess this might make hotplug easier? Particularly in cases where FW hasn't
> created the nodes.
>

In cases where you need to reach back to the device for some signal, you
likely need to have the driver for that device manage the alloc/free
patterns - so this may (or may not) generalize to 1-device-per-node.

In the scenario where you want some flexibility in managing regions,
this may require multiple nodes for device.  Maybe one device provides
multiple types of memory - you want those on separate nodes.

This doesn't seem like something you need to solve right away, just
something for folks to consider.

> > - Should SPM Nodes be zone-restricted? (MOVABLE only?)
> 
> For device based memory I think so - otherwise you can never gurantee devices
> can be removed or drivers (if required to access the memory) can be unbound as
> you can't migrate things off the memory.
> 

Zones in this scenario are bit of a square-peg/round-hole.  Forcing
everything in ZONE_MOVABLE means you can't do page pinning or things
like 1GB gigantic pages.  But the device driver should be capable of
managing hotplug anyway, so what's the point of ZONE_MOVABLE? :shrug:

> > The ZSwap example demonstrates this with the `mt_spm_nodemask`.  This
> > hack treats all spm nodes as-if they are compressed memory nodes, and
> > we bypass the software compression logic in zswap in favor of simply
> > copying memory directly to the allocated page.  In a real design
> 
> So in your example (I get it's a hack) is the main advantage that you can use
> all the same memory allocation policies (eg. cgroups) when needing to allocate
> the pages? Given this is ZSwap I guess these pages would never be mapped
> directly into user-space but would anything in the design prevent that? 

This is, in-fact, the long term intent. As long as the device can manage
inline decompression with reasonable latencies, there's no reason you
shouldn't be able to leave the pages mapped Read-Only in user-space.

The driver would be responsible for migrating on write-fault, similar to
a NUMA Hint Fault on the existing transparent page placement system.

> For example could a driver say allocate SPM memory and then explicitly
> migrate an existing page to it?

You might even extend migrate_pages with a new flag that simply drops
the write-able flag from the page table mapping and abstract that entire
complexity out of the driver :]

~Gregory

next prev parent reply	other threads:[~2025-11-18 10:37 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-12 19:29 Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2025-12-15  6:11   ` Balbir Singh
2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
2025-12-15  6:14   ` Balbir Singh
2025-12-15 12:38     ` Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
2025-11-13 14:58   ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
2025-11-13 16:37     ` kernel test robot
2026-01-15  2:38     ` [PATCH] dax/kmem: add build config for protected dax memory blocks Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
2025-11-18 10:36   ` Gregory Price [this message]
2025-11-21 21:07   ` Gregory Price
2025-11-23 23:09     ` Alistair Popple
2025-11-24 15:28       ` Gregory Price
2025-11-27  5:03         ` Alistair Popple
2025-11-24  9:19 ` David Hildenbrand (Red Hat)
2025-11-24 18:06   ` Gregory Price
2025-12-10 23:29     ` Yiannis Nikolakopoulos
2025-11-25 14:09 ` Kiryl Shutsemau
2025-11-25 15:05   ` Gregory Price
2025-11-27  5:12     ` Alistair Popple
2025-11-26  3:23 ` Balbir Singh
2025-11-26  8:29   ` Gregory Price
2025-12-03  4:36     ` Balbir Singh
2025-12-03  5:25       ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aRxMP_wDRxJIhIiB@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=cl@gentwo.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dongjoo.seo1@samsung.com \
    --cc=escape@linux.alibaba.com \
    --cc=fabio.m.de.francesco@linux.intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=juri.lelli@redhat.com \
    --cc=kees@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=ming.li@zohomail.com \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=namcao@linutronix.de \
    --cc=nphamcs@gmail.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=oleg@redhat.com \
    --cc=osalvador@suse.de \
    --cc=peterz@infradead.org \
    --cc=rakie.kim@sk.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=rrichter@amd.com \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vincent.guittot@linaro.org \
    --cc=vishal.l.verma@intel.com \
    --cc=vschneid@redhat.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox