Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Gregory Price <gourry@gourry.net>
To: Kiryl Shutsemau <kirill@shutemov.name>
Cc: linux-mm@kvack.org, kernel-team@meta.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
	cgroups@vger.kernel.org, dave@stgolabs.net,
	jonathan.cameron@huawei.com, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, dan.j.williams@intel.com,
	longman@redhat.com, akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
	apopple@nvidia.com, mingo@redhat.com, peterz@infradead.org,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
	kees@kernel.org, muchun.song@linux.dev, roman.gushchin@linux.dev,
	shakeel.butt@linux.dev, rientjes@google.com, jackmanb@google.com,
	cl@gentwo.org, harry.yoo@oracle.com, axelrasmussen@google.com,
	yuanchu@google.com, weixugc@google.com,
	zhengqi.arch@bytedance.com, yosry.ahmed@linux.dev,
	nphamcs@gmail.com, chengming.zhou@linux.dev,
	fabio.m.de.francesco@linux.intel.com, rrichter@amd.com,
	ming.li@zohomail.com, usamaarif642@gmail.com, brauner@kernel.org,
	oleg@redhat.com, namcao@linutronix.de, escape@linux.alibaba.com,
	dongjoo.seo1@samsung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
Date: Tue, 25 Nov 2025 10:05:21 -0500	[thread overview]
Message-ID: <aSXFseE5FMx-YzqX@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <h7vt26ek4wzrls6twsveinxz7aarwqtkhydbgvihsm7xzsjiuz@yk2dltuf2eoh>

On Tue, Nov 25, 2025 at 02:09:39PM +0000, Kiryl Shutsemau wrote:
> On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote:
> > With this set, we aim to enable allocation of "special purpose memory"
> > with the page allocator (mm/page_alloc.c) without exposing the same
> > memory as "System RAM".  Unless a non-userland component, and does so
> > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
> 
> How special is "special purpose memory"? If the only difference is a
> latency/bandwidth discrepancy compared to "System RAM", I don't believe
> it deserves this designation.
> 

That is not the only discrepancy, but it can certainly be one of them.

I do think, at a certain latency/bandwidth level, memory becomes
"Specific Purpose" - because the performance implications become so
dramatic that you cannot allow just anything to land there.

In my head, I've been thinking about this list

1) Plain old memory (<100ns)
2) Kinda slower, but basically still memory (100-300ns)
3) Slow Memory (>300ns, up to 2-3us loaded latencies)
4) Types 1-3, but with a special feature (Such as compression)
5) Coherent Accelerator Memory (various interconnects now exist)
6) Non-coherent Shared Memory and PMEM (FAMFS, Optane, etc)

Originally I was considering [3,4], but with Alistar's comments I am
also thinking about [5] since apparently some accelerators already
toss their memory into the page allocator for management.

Re: Slow memory --

   Think >500-700ns cache line fetches, or 1-2us loaded.

   It's still "Basically just memory", but the scenarios in which
   you can use it transparently shrink significantly.  If you can
   control what and how things can land there with good policy,
   this can still be a boon compared to hitting I/O.

   But you still want things like reclaim and compaction to run
   on this memory, and you still want buddy-allocation of this memory.

Re: Compression

  This is a class of memory device which presents "usable memory"
  but which carries stipulations around its use.

  The compressed case is the example I use in this set.  There is an
  inline compression mechanism on the device.  If the compression ratio
  drops to low, writes can get dropped resulting in memory poison.

  We could solve this kind of problem only allowing allocation via
  demotion and hack off the Write-bit in the PTE. This provides the
  interposition needed to fend-off compression ratio issues.

  But... it's basically still "just memory" - you can even leave it
  mapped in the CPU page tables and allow userland to read unimpeded.

  In fact, we even want things like compaction and reclaim to run here.
  This cannot be done *unless* this memory is in the page allocator,
  and basically necessitates reimplementing all the core services the
  kernel provides.

Re: Accelerators

  Alistair has described accelerators onlining their memory as NUMA
  nodes being an existing pattern (apparently not in-tree as far as I
  can see, though).

  General consensus is "don't do this" - and it should be obvious
  why.  Memory pressure can cause non-workload memory to spill to
  these NUMA nodes as fallback allocation targets.

  But if we had a strong isolation mechanism, this could be supported.
  I'm not convinced this kind of memory actually needs core services
  like reclaim, so I will wait to see those arguments/data before I
  conclude whether the idea is sound.

>
> I am not in favor of the new GFP flag approach. To me, this indicates
> that our infrastructure surrounding nodemasks is lacking. I believe we
> would benefit more by improving it rather than simply adding a GFP flag
> on top.
> 

The core of this series is not the GFP flag, it is the splitting of
(cpuset.mems_allowed) into (cpuset.mems_allowed, cpuset.sysram_nodes)

That is the nodemask infrastructure improvement.  The GFP flag is one
mechanism of loosening the validation logic from limiting allocations
from (sysram_nodes) to including all nodes present in (mems_allowed).

> While I am not an expert in NUMA, it appears that the approach with
> default and opt-in NUMA nodes could be generally useful. Like,
> introduce a system-wide default NUMA nodemask that is a subset of all
> possible nodes.

This patch set does that (cpuset.sysram_nodes and mt_sysram_nodemask)

> This way, users can request the "special" nodes by using
> a wider mask than the default.
> 

I describe in the response to David that this is possible, but creates
extreme tripping hazards for a large swath of existing software.

snippet
'''
Simple answer:  We can choose how hard this guardrail is to break.

This initial attempt makes it "Hard":
   You cannot "accidentally" allocate SPM, the call must be explicit.

Removing the GFP would work, and make it "Easier" to access SPM memory.

This would allow a trivial 

   mbind(range, SPM_NODE_ID)

Which is great, but is also an incredible tripping hazard:

   numactl --interleave --all

and in kernel land:

   __alloc_pages_noprof(..., nodes[N_MEMORY])

These will now instantly be subject to SPM node memory.
'''

There are many places that use these patterns already.

But at the end of the day, it is preference: we can choose to do that.

> cpusets should allow to set both default and possible masks in a
> hierarchical manner where a child's default/possible mask cannot be
> wider than the parent's possible mask and default is not wider that
> own possible.
> 

This patch set implements exactly what you describe:
   sysram_nodes = default
   mems_allowed = possible

> > Userspace-driven allocations are restricted by the sysram_nodes mask,
> > nothing in userspace can explicitly request memory from SPM nodes.
> > 
> > Instead, the intent is to create new components which understand memory
> > features and register those nodes with those components. This abstracts
> > the hardware complexity away from userland while also not requiring new
> > memory innovations to carry entirely new allocators.
> 
> I don't see how it is a positive. It seems to be negative side-effect of
> GFP being a leaky abstraction.
> 

It's a matter of applying an isolation mechanism and then punching an
explicit hole in it.  As it is right now, GFP is "leaky" in that there
are, basically, no walls.  Reclaim even ignored cpuset controls until
recently, and the page_alloc code even says to ignore cpuset when 
in an interrupt context.

The core of the proposal here is to provide a strong isolation mechanism
and then allow punching explicit holes in it.  The GFP flag is one
pattern, I'm open to others.

~Gregory

next prev parent reply	other threads:[~2025-11-25 15:05 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-12 19:29 Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2025-12-15  6:11   ` Balbir Singh
2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
2025-12-15  6:14   ` Balbir Singh
2025-12-15 12:38     ` Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
2025-11-13 14:58   ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
2025-11-13 16:37     ` kernel test robot
2026-01-15  2:38     ` [PATCH] dax/kmem: add build config for protected dax memory blocks Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
2025-11-18 10:36   ` Gregory Price
2025-11-21 21:07   ` Gregory Price
2025-11-23 23:09     ` Alistair Popple
2025-11-24 15:28       ` Gregory Price
2025-11-27  5:03         ` Alistair Popple
2025-11-24  9:19 ` David Hildenbrand (Red Hat)
2025-11-24 18:06   ` Gregory Price
2025-12-10 23:29     ` Yiannis Nikolakopoulos
2025-11-25 14:09 ` Kiryl Shutsemau
2025-11-25 15:05   ` Gregory Price [this message]
2025-11-27  5:12     ` Alistair Popple
2025-11-26  3:23 ` Balbir Singh
2025-11-26  8:29   ` Gregory Price
2025-12-03  4:36     ` Balbir Singh
2025-12-03  5:25       ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aSXFseE5FMx-YzqX@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=cl@gentwo.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dongjoo.seo1@samsung.com \
    --cc=escape@linux.alibaba.com \
    --cc=fabio.m.de.francesco@linux.intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=juri.lelli@redhat.com \
    --cc=kees@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=kirill@shutemov.name \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=ming.li@zohomail.com \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=namcao@linutronix.de \
    --cc=nphamcs@gmail.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=oleg@redhat.com \
    --cc=osalvador@suse.de \
    --cc=peterz@infradead.org \
    --cc=rakie.kim@sk.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=rrichter@amd.com \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vincent.guittot@linaro.org \
    --cc=vishal.l.verma@intel.com \
    --cc=vschneid@redhat.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox