Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Alistair Popple <apopple@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: Kiryl Shutsemau <kirill@shutemov.name>,
	linux-mm@kvack.org,  kernel-team@meta.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	 nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
	cgroups@vger.kernel.org,  dave@stgolabs.net,
	jonathan.cameron@huawei.com, dave.jiang@intel.com,
	 alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com,  dan.j.williams@intel.com,
	longman@redhat.com, akpm@linux-foundation.org, david@redhat.com,
	 lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org,  surenb@google.com,
	mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
	 matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com,
	 ying.huang@linux.alibaba.com, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	 vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org,  bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, tj@kernel.org,  hannes@cmpxchg.org,
	mkoutny@suse.com, kees@kernel.org, muchun.song@linux.dev,
	 roman.gushchin@linux.dev, shakeel.butt@linux.dev,
	rientjes@google.com, jackmanb@google.com,  cl@gentwo.org,
	harry.yoo@oracle.com, axelrasmussen@google.com,
	 yuanchu@google.com, weixugc@google.com,
	zhengqi.arch@bytedance.com,  yosry.ahmed@linux.dev,
	nphamcs@gmail.com, chengming.zhou@linux.dev,
	 fabio.m.de.francesco@linux.intel.com, rrichter@amd.com,
	ming.li@zohomail.com, usamaarif642@gmail.com,
	 brauner@kernel.org, oleg@redhat.com, namcao@linutronix.de,
	escape@linux.alibaba.com,  dongjoo.seo1@samsung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
Date: Thu, 27 Nov 2025 16:12:05 +1100	[thread overview]
Message-ID: <icora3w7wfisv2vtdc5w3w4kum2wbwqx2fmnxrrjo4tp7hgvem@jmb35qkh5ylx> (raw)
In-Reply-To: <aSXFseE5FMx-YzqX@gourry-fedora-PF4VCD3F>

On 2025-11-26 at 02:05 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Tue, Nov 25, 2025 at 02:09:39PM +0000, Kiryl Shutsemau wrote:
> > On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote:
> > > With this set, we aim to enable allocation of "special purpose memory"
> > > with the page allocator (mm/page_alloc.c) without exposing the same
> > > memory as "System RAM".  Unless a non-userland component, and does so
> > > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
> > 
> > How special is "special purpose memory"? If the only difference is a
> > latency/bandwidth discrepancy compared to "System RAM", I don't believe
> > it deserves this designation.
> > 
> 
> That is not the only discrepancy, but it can certainly be one of them.
> 
> I do think, at a certain latency/bandwidth level, memory becomes
> "Specific Purpose" - because the performance implications become so
> dramatic that you cannot allow just anything to land there.
> 
> In my head, I've been thinking about this list
> 
> 1) Plain old memory (<100ns)
> 2) Kinda slower, but basically still memory (100-300ns)
> 3) Slow Memory (>300ns, up to 2-3us loaded latencies)
> 4) Types 1-3, but with a special feature (Such as compression)
> 5) Coherent Accelerator Memory (various interconnects now exist)
> 6) Non-coherent Shared Memory and PMEM (FAMFS, Optane, etc)
> 
> Originally I was considering [3,4], but with Alistar's comments I am
> also thinking about [5] since apparently some accelerators already
> toss their memory into the page allocator for management.

Thanks.

> Re: Slow memory --
> 
>    Think >500-700ns cache line fetches, or 1-2us loaded.
> 
>    It's still "Basically just memory", but the scenarios in which
>    you can use it transparently shrink significantly.  If you can
>    control what and how things can land there with good policy,
>    this can still be a boon compared to hitting I/O.
> 
>    But you still want things like reclaim and compaction to run
>    on this memory, and you still want buddy-allocation of this memory.
> 
> Re: Compression
> 
>   This is a class of memory device which presents "usable memory"
>   but which carries stipulations around its use.
> 
>   The compressed case is the example I use in this set.  There is an
>   inline compression mechanism on the device.  If the compression ratio
>   drops to low, writes can get dropped resulting in memory poison.
> 
>   We could solve this kind of problem only allowing allocation via
>   demotion and hack off the Write-bit in the PTE. This provides the
>   interposition needed to fend-off compression ratio issues.
> 
>   But... it's basically still "just memory" - you can even leave it
>   mapped in the CPU page tables and allow userland to read unimpeded.
> 
>   In fact, we even want things like compaction and reclaim to run here.
>   This cannot be done *unless* this memory is in the page allocator,
>   and basically necessitates reimplementing all the core services the
>   kernel provides.
> 
> Re: Accelerators
> 
>   Alistair has described accelerators onlining their memory as NUMA
>   nodes being an existing pattern (apparently not in-tree as far as I
>   can see, though).

Yeah, sadly not yet :-( Hopefully "soon". Although onlining the memory doesn't
have much driver involvement as the GPU memory all just appears in the ACPI
tables as a CPU-less memory node anyway (which is why it ended up being easy for
people to toss it into the page allocator).

>   General consensus is "don't do this" - and it should be obvious
>   why.  Memory pressure can cause non-workload memory to spill to
>   these NUMA nodes as fallback allocation targets.

Indeed, this is a common complaint when people have done this.

>   But if we had a strong isolation mechanism, this could be supported.
>   I'm not convinced this kind of memory actually needs core services
>   like reclaim, so I will wait to see those arguments/data before I
>   conclude whether the idea is sound.

Sounds reasonable, I don't have strong arugments either way at the moment so
will see if we can gather some data.

> 
> 
> >
> > I am not in favor of the new GFP flag approach. To me, this indicates
> > that our infrastructure surrounding nodemasks is lacking. I believe we
> > would benefit more by improving it rather than simply adding a GFP flag
> > on top.
> > 
> 
> The core of this series is not the GFP flag, it is the splitting of
> (cpuset.mems_allowed) into (cpuset.mems_allowed, cpuset.sysram_nodes)
> 
> That is the nodemask infrastructure improvement.  The GFP flag is one
> mechanism of loosening the validation logic from limiting allocations
> from (sysram_nodes) to including all nodes present in (mems_allowed).
> 
> > While I am not an expert in NUMA, it appears that the approach with
> > default and opt-in NUMA nodes could be generally useful. Like,
> > introduce a system-wide default NUMA nodemask that is a subset of all
> > possible nodes.
> 
> This patch set does that (cpuset.sysram_nodes and mt_sysram_nodemask)
> 
> > This way, users can request the "special" nodes by using
> > a wider mask than the default.
> > 
> 
> I describe in the response to David that this is possible, but creates
> extreme tripping hazards for a large swath of existing software.
> 
> snippet
> '''
> Simple answer:  We can choose how hard this guardrail is to break.
> 
> This initial attempt makes it "Hard":
>    You cannot "accidentally" allocate SPM, the call must be explicit.
> 
> Removing the GFP would work, and make it "Easier" to access SPM memory.
> 
> This would allow a trivial 
> 
>    mbind(range, SPM_NODE_ID)
> 
> Which is great, but is also an incredible tripping hazard:
> 
>    numactl --interleave --all
> 
> and in kernel land:
> 
>    __alloc_pages_noprof(..., nodes[N_MEMORY])
> 
> These will now instantly be subject to SPM node memory.
> '''
> 
> There are many places that use these patterns already.
> 
> But at the end of the day, it is preference: we can choose to do that.
> 
> > cpusets should allow to set both default and possible masks in a
> > hierarchical manner where a child's default/possible mask cannot be
> > wider than the parent's possible mask and default is not wider that
> > own possible.
> > 
> 
> This patch set implements exactly what you describe:
>    sysram_nodes = default
>    mems_allowed = possible
> 
> > > Userspace-driven allocations are restricted by the sysram_nodes mask,
> > > nothing in userspace can explicitly request memory from SPM nodes.
> > > 
> > > Instead, the intent is to create new components which understand memory
> > > features and register those nodes with those components. This abstracts
> > > the hardware complexity away from userland while also not requiring new
> > > memory innovations to carry entirely new allocators.
> > 
> > I don't see how it is a positive. It seems to be negative side-effect of
> > GFP being a leaky abstraction.
> > 
> 
> It's a matter of applying an isolation mechanism and then punching an
> explicit hole in it.  As it is right now, GFP is "leaky" in that there
> are, basically, no walls.  Reclaim even ignored cpuset controls until
> recently, and the page_alloc code even says to ignore cpuset when 
> in an interrupt context.
> 
> The core of the proposal here is to provide a strong isolation mechanism
> and then allow punching explicit holes in it.  The GFP flag is one
> pattern, I'm open to others.
> 
> ~Gregory

next prev parent reply	other threads:[~2025-11-27  5:12 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-12 19:29 Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2025-12-15  6:11   ` Balbir Singh
2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
2025-12-15  6:14   ` Balbir Singh
2025-12-15 12:38     ` Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
2025-11-13 14:58   ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
2025-11-13 16:37     ` kernel test robot
2026-01-15  2:38     ` [PATCH] dax/kmem: add build config for protected dax memory blocks Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
2025-11-18 10:36   ` Gregory Price
2025-11-21 21:07   ` Gregory Price
2025-11-23 23:09     ` Alistair Popple
2025-11-24 15:28       ` Gregory Price
2025-11-27  5:03         ` Alistair Popple
2025-11-24  9:19 ` David Hildenbrand (Red Hat)
2025-11-24 18:06   ` Gregory Price
2025-12-10 23:29     ` Yiannis Nikolakopoulos
2025-11-25 14:09 ` Kiryl Shutsemau
2025-11-25 15:05   ` Gregory Price
2025-11-27  5:12     ` Alistair Popple [this message]
2025-11-26  3:23 ` Balbir Singh
2025-11-26  8:29   ` Gregory Price
2025-12-03  4:36     ` Balbir Singh
2025-12-03  5:25       ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=icora3w7wfisv2vtdc5w3w4kum2wbwqx2fmnxrrjo4tp7hgvem@jmb35qkh5ylx \
    --to=apopple@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=cl@gentwo.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dongjoo.seo1@samsung.com \
    --cc=escape@linux.alibaba.com \
    --cc=fabio.m.de.francesco@linux.intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=juri.lelli@redhat.com \
    --cc=kees@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=kirill@shutemov.name \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=ming.li@zohomail.com \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=namcao@linutronix.de \
    --cc=nphamcs@gmail.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=oleg@redhat.com \
    --cc=osalvador@suse.de \
    --cc=peterz@infradead.org \
    --cc=rakie.kim@sk.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=rrichter@amd.com \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vincent.guittot@linaro.org \
    --cc=vishal.l.verma@intel.com \
    --cc=vschneid@redhat.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox