[RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
@ 2025-11-12 19:29 Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
                   ` (14 more replies)
  0 siblings, 15 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

This is a code RFC for discussion related to

"Mempolicy is dead, long live memory policy!"
https://lpc.events/event/19/contributions/2143/

base-commit: 24172e0d79900908cf5ebf366600616d29c9b417
(version notes at end)

At LSF 2026, I plan to discuss:
- Why? (In short: shunting to DAX is a failed pattern for users)
- Other designs I considered (mempolicy, cpusets, zone_device)
- Why mempolicy.c and cpusets as-is are insufficient
- SPM types seeking this form of interface (Accelerator, Compression)
- Platform extensions that would be nice to see (SPM-only Bits)

Open Questions
- Single SPM nodemask, or multiple based on features?
- Apply SPM/SysRAM bit on-boot only or at-hotplug?
- Allocate extra "possible" NUMA nodes for flexbility?
- Should SPM Nodes be zone-restricted? (MOVABLE only?)
- How to handle things like reclaim and compaction on these nodes.

With this set, we aim to enable allocation of "special purpose memory"
with the page allocator (mm/page_alloc.c) without exposing the same
memory as "System RAM".  Unless a non-userland component, and does so
with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.

This isolation mechanism is a requirement for memory policies which
depend on certain sets of memory never being used outside special
interfaces (such as a specific mm/component or driver).

We present an example of using this mechanism within ZSWAP, as-if
a "compressed memory node" was present.  How to describe the features
of memory present on nodes is left up to comment here and at LPC '26.

Userspace-driven allocations are restricted by the sysram_nodes mask,
nothing in userspace can explicitly request memory from SPM nodes.

Instead, the intent is to create new components which understand memory
features and register those nodes with those components. This abstracts
the hardware complexity away from userland while also not requiring new
memory innovations to carry entirely new allocators.

The ZSwap example demonstrates this with the `mt_spm_nodemask`.  This
hack treats all spm nodes as-if they are compressed memory nodes, and
we bypass the software compression logic in zswap in favor of simply
copying memory directly to the allocated page.  In a real design

There are 4 major changes in this set:

1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes
   the set of nodes which are eligible for use as normal system ram

   Some existing users now pass mt_sysram_nodelist into the page
   allocator instead of NULL, but passing a NULL pointer in will simply
   have it replaced by mt_sysram_nodelist anyway.  Should a fully NULL
   pointer still make it to the page allocator, without GFP_SPM_NODE
   SPM node zones will simply be skipped.

   mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes
   present during __init, but if empty the use of mt_sysram_nodes()
   will return a NULL to preserve current behavior.

2) The addition of `cpuset.mems.sysram` which restricts allocations to
   `mt_sysram_nodes` unless GFP_SPM_NODE is used.

   SPM Nodes are still allowed in cpuset.mems.allowed and effective.

   This is done to allow separate control over sysram and SPM node sets
   by cgroups while maintaining the existing hierarchical rules.

   current cpuset configuration
   cpuset.mems_allowed
    |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
    |->tasks.mems_allowed    < cpuset.mems_effective

   new cpuset configuration
   cpuset.mems_allowed
    |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
    |.sysram_nodes           < (mems_effective ∩ default_sys_nodemask)
    |->task.sysram_nodes     < cpuset.sysram_nodes

   This means mems_allowed still restricts all node usage in any given
   task context, which is the existing behavior.

3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
   capacity being added should mark the node as an SPM Node. 

   A node is either SysRAM or SPM - never both.  Attempting to add
   incompatible memory to a node results in hotplug failure.

   DAX and CXL are made aware of the bit and have `spm_node` bits added
   to their relevant subsystems.

4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
   from the provided node or nodemask.  It changes the behavior of
   the cpuset mems_allowed and mt_node_allowed() checks.

v1->v2:
- naming improvements
    default_node -> sysram_node
    protected    -> spm (Specific Purpose Memory)
- add missing constify patch
- add patch to update callers of __cpuset_zone_allowed
- add additional logic to the mm sysram_nodes patch
- fix bot build issues (ifdef config builds)
- fix out-of-tree driver build issues (function renames)
- change compressed_nodelist to spm_nodelist
- add latch mechanism for sysram/spm nodes (Dan Williams)
  this drops some extra memory-hotplug logic which is nice
v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/

Gregory Price (11):
  mm: constify oom_control, scan_control, and alloc_context nodemask
  mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
  gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations
  memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes
  mm: restrict slub, oom, compaction, and page_alloc to sysram by
    default
  mm,cpusets: rename task->mems_allowed to task->sysram_nodes
  cpuset: introduce cpuset.mems.sysram
  mm/memory_hotplug: add MHP_SPM_NODE flag
  drivers/dax: add spm_node bit to dev_dax
  drivers/cxl: add spm_node bit to cxl region
  [HACK] mm/zswap: compressed ram integration example

 drivers/cxl/core/region.c       |  30 ++++++
 drivers/cxl/cxl.h               |   2 +
 drivers/dax/bus.c               |  39 ++++++++
 drivers/dax/bus.h               |   1 +
 drivers/dax/cxl.c               |   1 +
 drivers/dax/dax-private.h       |   1 +
 drivers/dax/kmem.c              |   2 +
 fs/proc/array.c                 |   2 +-
 include/linux/cpuset.h          |  62 +++++++------
 include/linux/gfp_types.h       |   5 +
 include/linux/memory-tiers.h    |  47 ++++++++++
 include/linux/memory_hotplug.h  |  10 ++
 include/linux/mempolicy.h       |   2 +-
 include/linux/mm.h              |   4 +-
 include/linux/mmzone.h          |   6 +-
 include/linux/oom.h             |   2 +-
 include/linux/sched.h           |   6 +-
 include/linux/swap.h            |   2 +-
 init/init_task.c                |   2 +-
 kernel/cgroup/cpuset-internal.h |   8 ++
 kernel/cgroup/cpuset-v1.c       |   7 ++
 kernel/cgroup/cpuset.c          | 158 ++++++++++++++++++++------------
 kernel/fork.c                   |   2 +-
 kernel/sched/fair.c             |   4 +-
 mm/compaction.c                 |  10 +-
 mm/hugetlb.c                    |   8 +-
 mm/internal.h                   |   2 +-
 mm/memcontrol.c                 |   3 +-
 mm/memory-tiers.c               |  66 ++++++++++++-
 mm/memory_hotplug.c             |   7 ++
 mm/mempolicy.c                  |  34 +++----
 mm/migrate.c                    |   4 +-
 mm/mmzone.c                     |   5 +-
 mm/oom_kill.c                   |  11 ++-
 mm/page_alloc.c                 |  57 +++++++-----
 mm/show_mem.c                   |  11 ++-
 mm/slub.c                       |  15 ++-
 mm/vmscan.c                     |   6 +-
 mm/zswap.c                      |  66 ++++++++++++-
 39 files changed, 532 insertions(+), 178 deletions(-)

-- 
2.51.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-12-15  6:11   ` Balbir Singh
  2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

The nodemasks in these structures may come from a variety of sources,
including tasks and cpusets - and should never be modified by any code
when being passed around inside another context.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/cpuset.h | 4 ++--
 include/linux/mm.h     | 4 ++--
 include/linux/mmzone.h | 6 +++---
 include/linux/oom.h    | 2 +-
 include/linux/swap.h   | 2 +-
 kernel/cgroup/cpuset.c | 2 +-
 mm/internal.h          | 2 +-
 mm/mmzone.c            | 5 +++--
 mm/page_alloc.c        | 4 ++--
 mm/show_mem.c          | 9 ++++++---
 mm/vmscan.c            | 6 +++---
 11 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b5..548eaf7ef8d0 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -80,7 +80,7 @@ extern bool cpuset_cpu_is_isolated(int cpu);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
-int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
+int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask);
 
 extern bool cpuset_current_node_allowed(int node, gfp_t gfp_mask);
 
@@ -219,7 +219,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 #define cpuset_current_mems_allowed (node_states[N_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 
-static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
+static inline int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask)
 {
 	return 1;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33bacc32..1a874917eae6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3343,7 +3343,7 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
 extern void mem_init(void);
 extern void __init mmap_init(void);
 
-extern void __show_mem(unsigned int flags, nodemask_t *nodemask, int max_zone_idx);
+extern void __show_mem(unsigned int flags, const nodemask_t *nodemask, int max_zone_idx);
 static inline void show_mem(void)
 {
 	__show_mem(0, NULL, MAX_NR_ZONES - 1);
@@ -3353,7 +3353,7 @@ extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
 extern __printf(3, 4)
-void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
+void warn_alloc(gfp_t gfp_mask, const nodemask_t *nodemask, const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7fb7331c5725..5c96b2c52817 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1725,7 +1725,7 @@ static inline int zonelist_node_idx(const struct zoneref *zoneref)
 
 struct zoneref *__next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes);
+					const nodemask_t *nodes);
 
 /**
  * next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
@@ -1744,7 +1744,7 @@ struct zoneref *__next_zones_zonelist(struct zoneref *z,
  */
 static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+					const nodemask_t *nodes)
 {
 	if (likely(!nodes && zonelist_zone_idx(z) <= highest_zoneidx))
 		return z;
@@ -1770,7 +1770,7 @@ static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z,
  */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+					const nodemask_t *nodes)
 {
 	return next_zones_zonelist(zonelist->_zonerefs,
 							highest_zoneidx, nodes);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7b02bc1d0a7e..00da05d227e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -30,7 +30,7 @@ struct oom_control {
 	struct zonelist *zonelist;
 
 	/* Used to determine mempolicy */
-	nodemask_t *nodemask;
+	const nodemask_t *nodemask;
 
 	/* Memory cgroup in which oom is invoked, or NULL for global oom */
 	struct mem_cgroup *memcg;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e818fbade1e2..f5154499bafd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -381,7 +381,7 @@ extern void swap_setup(void);
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-					gfp_t gfp_mask, nodemask_t *mask);
+					gfp_t gfp_mask, const nodemask_t *mask);
 
 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
 #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 52468d2c178a..cd3e2ae83d70 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4238,7 +4238,7 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
  *
  * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
+int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask)
 {
 	return nodes_intersects(*nodemask, current->mems_allowed);
 }
diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..464e60dd7ba1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -587,7 +587,7 @@ void page_alloc_sysctl_init(void);
  */
 struct alloc_context {
 	struct zonelist *zonelist;
-	nodemask_t *nodemask;
+	const nodemask_t *nodemask;
 	struct zoneref *preferred_zoneref;
 	int migratetype;
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 0c8f181d9d50..59dc3f2076a6 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -43,7 +43,8 @@ struct zone *next_zone(struct zone *zone)
 	return zone;
 }
 
-static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+static inline int zref_in_nodemask(struct zoneref *zref,
+				   const nodemask_t *nodes)
 {
 #ifdef CONFIG_NUMA
 	return node_isset(zonelist_node_idx(zref), *nodes);
@@ -55,7 +56,7 @@ static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 struct zoneref *__next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+					const nodemask_t *nodes)
 {
 	/*
 	 * Find the next suitable zone to use for the allocation.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 600d9e981c23..fd5401fb5e00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3924,7 +3924,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	return NULL;
 }
 
-static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
+static void warn_alloc_show_mem(gfp_t gfp_mask, const nodemask_t *nodemask)
 {
 	unsigned int filter = SHOW_MEM_FILTER_NODES;
 
@@ -3943,7 +3943,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
 	__show_mem(filter, nodemask, gfp_zone(gfp_mask));
 }
 
-void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
+void warn_alloc(gfp_t gfp_mask, const nodemask_t *nodemask, const char *fmt, ...)
 {
 	struct va_format vaf;
 	va_list args;
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 3a4b5207635d..24685b5c6dcf 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -116,7 +116,8 @@ void si_meminfo_node(struct sysinfo *val, int nid)
  * Determine whether the node should be displayed or not, depending on whether
  * SHOW_MEM_FILTER_NODES was passed to show_free_areas().
  */
-static bool show_mem_node_skip(unsigned int flags, int nid, nodemask_t *nodemask)
+static bool show_mem_node_skip(unsigned int flags, int nid,
+			       const nodemask_t *nodemask)
 {
 	if (!(flags & SHOW_MEM_FILTER_NODES))
 		return false;
@@ -177,7 +178,8 @@ static bool node_has_managed_zones(pg_data_t *pgdat, int max_zone_idx)
  * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's
  *   cpuset.
  */
-static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
+static void show_free_areas(unsigned int filter, const nodemask_t *nodemask,
+			    int max_zone_idx)
 {
 	unsigned long free_pcp = 0;
 	int cpu, nid;
@@ -399,7 +401,8 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 	show_swap_cache_info();
 }
 
-void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
+void __show_mem(unsigned int filter, const nodemask_t *nodemask,
+		int max_zone_idx)
 {
 	unsigned long total = 0, reserved = 0, highmem = 0;
 	struct zone *zone;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2fc8b626d3d..03e7f5206ad9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -80,7 +80,7 @@ struct scan_control {
 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
 	 * are scanned.
 	 */
-	nodemask_t	*nodemask;
+	const nodemask_t *nodemask;
 
 	/*
 	 * The memory cgroup that hit its limit and as a result is the
@@ -6530,7 +6530,7 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
  * happens, the page allocator should not consider triggering the OOM killer.
  */
 static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
-					nodemask_t *nodemask)
+				    const nodemask_t *nodemask)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -6610,7 +6610,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, const nodemask_t *nodemask)
 {
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-12-15  6:14   ` Balbir Singh
  2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

All current callers of __cpuset_zone_allowed() presently check if
cpusets_enabled() is true first - which is the first check of the
cpuset_zone_allowed() function.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/compaction.c |  7 +++----
 mm/page_alloc.c | 19 ++++++++-----------
 2 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c..d2176935d3dd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2829,10 +2829,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 					ac->highest_zoneidx, ac->nodemask) {
 		enum compact_result status;
 
-		if (cpusets_enabled() &&
-			(alloc_flags & ALLOC_CPUSET) &&
-			!__cpuset_zone_allowed(zone, gfp_mask))
-				continue;
+		if ((alloc_flags & ALLOC_CPUSET) &&
+		    !cpuset_zone_allowed(zone, gfp_mask))
+			continue;
 
 		if (prio > MIN_COMPACT_PRIORITY
 					&& compaction_deferred(zone, order)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd5401fb5e00..bcaf1125d109 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3750,10 +3750,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		struct page *page;
 		unsigned long mark;
 
-		if (cpusets_enabled() &&
-			(alloc_flags & ALLOC_CPUSET) &&
-			!__cpuset_zone_allowed(zone, gfp_mask))
-				continue;
+		if ((alloc_flags & ALLOC_CPUSET) &&
+		    !cpuset_zone_allowed(zone, gfp_mask))
+			continue;
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a node that is within its dirty
@@ -4553,10 +4552,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		unsigned long min_wmark = min_wmark_pages(zone);
 		bool wmark;
 
-		if (cpusets_enabled() &&
-			(alloc_flags & ALLOC_CPUSET) &&
-			!__cpuset_zone_allowed(zone, gfp_mask))
-				continue;
+		if ((alloc_flags & ALLOC_CPUSET) &&
+		    !cpuset_zone_allowed(zone, gfp_mask))
+			continue;
 
 		available = reclaimable = zone_reclaimable_pages(zone);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@ -5052,10 +5050,9 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	for_next_zone_zonelist_nodemask(zone, z, ac.highest_zoneidx, ac.nodemask) {
 		unsigned long mark;
 
-		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
-		    !__cpuset_zone_allowed(zone, gfp)) {
+		if ((alloc_flags & ALLOC_CPUSET) &&
+		    !cpuset_zone_allowed(zone, gfp))
 			continue;
-		}
 
 		if (nr_online_nodes > 1 && zone != zonelist_zone(ac.preferred_zoneref) &&
 		    zone_to_nid(zone) != zonelist_node_idx(ac.preferred_zoneref)) {
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

GFP_SPM_NODE changes the nodemask checks in the page allocator to include
the full set memory nodes, rather than just SysRAM nodes.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/gfp_types.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 65db9349f905..525ae891420e 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -58,6 +58,7 @@ enum {
 #ifdef CONFIG_SLAB_OBJ_EXT
 	___GFP_NO_OBJ_EXT_BIT,
 #endif
+	___GFP_SPM_NODE_BIT,
 	___GFP_LAST_BIT
 };
 
@@ -103,6 +104,7 @@ enum {
 #else
 #define ___GFP_NO_OBJ_EXT       0
 #endif
+#define ___GFP_SPM_NODE		BIT(___GFP_SPM_NODE_BIT)
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -145,6 +147,8 @@ enum {
  * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
  *
  * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
+ *
+ * %__GFP_SPM_NODE allows the use of Specific Purpose Memory Nodes
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
@@ -152,6 +156,7 @@ enum {
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
 #define __GFP_ACCOUNT	((__force gfp_t)___GFP_ACCOUNT)
 #define __GFP_NO_OBJ_EXT   ((__force gfp_t)___GFP_NO_OBJ_EXT)
+#define __GFP_SPM_NODE	((__force gfp_t)___GFP_SPM_NODE)
 
 /**
  * DOC: Watermark modifiers
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (2 preceding siblings ...)
  2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

Create Memory Node "types" (SysRAM and Specific Purpose) which can be
set at memory hotplug time.

SysRAM nodes present at __init time are added to the mt_sysram_nodelist
and memory hotplug will decide whether hotplugged nodes will be placed
in mt_sysram_nodelist or mt_spm_nodelist.

SPM nodes are not included in demotion targets.

Setting a node type is permanent and cannot be switched once set, this
prevents type-change race conditions on the global mt_sysram_nodelist.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory-tiers.h | 47 +++++++++++++++++++++++++
 mm/memory-tiers.c            | 66 ++++++++++++++++++++++++++++++++++--
 2 files changed, 111 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 7a805796fcfd..59443cbfaec3 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -35,10 +35,44 @@ struct memory_dev_type {
 
 struct access_coordinate;
 
+enum {
+	MT_NODE_TYPE_SYSRAM,
+	MT_NODE_TYPE_SPM
+};
+
 #ifdef CONFIG_NUMA
 extern bool numa_demotion_enabled;
 extern struct memory_dev_type *default_dram_type;
 extern nodemask_t default_dram_nodes;
+extern nodemask_t mt_sysram_nodelist;
+extern nodemask_t mt_spm_nodelist;
+static inline nodemask_t *mt_sysram_nodemask(void)
+{
+	if (nodes_empty(mt_sysram_nodelist))
+		return NULL;
+	return &mt_sysram_nodelist;
+}
+static inline void mt_nodemask_sysram_mask(nodemask_t *dst, nodemask_t *mask)
+{
+	/* If the sysram filter isn't available, this allows all */
+	if (nodes_empty(mt_sysram_nodelist)) {
+		nodes_or(*dst, *mask, NODE_MASK_NONE);
+		return;
+	}
+	nodes_and(*dst, *mask, mt_sysram_nodelist);
+}
+static inline bool mt_node_is_sysram(int nid)
+{
+	/* if sysram filter isn't setup, this allows all */
+	return nodes_empty(mt_sysram_nodelist) ||
+	       node_isset(nid, mt_sysram_nodelist);
+}
+static inline bool mt_node_allowed(int nid, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_SPM_NODE)
+		return true;
+	return mt_node_is_sysram(nid);
+}
 struct memory_dev_type *alloc_memory_type(int adistance);
 void put_memory_type(struct memory_dev_type *memtype);
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
@@ -73,11 +107,19 @@ static inline bool node_is_toptier(int node)
 }
 #endif
 
+int mt_set_node_type(int node, int type);
+
 #else
 
 #define numa_demotion_enabled	false
 #define default_dram_type	NULL
 #define default_dram_nodes	NODE_MASK_NONE
+#define mt_sysram_nodelist	NODE_MASK_NONE
+#define mt_spm_nodelist		NODE_MASK_NONE
+static inline nodemask_t *mt_sysram_nodemask(void) { return NULL; }
+static inline void mt_nodemask_sysram_mask(nodemask_t *dst, nodemask_t *mask) {}
+static inline bool mt_node_is_sysram(int nid) { return true; }
+static inline bool mt_node_allowed(int nid, gfp_t gfp_mask) { return true; }
 /*
  * CONFIG_NUMA implementation returns non NULL error.
  */
@@ -151,5 +193,10 @@ static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
 static inline void mt_put_memory_types(struct list_head *memory_types)
 {
 }
+
+int mt_set_node_type(int node, int type)
+{
+	return 0;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0ea5c13f10a2..dd6cfaa4c667 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -44,7 +44,15 @@ static LIST_HEAD(memory_tiers);
 static LIST_HEAD(default_memory_types);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 struct memory_dev_type *default_dram_type;
-nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE;
+
+/* default_dram_nodes is the list of nodes with both CPUs and RAM */
+nodemask_t default_dram_nodes = NODE_MASK_NONE;
+
+/* mt_sysram_nodelist is the list of nodes with SysramRAM */
+nodemask_t mt_sysram_nodelist = NODE_MASK_NONE;
+
+/* mt_spm_nodelist is the list of nodes with Specific Purpose Memory */
+nodemask_t mt_spm_nodelist = NODE_MASK_NONE;
 
 static const struct bus_type memory_tier_subsys = {
 	.name = "memory_tiering",
@@ -427,6 +435,14 @@ static void establish_demotion_targets(void)
 	disable_all_demotion_targets();
 
 	for_each_node_state(node, N_MEMORY) {
+		/*
+		 * If this is not a sysram node, direct-demotion is not allowed
+		 * and must be managed by special logic that understands the
+		 * memory features of that particular node.
+		 */
+		if (!node_isset(node, mt_sysram_nodelist))
+			continue;
+
 		best_distance = -1;
 		nd = &node_demotion[node];
 
@@ -457,7 +473,8 @@ static void establish_demotion_targets(void)
 				break;
 
 			distance = node_distance(node, target);
-			if (distance == best_distance || best_distance == -1) {
+			if ((distance == best_distance || best_distance == -1) &&
+			    node_isset(target, mt_sysram_nodelist)) {
 				best_distance = distance;
 				node_set(target, nd->preferred);
 			} else {
@@ -689,6 +706,48 @@ void mt_put_memory_types(struct list_head *memory_types)
 }
 EXPORT_SYMBOL_GPL(mt_put_memory_types);
 
+/**
+ * mt_set_node_type() - Set a NUMA Node's Memory type.
+ * @node: The node type to set
+ * @type: The type to set
+ *
+ * This is a one-way setting, once a type is assigned it cannot be cleared
+ * without resetting the system.  This is to avoid race conditions associated
+ * with moving nodes from one type to another during memory hotplug.
+ *
+ * Once a node is added as a SysRAM node, it will be used by default in
+ * the page allocator as a valid target when the calling does not provide
+ * a node or nodemask.  This is safe as the page allocator iterates through
+ * zones and uses this nodemask to filter zones - if a node is present but
+ * has no zones the node is ignored.
+ *
+ * Return: 0 if the node type is set successfully (or it's already set)
+ *         -EBUSY if the node has a different type already
+ *         -ENODEV if the type is invalid
+ */
+int mt_set_node_type(int node, int type)
+{
+	int err;
+
+	mutex_lock(&memory_tier_lock);
+	if (type == MT_NODE_TYPE_SYSRAM)
+		err = node_isset(node, mt_spm_nodelist) ? -EBUSY : 0;
+	else if (type == MT_NODE_TYPE_SPM)
+		err = node_isset(node, mt_sysram_nodelist) ? -EBUSY : 0;
+	if (err)
+		goto out;
+
+	if (type == MT_NODE_TYPE_SYSRAM)
+		node_set(node, mt_sysram_nodelist);
+	else if (type == MT_NODE_TYPE_SPM)
+		node_set(node, mt_spm_nodelist);
+	else
+		err = -ENODEV;
+out:
+	mutex_unlock(&memory_tier_lock);
+	return err;
+}
+
 /*
  * This is invoked via `late_initcall()` to initialize memory tiers for
  * memory nodes, both with and without CPUs. After the initialization of
@@ -922,6 +981,9 @@ static int __init memory_tier_init(void)
 	nodes_and(default_dram_nodes, node_states[N_MEMORY],
 		  node_states[N_CPU]);
 
+	/* Record all nodes with non-hotplugged memory as default SYSRAM nodes */
+	mt_sysram_nodelist = node_states[N_MEMORY];
+
 	hotplug_node_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
 	return 0;
 }
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (3 preceding siblings ...)
  2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

Restrict page allocation and zone iteration behavior in mm to skip
SPM Nodes via cpusets, or mt_sysram_nodelist when cpusets is disabled.

This constrains core users of nodemasks to the mt_sysram_nodelist, which
is guaranteed to at least contain the set of nodes with sysram memory
blocks present at boot (or NULL if NUMA is compiled out).

If the sysram nodelist is empty (something in memory-tiers broken),
return NULL, which still allows all zones to be iterated.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/compaction.c |  3 +++
 mm/oom_kill.c   |  5 ++++-
 mm/page_alloc.c | 18 ++++++++++++++----
 mm/slub.c       | 15 ++++++++++++---
 4 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index d2176935d3dd..7b73179d1fbf 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -13,6 +13,7 @@
 #include <linux/migrate.h>
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
+#include <linux/memory-tiers.h>
 #include <linux/sched/signal.h>
 #include <linux/backing-dev.h>
 #include <linux/sysctl.h>
@@ -2832,6 +2833,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		if ((alloc_flags & ALLOC_CPUSET) &&
 		    !cpuset_zone_allowed(zone, gfp_mask))
 			continue;
+		else if (!mt_node_allowed(zone_to_nid(zone), gfp_mask))
+			continue;
 
 		if (prio > MIN_COMPACT_PRIORITY
 					&& compaction_deferred(zone, order)) {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c145b0feecc1..386b4ceeaeb8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -34,6 +34,7 @@
 #include <linux/export.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/memory-tiers.h>
 #include <linux/mempolicy.h>
 #include <linux/security.h>
 #include <linux/ptrace.h>
@@ -1118,6 +1119,8 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
+	if (!oc->nodemask)
+		oc->nodemask = mt_sysram_nodemask();
 
 	if (oom_killer_disabled)
 		return false;
@@ -1154,7 +1157,7 @@ bool out_of_memory(struct oom_control *oc)
 	 */
 	oc->constraint = constrained_alloc(oc);
 	if (oc->constraint != CONSTRAINT_MEMORY_POLICY)
-		oc->nodemask = NULL;
+		oc->nodemask = mt_sysram_nodemask();
 	check_panic_on_oom(oc);
 
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bcaf1125d109..2ea6a50f6079 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -34,6 +34,7 @@
 #include <linux/cpuset.h>
 #include <linux/pagevec.h>
 #include <linux/memory_hotplug.h>
+#include <linux/memory-tiers.h>
 #include <linux/nodemask.h>
 #include <linux/vmstat.h>
 #include <linux/fault-inject.h>
@@ -3753,6 +3754,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		if ((alloc_flags & ALLOC_CPUSET) &&
 		    !cpuset_zone_allowed(zone, gfp_mask))
 			continue;
+		else if (!mt_node_allowed(zone_to_nid(zone), gfp_mask))
+			continue;
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a node that is within its dirty
@@ -4555,6 +4558,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		if ((alloc_flags & ALLOC_CPUSET) &&
 		    !cpuset_zone_allowed(zone, gfp_mask))
 			continue;
+		else if (!mt_node_allowed(zone_to_nid(zone), gfp_mask))
+			continue;
 
 		available = reclaimable = zone_reclaimable_pages(zone);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@ -4608,7 +4613,7 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
 	 */
 	if (cpusets_enabled() && ac->nodemask &&
 			!cpuset_nodemask_valid_mems_allowed(ac->nodemask)) {
-		ac->nodemask = NULL;
+		ac->nodemask = mt_sysram_nodemask();
 		return true;
 	}
 
@@ -4792,7 +4797,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * user oriented.
 	 */
 	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
-		ac->nodemask = NULL;
+		ac->nodemask = mt_sysram_nodemask();
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->highest_zoneidx, ac->nodemask);
 	}
@@ -4944,7 +4949,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 			ac->nodemask = &cpuset_current_mems_allowed;
 		else
 			*alloc_flags |= ALLOC_CPUSET;
-	}
+	} else if (!ac->nodemask) /* sysram_nodes may be NULL during __init */
+		ac->nodemask = mt_sysram_nodemask();
 
 	might_alloc(gfp_mask);
 
@@ -5053,6 +5059,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		if ((alloc_flags & ALLOC_CPUSET) &&
 		    !cpuset_zone_allowed(zone, gfp))
 			continue;
+		else if (!mt_node_allowed(zone_to_nid(zone), gfp))
+			continue;
 
 		if (nr_online_nodes > 1 && zone != zonelist_zone(ac.preferred_zoneref) &&
 		    zone_to_nid(zone) != zonelist_node_idx(ac.preferred_zoneref)) {
@@ -5187,8 +5195,10 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 	/*
 	 * Restore the original nodemask if it was potentially replaced with
 	 * &cpuset_current_mems_allowed to optimize the fast-path attempt.
+	 *
+	 * If not set, default to sysram nodes.
 	 */
-	ac.nodemask = nodemask;
+	ac.nodemask = nodemask ? nodemask : mt_sysram_nodemask();
 
 	page = __alloc_pages_slowpath(alloc_gfp, order, &ac);
 
diff --git a/mm/slub.c b/mm/slub.c
index 1bf65c421325..c857db97c6a0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -28,6 +28,7 @@
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/mempolicy.h>
+#include <linux/memory-tiers.h>
 #include <linux/ctype.h>
 #include <linux/stackdepot.h>
 #include <linux/debugobjects.h>
@@ -3576,11 +3577,19 @@ static struct slab *get_any_partial(struct kmem_cache *s,
 		zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
 		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
 			struct kmem_cache_node *n;
+			int nid = zone_to_nid(zone);
+			bool allowed;
 
-			n = get_node(s, zone_to_nid(zone));
+			n = get_node(s, nid);
+			if (!n)
+				continue;
+
+			if (cpusets_enabled())
+				allowed = __cpuset_zone_allowed(zone, pc->flags);
+			else
+				allowed = mt_node_allowed(nid, pc->flags);
 
-			if (n && cpuset_zone_allowed(zone, pc->flags) &&
-					n->nr_partial > s->min_partial) {
+			if (allowed && (n->nr_partial > s->min_partial)) {
 				slab = get_partial_node(s, n, pc);
 				if (slab) {
 					/*
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (4 preceding siblings ...)
  2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

task->mems_allowed actually contains the value of cpuset.effective_mems

The value of cpuset.mems.effective is the intersection of mems_allowed
and the cpuset's parent's mems.effective.  This creates a confusing
naming scheme between references to task->mems_allowed, and cpuset
mems_allowed and effective_mems.

With the intent of making this nodemask only contain SystemRAM Nodes
(i.e. omitting Specific Purpose Memory Nodes), rename task->mems_allowed
to task->sysram_nodes.  This accomplishes two things:

1) Detach task->mems_allowed and cpuset.mems_allowed naming scheme,
   making it slightly clearer that these may contain different values.

2) To enable cpusets.mems_allowed to contain SPM Nodes, letting a cgroup
   still control whether SPM nodes are "allowed" for that context, even
   if these nodes are not reachable through existing means.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 fs/proc/array.c           |  2 +-
 include/linux/cpuset.h    | 54 ++++++++++++++-------------
 include/linux/mempolicy.h |  2 +-
 include/linux/sched.h     |  6 +--
 init/init_task.c          |  2 +-
 kernel/cgroup/cpuset.c    | 78 +++++++++++++++++++--------------------
 kernel/fork.c             |  2 +-
 kernel/sched/fair.c       |  4 +-
 mm/hugetlb.c              |  8 ++--
 mm/mempolicy.c            | 28 +++++++-------
 mm/oom_kill.c             |  6 +--
 mm/page_alloc.c           | 16 ++++----
 mm/show_mem.c             |  2 +-
 13 files changed, 106 insertions(+), 104 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 2ae63189091e..61ee857a5caf 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -456,7 +456,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cap(m, task);
 	task_seccomp(m, task);
 	task_cpus_allowed(m, task);
-	cpuset_task_status_allowed(m, task);
+	cpuset_task_status_sysram(m, task);
 	task_context_switch_counts(m, task);
 	arch_proc_pid_thread_features(m, task);
 	return 0;
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 548eaf7ef8d0..9baaf19431b5 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -23,14 +23,14 @@
 /*
  * Static branch rewrites can happen in an arbitrary order for a given
  * key. In code paths where we need to loop with read_mems_allowed_begin() and
- * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
- * to ensure that begin() always gets rewritten before retry() in the
+ * read_mems_allowed_retry() to get a consistent view of task->sysram_nodes, we
+ * need to ensure that begin() always gets rewritten before retry() in the
  * disabled -> enabled transition. If not, then if local irqs are disabled
  * around the loop, we can deadlock since retry() would always be
- * comparing the latest value of the mems_allowed seqcount against 0 as
+ * comparing the latest value of the sysram_nodes seqcount against 0 as
  * begin() still would see cpusets_enabled() as false. The enabled -> disabled
  * transition should happen in reverse order for the same reasons (want to stop
- * looking at real value of mems_allowed.sequence in retry() first).
+ * looking at real value of sysram_nodes.sequence in retry() first).
  */
 extern struct static_key_false cpusets_pre_enable_key;
 extern struct static_key_false cpusets_enabled_key;
@@ -78,9 +78,10 @@ extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern bool cpuset_cpu_is_isolated(int cpu);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
-#define cpuset_current_mems_allowed (current->mems_allowed)
-void cpuset_init_current_mems_allowed(void);
-int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask);
+#define cpuset_current_sysram_nodes (current->sysram_nodes)
+#define cpuset_current_mems_allowed (cpuset_current_sysram_nodes)
+void cpuset_init_current_sysram_nodes(void);
+int cpuset_nodemask_valid_sysram_nodes(const nodemask_t *nodemask);
 
 extern bool cpuset_current_node_allowed(int node, gfp_t gfp_mask);
 
@@ -96,7 +97,7 @@ static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 	return true;
 }
 
-extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
+extern int cpuset_sysram_nodes_intersects(const struct task_struct *tsk1,
 					  const struct task_struct *tsk2);
 
 #ifdef CONFIG_CPUSETS_V1
@@ -111,8 +112,8 @@ extern void __cpuset_memory_pressure_bump(void);
 static inline void cpuset_memory_pressure_bump(void) { }
 #endif
 
-extern void cpuset_task_status_allowed(struct seq_file *m,
-					struct task_struct *task);
+extern void cpuset_task_status_sysram(struct seq_file *m,
+				      struct task_struct *task);
 extern int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
 			    struct pid *pid, struct task_struct *tsk);
 
@@ -128,12 +129,12 @@ extern bool current_cpuset_is_being_rebound(void);
 extern void dl_rebuild_rd_accounting(void);
 extern void rebuild_sched_domains(void);
 
-extern void cpuset_print_current_mems_allowed(void);
+extern void cpuset_print_current_sysram_nodes(void);
 extern void cpuset_reset_sched_domains(void);
 
 /*
- * read_mems_allowed_begin is required when making decisions involving
- * mems_allowed such as during page allocation. mems_allowed can be updated in
+ * read_mems_allowed_begin is required when making decisions involving a task's
+ * sysram_nodes such as during page allocation. sysram_nodes can be updated in
  * parallel and depending on the new value an operation can fail potentially
  * causing process failure. A retry loop with read_mems_allowed_begin and
  * read_mems_allowed_retry prevents these artificial failures.
@@ -143,13 +144,13 @@ static inline unsigned int read_mems_allowed_begin(void)
 	if (!static_branch_unlikely(&cpusets_pre_enable_key))
 		return 0;
 
-	return read_seqcount_begin(&current->mems_allowed_seq);
+	return read_seqcount_begin(&current->sysram_nodes_seq);
 }
 
 /*
  * If this returns true, the operation that took place after
  * read_mems_allowed_begin may have failed artificially due to a concurrent
- * update of mems_allowed. It is up to the caller to retry the operation if
+ * update of sysram_nodes. It is up to the caller to retry the operation if
  * appropriate.
  */
 static inline bool read_mems_allowed_retry(unsigned int seq)
@@ -157,7 +158,7 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
 	if (!static_branch_unlikely(&cpusets_enabled_key))
 		return false;
 
-	return read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return read_seqcount_retry(&current->sysram_nodes_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -166,9 +167,9 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 
 	task_lock(current);
 	local_irq_save(flags);
-	write_seqcount_begin(&current->mems_allowed_seq);
-	current->mems_allowed = nodemask;
-	write_seqcount_end(&current->mems_allowed_seq);
+	write_seqcount_begin(&current->sysram_nodes_seq);
+	current->sysram_nodes = nodemask;
+	write_seqcount_end(&current->sysram_nodes_seq);
 	local_irq_restore(flags);
 	task_unlock(current);
 }
@@ -216,10 +217,11 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 	return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_states[N_MEMORY])
-static inline void cpuset_init_current_mems_allowed(void) {}
+#define cpuset_current_sysram_nodes (node_states[N_MEMORY])
+#define cpuset_current_mems_allowed (cpuset_current_sysram_nodes)
+static inline void cpuset_init_current_sysram_nodes(void) {}
 
-static inline int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask)
+static inline int cpuset_nodemask_valid_sysram_nodes(const nodemask_t *nodemask)
 {
 	return 1;
 }
@@ -234,7 +236,7 @@ static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 	return true;
 }
 
-static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
+static inline int cpuset_sysram_nodes_intersects(const struct task_struct *tsk1,
 						 const struct task_struct *tsk2)
 {
 	return 1;
@@ -242,8 +244,8 @@ static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
 
 static inline void cpuset_memory_pressure_bump(void) {}
 
-static inline void cpuset_task_status_allowed(struct seq_file *m,
-						struct task_struct *task)
+static inline void cpuset_task_status_sysram(struct seq_file *m,
+					     struct task_struct *task)
 {
 }
 
@@ -276,7 +278,7 @@ static inline void cpuset_reset_sched_domains(void)
 	partition_sched_domains(1, NULL, NULL);
 }
 
-static inline void cpuset_print_current_mems_allowed(void)
+static inline void cpuset_print_current_sysram_nodes(void)
 {
 }
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 0fe96f3ab3ef..f9a2b1bed3fa 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -52,7 +52,7 @@ struct mempolicy {
 	int home_node;		/* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */
 
 	union {
-		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
+		nodemask_t cpuset_sysram_nodes;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
 	} w;
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b469878de25c..ad2d0cb00772 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1223,7 +1223,7 @@ struct task_struct {
 	u64				parent_exec_id;
 	u64				self_exec_id;
 
-	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
+	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, sysram_nodes, mempolicy: */
 	spinlock_t			alloc_lock;
 
 	/* Protection of the PI data structures: */
@@ -1314,9 +1314,9 @@ struct task_struct {
 #endif
 #ifdef CONFIG_CPUSETS
 	/* Protected by ->alloc_lock: */
-	nodemask_t			mems_allowed;
+	nodemask_t			sysram_nodes;
 	/* Sequence number to catch updates: */
-	seqcount_spinlock_t		mems_allowed_seq;
+	seqcount_spinlock_t		sysram_nodes_seq;
 	int				cpuset_mem_spread_rotor;
 #endif
 #ifdef CONFIG_CGROUPS
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206f..857a5978d403 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -173,7 +173,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.trc_blkd_node = LIST_HEAD_INIT(init_task.trc_blkd_node),
 #endif
 #ifdef CONFIG_CPUSETS
-	.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
+	.sysram_nodes_seq = SEQCNT_SPINLOCK_ZERO(init_task.sysram_nodes_seq,
 						 &init_task.alloc_lock),
 #endif
 #ifdef CONFIG_RT_MUTEXES
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index cd3e2ae83d70..f0c59621a7f2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -240,7 +240,7 @@ static struct cpuset top_cpuset = {
  * If a task is only holding callback_lock, then it has read-only
  * access to cpusets.
  *
- * Now, the task_struct fields mems_allowed and mempolicy may be changed
+ * Now, the task_struct fields sysram_nodes and mempolicy may be changed
  * by other task, we use alloc_lock in the task_struct fields to protect
  * them.
  *
@@ -2678,11 +2678,11 @@ static void schedule_flush_migrate_mm(void)
 }
 
 /*
- * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
+ * cpuset_change_task_nodemask - change task's sysram_nodes and mempolicy
  * @tsk: the task to change
  * @newmems: new nodes that the task will be set
  *
- * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
+ * We use the sysram_nodes_seq seqlock to safely update both tsk->sysram_nodes
  * and rebind an eventual tasks' mempolicy. If the task is allocating in
  * parallel, it might temporarily see an empty intersection, which results in
  * a seqlock check and retry before OOM or allocation failure.
@@ -2693,13 +2693,13 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	task_lock(tsk);
 
 	local_irq_disable();
-	write_seqcount_begin(&tsk->mems_allowed_seq);
+	write_seqcount_begin(&tsk->sysram_nodes_seq);
 
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+	nodes_or(tsk->sysram_nodes, tsk->sysram_nodes, *newmems);
 	mpol_rebind_task(tsk, newmems);
-	tsk->mems_allowed = *newmems;
+	tsk->sysram_nodes = *newmems;
 
-	write_seqcount_end(&tsk->mems_allowed_seq);
+	write_seqcount_end(&tsk->sysram_nodes_seq);
 	local_irq_enable();
 
 	task_unlock(tsk);
@@ -2709,9 +2709,9 @@ static void *cpuset_being_rebound;
 
 /**
  * cpuset_update_tasks_nodemask - Update the nodemasks of tasks in the cpuset.
- * @cs: the cpuset in which each task's mems_allowed mask needs to be changed
+ * @cs: the cpuset in which each task's sysram_nodes mask needs to be changed
  *
- * Iterate through each task of @cs updating its mems_allowed to the
+ * Iterate through each task of @cs updating its sysram_nodes to the
  * effective cpuset's.  As this function is called with cpuset_mutex held,
  * cpuset membership stays stable.
  */
@@ -3763,7 +3763,7 @@ static void cpuset_fork(struct task_struct *task)
 			return;
 
 		set_cpus_allowed_ptr(task, current->cpus_ptr);
-		task->mems_allowed = current->mems_allowed;
+		task->sysram_nodes = current->sysram_nodes;
 		return;
 	}
 
@@ -4205,9 +4205,9 @@ bool cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 	return changed;
 }
 
-void __init cpuset_init_current_mems_allowed(void)
+void __init cpuset_init_current_sysram_nodes(void)
 {
-	nodes_setall(current->mems_allowed);
+	nodes_setall(current->sysram_nodes);
 }
 
 /**
@@ -4233,14 +4233,14 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 }
 
 /**
- * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
+ * cpuset_nodemask_valid_sysram_nodes - check nodemask vs. current sysram_nodes
  * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes in the nodemask allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->sysram_nodes?
  */
-int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask)
+int cpuset_nodemask_valid_sysram_nodes(const nodemask_t *nodemask)
 {
-	return nodes_intersects(*nodemask, current->mems_allowed);
+	return nodes_intersects(*nodemask, current->sysram_nodes);
 }
 
 /*
@@ -4262,7 +4262,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * @gfp_mask: memory allocation flags
  *
  * If we're in interrupt, yes, we can always allocate.  If @node is set in
- * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
+ * current's sysram_nodes, yes.  If it's not a __GFP_HARDWALL request and this
  * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
  * yes.  If current has access to memory reserves as an oom victim, yes.
  * Otherwise, no.
@@ -4276,7 +4276,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * Scanning up parent cpusets requires callback_lock.  The
  * __alloc_pages() routine only calls here with __GFP_HARDWALL bit
  * _not_ set if it's a GFP_KERNEL allocation, and all nodes in the
- * current tasks mems_allowed came up empty on the first pass over
+ * current tasks sysram_nodes came up empty on the first pass over
  * the zonelist.  So only GFP_KERNEL allocations, if all nodes in the
  * cpuset are short of memory, might require taking the callback_lock.
  *
@@ -4304,7 +4304,7 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 
 	if (in_interrupt())
 		return true;
-	if (node_isset(node, current->mems_allowed))
+	if (node_isset(node, current->sysram_nodes))
 		return true;
 	/*
 	 * Allow tasks that have access to memory reserves because they have
@@ -4375,13 +4375,13 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
  * certain page cache or slab cache pages such as used for file
  * system buffers and inode caches, then instead of starting on the
  * local node to look for a free page, rather spread the starting
- * node around the tasks mems_allowed nodes.
+ * node around the tasks sysram_nodes nodes.
  *
  * We don't have to worry about the returned node being offline
  * because "it can't happen", and even if it did, it would be ok.
  *
  * The routines calling guarantee_online_mems() are careful to
- * only set nodes in task->mems_allowed that are online.  So it
+ * only set nodes in task->sysram_nodes that are online.  So it
  * should not be possible for the following code to return an
  * offline node.  But if it did, that would be ok, as this routine
  * is not returning the node where the allocation must be, only
@@ -4392,7 +4392,7 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
  */
 static int cpuset_spread_node(int *rotor)
 {
-	return *rotor = next_node_in(*rotor, current->mems_allowed);
+	return *rotor = next_node_in(*rotor, current->sysram_nodes);
 }
 
 /**
@@ -4402,35 +4402,35 @@ int cpuset_mem_spread_node(void)
 {
 	if (current->cpuset_mem_spread_rotor == NUMA_NO_NODE)
 		current->cpuset_mem_spread_rotor =
-			node_random(&current->mems_allowed);
+			node_random(&current->sysram_nodes);
 
 	return cpuset_spread_node(&current->cpuset_mem_spread_rotor);
 }
 
 /**
- * cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's?
+ * cpuset_sysram_nodes_intersects - Does @tsk1's sysram_nodes intersect @tsk2's?
  * @tsk1: pointer to task_struct of some task.
  * @tsk2: pointer to task_struct of some other task.
  *
- * Description: Return true if @tsk1's mems_allowed intersects the
- * mems_allowed of @tsk2.  Used by the OOM killer to determine if
+ * Description: Return true if @tsk1's sysram_nodes intersects the
+ * sysram_nodes of @tsk2.  Used by the OOM killer to determine if
  * one of the task's memory usage might impact the memory available
  * to the other.
  **/
 
-int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
+int cpuset_sysram_nodes_intersects(const struct task_struct *tsk1,
 				   const struct task_struct *tsk2)
 {
-	return nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed);
+	return nodes_intersects(tsk1->sysram_nodes, tsk2->sysram_nodes);
 }
 
 /**
- * cpuset_print_current_mems_allowed - prints current's cpuset and mems_allowed
+ * cpuset_print_current_sysram_nodes - prints current's cpuset and sysram_nodes
  *
  * Description: Prints current's name, cpuset name, and cached copy of its
- * mems_allowed to the kernel log.
+ * sysram_nodes to the kernel log.
  */
-void cpuset_print_current_mems_allowed(void)
+void cpuset_print_current_sysram_nodes(void)
 {
 	struct cgroup *cgrp;
 
@@ -4439,17 +4439,17 @@ void cpuset_print_current_mems_allowed(void)
 	cgrp = task_cs(current)->css.cgroup;
 	pr_cont(",cpuset=");
 	pr_cont_cgroup_name(cgrp);
-	pr_cont(",mems_allowed=%*pbl",
-		nodemask_pr_args(&current->mems_allowed));
+	pr_cont(",sysram_nodes=%*pbl",
+		nodemask_pr_args(&current->sysram_nodes));
 
 	rcu_read_unlock();
 }
 
-/* Display task mems_allowed in /proc/<pid>/status file. */
-void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task)
+/* Display task sysram_nodes in /proc/<pid>/status file. */
+void cpuset_task_status_sysram(struct seq_file *m, struct task_struct *task)
 {
-	seq_printf(m, "Mems_allowed:\t%*pb\n",
-		   nodemask_pr_args(&task->mems_allowed));
-	seq_printf(m, "Mems_allowed_list:\t%*pbl\n",
-		   nodemask_pr_args(&task->mems_allowed));
+	seq_printf(m, "Sysram_nodes:\t%*pb\n",
+		   nodemask_pr_args(&task->sysram_nodes));
+	seq_printf(m, "Sysram_nodes_list:\t%*pbl\n",
+		   nodemask_pr_args(&task->sysram_nodes));
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..9ca2b59d7f0e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2120,7 +2120,7 @@ __latent_entropy struct task_struct *copy_process(
 #endif
 #ifdef CONFIG_CPUSETS
 	p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
-	seqcount_spinlock_init(&p->mems_allowed_seq, &p->alloc_lock);
+	seqcount_spinlock_init(&p->sysram_nodes_seq, &p->alloc_lock);
 #endif
 #ifdef CONFIG_TRACE_IRQFLAGS
 	memset(&p->irqtrace, 0, sizeof(p->irqtrace));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..667c53fc3954 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3317,8 +3317,8 @@ static void task_numa_work(struct callback_head *work)
 	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
 	 * no page can be migrated.
 	 */
-	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) {
-		trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed);
+	if (cpusets_enabled() && nodes_weight(cpuset_current_sysram_nodes) == 1) {
+		trace_sched_skip_cpuset_numa(current, &cpuset_current_sysram_nodes);
 		return;
 	}
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0455119716ec..0d16890c1a4f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2366,7 +2366,7 @@ static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
 	 */
 	if (mpol->mode == MPOL_BIND &&
 		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
-		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
+		 cpuset_nodemask_valid_sysram_nodes(&mpol->nodes)))
 		return &mpol->nodes;
 #endif
 	return NULL;
@@ -2389,9 +2389,9 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 
 	mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h));
 	if (mbind_nodemask)
-		nodes_and(alloc_nodemask, *mbind_nodemask, cpuset_current_mems_allowed);
+		nodes_and(alloc_nodemask, *mbind_nodemask, cpuset_current_sysram_nodes);
 	else
-		alloc_nodemask = cpuset_current_mems_allowed;
+		alloc_nodemask = cpuset_current_sysram_nodes;
 
 	lockdep_assert_held(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
@@ -5084,7 +5084,7 @@ static unsigned int allowed_mems_nr(struct hstate *h)
 	gfp_t gfp_mask = htlb_alloc_mask(h);
 
 	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
-	for_each_node_mask(node, cpuset_current_mems_allowed) {
+	for_each_node_mask(node, cpuset_current_sysram_nodes) {
 		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
 			nr += array[node];
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eb83cff7db8c..735dabb9c50c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -396,7 +396,7 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
  * any, for the new policy.  mpol_new() has already validated the nodes
  * parameter with respect to the policy mode and flags.
  *
- * Must be called holding task's alloc_lock to protect task's mems_allowed
+ * Must be called holding task's alloc_lock to protect task's sysram_nodes
  * and mempolicy.  May also be called holding the mmap_lock for write.
  */
 static int mpol_set_nodemask(struct mempolicy *pol,
@@ -414,7 +414,7 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 
 	/* Check N_MEMORY */
 	nodes_and(nsc->mask1,
-		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
+		  cpuset_current_sysram_nodes, node_states[N_MEMORY]);
 
 	VM_BUG_ON(!nodes);
 
@@ -426,7 +426,7 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 	if (mpol_store_user_nodemask(pol))
 		pol->w.user_nodemask = *nodes;
 	else
-		pol->w.cpuset_mems_allowed = cpuset_current_mems_allowed;
+		pol->w.cpuset_sysram_nodes = cpuset_current_sysram_nodes;
 
 	ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
 	return ret;
@@ -501,9 +501,9 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 	else if (pol->flags & MPOL_F_RELATIVE_NODES)
 		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
 	else {
-		nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed,
+		nodes_remap(tmp, pol->nodes, pol->w.cpuset_sysram_nodes,
 								*nodes);
-		pol->w.cpuset_mems_allowed = *nodes;
+		pol->w.cpuset_sysram_nodes = *nodes;
 	}
 
 	if (nodes_empty(tmp))
@@ -515,14 +515,14 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 static void mpol_rebind_preferred(struct mempolicy *pol,
 						const nodemask_t *nodes)
 {
-	pol->w.cpuset_mems_allowed = *nodes;
+	pol->w.cpuset_sysram_nodes = *nodes;
 }
 
 /*
  * mpol_rebind_policy - Migrate a policy to a different set of nodes
  *
  * Per-vma policies are protected by mmap_lock. Allocations using per-task
- * policies are protected by task->mems_allowed_seq to prevent a premature
+ * policies are protected by task->sysram_nodes_seq to prevent a premature
  * OOM/allocation failure due to parallel nodemask modification.
  */
 static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
@@ -530,7 +530,7 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
 	if (!pol || pol->mode == MPOL_LOCAL)
 		return;
 	if (!mpol_store_user_nodemask(pol) &&
-	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
+	    nodes_equal(pol->w.cpuset_sysram_nodes, *newmask))
 		return;
 
 	mpol_ops[pol->mode].rebind(pol, newmask);
@@ -1086,7 +1086,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 			return -EINVAL;
 		*policy = 0;	/* just so it's initialized */
 		task_lock(current);
-		*nmask  = cpuset_current_mems_allowed;
+		*nmask  = cpuset_current_sysram_nodes;
 		task_unlock(current);
 		return 0;
 	}
@@ -2029,7 +2029,7 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 	unsigned int cpuset_mems_cookie;
 
 retry:
-	/* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
+	/* to prevent miscount use tsk->sysram_nodes_seq to detect rebind */
 	cpuset_mems_cookie = read_mems_allowed_begin();
 	node = current->il_prev;
 	if (!current->il_weight || !node_isset(node, policy->nodes)) {
@@ -2051,7 +2051,7 @@ static unsigned int interleave_nodes(struct mempolicy *policy)
 	unsigned int nid;
 	unsigned int cpuset_mems_cookie;
 
-	/* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
+	/* to prevent miscount, use tsk->sysram_nodes_seq to detect rebind */
 	do {
 		cpuset_mems_cookie = read_mems_allowed_begin();
 		nid = next_node_in(current->il_prev, policy->nodes);
@@ -2118,7 +2118,7 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
 	/*
 	 * barrier stabilizes the nodemask locally so that it can be iterated
 	 * over safely without concern for changes. Allocators validate node
-	 * selection does not violate mems_allowed, so this is safe.
+	 * selection does not violate sysram_nodes, so this is safe.
 	 */
 	barrier();
 	memcpy(mask, &pol->nodes, sizeof(nodemask_t));
@@ -2210,7 +2210,7 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 	case MPOL_BIND:
 		/* Restrict to nodemask (but not on lower zones) */
 		if (apply_policy_zone(pol, gfp_zone(gfp)) &&
-		    cpuset_nodemask_valid_mems_allowed(&pol->nodes))
+		    cpuset_nodemask_valid_sysram_nodes(&pol->nodes))
 			nodemask = &pol->nodes;
 		if (pol->home_node != NUMA_NO_NODE)
 			*nid = pol->home_node;
@@ -2738,7 +2738,7 @@ int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
 /*
  * If mpol_dup() sees current->cpuset == cpuset_being_rebound, then it
  * rebinds the mempolicy its copying by calling mpol_rebind_policy()
- * with the mems_allowed returned by cpuset_mems_allowed().  This
+ * with the sysram_nodes returned by cpuset_mems_allowed().  This
  * keeps mempolicies cpuset relative after its cpuset moves.  See
  * further kernel/cpuset.c update_nodemask().
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 386b4ceeaeb8..9d13580c21ef 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -110,7 +110,7 @@ static bool oom_cpuset_eligible(struct task_struct *start,
 			 * This is not a mempolicy constrained oom, so only
 			 * check the mems of tsk's cpuset.
 			 */
-			ret = cpuset_mems_allowed_intersects(current, tsk);
+			ret = cpuset_sysram_nodes_intersects(current, tsk);
 		}
 		if (ret)
 			break;
@@ -300,7 +300,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 
 	if (cpuset_limited) {
 		oc->totalpages = total_swap_pages;
-		for_each_node_mask(nid, cpuset_current_mems_allowed)
+		for_each_node_mask(nid, cpuset_current_sysram_nodes)
 			oc->totalpages += node_present_pages(nid);
 		return CONSTRAINT_CPUSET;
 	}
@@ -451,7 +451,7 @@ static void dump_oom_victim(struct oom_control *oc, struct task_struct *victim)
 	pr_info("oom-kill:constraint=%s,nodemask=%*pbl",
 			oom_constraint_text[oc->constraint],
 			nodemask_pr_args(oc->nodemask));
-	cpuset_print_current_mems_allowed();
+	cpuset_print_current_sysram_nodes();
 	mem_cgroup_print_oom_context(oc->memcg, victim);
 	pr_cont(",task=%s,pid=%d,uid=%d\n", victim->comm, victim->pid,
 		from_kuid(&init_user_ns, task_uid(victim)));
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ea6a50f6079..e1257cb7aea4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3964,7 +3964,7 @@ void warn_alloc(gfp_t gfp_mask, const nodemask_t *nodemask, const char *fmt, ...
 			nodemask_pr_args(nodemask));
 	va_end(args);
 
-	cpuset_print_current_mems_allowed();
+	cpuset_print_current_sysram_nodes();
 	pr_cont("\n");
 	dump_stack();
 	warn_alloc_show_mem(gfp_mask, nodemask);
@@ -4601,7 +4601,7 @@ static inline bool
 check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
 {
 	/*
-	 * It's possible that cpuset's mems_allowed and the nodemask from
+	 * It's possible that cpuset's sysram_nodes and the nodemask from
 	 * mempolicy don't intersect. This should be normally dealt with by
 	 * policy_nodemask(), but it's possible to race with cpuset update in
 	 * such a way the check therein was true, and then it became false
@@ -4612,13 +4612,13 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
 	 * caller can deal with a violated nodemask.
 	 */
 	if (cpusets_enabled() && ac->nodemask &&
-			!cpuset_nodemask_valid_mems_allowed(ac->nodemask)) {
+			!cpuset_nodemask_valid_sysram_nodes(ac->nodemask)) {
 		ac->nodemask = mt_sysram_nodemask();
 		return true;
 	}
 
 	/*
-	 * When updating a task's mems_allowed or mempolicy nodemask, it is
+	 * When updating a task's sysram_nodes or mempolicy nodemask, it is
 	 * possible to race with parallel threads in such a way that our
 	 * allocation can fail while the mask is being updated. If we are about
 	 * to fail, check if the cpuset changed during allocation and if so,
@@ -4702,7 +4702,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
 		struct zoneref *z = first_zones_zonelist(ac->zonelist,
 					ac->highest_zoneidx,
-					&cpuset_current_mems_allowed);
+					&cpuset_current_sysram_nodes);
 		if (!zonelist_zone(z))
 			goto nopage;
 	}
@@ -4946,7 +4946,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		 * to the current task context. It means that any node ok.
 		 */
 		if (in_task() && !ac->nodemask)
-			ac->nodemask = &cpuset_current_mems_allowed;
+			ac->nodemask = &cpuset_current_sysram_nodes;
 		else
 			*alloc_flags |= ALLOC_CPUSET;
 	} else if (!ac->nodemask) /* sysram_nodes may be NULL during __init */
@@ -5194,7 +5194,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 
 	/*
 	 * Restore the original nodemask if it was potentially replaced with
-	 * &cpuset_current_mems_allowed to optimize the fast-path attempt.
+	 * &cpuset_current_sysram_nodes to optimize the fast-path attempt.
 	 *
 	 * If not set, default to sysram nodes.
 	 */
@@ -5819,7 +5819,7 @@ build_all_zonelists_init(void)
 		per_cpu_pages_init(&per_cpu(boot_pageset, cpu), &per_cpu(boot_zonestats, cpu));
 
 	mminit_verify_zonelist();
-	cpuset_init_current_mems_allowed();
+	cpuset_init_current_sysram_nodes();
 }
 
 /*
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 24685b5c6dcf..ca7b6872c3d8 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -128,7 +128,7 @@ static bool show_mem_node_skip(unsigned int flags, int nid,
 	 * have to be precise here.
 	 */
 	if (!nodemask)
-		nodemask = &cpuset_current_mems_allowed;
+		nodemask = &cpuset_current_sysram_nodes;
 
 	return !node_isset(nid, *nodemask);
 }
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (5 preceding siblings ...)
  2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

mems_sysram contains only SystemRAM nodes (omitting SPM Nodes). The
nodelist is effectively intersect(effective_mems, mt_sysram_nodelist).

When checking mems_allowed, check for GFP_SPM_NODE to determine if
the check should be made against mems_sysram or mems_allowed, since
mems_sysram only contains sysram nodes.

This omits "Specific Purpose Memory Nodes" from default mems_allowed
checks, making those nodes unreachable via "normal" allocation paths
(page faults, mempolicies, etc).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/cpuset.h          |  8 ++--
 kernel/cgroup/cpuset-internal.h |  8 ++++
 kernel/cgroup/cpuset-v1.c       |  7 +++
 kernel/cgroup/cpuset.c          | 84 ++++++++++++++++++++++++---------
 mm/memcontrol.c                 |  3 +-
 mm/mempolicy.c                  |  6 +--
 mm/migrate.c                    |  4 +-
 7 files changed, 88 insertions(+), 32 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 9baaf19431b5..375bf446b66e 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -77,7 +77,7 @@ extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern bool cpuset_cpu_is_isolated(int cpu);
-extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
+extern nodemask_t cpuset_sysram_nodes_allowed(struct task_struct *p);
 #define cpuset_current_sysram_nodes (current->sysram_nodes)
 #define cpuset_current_mems_allowed (cpuset_current_sysram_nodes)
 void cpuset_init_current_sysram_nodes(void);
@@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+extern bool cpuset_sysram_node_allowed(struct cgroup *cgroup, int nid);
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -212,7 +212,7 @@ static inline bool cpuset_cpu_is_isolated(int cpu)
 	return false;
 }
 
-static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
+static inline nodemask_t cpuset_sysram_nodes_allowed(struct task_struct *p)
 {
 	return node_possible_map;
 }
@@ -296,7 +296,7 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+static inline bool cpuset_sysram_node_allowed(struct cgroup *cgroup, int nid)
 {
 	return true;
 }
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 337608f408ce..64e48fe040ed 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -53,6 +53,7 @@ typedef enum {
 	FILE_MEMORY_MIGRATE,
 	FILE_CPULIST,
 	FILE_MEMLIST,
+	FILE_MEMS_SYSRAM,
 	FILE_EFFECTIVE_CPULIST,
 	FILE_EFFECTIVE_MEMLIST,
 	FILE_SUBPARTS_CPULIST,
@@ -104,6 +105,13 @@ struct cpuset {
 	cpumask_var_t effective_cpus;
 	nodemask_t effective_mems;
 
+	/*
+	 * SystemRAM Memory Nodes for tasks.
+	 * This is the intersection of effective_mems and mt_sysram_nodelist.
+	 * Tasks will have their sysram_nodes set to this value.
+	 */
+	nodemask_t mems_sysram;
+
 	/*
 	 * Exclusive CPUs dedicated to current cgroup (default hierarchy only)
 	 *
diff --git a/kernel/cgroup/cpuset-v1.c b/kernel/cgroup/cpuset-v1.c
index 12e76774c75b..c58215d7230e 100644
--- a/kernel/cgroup/cpuset-v1.c
+++ b/kernel/cgroup/cpuset-v1.c
@@ -293,6 +293,7 @@ void cpuset1_hotplug_update_tasks(struct cpuset *cs,
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
+	cpuset_update_tasks_nodemask(cs);
 	cpuset_callback_unlock_irq();
 
 	/*
@@ -532,6 +533,12 @@ struct cftype cpuset1_files[] = {
 		.private = FILE_EFFECTIVE_MEMLIST,
 	},
 
+	{
+		.name = "mems_sysram",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_MEMS_SYSRAM,
+	},
+
 	{
 		.name = "cpu_exclusive",
 		.read_u64 = cpuset_read_u64,
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f0c59621a7f2..e08b59a0cf99 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -29,6 +29,7 @@
 #include <linux/mempolicy.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/memory-tiers.h>
 #include <linux/export.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
@@ -428,11 +429,11 @@ static void guarantee_active_cpus(struct task_struct *tsk,
  *
  * Call with callback_lock or cpuset_mutex held.
  */
-static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
+static void guarantee_online_sysram_nodes(struct cpuset *cs, nodemask_t *pmask)
 {
-	while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY]))
+	while (!nodes_intersects(cs->mems_sysram, node_states[N_MEMORY]))
 		cs = parent_cs(cs);
-	nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]);
+	nodes_and(*pmask, cs->mems_sysram, node_states[N_MEMORY]);
 }
 
 /**
@@ -2723,7 +2724,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
 
 	cpuset_being_rebound = cs;		/* causes mpol_dup() rebind */
 
-	guarantee_online_mems(cs, &newmems);
+	guarantee_online_sysram_nodes(cs, &newmems);
 
 	/*
 	 * The mpol_rebind_mm() call takes mmap_lock, which we couldn't
@@ -2748,7 +2749,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
 
 		migrate = is_memory_migrate(cs);
 
-		mpol_rebind_mm(mm, &cs->mems_allowed);
+		mpol_rebind_mm(mm, &cs->mems_sysram);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
 		else
@@ -2808,6 +2809,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 
 		spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
+		mt_nodemask_sysram_mask(&cp->mems_sysram, &cp->effective_mems);
 		spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -3234,11 +3236,11 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * by skipping the task iteration and update.
 	 */
 	if (cpuset_v2() && !cpus_updated && !mems_updated) {
-		cpuset_attach_nodemask_to = cs->effective_mems;
+		cpuset_attach_nodemask_to = cs->mems_sysram;
 		goto out;
 	}
 
-	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+	guarantee_online_sysram_nodes(cs, &cpuset_attach_nodemask_to);
 
 	cgroup_taskset_for_each(task, css, tset)
 		cpuset_attach_task(cs, task);
@@ -3249,7 +3251,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
 	 * not set.
 	 */
-	cpuset_attach_nodemask_to = cs->effective_mems;
+	cpuset_attach_nodemask_to = cs->mems_sysram;
 	if (!is_memory_migrate(cs) && !mems_updated)
 		goto out;
 
@@ -3371,6 +3373,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_EFFECTIVE_MEMLIST:
 		seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems));
 		break;
+	case FILE_MEMS_SYSRAM:
+		seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->mems_sysram));
+		break;
 	case FILE_EXCLUSIVE_CPULIST:
 		seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->exclusive_cpus));
 		break;
@@ -3482,6 +3487,12 @@ static struct cftype dfl_files[] = {
 		.private = FILE_EFFECTIVE_MEMLIST,
 	},
 
+	{
+		.name = "mems.sysram",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_MEMS_SYSRAM,
+	},
+
 	{
 		.name = "cpus.partition",
 		.seq_show = cpuset_partition_show,
@@ -3585,6 +3596,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
+		mt_nodemask_sysram_mask(&cs->mems_sysram, &cs->effective_mems);
 	}
 	spin_unlock_irq(&callback_lock);
 
@@ -3616,6 +3628,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	spin_lock_irq(&callback_lock);
 	cs->mems_allowed = parent->mems_allowed;
 	cs->effective_mems = parent->mems_allowed;
+	mt_nodemask_sysram_mask(&cs->mems_sysram, &cs->effective_mems);
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
 	spin_unlock_irq(&callback_lock);
@@ -3769,7 +3782,7 @@ static void cpuset_fork(struct task_struct *task)
 
 	/* CLONE_INTO_CGROUP */
 	mutex_lock(&cpuset_mutex);
-	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+	guarantee_online_sysram_nodes(cs, &cpuset_attach_nodemask_to);
 	cpuset_attach_task(cs, task);
 
 	dec_attach_in_progress_locked(cs);
@@ -3818,7 +3831,8 @@ int __init cpuset_init(void)
 	cpumask_setall(top_cpuset.effective_xcpus);
 	cpumask_setall(top_cpuset.exclusive_cpus);
 	nodes_setall(top_cpuset.effective_mems);
-
+	mt_nodemask_sysram_mask(&top_cpuset.mems_sysram,
+				&top_cpuset.effective_mems);
 	fmeter_init(&top_cpuset.fmeter);
 	INIT_LIST_HEAD(&remote_children);
 
@@ -3848,6 +3862,7 @@ hotplug_update_tasks(struct cpuset *cs,
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
+	mt_nodemask_sysram_mask(&cs->mems_sysram, &cs->effective_mems);
 	spin_unlock_irq(&callback_lock);
 
 	if (cpus_updated)
@@ -4039,6 +4054,8 @@ static void cpuset_handle_hotplug(void)
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
+		mt_nodemask_sysram_mask(&top_cpuset.mems_sysram,
+					&top_cpuset.effective_mems);
 		spin_unlock_irq(&callback_lock);
 		cpuset_update_tasks_nodemask(&top_cpuset);
 	}
@@ -4109,6 +4126,8 @@ void __init cpuset_init_smp(void)
 
 	cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
 	top_cpuset.effective_mems = node_states[N_MEMORY];
+	mt_nodemask_sysram_mask(&top_cpuset.mems_sysram,
+				&top_cpuset.effective_mems);
 
 	hotplug_node_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
 
@@ -4205,14 +4224,18 @@ bool cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 	return changed;
 }
 
+/*
+ * At this point in time, no hotplug nodes can have been added, so just set
+ * the sysram_nodes of the init task to the set of N_MEMORY nodes.
+ */
 void __init cpuset_init_current_sysram_nodes(void)
 {
-	nodes_setall(current->sysram_nodes);
+	current->sysram_nodes = node_states[N_MEMORY];
 }
 
 /**
- * cpuset_mems_allowed - return mems_allowed mask from a tasks cpuset.
- * @tsk: pointer to task_struct from which to obtain cpuset->mems_allowed.
+ * cpuset_sysram_nodes_allowed - return mems_sysram mask from a tasks cpuset.
+ * @tsk: pointer to task_struct from which to obtain cpuset->mems_sysram.
  *
  * Description: Returns the nodemask_t mems_allowed of the cpuset
  * attached to the specified @tsk.  Guaranteed to return some non-empty
@@ -4220,13 +4243,13 @@ void __init cpuset_init_current_sysram_nodes(void)
  * tasks cpuset.
  **/
 
-nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
+nodemask_t cpuset_sysram_nodes_allowed(struct task_struct *tsk)
 {
 	nodemask_t mask;
 	unsigned long flags;
 
 	spin_lock_irqsave(&callback_lock, flags);
-	guarantee_online_mems(task_cs(tsk), &mask);
+	guarantee_online_sysram_nodes(task_cs(tsk), &mask);
 	spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
@@ -4295,17 +4318,30 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *	tsk_is_oom_victim   - any node ok
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
+ *	GFP_SPM_NODE - allow specific purpose memory nodes in mems_allowed
  */
 bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 {
 	struct cpuset *cs;		/* current cpuset ancestors */
 	bool allowed;			/* is allocation in zone z allowed? */
 	unsigned long flags;
+	bool sp_node = gfp_mask & __GFP_SPM_NODE;
 
+	/* Only SysRAM nodes are valid in interrupt context */
 	if (in_interrupt())
-		return true;
-	if (node_isset(node, current->sysram_nodes))
-		return true;
+		return (!sp_node || node_isset(node, mt_sysram_nodelist));
+
+	if (sp_node) {
+		rcu_read_lock();
+		cs = task_cs(current);
+		allowed = node_isset(node, cs->mems_allowed);
+		rcu_read_unlock();
+	} else
+		allowed = node_isset(node, current->sysram_nodes);
+
+	if (allowed)
+		return allowed;
+
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
@@ -4324,11 +4360,15 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 
+	/* If not a SP Node allocation, restrict to sysram nodes */
+	if (!sp_node && !nodes_empty(mt_sysram_nodelist))
+		allowed &= node_isset(node, mt_sysram_nodelist);
+
 	spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 
-bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+bool cpuset_sysram_node_allowed(struct cgroup *cgroup, int nid)
 {
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
@@ -4347,7 +4387,7 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
 		return true;
 
 	/*
-	 * Normally, accessing effective_mems would require the cpuset_mutex
+	 * Normally, accessing mems_sysram would require the cpuset_mutex
 	 * or callback_lock - but node_isset is atomic and the reference
 	 * taken via cgroup_get_e_css is sufficient to protect css.
 	 *
@@ -4359,7 +4399,7 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
 	 * cannot make strong isolation guarantees, so this is acceptable.
 	 */
 	cs = container_of(css, struct cpuset, css);
-	allowed = node_isset(nid, cs->effective_mems);
+	allowed = node_isset(nid, cs->mems_sysram);
 	css_put(css);
 	return allowed;
 }
@@ -4380,7 +4420,7 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
  * We don't have to worry about the returned node being offline
  * because "it can't happen", and even if it did, it would be ok.
  *
- * The routines calling guarantee_online_mems() are careful to
+ * The routines calling guarantee_online_sysram_nodes() are careful to
  * only set nodes in task->sysram_nodes that are online.  So it
  * should not be possible for the following code to return an
  * offline node.  But if it did, that would be ok, as this routine
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4deda33625f4..7cac7ff013a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5599,5 +5599,6 @@ subsys_initcall(mem_cgroup_swap_init);
 
 bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
 {
-	return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
+	return memcg ? cpuset_sysram_node_allowed(memcg->css.cgroup, nid) :
+		       true;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 735dabb9c50c..e1e8a1f3e1a2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1831,14 +1831,14 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 	}
 	rcu_read_unlock();
 
-	task_nodes = cpuset_mems_allowed(task);
+	task_nodes = cpuset_sysram_nodes_allowed(task);
 	/* Is the user allowed to access the target nodes? */
 	if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) {
 		err = -EPERM;
 		goto out_put;
 	}
 
-	task_nodes = cpuset_mems_allowed(current);
+	task_nodes = cpuset_sysram_nodes_allowed(current);
 	nodes_and(*new, *new, task_nodes);
 	if (nodes_empty(*new))
 		goto out_put;
@@ -2763,7 +2763,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old)
 		*new = *old;
 
 	if (current_cpuset_is_being_rebound()) {
-		nodemask_t mems = cpuset_mems_allowed(current);
+		nodemask_t mems = cpuset_sysram_nodes_allowed(current);
 		mpol_rebind_policy(new, &mems);
 	}
 	atomic_set(&new->refcnt, 1);
diff --git a/mm/migrate.c b/mm/migrate.c
index c0e9f15be2a2..c612f05d23db 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2526,7 +2526,7 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
 	 */
 	if (!pid) {
 		mmget(current->mm);
-		*mem_nodes = cpuset_mems_allowed(current);
+		*mem_nodes = cpuset_sysram_nodes_allowed(current);
 		return current->mm;
 	}
 
@@ -2547,7 +2547,7 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
 	mm = ERR_PTR(security_task_movememory(task));
 	if (IS_ERR(mm))
 		goto out;
-	*mem_nodes = cpuset_mems_allowed(task);
+	*mem_nodes = cpuset_sysram_nodes_allowed(task);
 	mm = get_task_mm(task);
 out:
 	put_task_struct(task);
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (6 preceding siblings ...)
  2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-11-13 14:58   ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

Add support for Specific Purpose Memory (SPM) NUMA nodes.

A SPM node is managed by the page allocator, but can only allocated
by using the __GFP_SP_NODE flag with an appropriate nodemask.

Check/Set the node type (SysRAM vs SPM) at hotplug time.
Disallow SPM from being added to SysRAM nodes and vice-versa.

This prevents normal allocation paths (page faults, kmalloc, etc)
from being directly exposed to these memories, and provides a clear
integration point for buddy-allocation of SPM memory.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory_hotplug.h | 10 ++++++++++
 mm/memory_hotplug.c            |  7 +++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 23f038a16231..a50c467951ba 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -74,6 +74,16 @@ typedef int __bitwise mhp_t;
  * helpful in low-memory situations.
  */
 #define MHP_OFFLINE_INACCESSIBLE	((__force mhp_t)BIT(3))
+/*
+ * The hotplugged memory can only be added to a "Specific Purpose Memory"
+ * NUMA node.  SPM Nodes are not generally accessible by the page allocator
+ * by way of userland configuration - as most nodemask interfaces
+ * (mempolicy, cpusets) restrict nodes to SysRAM nodes.
+ *
+ * Hotplugging SPM into a SysRAM Node results in -EINVAL.
+ * Hotplugging SysRAM into a SPM Node results in -EINVAL.
+ */
+#define MHP_SPM_NODE	((__force mhp_t)BIT(4))
 
 /*
  * Extended parameters for memory hotplug:
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0be83039c3b5..488cdd8e5f6f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -20,6 +20,7 @@
 #include <linux/memory.h>
 #include <linux/memremap.h>
 #include <linux/memory_hotplug.h>
+#include <linux/memory-tiers.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
 #include <linux/delay.h>
@@ -1529,6 +1530,12 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 
 	mem_hotplug_begin();
 
+	/* Set the NUMA node type and bail out if the type is wrong */
+	ret = mt_set_node_type(nid, (mhp_flags & MHP_SPM_NODE) ?
+				    MT_NODE_TYPE_SPM : MT_NODE_TYPE_SYSRAM);
+	if (ret)
+		goto error_mem_hotplug_end;
+
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		if (res->flags & IORESOURCE_SYSRAM_DRIVER_MANAGED)
 			memblock_flags = MEMBLOCK_DRIVER_MANAGED;
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (7 preceding siblings ...)
  2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

This bit is used by dax/kmem to determine whether to set the
MHP_SPM_NODE flags, which determines whether the hotplug memory
is SysRAM or Specific Purpose Memory.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/dax/bus.c         | 39 +++++++++++++++++++++++++++++++++++++++
 drivers/dax/bus.h         |  1 +
 drivers/dax/dax-private.h |  1 +
 drivers/dax/kmem.c        |  2 ++
 4 files changed, 43 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index fde29e0ad68b..b0de43854112 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1361,6 +1361,43 @@ static ssize_t memmap_on_memory_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(memmap_on_memory);
 
+static ssize_t spm_node_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+
+	return sysfs_emit(buf, "%d\n", dev_dax->spm_node);
+}
+
+static ssize_t spm_node_store(struct device *dev,
+			      struct device_attribute *attr,
+			      const char *buf, size_t len)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	bool val;
+	int rc;
+
+	rc = kstrtobool(buf, &val);
+	if (rc)
+		return rc;
+
+	rc = down_write_killable(&dax_dev_rwsem);
+	if (rc)
+		return rc;
+
+	if (dev_dax->spm_node != val && dev->driver &&
+	    to_dax_drv(dev->driver)->type == DAXDRV_KMEM_TYPE) {
+		up_write(&dax_dev_rwsem);
+		return -EBUSY;
+	}
+
+	dev_dax->spm_node = val;
+	up_write(&dax_dev_rwsem);
+
+	return len;
+}
+static DEVICE_ATTR_RW(spm_node);
+
 static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
@@ -1388,6 +1425,7 @@ static struct attribute *dev_dax_attributes[] = {
 	&dev_attr_resource.attr,
 	&dev_attr_numa_node.attr,
 	&dev_attr_memmap_on_memory.attr,
+	&dev_attr_spm_node.attr,
 	NULL,
 };
 
@@ -1494,6 +1532,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	ida_init(&dev_dax->ida);
 
 	dev_dax->memmap_on_memory = data->memmap_on_memory;
+	dev_dax->spm_node = data->spm_node;
 
 	inode = dax_inode(dax_dev);
 	dev->devt = inode->i_rdev;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index cbbf64443098..51ed961b6a3c 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -24,6 +24,7 @@ struct dev_dax_data {
 	resource_size_t size;
 	int id;
 	bool memmap_on_memory;
+	bool spm_node;
 };
 
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 0867115aeef2..3d1b1f996383 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -89,6 +89,7 @@ struct dev_dax {
 	struct device dev;
 	struct dev_pagemap *pgmap;
 	bool memmap_on_memory;
+	bool spm_node;
 	int nr_range;
 	struct dev_dax_range *ranges;
 };
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index c036e4d0b610..3c3dd1cd052c 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -169,6 +169,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		mhp_flags = MHP_NID_IS_MGID;
 		if (dev_dax->memmap_on_memory)
 			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+		if (dev_dax->spm_node)
+			mhp_flags |= MHP_SPM_NODE;
 
 		/*
 		 * Ensure that future kexec'd kernels will not treat
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (8 preceding siblings ...)
  2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

Add spm_node bit to cxl region, forward it to the dax device.

This allows auto-hotplug to occur without an intermediate udev
step to poke the DAX device spm_node bit.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/core/region.c | 30 ++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h         |  2 ++
 drivers/dax/cxl.c         |  1 +
 3 files changed, 33 insertions(+)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index b06fee1978ba..3348b09dfe9a 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -754,6 +754,35 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RW(size);
 
+static ssize_t spm_node_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%d\n", cxlr->spm_node);
+}
+
+static ssize_t spm_node_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	bool val;
+	int rc;
+
+	rc = kstrtobool(buf, &val);
+	if (rc)
+		return rc;
+
+	ACQUIRE(rwsem_write_kill, rwsem)(&cxl_rwsem.region);
+	if ((rc = ACQUIRE_ERR(rwsem_read_intr, &rwsem)))
+		return rc;
+
+	cxlr->spm_node = val;
+	return len;
+}
+static DEVICE_ATTR_RW(spm_node);
+
 static struct attribute *cxl_region_attrs[] = {
 	&dev_attr_uuid.attr,
 	&dev_attr_commit.attr,
@@ -762,6 +791,7 @@ static struct attribute *cxl_region_attrs[] = {
 	&dev_attr_resource.attr,
 	&dev_attr_size.attr,
 	&dev_attr_mode.attr,
+	&dev_attr_spm_node.attr,
 	NULL,
 };
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 231ddccf8977..ba7cde06dfd3 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -530,6 +530,7 @@ enum cxl_partition_mode {
  * @coord: QoS access coordinates for the region
  * @node_notifier: notifier for setting the access coordinates to node
  * @adist_notifier: notifier for calculating the abstract distance of node
+ * @spm_node: memory can only be added to specific purpose NUMA nodes
  */
 struct cxl_region {
 	struct device dev;
@@ -543,6 +544,7 @@ struct cxl_region {
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
 	struct notifier_block node_notifier;
 	struct notifier_block adist_notifier;
+	bool spm_node;
 };
 
 struct cxl_nvdimm_bridge {
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 13cd94d32ff7..968d23fc19ed 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -27,6 +27,7 @@ static int cxl_dax_region_probe(struct device *dev)
 		.id = -1,
 		.size = range_len(&cxlr_dax->hpa_range),
 		.memmap_on_memory = true,
+		.spm_node = cxlr->spm_node,
 	};
 
 	return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (9 preceding siblings ...)
  2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
@ 2025-11-12 19:29 ` Gregory Price
  2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

Here is an example of how you might use a SPM memory node.

If there is compressed ram available (in this case, a bit present
in mt_spm_nodelist), we skip the entire software compression process
and memcpy directly to a compressed memory folio, and store the newly
allocated compressed memory page as the zswap entry->handle.

On decompress we do the opposite: copy directly from the stored
page to the destination, and free the compressed memory page.

Note: We do not integrate any compressed memory device checks at
this point because this is a stand-in to demonstrate how the SPM
node allocation mechanism works.

See the "TODO" comment in `zswap_compress_direct()` for more details

In reality, we would want to move this mechanism out of zswap into
its own component (cram.c?), and enable a more direct migrate_page()
call that actually re-maps the page read-only into any mappings, and
then provides a write-fault handler which promotes the page on write.

(Similar to a NUMA Hint Fault, but only on write-access)

This prevents any run-away compression ratio failures, since the
compression ratio would be checked on allocation, rather than allowed
to silently decrease on writes until the device becomes unstable.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/zswap.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index c1af782e54ec..e6f48a4e90f1 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -25,6 +25,7 @@
 #include <linux/scatterlist.h>
 #include <linux/mempolicy.h>
 #include <linux/mempool.h>
+#include <linux/memory-tiers.h>
 #include <crypto/acompress.h>
 #include <linux/zswap.h>
 #include <linux/mm_types.h>
@@ -191,6 +192,7 @@ struct zswap_entry {
 	swp_entry_t swpentry;
 	unsigned int length;
 	bool referenced;
+	bool direct;
 	struct zswap_pool *pool;
 	unsigned long handle;
 	struct obj_cgroup *objcg;
@@ -717,7 +719,8 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
 static void zswap_entry_free(struct zswap_entry *entry)
 {
 	zswap_lru_del(&zswap_list_lru, entry);
-	zs_free(entry->pool->zs_pool, entry->handle);
+	if (!entry->direct)
+		zs_free(entry->pool->zs_pool, entry->handle);
 	zswap_pool_put(entry->pool);
 	if (entry->objcg) {
 		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
@@ -851,6 +854,43 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
 	mutex_unlock(&acomp_ctx->mutex);
 }
 
+static struct page *zswap_compress_direct(struct page *src,
+					  struct zswap_entry *entry)
+{
+	int nid = first_node(mt_spm_nodelist);
+	struct page *dst;
+	gfp_t gfp;
+
+	if (nid == NUMA_NO_NODE)
+		return NULL;
+
+	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE |
+	      __GFP_SPM_NODE;
+	dst = __alloc_pages(gfp, 0, nid, &mt_spm_nodelist);
+	if (!dst)
+		return NULL;
+
+	/*
+	 * TODO: check that the page is safe to use
+	 *
+	 * In a real implementation, we would not be using ZSWAP to demonstrate this
+	 * and instead would implement a new component (compressed_ram, cram.c?)
+	 *
+	 * At this point we would check via some callback that the device's memory
+	 * is actually safe to use - and if not, free the page (without writing to
+	 * it), and kick off kswapd for that node to make room.
+	 *
+	 * Alternatively, if the compressed memory device(s) report a watermark
+	 * crossing via interrupt, a flag can be set that is checked here rather
+	 * that calling back into a device driver.
+	 *
+	 * In this case, we're testing with normal memory, so the memory is always
+	 * safe to use (i.e. no compression ratio to worry about).
+	 */
+	copy_mc_highpage(dst, src);
+	return dst;
+}
+
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 			   struct zswap_pool *pool)
 {
@@ -862,6 +902,19 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	gfp_t gfp;
 	u8 *dst;
 	bool mapped = false;
+	struct page *zpage;
+
+	/* Try to shunt directly to compressed ram */
+	if (!nodes_empty(mt_spm_nodelist)) {
+		zpage = zswap_compress_direct(page, entry);
+		if (zpage) {
+			entry->handle = (unsigned long)zpage;
+			entry->length = PAGE_SIZE;
+			entry->direct = true;
+			return true;
+		}
+		/* otherwise fallback to normal zswap */
+	}
 
 	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
 	dst = acomp_ctx->buffer;
@@ -939,6 +992,16 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	int decomp_ret = 0, dlen = PAGE_SIZE;
 	u8 *src, *obj;
 
+	/* compressed ram page */
+	if (entry->direct) {
+		struct page *src = (struct page *)entry->handle;
+		struct folio *zfolio = page_folio(src);
+
+		memcpy_folio(folio, 0, zfolio, 0, PAGE_SIZE);
+		__free_page(src);
+		goto direct_done;
+	}
+
 	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
 	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
 
@@ -972,6 +1035,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	zs_obj_read_end(pool->zs_pool, entry->handle, obj);
 	acomp_ctx_put_unlock(acomp_ctx);
 
+direct_done:
 	if (!decomp_ret && dlen == PAGE_SIZE)
 		return true;
 
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] memory-tiers: multi-definition fixup
  2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
@ 2025-11-13 14:58   ` Gregory Price
  2025-11-13 16:37     ` kernel test robot
  2026-01-15  2:38     ` [PATCH] dax/kmem: add build config for protected dax memory blocks Gregory Price
  0 siblings, 2 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-13 14:58 UTC (permalink / raw)
  To: linux-mm; +Cc: kernel-team, linux-kernel

mt_set_node_type should be static

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory-tiers.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 59443cbfaec3..aed9dc9e0c82 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -194,7 +194,7 @@ static inline void mt_put_memory_types(struct list_head *memory_types)
 {
 }
 
-int mt_set_node_type(int node, int type)
+static int mt_set_node_type(int node, int type)
 {
 	return 0;
 }
-- 
2.51.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] memory-tiers: multi-definition fixup
  2025-11-13 14:58   ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
@ 2025-11-13 16:37     ` kernel test robot
  2026-01-15  2:38     ` [PATCH] dax/kmem: add build config for protected dax memory blocks Gregory Price
  1 sibling, 0 replies; 34+ messages in thread
From: kernel test robot @ 2025-11-13 16:37 UTC (permalink / raw)
  To: Gregory Price, linux-mm; +Cc: oe-kbuild-all, kernel-team, linux-kernel

Hi Gregory,

kernel test robot noticed the following build warnings:



url:    https://github.com/intel-lab-lkp/linux/commits/UPDATE-20251113-230036/Gregory-Price/mm-constify-oom_control-scan_control-and-alloc_context-nodemask/20251113-033247
base:   the 8th patch of https://lore.kernel.org/r/20251112192936.2574429-9-gourry%40gourry.net
patch link:    https://lore.kernel.org/r/20251113145815.2926823-1-gourry%40gourry.net
patch subject: [PATCH] memory-tiers: multi-definition fixup
config: m68k-allnoconfig (https://download.01.org/0day-ci/archive/20251114/202511140039.XVfj2ju0-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251114/202511140039.XVfj2ju0-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511140039.XVfj2ju0-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from mm/oom_kill.c:37:
>> include/linux/memory-tiers.h:197:12: warning: 'mt_set_node_type' defined but not used [-Wunused-function]
     197 | static int mt_set_node_type(int node, int type)
         |            ^~~~~~~~~~~~~~~~


vim +/mt_set_node_type +197 include/linux/memory-tiers.h

   196	
 > 197	static int mt_set_node_type(int node, int type)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (10 preceding siblings ...)
  2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
@ 2025-11-18  7:02 ` Alistair Popple
  2025-11-18 10:36   ` Gregory Price
  2025-11-21 21:07   ` Gregory Price
  2025-11-24  9:19 ` David Hildenbrand (Red Hat)
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 34+ messages in thread
From: Alistair Popple @ 2025-11-18  7:02 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On 2025-11-13 at 06:29 +1100, Gregory Price <gourry@gourry.net> wrote...
> This is a code RFC for discussion related to
> 
> "Mempolicy is dead, long live memory policy!"
> https://lpc.events/event/19/contributions/2143/
> 
> base-commit: 24172e0d79900908cf5ebf366600616d29c9b417
> (version notes at end)
> 
> At LSF 2026, I plan to discuss:

Excellent! This all sounds quite interesting to me at least so I've added my two
cents here but looking forward to discussing at LPC.

> - Why? (In short: shunting to DAX is a failed pattern for users)
> - Other designs I considered (mempolicy, cpusets, zone_device)

I'm interested in the contrast with zone_device, and in particular why
device_coherent memory doesn't end up being a good fit for this.

> - Why mempolicy.c and cpusets as-is are insufficient
> - SPM types seeking this form of interface (Accelerator, Compression)

I'm sure you can guess my interest is in GPUs which also have memory some people
consider should only be used for specific purposes :-) Currently our coherent
GPUs online this as a normal NUMA noode, for which we have also generally
found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
hear what short comings you have been running into (I'm less familiar with the
Compression cases you talk about here though).

> - Platform extensions that would be nice to see (SPM-only Bits)
> 
> Open Questions
> - Single SPM nodemask, or multiple based on features?
> - Apply SPM/SysRAM bit on-boot only or at-hotplug?
> - Allocate extra "possible" NUMA nodes for flexbility?

I guess this might make hotplug easier? Particularly in cases where FW hasn't
created the nodes.

> - Should SPM Nodes be zone-restricted? (MOVABLE only?)

For device based memory I think so - otherwise you can never gurantee devices
can be removed or drivers (if required to access the memory) can be unbound as
you can't migrate things off the memory.

> - How to handle things like reclaim and compaction on these nodes.
> 
> 
> With this set, we aim to enable allocation of "special purpose memory"
> with the page allocator (mm/page_alloc.c) without exposing the same
> memory as "System RAM".  Unless a non-userland component, and does so
> with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
> 
> This isolation mechanism is a requirement for memory policies which
> depend on certain sets of memory never being used outside special
> interfaces (such as a specific mm/component or driver).
> 
> We present an example of using this mechanism within ZSWAP, as-if
> a "compressed memory node" was present.  How to describe the features
> of memory present on nodes is left up to comment here and at LPC '26.
> 
> Userspace-driven allocations are restricted by the sysram_nodes mask,
> nothing in userspace can explicitly request memory from SPM nodes.
> 
> Instead, the intent is to create new components which understand memory
> features and register those nodes with those components. This abstracts
> the hardware complexity away from userland while also not requiring new
> memory innovations to carry entirely new allocators.
> 
> The ZSwap example demonstrates this with the `mt_spm_nodemask`.  This
> hack treats all spm nodes as-if they are compressed memory nodes, and
> we bypass the software compression logic in zswap in favor of simply
> copying memory directly to the allocated page.  In a real design

So in your example (I get it's a hack) is the main advantage that you can use
all the same memory allocation policies (eg. cgroups) when needing to allocate
the pages? Given this is ZSwap I guess these pages would never be mapped
directly into user-space but would anything in the design prevent that? For
example could a driver say allocate SPM memory and then explicitly migrate an
existing page to it?

> There are 4 major changes in this set:
> 
> 1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes
>    the set of nodes which are eligible for use as normal system ram
> 
>    Some existing users now pass mt_sysram_nodelist into the page
>    allocator instead of NULL, but passing a NULL pointer in will simply
>    have it replaced by mt_sysram_nodelist anyway.  Should a fully NULL
>    pointer still make it to the page allocator, without GFP_SPM_NODE
>    SPM node zones will simply be skipped.
> 
>    mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes
>    present during __init, but if empty the use of mt_sysram_nodes()
>    will return a NULL to preserve current behavior.
> 
> 
> 2) The addition of `cpuset.mems.sysram` which restricts allocations to
>    `mt_sysram_nodes` unless GFP_SPM_NODE is used.
> 
>    SPM Nodes are still allowed in cpuset.mems.allowed and effective.
> 
>    This is done to allow separate control over sysram and SPM node sets
>    by cgroups while maintaining the existing hierarchical rules.
> 
>    current cpuset configuration
>    cpuset.mems_allowed
>     |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
>     |->tasks.mems_allowed    < cpuset.mems_effective
> 
>    new cpuset configuration
>    cpuset.mems_allowed
>     |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
>     |.sysram_nodes           < (mems_effective ∩ default_sys_nodemask)
>     |->task.sysram_nodes     < cpuset.sysram_nodes
> 
>    This means mems_allowed still restricts all node usage in any given
>    task context, which is the existing behavior.
> 
> 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
>    capacity being added should mark the node as an SPM Node. 
> 
>    A node is either SysRAM or SPM - never both.  Attempting to add
>    incompatible memory to a node results in hotplug failure.
> 
>    DAX and CXL are made aware of the bit and have `spm_node` bits added
>    to their relevant subsystems.
> 
> 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
>    from the provided node or nodemask.  It changes the behavior of
>    the cpuset mems_allowed and mt_node_allowed() checks.
> 
> v1->v2:
> - naming improvements
>     default_node -> sysram_node
>     protected    -> spm (Specific Purpose Memory)
> - add missing constify patch
> - add patch to update callers of __cpuset_zone_allowed
> - add additional logic to the mm sysram_nodes patch
> - fix bot build issues (ifdef config builds)
> - fix out-of-tree driver build issues (function renames)
> - change compressed_nodelist to spm_nodelist
> - add latch mechanism for sysram/spm nodes (Dan Williams)
>   this drops some extra memory-hotplug logic which is nice
> v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/
> 
> Gregory Price (11):
>   mm: constify oom_control, scan_control, and alloc_context nodemask
>   mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
>   gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations
>   memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes
>   mm: restrict slub, oom, compaction, and page_alloc to sysram by
>     default
>   mm,cpusets: rename task->mems_allowed to task->sysram_nodes
>   cpuset: introduce cpuset.mems.sysram
>   mm/memory_hotplug: add MHP_SPM_NODE flag
>   drivers/dax: add spm_node bit to dev_dax
>   drivers/cxl: add spm_node bit to cxl region
>   [HACK] mm/zswap: compressed ram integration example
> 
>  drivers/cxl/core/region.c       |  30 ++++++
>  drivers/cxl/cxl.h               |   2 +
>  drivers/dax/bus.c               |  39 ++++++++
>  drivers/dax/bus.h               |   1 +
>  drivers/dax/cxl.c               |   1 +
>  drivers/dax/dax-private.h       |   1 +
>  drivers/dax/kmem.c              |   2 +
>  fs/proc/array.c                 |   2 +-
>  include/linux/cpuset.h          |  62 +++++++------
>  include/linux/gfp_types.h       |   5 +
>  include/linux/memory-tiers.h    |  47 ++++++++++
>  include/linux/memory_hotplug.h  |  10 ++
>  include/linux/mempolicy.h       |   2 +-
>  include/linux/mm.h              |   4 +-
>  include/linux/mmzone.h          |   6 +-
>  include/linux/oom.h             |   2 +-
>  include/linux/sched.h           |   6 +-
>  include/linux/swap.h            |   2 +-
>  init/init_task.c                |   2 +-
>  kernel/cgroup/cpuset-internal.h |   8 ++
>  kernel/cgroup/cpuset-v1.c       |   7 ++
>  kernel/cgroup/cpuset.c          | 158 ++++++++++++++++++++------------
>  kernel/fork.c                   |   2 +-
>  kernel/sched/fair.c             |   4 +-
>  mm/compaction.c                 |  10 +-
>  mm/hugetlb.c                    |   8 +-
>  mm/internal.h                   |   2 +-
>  mm/memcontrol.c                 |   3 +-
>  mm/memory-tiers.c               |  66 ++++++++++++-
>  mm/memory_hotplug.c             |   7 ++
>  mm/mempolicy.c                  |  34 +++----
>  mm/migrate.c                    |   4 +-
>  mm/mmzone.c                     |   5 +-
>  mm/oom_kill.c                   |  11 ++-
>  mm/page_alloc.c                 |  57 +++++++-----
>  mm/show_mem.c                   |  11 ++-
>  mm/slub.c                       |  15 ++-
>  mm/vmscan.c                     |   6 +-
>  mm/zswap.c                      |  66 ++++++++++++-
>  39 files changed, 532 insertions(+), 178 deletions(-)
> 
> -- 
> 2.51.1
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
@ 2025-11-18 10:36   ` Gregory Price
  2025-11-21 21:07   ` Gregory Price
  1 sibling, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-11-18 10:36 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> On 2025-11-13 at 06:29 +1100, Gregory Price <gourry@gourry.net> wrote...
> > - Why? (In short: shunting to DAX is a failed pattern for users)
> > - Other designs I considered (mempolicy, cpusets, zone_device)
> 
> I'm interested in the contrast with zone_device, and in particular why
> device_coherent memory doesn't end up being a good fit for this.
>

I did consider zone_device briefly, but if you want sparse allocation
you end up essentially re-implementing some form of buddy allocator.

That seemed less then ideal, to say the least.

Additionally, pgmap use precludes these pages from using LRU/Reclaim,
and some devices may very well be compatible with such patterns.

(I think compression will be, but it still needs work)

> > - Why mempolicy.c and cpusets as-is are insufficient
> > - SPM types seeking this form of interface (Accelerator, Compression)
> 
> I'm sure you can guess my interest is in GPUs which also have memory some people
> consider should only be used for specific purposes :-) Currently our coherent
> GPUs online this as a normal NUMA noode, for which we have also generally
> found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
> hear what short comings you have been running into (I'm less familiar with the
> Compression cases you talk about here though).
> 

The TL;DR:

cpusets as-designed doesn't really allow the concept of "Nothing can
access XYZ node except specific things" because this would involve
removing a node from the root cpusets.mems - and that can't be loosened.

mempolicy is more of a suggestion and can be completely overridden. It
is entirely ignored by things like demotion/reclaim/etc.

I plan to discuss a bit of the specifics at LPC, but a lot of this stems
from the zone-iteration logic in page_alloc.c and the rather... ermm...
"complex" nature of how mempolicy and cpusets interacts with each other.

I may add some additional notes on this thread prior to LPC given that
time may be too short to get into the nasty bits in the session.

> > - Platform extensions that would be nice to see (SPM-only Bits)
> > 
> > Open Questions
> > - Single SPM nodemask, or multiple based on features?
> > - Apply SPM/SysRAM bit on-boot only or at-hotplug?
> > - Allocate extra "possible" NUMA nodes for flexbility?
> 
> I guess this might make hotplug easier? Particularly in cases where FW hasn't
> created the nodes.
>

In cases where you need to reach back to the device for some signal, you
likely need to have the driver for that device manage the alloc/free
patterns - so this may (or may not) generalize to 1-device-per-node.

In the scenario where you want some flexibility in managing regions,
this may require multiple nodes for device.  Maybe one device provides
multiple types of memory - you want those on separate nodes.

This doesn't seem like something you need to solve right away, just
something for folks to consider.

> > - Should SPM Nodes be zone-restricted? (MOVABLE only?)
> 
> For device based memory I think so - otherwise you can never gurantee devices
> can be removed or drivers (if required to access the memory) can be unbound as
> you can't migrate things off the memory.
> 

Zones in this scenario are bit of a square-peg/round-hole.  Forcing
everything in ZONE_MOVABLE means you can't do page pinning or things
like 1GB gigantic pages.  But the device driver should be capable of
managing hotplug anyway, so what's the point of ZONE_MOVABLE? :shrug:

> > The ZSwap example demonstrates this with the `mt_spm_nodemask`.  This
> > hack treats all spm nodes as-if they are compressed memory nodes, and
> > we bypass the software compression logic in zswap in favor of simply
> > copying memory directly to the allocated page.  In a real design
> 
> So in your example (I get it's a hack) is the main advantage that you can use
> all the same memory allocation policies (eg. cgroups) when needing to allocate
> the pages? Given this is ZSwap I guess these pages would never be mapped
> directly into user-space but would anything in the design prevent that? 

This is, in-fact, the long term intent. As long as the device can manage
inline decompression with reasonable latencies, there's no reason you
shouldn't be able to leave the pages mapped Read-Only in user-space.

The driver would be responsible for migrating on write-fault, similar to
a NUMA Hint Fault on the existing transparent page placement system.

> For example could a driver say allocate SPM memory and then explicitly
> migrate an existing page to it?

You might even extend migrate_pages with a new flag that simply drops
the write-able flag from the page table mapping and abstract that entire
complexity out of the driver :]

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
  2025-11-18 10:36   ` Gregory Price
@ 2025-11-21 21:07   ` Gregory Price
  2025-11-23 23:09     ` Alistair Popple
  1 sibling, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-11-21 21:07 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> 
> I'm interested in the contrast with zone_device, and in particular why
> device_coherent memory doesn't end up being a good fit for this.
> 
> > - Why mempolicy.c and cpusets as-is are insufficient
> > - SPM types seeking this form of interface (Accelerator, Compression)
> 
> I'm sure you can guess my interest is in GPUs which also have memory some people
> consider should only be used for specific purposes :-) Currently our coherent
> GPUs online this as a normal NUMA noode, for which we have also generally
> found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
> hear what short comings you have been running into (I'm less familiar with the
> Compression cases you talk about here though).
> 

after some thought, talks, and doc readings it seems like the
zone_device setups don't allow the CPU to map the devmem into page
tables, and instead depends on migrate_device logic (unless the docs are
out of sync with the code these days).  That's at least what's described
in hmm and migrate_device.  

Assuming this is out of date and ZONE_DEVICE memory is mappable into
page tables, assuming you want sparse allocation, ZONE_DEVICE seems to
suggest you at least have to re-implement the buddy logic (which isn't
that tall of an ask).

But I could imagine an (overly simplistic) pattern with SPM Nodes:

fd = open("/dev/gpu_mem", ...)
buf = mmap(fd, ...)
buf[0] 
   1) driver takes the fault
   2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
   3) driver manages any special page table masks
      Like marking pages RO/RW to manage ownership.
   4) driver sends the gpu the (mapping_id, pfn, index) information
      so that gpu can map the region in its page tables.
   5) since the memory is cache coherent, gpu and cpu are free to
      operate directly on the pages without any additional magic
      (except typical concurrency controls).

Driver doesn't have to do much in the way of allocationg management.

This is probably less compelling since you don't want general purposes
services like reclaim, migration, compaction, tiering - etc.  

The value is clearly that you get to manage GPU memory like any other
memory, but without worry that other parts of the system will touch it.

I'm much more focused on the "I have memory that is otherwise general
purpose, and wants services like reclaim and compaction, but I want
strong controls over how things can land there in the first place".

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-21 21:07   ` Gregory Price
@ 2025-11-23 23:09     ` Alistair Popple
  2025-11-24 15:28       ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Alistair Popple @ 2025-11-23 23:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On 2025-11-22 at 08:07 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> > 
> > I'm interested in the contrast with zone_device, and in particular why
> > device_coherent memory doesn't end up being a good fit for this.
> > 
> > > - Why mempolicy.c and cpusets as-is are insufficient
> > > - SPM types seeking this form of interface (Accelerator, Compression)
> > 
> > I'm sure you can guess my interest is in GPUs which also have memory some people
> > consider should only be used for specific purposes :-) Currently our coherent
> > GPUs online this as a normal NUMA noode, for which we have also generally
> > found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
> > hear what short comings you have been running into (I'm less familiar with the
> > Compression cases you talk about here though).
> > 
> 
> after some thought, talks, and doc readings it seems like the
> zone_device setups don't allow the CPU to map the devmem into page
> tables, and instead depends on migrate_device logic (unless the docs are
> out of sync with the code these days).  That's at least what's described
> in hmm and migrate_device.  

There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The former
is mostly irrelevant for this discussion but I'm including the descriptions here
for completeness. You are correct in saying that the only way either of these
currently get mapped into the page tables is via explicit migration of memory
to ZONE_DEVICE by a driver. There is also a corner case for first touch handling
which allows drivers to establish mappings to zero pages on a device if the page
hasn't been populated previously on the CPU.

These pages can, in some sense at least, be mapped on the CPU. DEVICE_COHERENT
pages are mapped normally (ie. CPU can access these directly) where as
DEVICE_PRIVATE pages are mapped using special swap entries so drivers can
emulate coherence by migrating pages back. This is used by devices without
coherent interconnects (ie. PCIe) where as the former could be used by eg. CXL.

> Assuming this is out of date and ZONE_DEVICE memory is mappable into
> page tables, assuming you want sparse allocation, ZONE_DEVICE seems to
> suggest you at least have to re-implement the buddy logic (which isn't
> that tall of an ask).

That's basically what happens - GPU drivers need memory allocation and therefore
re-implement some form of memory allocator. Agree that just being able to
reuse the buddy logic probably isn't that compelling though and isn't really of
interest (hence some of my original questions on what this is about).

> But I could imagine an (overly simplistic) pattern with SPM Nodes:
> 
> fd = open("/dev/gpu_mem", ...)
> buf = mmap(fd, ...)
> buf[0] 
>    1) driver takes the fault
>    2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
>    3) driver manages any special page table masks
>       Like marking pages RO/RW to manage ownership.

Of course as an aside this needs to match the CPU PTEs logic (this what
hmm_range_fault() is primarily used for).

>    4) driver sends the gpu the (mapping_id, pfn, index) information
>       so that gpu can map the region in its page tables.

On coherent systems this often just uses HW address translation services
(ATS), although I think the specific implementation of how page-tables are
mirrored/shared is orthogonal to this.

>    5) since the memory is cache coherent, gpu and cpu are free to
>       operate directly on the pages without any additional magic
>       (except typical concurrency controls).

This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today,
except in the case of DEVICE_PRIVATE in step (5) above. In that case the page is
mapped as a non-present special swap entry that triggers a driver callback due
to the lack of cache coherence.

> Driver doesn't have to do much in the way of allocationg management.
> 
> This is probably less compelling since you don't want general purposes
> services like reclaim, migration, compaction, tiering - etc.  

On at least some of our systems I'm told we do want this, hence my interest
here. Currently we have systems not using DEVICE_COHERENT and instead just
onlining everything as normal system managed memory in order to get reclaim
and tiering. Of course then people complain that it's managed as normal system
memory and non-GPU related things (ie. page-cache) end up in what's viewed as
special purpose memory.

> The value is clearly that you get to manage GPU memory like any other
> memory, but without worry that other parts of the system will touch it.
> 
> I'm much more focused on the "I have memory that is otherwise general
> purpose, and wants services like reclaim and compaction, but I want
> strong controls over how things can land there in the first place".

So maybe there is some overlap here - what I have is memoy that we want managed
much like normal memory but with strong controls over what it can be used for
(ie. just for tasks utilising the processing element on the accelerator).

 - Alistair

> ~Gregory
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (11 preceding siblings ...)
  2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
@ 2025-11-24  9:19 ` David Hildenbrand (Red Hat)
  2025-11-24 18:06   ` Gregory Price
  2025-11-25 14:09 ` Kiryl Shutsemau
  2025-11-26  3:23 ` Balbir Singh
  14 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24  9:19 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, tj, hannes, mkoutny, kees, muchun.song, roman.gushchin,
	shakeel.butt, rientjes, jackmanb, cl, harry.yoo, axelrasmussen,
	yuanchu, weixugc, zhengqi.arch, yosry.ahmed, nphamcs,
	chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

[...]

> 2) The addition of `cpuset.mems.sysram` which restricts allocations to
>     `mt_sysram_nodes` unless GFP_SPM_NODE is used.
> 
>     SPM Nodes are still allowed in cpuset.mems.allowed and effective.
> 
>     This is done to allow separate control over sysram and SPM node sets
>     by cgroups while maintaining the existing hierarchical rules.
> 
>     current cpuset configuration
>     cpuset.mems_allowed
>      |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
>      |->tasks.mems_allowed    < cpuset.mems_effective
> 
>     new cpuset configuration
>     cpuset.mems_allowed
>      |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
>      |.sysram_nodes           < (mems_effective ∩ default_sys_nodemask)
>      |->task.sysram_nodes     < cpuset.sysram_nodes
> 
>     This means mems_allowed still restricts all node usage in any given
>     task context, which is the existing behavior.
> 
> 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
>     capacity being added should mark the node as an SPM Node.

Sounds a bit like the wrong interface for configuring this. This smells 
like a per-node setting that should be configured before hotplugging any 
memory.

> 
>     A node is either SysRAM or SPM - never both.  Attempting to add
>     incompatible memory to a node results in hotplug failure.
> 
>     DAX and CXL are made aware of the bit and have `spm_node` bits added
>     to their relevant subsystems.
> 
> 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
>     from the provided node or nodemask.  It changes the behavior of
>     the cpuset mems_allowed and mt_node_allowed() checks.

I wonder why that is required. Couldn't we disallow allocation from one 
of these special nodes as default, and only allow it if someone 
explicitly passes in the node for allocation?

What's the problem with that?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-23 23:09     ` Alistair Popple
@ 2025-11-24 15:28       ` Gregory Price
  2025-11-27  5:03         ` Alistair Popple
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-11-24 15:28 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On Mon, Nov 24, 2025 at 10:09:37AM +1100, Alistair Popple wrote:
> On 2025-11-22 at 08:07 +1100, Gregory Price <gourry@gourry.net> wrote...
> > On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> > > 
> 
> There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The former
> is mostly irrelevant for this discussion but I'm including the descriptions here
> for completeness.
> 

I appreciate you taking the time here.  I'll maybe try to look at
updating the docs as this evolves.

> > But I could imagine an (overly simplistic) pattern with SPM Nodes:
> > 
> > fd = open("/dev/gpu_mem", ...)
> > buf = mmap(fd, ...)
> > buf[0] 
> >    1) driver takes the fault
> >    2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
> >    3) driver manages any special page table masks
> >       Like marking pages RO/RW to manage ownership.
> 
> Of course as an aside this needs to match the CPU PTEs logic (this what
> hmm_range_fault() is primarily used for).
>

This is actually the most interesting part of series for me.  I'm using
a compressed memory device as a stand-in for a memory type that requires
special page table entries (RO) to avoid compression ratios tanking
(resulting, eventually, in a MCE as there's no way to slow things down).

You can somewhat "Get there from here" through device coherent
ZONE_DEVICE, but you still don't have access to basic services like
compaction and reclaim - which you absolutely do want for such a memory
type (for the same reasons we groom zswap and zram).

I wonder if we can even re-use the hmm interfaces for SPM nodes to make
managing special page table policies easier as well.  That seems
promising.

I said this during LSFMM: Without isolation, "memory policy" is really
just a suggestion.  What we're describing here is all predicated on
isolation work, and all of a sudden much clearer examples of managing
memory on NUMA boundaries starts to make a little more sense.

> >    4) driver sends the gpu the (mapping_id, pfn, index) information
> >       so that gpu can map the region in its page tables.
> 
> On coherent systems this often just uses HW address translation services
> (ATS), although I think the specific implementation of how page-tables are
> mirrored/shared is orthogonal to this.
>

Yeah this part is completely foreign to me, I just presume there's some
way to tell the GPU how to recontruct the virtually contiguous setup.

That mechanism would be entirely reusable here (I assume).

> This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today,
> except in the case of DEVICE_PRIVATE in step (5) above. In that case the page is
> mapped as a non-present special swap entry that triggers a driver callback due
> to the lack of cache coherence.
> 

Btw, just an aside, Lorenzo is moving to rename these entries to
softleaf (software-leaf) entries. I think you'll find it welcome.
https://lore.kernel.org/linux-mm/c879383aac77d96a03e4d38f7daba893cd35fc76.1762812360.git.lorenzo.stoakes@oracle.com/

> > Driver doesn't have to do much in the way of allocationg management.
> > 
> > This is probably less compelling since you don't want general purposes
> > services like reclaim, migration, compaction, tiering - etc.  
> 
> On at least some of our systems I'm told we do want this, hence my interest
> here. Currently we have systems not using DEVICE_COHERENT and instead just
> onlining everything as normal system managed memory in order to get reclaim
> and tiering. Of course then people complain that it's managed as normal system
> memory and non-GPU related things (ie. page-cache) end up in what's viewed as
> special purpose memory.
> 

Ok, so now this gets interesting then.  I don't understand how this
makes sense (not saying it doesn't, I simply don't understand).

I would presume that under no circumstance do you want device memory to
just suddenly disappear without some coordination from the driver.

Whether it's compaction or reclaim, you have some thread that's going to
migrate a virtual mapping from HPA(A) to HPA(B) and HPA(B) may or may not
even map to the same memory device.

That thread may not even be called in the context of a thread which
accesses GPU memory (although, I think we could enforce that on top
of SPM nodes, but devil is in the details).

Maybe that "all magically works" because of the ATS described above?

I suppose this assumes you have some kind of unified memory view between
host and device memory?  Are there docs here you can point me at that
might explain this wizardry?  (Sincerely, this is fascinating)

> > The value is clearly that you get to manage GPU memory like any other
> > memory, but without worry that other parts of the system will touch it.
> > 
> > I'm much more focused on the "I have memory that is otherwise general
> > purpose, and wants services like reclaim and compaction, but I want
> > strong controls over how things can land there in the first place".
> 
> So maybe there is some overlap here - what I have is memoy that we want managed
> much like normal memory but with strong controls over what it can be used for
> (ie. just for tasks utilising the processing element on the accelerator).
> 

I think it might be great if we could discuss this a bit more in-depth,
as i've already been considering very mild refactors to reclaim to
enable a driver to engage it with an SPM node as the only shrink target.

This all becomes much more complicated due to per-memcg LRUs and such.

All that said, I'm focused on the isolation / allocation pieces first.
If that can't be agreed upon, the rest isn't worth exploring.

I do have a mild extension to mempolicy that allows mbind() to hit an
SPM node as an example as well.  I'll discuss this in the response to
David's thread, as he had some related questions about the GFP flag.

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-24  9:19 ` David Hildenbrand (Red Hat)
@ 2025-11-24 18:06   ` Gregory Price
  2025-12-10 23:29     ` Yiannis Nikolakopoulos
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-11-24 18:06 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, osalvador, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On Mon, Nov 24, 2025 at 10:19:37AM +0100, David Hildenbrand (Red Hat) wrote:
> [...]
> 

Apologies in advance for the wall of text, both of your questions really
do cut to the core of the series.  The first (SPM nodes) is basically a
plumbing problem I haven't had time to address pre-LPC, the second (GFP)
is actually a design decision that is definitely up in the air.

So consider this a dump of everything I wouldn't have had time to cover
in the LPC session.

> > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
> >     capacity being added should mark the node as an SPM Node.
> 
> Sounds a bit like the wrong interface for configuring this. This smells like
> a per-node setting that should be configured before hotplugging any memory.
> 

Assuming you're specifically talking about the MHP portion of this.

I agree, and I think the plumbing ultimately goes through acpi and
kernel configs.  This was my shortest path to demonstrate a functional
prototype by LPC.

I think the most likely option simply reserving additional NUMA nodes for
hotpluggable regions based on a Kconfig setting.

I think the real setup process should look like follows:

1. At __init time, Linux reserves additional SPM nodes based on some
   configuration (build? runtime? etc)

   Essentially create:  nodes[N_SPM]

2. At SPM setup time, a driver registers an "Abstract Type" with
   mm/memory_tiers.c  which maps SPM->Type.

   This gives the core some management callback infrastructure without
   polluting the core with device specific nonsense.

   This also gives the driver a change to define things like SLIT
   distances for those nodes, which otherwise won't exist.

3. At hotplug time, memory-hotplug.c should only have to flip a bit
   in `mt_sysram_nodes` if NID is not in nodes[N_SPM].  That logic
   is still there to ensure the base filtering works as intended.

I haven't quite figured out how to plumb out nodes[N_SPM] as described
above, but I did figure out how to demonstrate roughly the same effect
through memory-hotplug.c - hopefully that much is clear.

The problem with the above plan, is whether that "Makes sense" according
to ACPI specs and friends.

This operates in "Ambiguity Land", which is uncomfortable.

======== How Linux ingests ACPI Tables to make NUMA nodes =======
For the sake of completeness:

NUMA nodes are "marked as possible" primarily via entries in the ACPI
SRAT (Static Resource Affinity Table).
https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html

        Subtable Type : 01 [Memory Affinity]
               Length : 28
     Proximity Domain : 00000001          <- NUMA Node 1

A proximity domain (PXM) is simply a logical grouping of components
according to the OSPM.  Linux takes PXMs and maps them to NUMA nodes.

In most cases (NR_PXM == NR_NODES), but not always.  For example, if
the CXL Early Detection Table (CEDT) describes a CXL memory region for
which there is no SRAT entry, Linux reserves a "Fake PXM" id and
marks that ID as a "possible" NUMA node.

= drivers/acpi/numa/srat.c

int __init acpi_numa_init(void)
{
...
        /* fake_pxm is the next unused PXM value after SRAT parsing */
        for (i = 0, fake_pxm = -1; i < MAX_NUMNODES; i++) {
                if (node_to_pxm_map[i] > fake_pxm)
                        fake_pxm = node_to_pxm_map[i];
        }
        last_real_pxm = fake_pxm;
        fake_pxm++;
        acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
                              &fake_pxm);
...
}

static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
                                   void *arg, const unsigned long table_end)
{
...
        /* No SRAT description. Create a new node. */
        node = acpi_map_pxm_to_node(*fake_pxm);
...
        node_set(node, numa_nodes_parsed);   <- this is used to set N_POSSIBLE
}

Here's where we get into "Specification Ambiguity"

The ACPI spec does not limit (as far as I can see) a memory region from
being associated with multiple proximity domains (NUMA nodes).

Therefore, the OSPM could actually report it multiple times in the SRAT
in order to reserve multiple NUMA node possiblities for the same device.

A further extention to ACPI could be used to mark such Memory PXMs as
"Specific Purpose" - similar to the EFI_MEMORY_SP bit used to mark
memory regions as "Soft Reserved".

(this would probably break quite a lot of existing linux code, which
 a quick browse around gives you the sense that there's an assumption
 a given page can only be affiliated with one possible numa node)

But Linux could also utilize build or runtime settings to add additional
nodes which are reserved for SPM use - but are otherwise left out of
all the default maps.  This at least seems reasonable.

Note: N_POSSIBLE nodes is set at __init time, and is more or less
expected to never change.  It's probably preferable to work with this
restriction, rather than to try to change it.  Many race conditions.

<skippable wall>
================= Spec nonsense for reference ====================
(ACPI 6.5 Spec)
5.2.16.2 Memory Affinity Structure
The Memory Affinity structure provides the following topology information statically to the operating system:
• The association between a memory range and the proximity domain to which it belongs
• Information about whether the memory range can be hot-plugged.

5.2.19 Maximum System Characteristics Table (MSCT)
This section describes the format of the Maximum System Characteristic Table (MSCT), which provides OSPM with
information characteristics of a system’s maximum topology capabilities. If the system maximum topology is not
known up front at boot time, then this table is not present. OSPM will use information provided by the MSCT only
when the System Resource Affinity Table (SRAT) exists. The MSCT must contain all proximity and clock domains
defined in the SRAT.

-- field: Maximum Number of Proximity Domains
   Indicates the maximum number of Proximity Domains ever possible in the system.

   In theory an OSPM could make (MAX_NODES > (NR_NODES in SRAT)) and
   that delta could be used to indicate the presense of SPM nodes.

   This doesn't solve the SLIT PXM distance problem.

6.2.14 _PXM (Proximity)
This optional object is used to describe proximity domain associations within a machine. _PXM evaluates to an integer
that identifies a device as belonging to a Proximity Domain defined in the System Resource Affinity Table (SRAT).
OSPM assumes that two devices in the same proximity domain are tightly coupled.

17.2.1 System Resource Affinity Table Definition
The optional System Resource Affinity Table (SRAT) provides the boot time description of the processor and memory
ranges belonging to a system locality. OSPM will consume the SRAT only at boot time. For any devices not in the
SRAT, OSPM should use _PXM (Proximity) for them or their ancestors that are hot-added into the system after boot
up.

The SRAT describes the system locality that all processors and memory present in a system belong to at system boot.
This includes memory that can be hot-added (that is memory that can be added to the system while it is running,
without requiring a reboot). OSPM can use this information to optimize the performance of NUMA architecture
systems. For example, OSPM could utilize this information to optimize allocation of memory resources and the
scheduling of software threads.

=============================================================
</skippable wall>

So TL;DR: Yes, I agree, this logic should __init time configured, but
while we work on that plumbing, the memory-hotplug.c interface can be
used to unblock exploratory work (such as Alistair's GPU interests).

> > 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
> >     from the provided node or nodemask.  It changes the behavior of
> >     the cpuset mems_allowed and mt_node_allowed() checks.
> 
> I wonder why that is required. Couldn't we disallow allocation from one of
> these special nodes as default, and only allow it if someone explicitly
> passes in the node for allocation?
> 
> What's the problem with that?
> 

Simple answer:  We can choose how hard this guardrail is to break.

This initial attempt makes it "Hard":
   You cannot "accidentally" allocate SPM, the call must be explicit.

Removing the GFP would work, and make it "Easier" to access SPM memory.
(There would be other adjustments needed, but the idea is the same).

To do this you would revert the mems_allowed check changes in cpuset
to check mems_allowed always (instead of sysram_nodes).

This would allow a trivial 

   mbind(range, SPM_NODE_ID)

Which is great, but is also an incredible tripping hazard:

   numactl --interleave --all

and in kernel land:

   __alloc_pages_noprof(..., nodes[N_MEMORY])

These will now instantly be subject to SPM node memory.

The first pass leverages the GFP flag to make all these tripping hazards
disappear.  You can pass a completely garbage nodemask into the page
allocator and still rest assured that you won't touch SPM nodes.

So TL;DR: "What do we want here?" (if anything at all)

For completeness, here are the page_alloc/cpuset/mempolicy interactions
which lead me to a GFP flag as the "loosening mechanism" for the filter,
rather than allowing any nodemask to "just work".

Apologies again for the wall of text here, essentially dumping
~6 months of research and prototyping.

====================
There are basically 3 components which interact with each other:

   1) the page allocator nodemask / zone logic
   2) cpuset.mems_allowed
   3) mempolicy (task, vma)

   and now:

   4) GFP_SPM_NODE

=== 1) the page allocator nodemask and zone iteration logic

   - page allocator uses prepare_alloc_pages() to decide what
     alloc_context.nodemask will contain

   - nodemask can be NULL or a set of nodes.

   - for_zone() iteration logic will iterate all zones if mask=NULL
     Otherwise, it skips zones on nodes not present in the mask

   - the value of alloc_context.nodemask may change
     for example it may end up loosened if in an interrupt context or
     if reclaim/compaction/fallbacks are invoked.

Some issues might be obvious:

   It would be bad, for example, for an interrupt to have its allocation
   context loosened to nodes[N_MEMORY] and end up allocating SPM memory

   Capturing all of these scenarios would be very difficult if not
   impossible.

The page allocator does an initial filtering of nodes if nodemask=NULL,
or it defers the filter operation to the allocation logic if a nodemask
is present (or we're in a interrupt context).

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
                int preferred_nid, nodemask_t *nodemask,
                struct alloc_context *ac, gfp_t *alloc_gfp,
                unsigned int *alloc_flags)
{
...
        ac->nodemask = nodemask;
        if (cpuset_enabled()) {
	...
                if (in_task() && !ac->nodemask)
                        ac->nodemask = &cpuset_current_mems_allowed;
			               ^^^^ current_task.mems_allowed
                else 
                        *alloc_flags |= ALLOC_CPUSET;
			^^^ apply cpuset check during allocation instead
        }
}

Note here: If cpuset is not enabled, we don't filter!
           patch 05/11 uses mt_sysram_nodes to filter in that scenario

In the actual allocation logic, we use this nodemask (or cpusets) to
filter out unwanted nodes.

static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
                                                const struct alloc_context *ac)
{
        z = ac->preferred_zoneref;
        for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,
                                        ac->nodemask) {
		^ if nodemask=NULL - iterates ALL zones in all nodes ^
	...
                if (cpuset_enabled() &&
                        (alloc_flags & ALLOC_CPUSET) &&
                        !__cpuset_zone_allowed(zone, gfp_mask))
                                continue;
 		^^^^^^^^  Skip zone if not in mems_allowed ^^^^^^^^^

Of course we could change the page allocator logic more explicitly
to support this kind of scenario.

For example:

   We might add alloc_spm_pages() which checks mems_allowed instead
   of sysram_nodes.

I tried this, and the code duplication and spaghetti it resulted in
was embarassing.  It did work, but adding hundreds of lines to
page_alloc.c, with the risk of breaking something just lead me to
quickly disgarded it. 

It also just bluntly made using SPM memory worse - you just want to
call alloc_pages(nodemask) and be done with it.

This is what lead me to focus on modifying cpuset.mems_allowed and
add global filter logic when cpusets is disabled.

=== 2) cpuset.mems

   - cpuset.mems_allowed is the "primary filter" for most allocations

   - if cpusets is not enabled, basically all nodes are "allowed"

   - cpuset.mems_allowed is an *inherited value*

     child cgroups are restricted by the parent's mems_allowed
     cpuset.effective_mems is the actual nodemask filter.

cpuset.mems_allowed as-is cannot both restrict *AND* allow SPM nodes.

See the filtering functions above:

   If you remove an SPM node from root_cgroup.cpuset.mems_allowed
   to all of its children from using it, you effectively prevent
   ANYTHING from using it:  The node is simply not allowed.

   Since all tasks operate from within a the root context or its
   children - you can never "Allow" the node.

   If you don't remove the SPM node from the root cgroup, you aren't
   preventing tasks in the root cgroup from accessing the memory.

I chose to break mems_allowed into (mems_allowed, sysram_nodes) to:

  a) create simple nodemask=NULL default nodemask filters:
     mt_sysram_nodes, cpuset.sysram_nodes, task.sysram_nodes

  b) Leverage the existing cpuset filtering mechanism in
     mems_allowed() checks

  c) Simplify the non-cpuset filter mechanism to a 2-line change
     in page_alloc.c -- from Patch 04/11:

@@ -3753,6 +3754,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		if ((alloc_flags & ALLOC_CPUSET) &&
 		    !cpuset_zone_allowed(zone, gfp_mask))
 			continue;
+		else if (!mt_node_allowed(zone_to_nid(zone), gfp_mask))
+			continue;

page_alloc.c changes are much cleaner and easy to understand this way

=== 3) mempolicy

   - mempolicy allows you change the task or vma node-policy, separate
     from (but restricted by) cpuset.mems

   - there are some policies like interleave which provide (ALL) options
     which create, basically, a nodemask=nodes[N_MEMORY] scenario.

   - This is entirely controllable via userspace.

   - There exists a lot of software out there which makes use of this
     interface via numactl syscalls (set_mempolicy, mbind, etc)

   - There is a global "default" mempolicy which is leveraged when
     task->mempolicy=NULL or vma->vm_policy=NULL.

     The default policy is essentially "Allocate from local node, but
     fallback to any possible node as-needed"

During my initial explorations I started by looking at whether a filter
function could be implemented via the global policy.

It should be somewhat obvious this falls apart completely as soon as you
find the page allocator actually filters using cpusets.

So mempolicies are dead as a candidate for any real isolation mechanism.
It is nothing more than a suggestion at best, and is actually explicitly
ignored by things like reclaim.

   (cough: Mempolicy is dead, long live Memory Policy)

I was also very worried about introducing an SPM Node solution which
presented as an isolation mechanism... which then immediately crashed
and burned when deployed by anyone already using numactl.

I have since, however, been experimenting with how you might enable
mempolicy to include SPM nodes more explicitly (with the GFP flag).

(attached at the end, completely untested, just conceptual).

=== 4) GFP_SPM_NODE

Once the filtering functions are in place (sysram_nodes), we've hit
a point where absolutely nothing can actually touch those nodes at all.

So that was requirement #1... but of course we do actually want to
allocate this memory, that's the point.  But now we have a choice...

If a node is present in the nodemask, we can:

   1) filter it based on sysram_nodes
      a) cpuset.sysram, or
      b) mt_sysram_nodes

   or

   2) filter it based on mems_allowed
      a) cpuset.effective_mems, or
      b) nodes[N_MEMORY]

The first choice is "Hard Guardrails" - it requires both an explict mask
AND the GFP flag to reach SPM memory.

The second choice is "Soft Guardrails" - more or less any nodemask is
allowed, and we trust the callers to be sane.

The cpuset filter functions already had gfp argument by the way:

bool cpuset_current_node_allowed(int node, gfp_t gfp_mask) {...}

I chose the former for the first pass due to the mempolicy section
above.  If someone has an idea of how to apply this filtering logic
WITHOUT the GFP flag - I am absolutely welcome to suggestions.

My only other idea was separate alloc_spm_pages() interfaces, and that
just felt bad.

~Gregory

---------------  mempolicy extension ----------

mempolicy: add MPOL_F_SPM_NODE

Add a way for mempolicies to access SPM nodes.

Require MPOL_F_STATIC_NODES to prevent the policy mask from being
remapped onto other nodes.

Note: This doesn't work as-is because mempolicies are restricted by
cpuset.sysram_nodes instead of cpuset.mems_allowed, so the nodemask
will be rejected.  This can be changed in the new/rebind mempolicy
interfaces.

Signed-off-by: Gregory Price

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 8fbbe613611a..c26aa8fb56d3 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -31,6 +31,7 @@ enum {
 #define MPOL_F_STATIC_NODES    (1 << 15)
 #define MPOL_F_RELATIVE_NODES  (1 << 14)
 #define MPOL_F_NUMA_BALANCING  (1 << 13) /* Optimize with NUMA balancing if possible */
+#define MPOL_F_SPM_NODE        (1 << 12) /* Nodemask contains SPM Nodes */

 /*
  * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
diff --git a/mm/memory.c b/mm/memory.c
index b59ae7ce42eb..7097d7045954 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3459,8 +3459,14 @@ static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
 {
        struct file *vm_file = vma->vm_file;

-       if (vm_file)
-               return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
+       if (vm_file) {
+               gfp_t gfp;
+               gfp = mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
+               if (vma->vm_policy)
+                       gfp |= (vma->vm_policy->flags & MPOL_F_SPM_NODE) ?
+                               __GFP_SPM_NODE : 0;
+               return gfp;
+       }

        /*
         * Special mappings (e.g. VDSO) do not have any file so fake
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e1e8a1f3e1a2..2b4d23983ef8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1652,6 +1652,8 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
                return -EINVAL;
        if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
                return -EINVAL;
+       if ((*flags & MPOL_F_SPM_NODE) && !(*flags & MPOL_F_STATIC_NODES))
+               return -EINVAL;
        if (*flags & MPOL_F_NUMA_BALANCING) {
                if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
                        *flags |= (MPOL_F_MOF | MPOL_F_MORON);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (12 preceding siblings ...)
  2025-11-24  9:19 ` David Hildenbrand (Red Hat)
@ 2025-11-25 14:09 ` Kiryl Shutsemau
  2025-11-25 15:05   ` Gregory Price
  2025-11-26  3:23 ` Balbir Singh
  14 siblings, 1 reply; 34+ messages in thread
From: Kiryl Shutsemau @ 2025-11-25 14:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, kees,
	muchun.song, roman.gushchin, shakeel.butt, rientjes, jackmanb,
	cl, harry.yoo, axelrasmussen, yuanchu, weixugc, zhengqi.arch,
	yosry.ahmed, nphamcs, chengming.zhou, fabio.m.de.francesco,
	rrichter, ming.li, usamaarif642, brauner, oleg, namcao, escape,
	dongjoo.seo1

On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote:
> With this set, we aim to enable allocation of "special purpose memory"
> with the page allocator (mm/page_alloc.c) without exposing the same
> memory as "System RAM".  Unless a non-userland component, and does so
> with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.

How special is "special purpose memory"? If the only difference is a
latency/bandwidth discrepancy compared to "System RAM", I don't believe
it deserves this designation.

I am not in favor of the new GFP flag approach. To me, this indicates
that our infrastructure surrounding nodemasks is lacking. I believe we
would benefit more by improving it rather than simply adding a GFP flag
on top.

While I am not an expert in NUMA, it appears that the approach with
default and opt-in NUMA nodes could be generally useful. Like,
introduce a system-wide default NUMA nodemask that is a subset of all
possible nodes. This way, users can request the "special" nodes by using
a wider mask than the default.

cpusets should allow to set both default and possible masks in a
hierarchical manner where a child's default/possible mask cannot be
wider than the parent's possible mask and default is not wider that
own possible.

> Userspace-driven allocations are restricted by the sysram_nodes mask,
> nothing in userspace can explicitly request memory from SPM nodes.
> 
> Instead, the intent is to create new components which understand memory
> features and register those nodes with those components. This abstracts
> the hardware complexity away from userland while also not requiring new
> memory innovations to carry entirely new allocators.

I don't see how it is a positive. It seems to be negative side-effect of
GFP being a leaky abstraction.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-25 14:09 ` Kiryl Shutsemau
@ 2025-11-25 15:05   ` Gregory Price
  2025-11-27  5:12     ` Alistair Popple
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-11-25 15:05 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, kees,
	muchun.song, roman.gushchin, shakeel.butt, rientjes, jackmanb,
	cl, harry.yoo, axelrasmussen, yuanchu, weixugc, zhengqi.arch,
	yosry.ahmed, nphamcs, chengming.zhou, fabio.m.de.francesco,
	rrichter, ming.li, usamaarif642, brauner, oleg, namcao, escape,
	dongjoo.seo1

On Tue, Nov 25, 2025 at 02:09:39PM +0000, Kiryl Shutsemau wrote:
> On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote:
> > With this set, we aim to enable allocation of "special purpose memory"
> > with the page allocator (mm/page_alloc.c) without exposing the same
> > memory as "System RAM".  Unless a non-userland component, and does so
> > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
> 
> How special is "special purpose memory"? If the only difference is a
> latency/bandwidth discrepancy compared to "System RAM", I don't believe
> it deserves this designation.
> 

That is not the only discrepancy, but it can certainly be one of them.

I do think, at a certain latency/bandwidth level, memory becomes
"Specific Purpose" - because the performance implications become so
dramatic that you cannot allow just anything to land there.

In my head, I've been thinking about this list

1) Plain old memory (<100ns)
2) Kinda slower, but basically still memory (100-300ns)
3) Slow Memory (>300ns, up to 2-3us loaded latencies)
4) Types 1-3, but with a special feature (Such as compression)
5) Coherent Accelerator Memory (various interconnects now exist)
6) Non-coherent Shared Memory and PMEM (FAMFS, Optane, etc)

Originally I was considering [3,4], but with Alistar's comments I am
also thinking about [5] since apparently some accelerators already
toss their memory into the page allocator for management.

Re: Slow memory --

   Think >500-700ns cache line fetches, or 1-2us loaded.

   It's still "Basically just memory", but the scenarios in which
   you can use it transparently shrink significantly.  If you can
   control what and how things can land there with good policy,
   this can still be a boon compared to hitting I/O.

   But you still want things like reclaim and compaction to run
   on this memory, and you still want buddy-allocation of this memory.

Re: Compression

  This is a class of memory device which presents "usable memory"
  but which carries stipulations around its use.

  The compressed case is the example I use in this set.  There is an
  inline compression mechanism on the device.  If the compression ratio
  drops to low, writes can get dropped resulting in memory poison.

  We could solve this kind of problem only allowing allocation via
  demotion and hack off the Write-bit in the PTE. This provides the
  interposition needed to fend-off compression ratio issues.

  But... it's basically still "just memory" - you can even leave it
  mapped in the CPU page tables and allow userland to read unimpeded.

  In fact, we even want things like compaction and reclaim to run here.
  This cannot be done *unless* this memory is in the page allocator,
  and basically necessitates reimplementing all the core services the
  kernel provides.

Re: Accelerators

  Alistair has described accelerators onlining their memory as NUMA
  nodes being an existing pattern (apparently not in-tree as far as I
  can see, though).

  General consensus is "don't do this" - and it should be obvious
  why.  Memory pressure can cause non-workload memory to spill to
  these NUMA nodes as fallback allocation targets.

  But if we had a strong isolation mechanism, this could be supported.
  I'm not convinced this kind of memory actually needs core services
  like reclaim, so I will wait to see those arguments/data before I
  conclude whether the idea is sound.

>
> I am not in favor of the new GFP flag approach. To me, this indicates
> that our infrastructure surrounding nodemasks is lacking. I believe we
> would benefit more by improving it rather than simply adding a GFP flag
> on top.
> 

The core of this series is not the GFP flag, it is the splitting of
(cpuset.mems_allowed) into (cpuset.mems_allowed, cpuset.sysram_nodes)

That is the nodemask infrastructure improvement.  The GFP flag is one
mechanism of loosening the validation logic from limiting allocations
from (sysram_nodes) to including all nodes present in (mems_allowed).

> While I am not an expert in NUMA, it appears that the approach with
> default and opt-in NUMA nodes could be generally useful. Like,
> introduce a system-wide default NUMA nodemask that is a subset of all
> possible nodes.

This patch set does that (cpuset.sysram_nodes and mt_sysram_nodemask)

> This way, users can request the "special" nodes by using
> a wider mask than the default.
> 

I describe in the response to David that this is possible, but creates
extreme tripping hazards for a large swath of existing software.

snippet
'''
Simple answer:  We can choose how hard this guardrail is to break.

This initial attempt makes it "Hard":
   You cannot "accidentally" allocate SPM, the call must be explicit.

Removing the GFP would work, and make it "Easier" to access SPM memory.

This would allow a trivial 

   mbind(range, SPM_NODE_ID)

Which is great, but is also an incredible tripping hazard:

   numactl --interleave --all

and in kernel land:

   __alloc_pages_noprof(..., nodes[N_MEMORY])

These will now instantly be subject to SPM node memory.
'''

There are many places that use these patterns already.

But at the end of the day, it is preference: we can choose to do that.

> cpusets should allow to set both default and possible masks in a
> hierarchical manner where a child's default/possible mask cannot be
> wider than the parent's possible mask and default is not wider that
> own possible.
> 

This patch set implements exactly what you describe:
   sysram_nodes = default
   mems_allowed = possible

> > Userspace-driven allocations are restricted by the sysram_nodes mask,
> > nothing in userspace can explicitly request memory from SPM nodes.
> > 
> > Instead, the intent is to create new components which understand memory
> > features and register those nodes with those components. This abstracts
> > the hardware complexity away from userland while also not requiring new
> > memory innovations to carry entirely new allocators.
> 
> I don't see how it is a positive. It seems to be negative side-effect of
> GFP being a leaky abstraction.
> 

It's a matter of applying an isolation mechanism and then punching an
explicit hole in it.  As it is right now, GFP is "leaky" in that there
are, basically, no walls.  Reclaim even ignored cpuset controls until
recently, and the page_alloc code even says to ignore cpuset when 
in an interrupt context.

The core of the proposal here is to provide a strong isolation mechanism
and then allow punching explicit holes in it.  The GFP flag is one
pattern, I'm open to others.

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
                   ` (13 preceding siblings ...)
  2025-11-25 14:09 ` Kiryl Shutsemau
@ 2025-11-26  3:23 ` Balbir Singh
  2025-11-26  8:29   ` Gregory Price
  14 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2025-11-26  3:23 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, tj, hannes, mkoutny, kees, muchun.song, roman.gushchin,
	shakeel.butt, rientjes, jackmanb, cl, harry.yoo, axelrasmussen,
	yuanchu, weixugc, zhengqi.arch, yosry.ahmed, nphamcs,
	chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On 11/13/25 06:29, Gregory Price wrote:
> This is a code RFC for discussion related to
> 
> "Mempolicy is dead, long live memory policy!"
> https://lpc.events/event/19/contributions/2143/
> 

:)

I am trying to read through your series, but in the past I tried
https://lwn.net/Articles/720380/

> base-commit: 24172e0d79900908cf5ebf366600616d29c9b417
> (version notes at end)
> 
> At LSF 2026, I plan to discuss:
> - Why? (In short: shunting to DAX is a failed pattern for users)
> - Other designs I considered (mempolicy, cpusets, zone_device)
> - Why mempolicy.c and cpusets as-is are insufficient
> - SPM types seeking this form of interface (Accelerator, Compression)
> - Platform extensions that would be nice to see (SPM-only Bits)
> 
> Open Questions
> - Single SPM nodemask, or multiple based on features?
> - Apply SPM/SysRAM bit on-boot only or at-hotplug?
> - Allocate extra "possible" NUMA nodes for flexbility?
> - Should SPM Nodes be zone-restricted? (MOVABLE only?)
> - How to handle things like reclaim and compaction on these nodes.
> 
> 
> With this set, we aim to enable allocation of "special purpose memory"
> with the page allocator (mm/page_alloc.c) without exposing the same
> memory as "System RAM".  Unless a non-userland component, and does so
> with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
> 
> This isolation mechanism is a requirement for memory policies which
> depend on certain sets of memory never being used outside special
> interfaces (such as a specific mm/component or driver).
> 
> We present an example of using this mechanism within ZSWAP, as-if
> a "compressed memory node" was present.  How to describe the features
> of memory present on nodes is left up to comment here and at LPC '26.
> 
> Userspace-driven allocations are restricted by the sysram_nodes mask,
> nothing in userspace can explicitly request memory from SPM nodes.
> 
> Instead, the intent is to create new components which understand memory
> features and register those nodes with those components. This abstracts
> the hardware complexity away from userland while also not requiring new
> memory innovations to carry entirely new allocators.
> 
> The ZSwap example demonstrates this with the `mt_spm_nodemask`.  This
> hack treats all spm nodes as-if they are compressed memory nodes, and
> we bypass the software compression logic in zswap in favor of simply
> copying memory directly to the allocated page.  In a real design
> 
> There are 4 major changes in this set:
> 
> 1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes
>    the set of nodes which are eligible for use as normal system ram
> 
>    Some existing users now pass mt_sysram_nodelist into the page
>    allocator instead of NULL, but passing a NULL pointer in will simply
>    have it replaced by mt_sysram_nodelist anyway.  Should a fully NULL
>    pointer still make it to the page allocator, without GFP_SPM_NODE
>    SPM node zones will simply be skipped.
> 
>    mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes
>    present during __init, but if empty the use of mt_sysram_nodes()
>    will return a NULL to preserve current behavior.
> 
> 
> 2) The addition of `cpuset.mems.sysram` which restricts allocations to
>    `mt_sysram_nodes` unless GFP_SPM_NODE is used.
> 
>    SPM Nodes are still allowed in cpuset.mems.allowed and effective.
> 
>    This is done to allow separate control over sysram and SPM node sets
>    by cgroups while maintaining the existing hierarchical rules.
> 
>    current cpuset configuration
>    cpuset.mems_allowed
>     |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
>     |->tasks.mems_allowed    < cpuset.mems_effective
> 
>    new cpuset configuration
>    cpuset.mems_allowed
>     |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
>     |.sysram_nodes           < (mems_effective ∩ default_sys_nodemask)
>     |->task.sysram_nodes     < cpuset.sysram_nodes
> 
>    This means mems_allowed still restricts all node usage in any given
>    task context, which is the existing behavior.
> 
> 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
>    capacity being added should mark the node as an SPM Node. 
> 
>    A node is either SysRAM or SPM - never both.  Attempting to add
>    incompatible memory to a node results in hotplug failure.
> 
>    DAX and CXL are made aware of the bit and have `spm_node` bits added
>    to their relevant subsystems.
> 
> 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
>    from the provided node or nodemask.  It changes the behavior of
>    the cpuset mems_allowed and mt_node_allowed() checks.
> 
> v1->v2:
> - naming improvements
>     default_node -> sysram_node
>     protected    -> spm (Specific Purpose Memory)
> - add missing constify patch
> - add patch to update callers of __cpuset_zone_allowed
> - add additional logic to the mm sysram_nodes patch
> - fix bot build issues (ifdef config builds)
> - fix out-of-tree driver build issues (function renames)
> - change compressed_nodelist to spm_nodelist
> - add latch mechanism for sysram/spm nodes (Dan Williams)
>   this drops some extra memory-hotplug logic which is nice
> v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/
> 
> Gregory Price (11):
>   mm: constify oom_control, scan_control, and alloc_context nodemask
>   mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
>   gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations
>   memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes
>   mm: restrict slub, oom, compaction, and page_alloc to sysram by
>     default
>   mm,cpusets: rename task->mems_allowed to task->sysram_nodes
>   cpuset: introduce cpuset.mems.sysram
>   mm/memory_hotplug: add MHP_SPM_NODE flag
>   drivers/dax: add spm_node bit to dev_dax
>   drivers/cxl: add spm_node bit to cxl region
>   [HACK] mm/zswap: compressed ram integration example
> 
>  drivers/cxl/core/region.c       |  30 ++++++
>  drivers/cxl/cxl.h               |   2 +
>  drivers/dax/bus.c               |  39 ++++++++
>  drivers/dax/bus.h               |   1 +
>  drivers/dax/cxl.c               |   1 +
>  drivers/dax/dax-private.h       |   1 +
>  drivers/dax/kmem.c              |   2 +
>  fs/proc/array.c                 |   2 +-
>  include/linux/cpuset.h          |  62 +++++++------
>  include/linux/gfp_types.h       |   5 +
>  include/linux/memory-tiers.h    |  47 ++++++++++
>  include/linux/memory_hotplug.h  |  10 ++
>  include/linux/mempolicy.h       |   2 +-
>  include/linux/mm.h              |   4 +-
>  include/linux/mmzone.h          |   6 +-
>  include/linux/oom.h             |   2 +-
>  include/linux/sched.h           |   6 +-
>  include/linux/swap.h            |   2 +-
>  init/init_task.c                |   2 +-
>  kernel/cgroup/cpuset-internal.h |   8 ++
>  kernel/cgroup/cpuset-v1.c       |   7 ++
>  kernel/cgroup/cpuset.c          | 158 ++++++++++++++++++++------------
>  kernel/fork.c                   |   2 +-
>  kernel/sched/fair.c             |   4 +-
>  mm/compaction.c                 |  10 +-
>  mm/hugetlb.c                    |   8 +-
>  mm/internal.h                   |   2 +-
>  mm/memcontrol.c                 |   3 +-
>  mm/memory-tiers.c               |  66 ++++++++++++-
>  mm/memory_hotplug.c             |   7 ++
>  mm/mempolicy.c                  |  34 +++----
>  mm/migrate.c                    |   4 +-
>  mm/mmzone.c                     |   5 +-
>  mm/oom_kill.c                   |  11 ++-
>  mm/page_alloc.c                 |  57 +++++++-----
>  mm/show_mem.c                   |  11 ++-
>  mm/slub.c                       |  15 ++-
>  mm/vmscan.c                     |   6 +-
>  mm/zswap.c                      |  66 ++++++++++++-
>  39 files changed, 532 insertions(+), 178 deletions(-)
> 

Balbir


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-26  3:23 ` Balbir Singh
@ 2025-11-26  8:29   ` Gregory Price
  2025-12-03  4:36     ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2025-11-26  8:29 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, kees,
	muchun.song, roman.gushchin, shakeel.butt, rientjes, jackmanb,
	cl, harry.yoo, axelrasmussen, yuanchu, weixugc, zhengqi.arch,
	yosry.ahmed, nphamcs, chengming.zhou, fabio.m.de.francesco,
	rrichter, ming.li, usamaarif642, brauner, oleg, namcao, escape,
	dongjoo.seo1

On Wed, Nov 26, 2025 at 02:23:23PM +1100, Balbir Singh wrote:
> On 11/13/25 06:29, Gregory Price wrote:
> > This is a code RFC for discussion related to
> > 
> > "Mempolicy is dead, long live memory policy!"
> > https://lpc.events/event/19/contributions/2143/
> > 
> 
> :)
> 
> I am trying to read through your series, but in the past I tried
> https://lwn.net/Articles/720380/
> 

This is very interesting, I gave the whole RFC a read and it seems you
were working from the same conclusion ~8 years ago - that NUMA just
plainly "Feels like the correct abstraction".

First, thank you, the read-through here filled in some holes regarding
HMM-CDM for me.  If you have developed any other recent opinions on the
use of HMM-CDM vs NUMA-CDM, your experience is most welcome.


Some observations:

1) You implemented what amounts to N_SPM_NODES 

   - I find it funny we separately came to the same conclusion. I had
     not seen your series while researching this, that should be an
     instructive history lesson for readers.

   - N_SPM_NODES probably dictates some kind of input from ACPI table
     extension, drivers input (like my MHP flag), or kernel configs
     (build/init) to make sense.

   - I discussed in my note to David that this is probably the right
     way to go about doing it. I think N_MEMORY can still be set, if
     a new global-default-node policy is created.

   - cpuset/global sysram_nodes masks in this set are that policy.


2) You bring up the concept of NUMA node attributes

   - I have privately discussed this concept with MM folks, but had
     not come around to formalize this.  It seems a natural extension.

   - I wasn't sure whether such a thing would end up in memory-tiers.c
     or somehow abstracted otherwise.  We definitely do not want node
     attributes to imply infinite N_XXXXX masks.


3) You attacked the problem from the zone iteration mechanism as the
   primary allocation filter - while I used cpusets and basically
   implemented a new in-kernel policy (sysram_nodes)

   - I chose not to take that route (omitting these nodes from N_MEMORY)
     precisely because it would require making changes all over the
     kernel for components that may want to use the memory which
     leverage N_MEMORY for zone iteration.

   - Instead, I can see either per-component policies (reclaim->nodes)
     or a global policy that covers all of those components (similar to
     my sysram_nodes).  Drivers would then be responsible to register
     their hotplugged memory nodes with those components accordingly.

   - My mechanism requires a GFP flag to punch a hole in the isolation,
     while yours depends on the fact that page_alloc uses N_MEMORY if
     nodemask is not provided.  I can see an argument for going that
     route instead of the sysram_nodes policy, but I also understand
     why removing them from N_MEMORY causes issues (how do you opt these
     nodes into core services like kswapd and such).

     Interesting discussions to be had.


4)   Many commenters tried pushing mempolicy as the place to do this.
     We both independently came to the conclusion that 

   - mempolicy is at best an insufficient mechanism for isolation due
     to the way the rest of the system is designed (cpusets, zones)

   - at worst, actually harmful because it leads kernel developers to
     believe users view mempolicy APIs as reasonable. They don't.
     In my experience it's viewed as:
         - too complicated (SW doesn't want to know about HW)
         - useless (it's not even respected by reclaim)
         - actively harmful (it makes your code less portable)
	 - "The only thing we have"

Your RFC has the same concerns expressed that I have seen over past
few years in Device-Memory development groups... except that the general
consensus was (in 2017) that these devices were not commodity hardware
the kernel needs a general abstraction (NUMA) to support.

"Push the complexity to userland" (mempolicy), and
"Make the driver manage it." (hmm/zone_device)

Have been the prevailing opinions as a result.

From where I sit, this depends on the assumption that anyone using such
systems is presumed to be sophisticated and empowered enough to accept
that complexity.  This is just quite bluntly no longer the case.

GPUs, unified memory, and coherent interconnects have all become
commodity hardware in the data center, and the "users" here are
infrastructure-as-a-service folks that want these systems to be
some definition of fungible.

~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-24 15:28       ` Gregory Price
@ 2025-11-27  5:03         ` Alistair Popple
  0 siblings, 0 replies; 34+ messages in thread
From: Alistair Popple @ 2025-11-27  5:03 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On 2025-11-25 at 02:28 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Mon, Nov 24, 2025 at 10:09:37AM +1100, Alistair Popple wrote:
> > On 2025-11-22 at 08:07 +1100, Gregory Price <gourry@gourry.net> wrote...
> > > On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> > > > 
> > 
> > There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The former
> > is mostly irrelevant for this discussion but I'm including the descriptions here
> > for completeness.
> > 
> 
> I appreciate you taking the time here.  I'll maybe try to look at
> updating the docs as this evolves.

I believe the DEVICE_PRIVATE bit is documented here
https://www.kernel.org/doc/Documentation/vm/hmm.rst , but if there is anything
there that you think needs improvement I'd be happy to look or review. I'm not
sure if that was updated for DEVICE_COHERENT though.

> > > But I could imagine an (overly simplistic) pattern with SPM Nodes:
> > > 
> > > fd = open("/dev/gpu_mem", ...)
> > > buf = mmap(fd, ...)
> > > buf[0] 
> > >    1) driver takes the fault
> > >    2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
> > >    3) driver manages any special page table masks
> > >       Like marking pages RO/RW to manage ownership.
> > 
> > Of course as an aside this needs to match the CPU PTEs logic (this what
> > hmm_range_fault() is primarily used for).
> >
> 
> This is actually the most interesting part of series for me.  I'm using
> a compressed memory device as a stand-in for a memory type that requires
> special page table entries (RO) to avoid compression ratios tanking
> (resulting, eventually, in a MCE as there's no way to slow things down).
> 
> You can somewhat "Get there from here" through device coherent
> ZONE_DEVICE, but you still don't have access to basic services like
> compaction and reclaim - which you absolutely do want for such a memory
> type (for the same reasons we groom zswap and zram).
> 
> I wonder if we can even re-use the hmm interfaces for SPM nodes to make
> managing special page table policies easier as well.  That seems
> promising.

It might depend on what exactly you're looking to do - HMM is really too parts,
one for mirroring page tables and another for allowing special non-present PTEs
to be setup to map a dummy ZONE_DEVICE struct page that notifies a driver when
the CPU attempts access.

> I said this during LSFMM: Without isolation, "memory policy" is really
> just a suggestion.  What we're describing here is all predicated on
> isolation work, and all of a sudden much clearer examples of managing
> memory on NUMA boundaries starts to make a little more sense.

I very much agree with the views of memory policy that you shared in one of the
other threads. I don't think it is adequate for providing isolation, and agree
the isolation (and degree of isolation) is the interesting bit of the work here,
at least for now.

> 
> > >    4) driver sends the gpu the (mapping_id, pfn, index) information
> > >       so that gpu can map the region in its page tables.
> > 
> > On coherent systems this often just uses HW address translation services
> > (ATS), although I think the specific implementation of how page-tables are
> > mirrored/shared is orthogonal to this.
> >
> 
> Yeah this part is completely foreign to me, I just presume there's some
> way to tell the GPU how to recontruct the virtually contiguous setup.
> That mechanism would be entirely reusable here (I assume).
> 
> > This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today,
> > except in the case of DEVICE_PRIVATE in step (5) above. In that case the page is
> > mapped as a non-present special swap entry that triggers a driver callback due
> > to the lack of cache coherence.
> > 
> 
> Btw, just an aside, Lorenzo is moving to rename these entries to
> softleaf (software-leaf) entries. I think you'll find it welcome.
> https://lore.kernel.org/linux-mm/c879383aac77d96a03e4d38f7daba893cd35fc76.1762812360.git.lorenzo.stoakes@oracle.com/
> 
> > > Driver doesn't have to do much in the way of allocationg management.
> > > 
> > > This is probably less compelling since you don't want general purposes
> > > services like reclaim, migration, compaction, tiering - etc.  
> > 
> > On at least some of our systems I'm told we do want this, hence my interest
> > here. Currently we have systems not using DEVICE_COHERENT and instead just
> > onlining everything as normal system managed memory in order to get reclaim
> > and tiering. Of course then people complain that it's managed as normal system
> > memory and non-GPU related things (ie. page-cache) end up in what's viewed as
> > special purpose memory.
> > 
> 
> Ok, so now this gets interesting then.  I don't understand how this
> makes sense (not saying it doesn't, I simply don't understand).
> 
> I would presume that under no circumstance do you want device memory to
> just suddenly disappear without some coordination from the driver.
> 
> Whether it's compaction or reclaim, you have some thread that's going to
> migrate a virtual mapping from HPA(A) to HPA(B) and HPA(B) may or may not
> even map to the same memory device.
> 
> That thread may not even be called in the context of a thread which
> accesses GPU memory (although, I think we could enforce that on top
> of SPM nodes, but devil is in the details).
> 
> Maybe that "all magically works" because of the ATS described above?

Pretty much - both ATS and hmm_range_fault() are, conceptually at least, just
methods of sharing/mirroring the CPU page table to a device. So in your example
above if a thread was to migrate a mapping from one page to another this "black
magic" would keep everything in sync. Eg. For hmm_range_fault() the driver
gets a mmu_notifier callback saying the virtual mapping no longer points to
HPA(A). If it needs to find the new mapping to HPA(B) it can look it up using
hmm_range_fault() and program it's page tables with the new mapping.

At a sufficiently high level ATS is just a HW implemented equivalence of this.

> I suppose this assumes you have some kind of unified memory view between
> host and device memory?  Are there docs here you can point me at that
> might explain this wizardry?  (Sincerely, this is fascinating)

Right - it's all predicated on the host and device sharing the same view of the
virtual address space. I'm not sure of any good docs on this, but I will be at
LPC so would be happy to have a discussion there.

> > > The value is clearly that you get to manage GPU memory like any other
> > > memory, but without worry that other parts of the system will touch it.
> > > 
> > > I'm much more focused on the "I have memory that is otherwise general
> > > purpose, and wants services like reclaim and compaction, but I want
> > > strong controls over how things can land there in the first place".
> > 
> > So maybe there is some overlap here - what I have is memoy that we want managed
> > much like normal memory but with strong controls over what it can be used for
> > (ie. just for tasks utilising the processing element on the accelerator).
> > 
> 
> I think it might be great if we could discuss this a bit more in-depth,
> as i've already been considering very mild refactors to reclaim to
> enable a driver to engage it with an SPM node as the only shrink target.

Absolutely! Looking forward to an in-person discussion.

 - Alistair

> This all becomes much more complicated due to per-memcg LRUs and such.
> 
> All that said, I'm focused on the isolation / allocation pieces first.
> If that can't be agreed upon, the rest isn't worth exploring.
> 
> I do have a mild extension to mempolicy that allows mbind() to hit an
> SPM node as an example as well.  I'll discuss this in the response to
> David's thread, as he had some related questions about the GFP flag.
> 
> ~Gregory
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-25 15:05   ` Gregory Price
@ 2025-11-27  5:12     ` Alistair Popple
  0 siblings, 0 replies; 34+ messages in thread
From: Alistair Popple @ 2025-11-27  5:12 UTC (permalink / raw)
  To: Gregory Price
  Cc: Kiryl Shutsemau, linux-mm, kernel-team, linux-cxl, linux-kernel,
	nvdimm, linux-fsdevel, cgroups, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, ying.huang,
	mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, kees,
	muchun.song, roman.gushchin, shakeel.butt, rientjes, jackmanb,
	cl, harry.yoo, axelrasmussen, yuanchu, weixugc, zhengqi.arch,
	yosry.ahmed, nphamcs, chengming.zhou, fabio.m.de.francesco,
	rrichter, ming.li, usamaarif642, brauner, oleg, namcao, escape,
	dongjoo.seo1

On 2025-11-26 at 02:05 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Tue, Nov 25, 2025 at 02:09:39PM +0000, Kiryl Shutsemau wrote:
> > On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote:
> > > With this set, we aim to enable allocation of "special purpose memory"
> > > with the page allocator (mm/page_alloc.c) without exposing the same
> > > memory as "System RAM".  Unless a non-userland component, and does so
> > > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
> > 
> > How special is "special purpose memory"? If the only difference is a
> > latency/bandwidth discrepancy compared to "System RAM", I don't believe
> > it deserves this designation.
> > 
> 
> That is not the only discrepancy, but it can certainly be one of them.
> 
> I do think, at a certain latency/bandwidth level, memory becomes
> "Specific Purpose" - because the performance implications become so
> dramatic that you cannot allow just anything to land there.
> 
> In my head, I've been thinking about this list
> 
> 1) Plain old memory (<100ns)
> 2) Kinda slower, but basically still memory (100-300ns)
> 3) Slow Memory (>300ns, up to 2-3us loaded latencies)
> 4) Types 1-3, but with a special feature (Such as compression)
> 5) Coherent Accelerator Memory (various interconnects now exist)
> 6) Non-coherent Shared Memory and PMEM (FAMFS, Optane, etc)
> 
> Originally I was considering [3,4], but with Alistar's comments I am
> also thinking about [5] since apparently some accelerators already
> toss their memory into the page allocator for management.

Thanks.

> Re: Slow memory --
> 
>    Think >500-700ns cache line fetches, or 1-2us loaded.
> 
>    It's still "Basically just memory", but the scenarios in which
>    you can use it transparently shrink significantly.  If you can
>    control what and how things can land there with good policy,
>    this can still be a boon compared to hitting I/O.
> 
>    But you still want things like reclaim and compaction to run
>    on this memory, and you still want buddy-allocation of this memory.
> 
> Re: Compression
> 
>   This is a class of memory device which presents "usable memory"
>   but which carries stipulations around its use.
> 
>   The compressed case is the example I use in this set.  There is an
>   inline compression mechanism on the device.  If the compression ratio
>   drops to low, writes can get dropped resulting in memory poison.
> 
>   We could solve this kind of problem only allowing allocation via
>   demotion and hack off the Write-bit in the PTE. This provides the
>   interposition needed to fend-off compression ratio issues.
> 
>   But... it's basically still "just memory" - you can even leave it
>   mapped in the CPU page tables and allow userland to read unimpeded.
> 
>   In fact, we even want things like compaction and reclaim to run here.
>   This cannot be done *unless* this memory is in the page allocator,
>   and basically necessitates reimplementing all the core services the
>   kernel provides.
> 
> Re: Accelerators
> 
>   Alistair has described accelerators onlining their memory as NUMA
>   nodes being an existing pattern (apparently not in-tree as far as I
>   can see, though).

Yeah, sadly not yet :-( Hopefully "soon". Although onlining the memory doesn't
have much driver involvement as the GPU memory all just appears in the ACPI
tables as a CPU-less memory node anyway (which is why it ended up being easy for
people to toss it into the page allocator).

>   General consensus is "don't do this" - and it should be obvious
>   why.  Memory pressure can cause non-workload memory to spill to
>   these NUMA nodes as fallback allocation targets.

Indeed, this is a common complaint when people have done this.

>   But if we had a strong isolation mechanism, this could be supported.
>   I'm not convinced this kind of memory actually needs core services
>   like reclaim, so I will wait to see those arguments/data before I
>   conclude whether the idea is sound.

Sounds reasonable, I don't have strong arugments either way at the moment so
will see if we can gather some data.

> 
> 
> >
> > I am not in favor of the new GFP flag approach. To me, this indicates
> > that our infrastructure surrounding nodemasks is lacking. I believe we
> > would benefit more by improving it rather than simply adding a GFP flag
> > on top.
> > 
> 
> The core of this series is not the GFP flag, it is the splitting of
> (cpuset.mems_allowed) into (cpuset.mems_allowed, cpuset.sysram_nodes)
> 
> That is the nodemask infrastructure improvement.  The GFP flag is one
> mechanism of loosening the validation logic from limiting allocations
> from (sysram_nodes) to including all nodes present in (mems_allowed).
> 
> > While I am not an expert in NUMA, it appears that the approach with
> > default and opt-in NUMA nodes could be generally useful. Like,
> > introduce a system-wide default NUMA nodemask that is a subset of all
> > possible nodes.
> 
> This patch set does that (cpuset.sysram_nodes and mt_sysram_nodemask)
> 
> > This way, users can request the "special" nodes by using
> > a wider mask than the default.
> > 
> 
> I describe in the response to David that this is possible, but creates
> extreme tripping hazards for a large swath of existing software.
> 
> snippet
> '''
> Simple answer:  We can choose how hard this guardrail is to break.
> 
> This initial attempt makes it "Hard":
>    You cannot "accidentally" allocate SPM, the call must be explicit.
> 
> Removing the GFP would work, and make it "Easier" to access SPM memory.
> 
> This would allow a trivial 
> 
>    mbind(range, SPM_NODE_ID)
> 
> Which is great, but is also an incredible tripping hazard:
> 
>    numactl --interleave --all
> 
> and in kernel land:
> 
>    __alloc_pages_noprof(..., nodes[N_MEMORY])
> 
> These will now instantly be subject to SPM node memory.
> '''
> 
> There are many places that use these patterns already.
> 
> But at the end of the day, it is preference: we can choose to do that.
> 
> > cpusets should allow to set both default and possible masks in a
> > hierarchical manner where a child's default/possible mask cannot be
> > wider than the parent's possible mask and default is not wider that
> > own possible.
> > 
> 
> This patch set implements exactly what you describe:
>    sysram_nodes = default
>    mems_allowed = possible
> 
> > > Userspace-driven allocations are restricted by the sysram_nodes mask,
> > > nothing in userspace can explicitly request memory from SPM nodes.
> > > 
> > > Instead, the intent is to create new components which understand memory
> > > features and register those nodes with those components. This abstracts
> > > the hardware complexity away from userland while also not requiring new
> > > memory innovations to carry entirely new allocators.
> > 
> > I don't see how it is a positive. It seems to be negative side-effect of
> > GFP being a leaky abstraction.
> > 
> 
> It's a matter of applying an isolation mechanism and then punching an
> explicit hole in it.  As it is right now, GFP is "leaky" in that there
> are, basically, no walls.  Reclaim even ignored cpuset controls until
> recently, and the page_alloc code even says to ignore cpuset when 
> in an interrupt context.
> 
> The core of the proposal here is to provide a strong isolation mechanism
> and then allow punching explicit holes in it.  The GFP flag is one
> pattern, I'm open to others.
> 
> ~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-26  8:29   ` Gregory Price
@ 2025-12-03  4:36     ` Balbir Singh
  2025-12-03  5:25       ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2025-12-03  4:36 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, kees,
	muchun.song, roman.gushchin, shakeel.butt, rientjes, jackmanb,
	cl, harry.yoo, axelrasmussen, yuanchu, weixugc, zhengqi.arch,
	yosry.ahmed, nphamcs, chengming.zhou, fabio.m.de.francesco,
	rrichter, ming.li, usamaarif642, brauner, oleg, namcao, escape,
	dongjoo.seo1

On 11/26/25 19:29, Gregory Price wrote:
> On Wed, Nov 26, 2025 at 02:23:23PM +1100, Balbir Singh wrote:
>> On 11/13/25 06:29, Gregory Price wrote:
>>> This is a code RFC for discussion related to
>>>
>>> "Mempolicy is dead, long live memory policy!"
>>> https://lpc.events/event/19/contributions/2143/
>>>
>>
>> :)
>>
>> I am trying to read through your series, but in the past I tried
>> https://lwn.net/Articles/720380/
>>
> 
> This is very interesting, I gave the whole RFC a read and it seems you
> were working from the same conclusion ~8 years ago - that NUMA just
> plainly "Feels like the correct abstraction".
> 
> First, thank you, the read-through here filled in some holes regarding
> HMM-CDM for me.  If you have developed any other recent opinions on the
> use of HMM-CDM vs NUMA-CDM, your experience is most welcome.
> 

Sorry for the delay in responding, I've not yet read through your series

> 
> Some observations:
> 
> 1) You implemented what amounts to N_SPM_NODES 
> 
>    - I find it funny we separately came to the same conclusion. I had
>      not seen your series while researching this, that should be an
>      instructive history lesson for readers.
> 
>    - N_SPM_NODES probably dictates some kind of input from ACPI table
>      extension, drivers input (like my MHP flag), or kernel configs
>      (build/init) to make sense.
> 
>    - I discussed in my note to David that this is probably the right
>      way to go about doing it. I think N_MEMORY can still be set, if
>      a new global-default-node policy is created.
> 

I still think N_MEMORY as a flag should mean something different from
N_SPM_NODE_MEMORY because their characteristics are different

>    - cpuset/global sysram_nodes masks in this set are that policy.
> 
> 
> 2) You bring up the concept of NUMA node attributes
> 
>    - I have privately discussed this concept with MM folks, but had
>      not come around to formalize this.  It seems a natural extension.
> 
>    - I wasn't sure whether such a thing would end up in memory-tiers.c
>      or somehow abstracted otherwise.  We definitely do not want node
>      attributes to imply infinite N_XXXXX masks.

I have to think about this some more

> 
> 
> 3) You attacked the problem from the zone iteration mechanism as the
>    primary allocation filter - while I used cpusets and basically
>    implemented a new in-kernel policy (sysram_nodes)
> 
>    - I chose not to take that route (omitting these nodes from N_MEMORY)
>      precisely because it would require making changes all over the
>      kernel for components that may want to use the memory which
>      leverage N_MEMORY for zone iteration.
> 
>    - Instead, I can see either per-component policies (reclaim->nodes)
>      or a global policy that covers all of those components (similar to
>      my sysram_nodes).  Drivers would then be responsible to register
>      their hotplugged memory nodes with those components accordingly.
> 

To me node zonelists provide the right abstraction of where to allocate from
and how to fallback as needed. I'll read your patches to figure out how your
approach is different. I wanted the isolation at allocation time

>    - My mechanism requires a GFP flag to punch a hole in the isolation,
>      while yours depends on the fact that page_alloc uses N_MEMORY if
>      nodemask is not provided.  I can see an argument for going that
>      route instead of the sysram_nodes policy, but I also understand
>      why removing them from N_MEMORY causes issues (how do you opt these
>      nodes into core services like kswapd and such).
> 
>      Interesting discussions to be had.


Yes, we should look at the pros and cons. To be honest, I'd wouldn't be 
opposed to having kswapd and reclaim look different for these nodes, it
would also mean that we'd need pagecache hooks if we want page cache on
these nodes. Everything else, including move_pages() should just work.

> 
> 
> 4)   Many commenters tried pushing mempolicy as the place to do this.
>      We both independently came to the conclusion that 
> 
>    - mempolicy is at best an insufficient mechanism for isolation due
>      to the way the rest of the system is designed (cpusets, zones)
> 
>    - at worst, actually harmful because it leads kernel developers to
>      believe users view mempolicy APIs as reasonable. They don't.
>      In my experience it's viewed as:
>          - too complicated (SW doesn't want to know about HW)
>          - useless (it's not even respected by reclaim)
>          - actively harmful (it makes your code less portable)
> 	 - "The only thing we have"
> 
> Your RFC has the same concerns expressed that I have seen over past
> few years in Device-Memory development groups... except that the general
> consensus was (in 2017) that these devices were not commodity hardware
> the kernel needs a general abstraction (NUMA) to support.
> 
> "Push the complexity to userland" (mempolicy), and
> "Make the driver manage it." (hmm/zone_device)
> 
Yep

> Have been the prevailing opinions as a result.
> 
> From where I sit, this depends on the assumption that anyone using such
> systems is presumed to be sophisticated and empowered enough to accept
> that complexity.  This is just quite bluntly no longer the case.
> 
> GPUs, unified memory, and coherent interconnects have all become
> commodity hardware in the data center, and the "users" here are
> infrastructure-as-a-service folks that want these systems to be
> some definition of fungible.
> 

I also think the absence of better integration makes memory management harder

Balbir


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-12-03  4:36     ` Balbir Singh
@ 2025-12-03  5:25       ` Gregory Price
  0 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-12-03  5:25 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, kees,
	muchun.song, roman.gushchin, shakeel.butt, rientjes, jackmanb,
	cl, harry.yoo, axelrasmussen, yuanchu, weixugc, zhengqi.arch,
	yosry.ahmed, nphamcs, chengming.zhou, fabio.m.de.francesco,
	rrichter, ming.li, usamaarif642, brauner, oleg, namcao, escape,
	dongjoo.seo1

On Wed, Dec 03, 2025 at 03:36:33PM +1100, Balbir Singh wrote:
> >    - I discussed in my note to David that this is probably the right
> >      way to go about doing it. I think N_MEMORY can still be set, if
> >      a new global-default-node policy is created.
> > 
> 
> I still think N_MEMORY as a flag should mean something different from
> N_SPM_NODE_MEMORY because their characteristics are different
> 
... snip ...  (I agree, see later)

> >    - Instead, I can see either per-component policies (reclaim->nodes)
> >      or a global policy that covers all of those components (similar to
> >      my sysram_nodes).  Drivers would then be responsible to register
> >      their hotplugged memory nodes with those components accordingly.
> > 
> 
> To me node zonelists provide the right abstraction of where to allocate from
> and how to fallback as needed. I'll read your patches to figure out how your
> approach is different. I wanted the isolation at allocation time
>
... snip ... (I agree, see later)

> 
> Yes, we should look at the pros and cons. To be honest, I'd wouldn't be 
> opposed to having kswapd and reclaim look different for these nodes, it
> would also mean that we'd need pagecache hooks if we want page cache on
> these nodes. Everything else, including move_pages() should just work.
> 

Basically my series does (roughly) the same as yours, but adds the
cpusets controls and a GFP flag.  The MHP extention should ultimately
be converted to N_SPM_NODE_MEMORY (or whatever we decide to name it).

After some more time to think, I think we want all of it.

- N_SPM_NODE_MEMORY (or whatever we call it) handles filtering out
  SPM at allocation time by default and protects all current users
  of N_MEMORY from exposure to SPM.

- cpusets controls allow userland isolation control and a default sysram
  mask (I think cpusets.sysram_nodes doesn't even need to be exposed via
  sysfs to be honest).  cpusets fix is needed due to task->mems_allowed
  being used as a default nodemask on systems using cgroups/cpusets.

- GFP_SP_NODE protects against someone doing something like:
      get_page_from_freelist(..., node_states[N_POSSIBLE])
      or
      numactl --interleave --all ./my_program

  While providing a way to punch an explicit hole in the isolation
  (GFP_SP_NODE means "Use N_SPM_NODE_MEMORY instead of N_MEMORY")

  This could be argued against so long as we restrict mempolicy.c
  to N_MEMORY nodes (to avoid `--interleave --all` issues), but this
  limitation may not be preferable.

  My concern is for breaking existing userland software that happens
  to run on a system with SPM - but you can probably imagine many more
  bad scenarios.

~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
  2025-11-24 18:06   ` Gregory Price
@ 2025-12-10 23:29     ` Yiannis Nikolakopoulos
  0 siblings, 0 replies; 34+ messages in thread
From: Yiannis Nikolakopoulos @ 2025-12-10 23:29 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand (Red Hat),
	linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, osalvador, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

Just managed to go through the series and I think there are very good
ideas here. It seems to cover the isolation requirements that are
needed for the devices with inline compression.
As an RFC I can try to build something on top of it and test it more.

I hope we find the right abstractions for this to move forward.

On Tue, Nov 25, 2025 at 6:58 AM Gregory Price <gourry@gourry.net> wrote:
>
> On Mon, Nov 24, 2025 at 10:19:37AM +0100, David Hildenbrand (Red Hat) wrote:
> > [...]
> >
>
> Apologies in advance for the wall of text, both of your questions really
> do cut to the core of the series.  The first (SPM nodes) is basically a
> plumbing problem I haven't had time to address pre-LPC, the second (GFP)
> is actually a design decision that is definitely up in the air.
>
> So consider this a dump of everything I wouldn't have had time to cover
> in the LPC session.
>
> > > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
> > >     capacity being added should mark the node as an SPM Node.
> >
> > Sounds a bit like the wrong interface for configuring this. This smells like
> > a per-node setting that should be configured before hotplugging any memory.
> >
>
> Assuming you're specifically talking about the MHP portion of this.
>
> I agree, and I think the plumbing ultimately goes through acpi and
> kernel configs.  This was my shortest path to demonstrate a functional
> prototype by LPC.
>
> I think the most likely option simply reserving additional NUMA nodes for
> hotpluggable regions based on a Kconfig setting.
>
> I think the real setup process should look like follows:
>
> 1. At __init time, Linux reserves additional SPM nodes based on some
>    configuration (build? runtime? etc)
>
>    Essentially create:  nodes[N_SPM]
>
> 2. At SPM setup time, a driver registers an "Abstract Type" with
>    mm/memory_tiers.c  which maps SPM->Type.
>
>    This gives the core some management callback infrastructure without
>    polluting the core with device specific nonsense.
>
>    This also gives the driver a change to define things like SLIT
>    distances for those nodes, which otherwise won't exist.
>
> 3. At hotplug time, memory-hotplug.c should only have to flip a bit
>    in `mt_sysram_nodes` if NID is not in nodes[N_SPM].  That logic
>    is still there to ensure the base filtering works as intended.
>
>
> I haven't quite figured out how to plumb out nodes[N_SPM] as described
> above, but I did figure out how to demonstrate roughly the same effect
> through memory-hotplug.c - hopefully that much is clear.
>
> The problem with the above plan, is whether that "Makes sense" according
> to ACPI specs and friends.
>
> This operates in "Ambiguity Land", which is uncomfortable.
What you describe in a high level above makes sense. And while I agree
that ACPI seems like a good layer for this, it could take a while for
things to converge. At the same time different vendors might do things
differently (unsurprisingly I guess...). For example, it would not be
an absurd idea that the "specialness" of the device (e.g. compression)
appears as a vendor specific capability in CXL. So, it would make
sense to allow specific device drivers to set the respective node as
SPM (as I understood you suggest above, right?)

Finally, going back to the isolation, I'm curious to see if this
covers GPU use cases as Alistair brought up or HBMs in general. Maybe
there could be synergies with the HBM related talk in the device MC?

Best,
/Yiannis


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask
  2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
@ 2025-12-15  6:11   ` Balbir Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Balbir Singh @ 2025-12-15  6:11 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, tj, hannes, mkoutny, kees, muchun.song, roman.gushchin,
	shakeel.butt, rientjes, jackmanb, cl, harry.yoo, axelrasmussen,
	yuanchu, weixugc, zhengqi.arch, yosry.ahmed, nphamcs,
	chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On 11/13/25 06:29, Gregory Price wrote:
> The nodemasks in these structures may come from a variety of sources,
> including tasks and cpusets - and should never be modified by any code
> when being passed around inside another context.
> 
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  include/linux/cpuset.h | 4 ++--
>  include/linux/mm.h     | 4 ++--
>  include/linux/mmzone.h | 6 +++---
>  include/linux/oom.h    | 2 +-
>  include/linux/swap.h   | 2 +-
>  kernel/cgroup/cpuset.c | 2 +-
>  mm/internal.h          | 2 +-
>  mm/mmzone.c            | 5 +++--
>  mm/page_alloc.c        | 4 ++--
>  mm/show_mem.c          | 9 ++++++---
>  mm/vmscan.c            | 6 +++---
>  11 files changed, 25 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b5..548eaf7ef8d0 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -80,7 +80,7 @@ extern bool cpuset_cpu_is_isolated(int cpu);
>  extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
>  #define cpuset_current_mems_allowed (current->mems_allowed)
>  void cpuset_init_current_mems_allowed(void);
> -int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
> +int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask);
>  
>  extern bool cpuset_current_node_allowed(int node, gfp_t gfp_mask);
>  
> @@ -219,7 +219,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
>  #define cpuset_current_mems_allowed (node_states[N_MEMORY])
>  static inline void cpuset_init_current_mems_allowed(void) {}

The cleanup looks good

Acked-by: Balbir Singh <balbirs@nvidia.com>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
  2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
@ 2025-12-15  6:14   ` Balbir Singh
  2025-12-15 12:38     ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2025-12-15  6:14 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, tj, hannes, mkoutny, kees, muchun.song, roman.gushchin,
	shakeel.butt, rientjes, jackmanb, cl, harry.yoo, axelrasmussen,
	yuanchu, weixugc, zhengqi.arch, yosry.ahmed, nphamcs,
	chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

On 11/13/25 06:29, Gregory Price wrote:
> All current callers of __cpuset_zone_allowed() presently check if
> cpusets_enabled() is true first - which is the first check of the
> cpuset_zone_allowed() function.
> 
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  mm/compaction.c |  7 +++----
>  mm/page_alloc.c | 19 ++++++++-----------
>  2 files changed, 11 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 1e8f8eca318c..d2176935d3dd 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2829,10 +2829,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  					ac->highest_zoneidx, ac->nodemask) {
>  		enum compact_result status;
>  
> -		if (cpusets_enabled() &&
> -			(alloc_flags & ALLOC_CPUSET) &&
> -			!__cpuset_zone_allowed(zone, gfp_mask))
> -				continue;
> +		if ((alloc_flags & ALLOC_CPUSET) &&
> +		    !cpuset_zone_allowed(zone, gfp_mask))
> +			continue;
>  

Shouldn't this become one inline helper -- alloc_flags and cpuset_zone_allowed.

Balbir
<snip>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
  2025-12-15  6:14   ` Balbir Singh
@ 2025-12-15 12:38     ` Gregory Price
  0 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2025-12-15 12:38 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, kernel-team, linux-cxl, linux-kernel, nvdimm,
	linux-fsdevel, cgroups, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, kees,
	muchun.song, roman.gushchin, shakeel.butt, rientjes, jackmanb,
	cl, harry.yoo, axelrasmussen, yuanchu, weixugc, zhengqi.arch,
	yosry.ahmed, nphamcs, chengming.zhou, fabio.m.de.francesco,
	rrichter, ming.li, usamaarif642, brauner, oleg, namcao, escape,
	dongjoo.seo1

On Mon, Dec 15, 2025 at 05:14:07PM +1100, Balbir Singh wrote:
> On 11/13/25 06:29, Gregory Price wrote:
> > @@ -2829,10 +2829,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> >  					ac->highest_zoneidx, ac->nodemask) {
> >  		enum compact_result status;
> >  
> > -		if (cpusets_enabled() &&
> > -			(alloc_flags & ALLOC_CPUSET) &&
> > -			!__cpuset_zone_allowed(zone, gfp_mask))
> > -				continue;
> > +		if ((alloc_flags & ALLOC_CPUSET) &&
> > +		    !cpuset_zone_allowed(zone, gfp_mask))
> > +			continue;
> >  
> 
> Shouldn't this become one inline helper -- alloc_flags and cpuset_zone_allowed.
>

I actually went back and took a look at this code and I think there was
a corner case I missed by re-ordering cpusets_enabled and ALLOC_CPUSET
when the GFP flag was added.

I will take another look here and see if it can't be fully abstracted
into a helper, but i remember thinking to myself "Damn, i have to open
code this to deal with the cpusets_disabled case".

Will double check on next version.

> Balbir
> <snip>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] dax/kmem: add build config for protected dax memory blocks
  2025-11-13 14:58   ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
  2025-11-13 16:37     ` kernel test robot
@ 2026-01-15  2:38     ` Gregory Price
  1 sibling, 0 replies; 34+ messages in thread
From: Gregory Price @ 2026-01-15  2:38 UTC (permalink / raw)
  To: linux-mm; +Cc: kernel-team, linux-kernel, Dan Williams

Since this protection may break userspace tools, it should
be an opt-in until those tools have time to update to the
new daxN.M/hotplug interface instead of memory blocks.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/dax/Kconfig | 18 ++++++++++++++++++
 drivers/dax/kmem.c  | 29 ++++++++++++++++++++---------
 2 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index d656e4c0eb84..cc13c22eb8f8 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,22 @@ config DEV_DAX_KMEM
 
 	  Say N if unsure.
 
+config DEV_DAX_KMEM_PROTECTED
+	bool "Protect DAX_KMEM memory blocks being changed"
+	depends on DEV_DAX_KMEM
+	default n
+	help
+	  Prevents actions from outside the KMEM DAX driver from changing
+	  DAX KMEM memory block states. For example, the memory block
+	  sysfs functions (online, state) will return -EBUSY, and normal
+	  calls to memory_hotplug functions from other drivers and kernel
+	  sources will fail.
+
+	  This may break existing memory block management patterns that
+	  depend on offlining DAX KMEM blocks from userland before unbinding
+	  the driver.  Use this only if your tools have been updated to use
+	  the daxN.M/hotplug interface.
+
+	  Say N if unsure.
+
 endif
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index f3562f65376c..094b8a51099e 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -184,6 +184,21 @@ static int dax_kmem_memory_notifier_cb(struct notifier_block *nb,
 	return NOTIFY_BAD;
 }
 
+static int dax_kmem_register_notifier(struct dax_kmem_data *data)
+{
+	if (!IS_ENABLED(DEV_DAX_KMEM_PROTECTED))
+		return 0;
+	data->mem_nb.notifier_call = dax_kmem_memory_notifier_cb;
+	return register_memory_notifier(&data->mem_nb);
+}
+
+static void dax_kmem_unregister_notifier(struct dax_kmem_data *data)
+{
+	if (!IS_ENABLED(DEV_DAX_KMEM_PROTECTED))
+		return;
+	unregister_memory_notifier(&data->mem_nb);
+}
+
 /**
  * dax_kmem_do_hotplug - hotplug memory for dax kmem device
  * @dev_dax: the dev_dax instance
@@ -563,13 +578,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	if (rc < 0)
 		goto err_resources;
 
-	/* Register memory notifier to block external operations */
-	data->mem_nb.notifier_call = dax_kmem_memory_notifier_cb;
-	rc = register_memory_notifier(&data->mem_nb);
-	if (rc) {
-		dev_warn(dev, "failed to register memory notifier\n");
+	rc = dax_kmem_register_notifier(data);
+	if (rc)
 		goto err_notifier;
-	}
 
 	/*
 	 * Hotplug using the system default policy - this preserves backwards
@@ -595,7 +606,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	return 0;
 
 err_hotplug:
-	unregister_memory_notifier(&data->mem_nb);
+	dax_kmem_unregister_notifier(data);
 err_notifier:
 	dax_kmem_cleanup_resources(dev_dax, data);
 err_resources:
@@ -619,7 +630,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 
 	device_remove_file(dev, &dev_attr_hotplug);
 	dax_kmem_cleanup_resources(dev_dax, data);
-	unregister_memory_notifier(&data->mem_nb);
+	dax_kmem_unregister_notifier(data);
 	memory_group_unregister(data->mgid);
 	kfree(data->res_name);
 	kfree(data);
@@ -640,7 +651,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 	struct dax_kmem_data *data = dev_get_drvdata(dev);
 
 	device_remove_file(dev, &dev_attr_hotplug);
-	unregister_memory_notifier(&data->mem_nb);
+	dax_kmem_unregister_notifier(data);
 
 	/*
 	 * Without hotremove purposely leak the request_mem_region() for the
-- 
2.52.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2026-01-15  2:39 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2025-12-15  6:11   ` Balbir Singh
2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
2025-12-15  6:14   ` Balbir Singh
2025-12-15 12:38     ` Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
2025-11-13 14:58   ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
2025-11-13 16:37     ` kernel test robot
2026-01-15  2:38     ` [PATCH] dax/kmem: add build config for protected dax memory blocks Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
2025-11-18 10:36   ` Gregory Price
2025-11-21 21:07   ` Gregory Price
2025-11-23 23:09     ` Alistair Popple
2025-11-24 15:28       ` Gregory Price
2025-11-27  5:03         ` Alistair Popple
2025-11-24  9:19 ` David Hildenbrand (Red Hat)
2025-11-24 18:06   ` Gregory Price
2025-12-10 23:29     ` Yiannis Nikolakopoulos
2025-11-25 14:09 ` Kiryl Shutsemau
2025-11-25 15:05   ` Gregory Price
2025-11-27  5:12     ` Alistair Popple
2025-11-26  3:23 ` Balbir Singh
2025-11-26  8:29   ` Gregory Price
2025-12-03  4:36     ` Balbir Singh
2025-12-03  5:25       ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox