[RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes
@ 2025-11-07 22:49 Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 1/9] gfp: Add GFP_PROTECTED for protected-node allocations Gregory Price
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

Author Note
-----------
This is a code RFC for discussion related to

"Mempolicy is dead, long live memory policy!"
https://lpc.events/event/19/contributions/2143/

Given the subtlety of some of these changes, and the upcoming holidays
I wanted to publish this well ahead of time for discussion. This is
the baseline patch set which predicates a new kind of mempolicy based
on NUMA node memory features - which can be defined by the components
adding memory to such NUMA nodes.

Included is an example of a Compressed Memory Node, and how compressed
RAM could be managed by zswap.  Compressed memory is its own rabbit
hole - I recommend not getting hung up on the example. 

The core discussion should be around whether such a "Protected Node"
based system is reasonable - and whether there are sufficient potential
users to warrant support.

Also please do not get hung up on naming. "Protected" just means
"Not-System-RAM".  If you see "Default" just assume "Systam RAM".

base-commit: 1c353dc8d962de652bc7ad2ba2e63f553331391c
-----------

With this set, we aim to enable allocation of "special purpose memory"
with the page allocator (mm/page_alloc.c) without exposing the same
memory as "Typical System RAM".  Unless a non-userland component
explicitly asks for the node, and does so with a GFP_PROTECTED flag,
memory on that node cannot be "accidentally" used as normal ram.

We present an example of using this mechanism within ZSWAP, as-if
a "compressed memory node" was present.  How to describe the features
of memory present on nodes is left up to comment here and at LPC '26.

Important Note: Since userspace interfaces are restricted by the
default_node mask (sysram), nothing in userspace can explicitly
request memory from protected nodes.  Instead, the intent is to
create new components which understand different node features,
which abstracts the hardware complexity away from userland.

The ZSWAP example demonstrates this with `mt_compressed_nodemask`
which is simply a hack to simply demonstrate the idea.

There are 4 major changes in this set:

1) Introducing default_sysram_nodes in mm/memory-tiers.c which denotes
   the set of default nodes which are eligible for use as normal sysram

   Some existing users noew pass default_sysram_nodes into the page
   allocator instead of NULL, but passing a NULL pointer in will simply
   have it replaced by default_sysram_nodes anyway.

   default_sysram_nodes is always guaranteed to contain the N_MEMORY
   nodes that were present at boot time, and so it can never be empty.

2) The addition of `cpuset.mems.default` which restricts cgroups to
   using `default_sysram_nodes` by default, while allowing non-sysram
   nodes into mems_effective (mems_allowed).

   This is done to allow separate control over sysram and protected node
   sets by cgroups while maintaining the hierarchical rules.

   current cpuset configuration
   cpuset.mems_allowed
    |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
    |->tasks.mems_allowed    < cpuset.mems_effective

   new cpuset configuration
   cpuset.mems_allowed
    |.mems_effective        < (mems_allowed ∩ parent.mems_effective)
    |.mems_default          < (mems_effective ∩ default_sys_nodemask)
      |->task.mems_default  < cpuset.mems_default - (note renamed)

3) Addition of MHP_PROTECTED_MEMORY flag to denote to memory-hotplug
   that the memory capacity being added should mark the node as a
   protected memory node.  A node is either SysRAM or Protected, and
   cannot contain both (adding protected to an existing SysRAM node
   will result in EINVAL).

   DAX and CXL are made aware of the bit and have `protected_memory`
   bits added to their relevant subsystems.

4) Adding GFP_PROTECTED - which allows page_alloc.c to request memory
   from the provided node or nodemask.  It changes the behavior of
   the cpuset mems_allowed check.

   Probably there needs to be some additional work done here to
   restrict non-cgroup kernels.

Gregory Price (9):
  gfp: Add GFP_PROTECTED for protected-node allocations
  memory-tiers: create default_sysram_nodes
  mm: default slub, oom_kill, compaction, and page_alloc to sysram
  mm,cpusets: rename task->mems_allowed to task->mems_default
  cpuset: introduce cpuset.mems.default
  mm/memory_hotplug: add MHP_PROTECTED_MEMORY flag
  drivers/dax: add protected memory bit to dev_dax
  drivers/cxl: add protected_memory bit to cxl region
  [HACK] mm/zswap: compressed ram integration example

 drivers/cxl/core/region.c       |  30 ++++++
 drivers/cxl/cxl.h               |   2 +
 drivers/dax/bus.c               |  39 ++++++++
 drivers/dax/bus.h               |   1 +
 drivers/dax/cxl.c               |   1 +
 drivers/dax/dax-private.h       |   1 +
 drivers/dax/kmem.c              |   2 +
 fs/proc/array.c                 |   2 +-
 include/linux/cpuset.h          |  52 +++++------
 include/linux/gfp_types.h       |   3 +
 include/linux/memory-tiers.h    |   4 +
 include/linux/memory_hotplug.h  |  10 ++
 include/linux/mempolicy.h       |   2 +-
 include/linux/sched.h           |   6 +-
 init/init_task.c                |   2 +-
 kernel/cgroup/cpuset-internal.h |   8 ++
 kernel/cgroup/cpuset-v1.c       |   7 ++
 kernel/cgroup/cpuset.c          | 157 +++++++++++++++++++++-----------
 kernel/fork.c                   |   2 +-
 kernel/sched/fair.c             |   4 +-
 mm/hugetlb.c                    |   8 +-
 mm/memcontrol.c                 |   2 +-
 mm/memory-tiers.c               |  25 ++++-
 mm/memory_hotplug.c             |  25 +++++
 mm/mempolicy.c                  |  34 +++----
 mm/migrate.c                    |   4 +-
 mm/oom_kill.c                   |  11 ++-
 mm/page_alloc.c                 |  28 +++---
 mm/show_mem.c                   |   2 +-
 mm/slub.c                       |   4 +-
 mm/vmscan.c                     |   2 +-
 mm/zswap.c                      |  65 ++++++++++++-
 32 files changed, 411 insertions(+), 134 deletions(-)

-- 
2.51.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/9] gfp: Add GFP_PROTECTED for protected-node allocations
  2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
@ 2025-11-07 22:49 ` Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 2/9] memory-tiers: create default_sysram_nodes Gregory Price
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

GFP_PROTECTED changes the nodemask checks when ALLOC_CPUSET
is set in the page allocator to check the full set of nodes
in cpuset->mems_allowed rather than just sysram nodes in
task->mems_default.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/gfp_types.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 65db9349f905..2c0c250ade3a 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -58,6 +58,7 @@ enum {
 #ifdef CONFIG_SLAB_OBJ_EXT
 	___GFP_NO_OBJ_EXT_BIT,
 #endif
+	___GFP_PROTECTED_BIT,
 	___GFP_LAST_BIT
 };
 
@@ -103,6 +104,7 @@ enum {
 #else
 #define ___GFP_NO_OBJ_EXT       0
 #endif
+#define ___GFP_PROTECTED	BIT(___GFP_PROTECTED_BIT)
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -115,6 +117,7 @@ enum {
 #define __GFP_HIGHMEM	((__force gfp_t)___GFP_HIGHMEM)
 #define __GFP_DMA32	((__force gfp_t)___GFP_DMA32)
 #define __GFP_MOVABLE	((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
+#define __GFP_PROTECTED	((__force gfp_t)___GFP_PROTECTED) /* Protected nodes allowed */
 #define GFP_ZONEMASK	(__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
 
 /**
-- 
2.51.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/9] memory-tiers: create default_sysram_nodes
  2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 1/9] gfp: Add GFP_PROTECTED for protected-node allocations Gregory Price
@ 2025-11-07 22:49 ` Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 3/9] mm: default slub, oom_kill, compaction, and page_alloc to sysram Gregory Price
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

Record the set of memory nodes present at __init time, so that hotplug
memory nodes can choose whether to expose themselves to the page
allocator at hotplug time.

Do not included non-sysram nodes in demotion targets.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory-tiers.h |  3 +++
 mm/memory-tiers.c            | 22 ++++++++++++++++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 7a805796fcfd..3d3f3687d134 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -39,6 +39,9 @@ struct access_coordinate;
 extern bool numa_demotion_enabled;
 extern struct memory_dev_type *default_dram_type;
 extern nodemask_t default_dram_nodes;
+extern nodemask_t default_sysram_nodelist;
+#define default_sysram_nodes (nodes_empty(default_sysram_nodelist) ? NULL : \
+			      &default_sysram_nodelist)
 struct memory_dev_type *alloc_memory_type(int adistance);
 void put_memory_type(struct memory_dev_type *memtype);
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0ea5c13f10a2..b2ee4f73ad54 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -44,7 +44,12 @@ static LIST_HEAD(memory_tiers);
 static LIST_HEAD(default_memory_types);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 struct memory_dev_type *default_dram_type;
-nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE;
+
+/* default_dram_nodes is the list of nodes with both CPUs and RAM */
+nodemask_t default_dram_nodes = NODE_MASK_NONE;
+
+/* default_sysram_nodelist is the list of nodes with RAM at __init time */
+nodemask_t default_sysram_nodelist = NODE_MASK_NONE;
 
 static const struct bus_type memory_tier_subsys = {
 	.name = "memory_tiering",
@@ -427,6 +432,14 @@ static void establish_demotion_targets(void)
 	disable_all_demotion_targets();
 
 	for_each_node_state(node, N_MEMORY) {
+		/*
+		 * If this is not a sysram node, direct-demotion is not allowed
+		 * and must be managed by special logic that understands the
+		 * memory features of that particular node.
+		 */
+		if (!node_isset(node, default_sysram_nodelist))
+			continue;
+
 		best_distance = -1;
 		nd = &node_demotion[node];
 
@@ -457,7 +470,8 @@ static void establish_demotion_targets(void)
 				break;
 
 			distance = node_distance(node, target);
-			if (distance == best_distance || best_distance == -1) {
+			if ((distance == best_distance || best_distance == -1) &&
+			    node_isset(target, default_sysram_nodelist)) {
 				best_distance = distance;
 				node_set(target, nd->preferred);
 			} else {
@@ -812,6 +826,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
 }
 EXPORT_SYMBOL_GPL(mt_perf_to_adistance);
 
+
 /**
  * register_mt_adistance_algorithm() - Register memory tiering abstract distance algorithm
  * @nb: The notifier block which describe the algorithm
@@ -922,6 +937,9 @@ static int __init memory_tier_init(void)
 	nodes_and(default_dram_nodes, node_states[N_MEMORY],
 		  node_states[N_CPU]);
 
+	/* Record all nodes with non-hotplugged memory as default SYSRAM nodes */
+	default_sysram_nodelist = node_states[N_MEMORY];
+
 	hotplug_node_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
 	return 0;
 }
-- 
2.51.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/9] mm: default slub, oom_kill, compaction, and page_alloc to sysram
  2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 1/9] gfp: Add GFP_PROTECTED for protected-node allocations Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 2/9] memory-tiers: create default_sysram_nodes Gregory Price
@ 2025-11-07 22:49 ` Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 4/9] mm,cpusets: rename task->mems_allowed to task->mems_default Gregory Price
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

Constrain core users of nodemasks to the default_sysram_nodemask,
which is guaranteed to either be NULL or contain the set of nodes
with sysram memory blocks.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/oom_kill.c   |  5 ++++-
 mm/page_alloc.c | 12 ++++++++----
 mm/slub.c       |  4 +++-
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c145b0feecc1..e0b6137835b2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -34,6 +34,7 @@
 #include <linux/export.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/memory-tiers.h>
 #include <linux/mempolicy.h>
 #include <linux/security.h>
 #include <linux/ptrace.h>
@@ -1118,6 +1119,8 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
+	if (!oc->nodemask)
+		oc->nodemask = default_sysram_nodes;
 
 	if (oom_killer_disabled)
 		return false;
@@ -1154,7 +1157,7 @@ bool out_of_memory(struct oom_control *oc)
 	 */
 	oc->constraint = constrained_alloc(oc);
 	if (oc->constraint != CONSTRAINT_MEMORY_POLICY)
-		oc->nodemask = NULL;
+		oc->nodemask = default_sysram_nodes;
 	check_panic_on_oom(oc);
 
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd5401fb5e00..18213eacf974 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -34,6 +34,7 @@
 #include <linux/cpuset.h>
 #include <linux/pagevec.h>
 #include <linux/memory_hotplug.h>
+#include <linux/memory-tiers.h>
 #include <linux/nodemask.h>
 #include <linux/vmstat.h>
 #include <linux/fault-inject.h>
@@ -4610,7 +4611,7 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
 	 */
 	if (cpusets_enabled() && ac->nodemask &&
 			!cpuset_nodemask_valid_mems_allowed(ac->nodemask)) {
-		ac->nodemask = NULL;
+		ac->nodemask = default_sysram_nodes;
 		return true;
 	}
 
@@ -4794,7 +4795,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * user oriented.
 	 */
 	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
-		ac->nodemask = NULL;
+		ac->nodemask = default_sysram_nodes;
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->highest_zoneidx, ac->nodemask);
 	}
@@ -4946,7 +4947,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 			ac->nodemask = &cpuset_current_mems_allowed;
 		else
 			*alloc_flags |= ALLOC_CPUSET;
-	}
+	} else if (!ac->nodemask) /* sysram_nodes may be NULL during __init */
+		ac->nodemask = default_sysram_nodes;
 
 	might_alloc(gfp_mask);
 
@@ -5190,8 +5192,10 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 	/*
 	 * Restore the original nodemask if it was potentially replaced with
 	 * &cpuset_current_mems_allowed to optimize the fast-path attempt.
+	 *
+	 * If not set, default to sysram nodes.
 	 */
-	ac.nodemask = nodemask;
+	ac.nodemask = nodemask ? nodemask : default_sysram_nodes;
 
 	page = __alloc_pages_slowpath(alloc_gfp, order, &ac);
 
diff --git a/mm/slub.c b/mm/slub.c
index d4367f25b20d..b8358a961c4c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -28,6 +28,7 @@
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/mempolicy.h>
+#include <linux/memory-tiers.h>
 #include <linux/ctype.h>
 #include <linux/stackdepot.h>
 #include <linux/debugobjects.h>
@@ -3570,7 +3571,8 @@ static struct slab *get_any_partial(struct kmem_cache *s,
 	do {
 		cpuset_mems_cookie = read_mems_allowed_begin();
 		zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
-		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
+		for_each_zone_zonelist_nodemask(zone, z, zonelist, highest_zoneidx,
+						default_sysram_nodes) {
 			struct kmem_cache_node *n;
 
 			n = get_node(s, zone_to_nid(zone));
-- 
2.51.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/9] mm,cpusets: rename task->mems_allowed to task->mems_default
  2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
                   ` (2 preceding siblings ...)
  2025-11-07 22:49 ` [RFC PATCH 3/9] mm: default slub, oom_kill, compaction, and page_alloc to sysram Gregory Price
@ 2025-11-07 22:49 ` Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 5/9] cpuset: introduce cpuset.mems.default Gregory Price
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

task->mems_allowed actually contains the value of cpuset.effective_mems

The value of cpuset.mems.effective is the intersection of mems_allowed
and the cpuset's parent's mems.effective.  This creates a confusing
naming scheme between references to task->mems_allowed, and cpuset
mems_allowed and effective_mems.

Rename task->mems_allowed to task->mems_default for two reasons.

1) To detach the task->mems_allowed and cpuset.mems_allowed naming
   scheme and make it clear the two fields may contain different values.

2) To enable mems_allowed to contain memory nodes which may not be
   present in effective_mems due to being "Special Purpose" nodes
   which require explicit GFP flags to allocate from (implemented
   in a future patch in this series).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 fs/proc/array.c           |  2 +-
 include/linux/cpuset.h    | 44 +++++++++++-----------
 include/linux/mempolicy.h |  2 +-
 include/linux/sched.h     |  6 +--
 init/init_task.c          |  2 +-
 kernel/cgroup/cpuset.c    | 78 +++++++++++++++++++--------------------
 kernel/fork.c             |  2 +-
 kernel/sched/fair.c       |  4 +-
 mm/hugetlb.c              |  8 ++--
 mm/mempolicy.c            | 28 +++++++-------
 mm/oom_kill.c             |  6 +--
 mm/page_alloc.c           | 16 ++++----
 mm/show_mem.c             |  2 +-
 mm/vmscan.c               |  2 +-
 14 files changed, 101 insertions(+), 101 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 2ae63189091e..3929d7cf65d5 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -456,7 +456,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cap(m, task);
 	task_seccomp(m, task);
 	task_cpus_allowed(m, task);
-	cpuset_task_status_allowed(m, task);
+	cpuset_task_status_default(m, task);
 	task_context_switch_counts(m, task);
 	arch_proc_pid_thread_features(m, task);
 	return 0;
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 548eaf7ef8d0..4db08c580cc3 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -23,14 +23,14 @@
 /*
  * Static branch rewrites can happen in an arbitrary order for a given
  * key. In code paths where we need to loop with read_mems_allowed_begin() and
- * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
+ * read_mems_allowed_retry() to get a consistent view of mems_default, we need
  * to ensure that begin() always gets rewritten before retry() in the
  * disabled -> enabled transition. If not, then if local irqs are disabled
  * around the loop, we can deadlock since retry() would always be
- * comparing the latest value of the mems_allowed seqcount against 0 as
+ * comparing the latest value of the mems_default seqcount against 0 as
  * begin() still would see cpusets_enabled() as false. The enabled -> disabled
  * transition should happen in reverse order for the same reasons (want to stop
- * looking at real value of mems_allowed.sequence in retry() first).
+ * looking at real value of mems_default.sequence in retry() first).
  */
 extern struct static_key_false cpusets_pre_enable_key;
 extern struct static_key_false cpusets_enabled_key;
@@ -78,9 +78,9 @@ extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern bool cpuset_cpu_is_isolated(int cpu);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
-#define cpuset_current_mems_allowed (current->mems_allowed)
-void cpuset_init_current_mems_allowed(void);
-int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask);
+#define cpuset_current_mems_default (current->mems_default)
+void cpuset_init_current_mems_default(void);
+int cpuset_nodemask_valid_mems_default(const nodemask_t *nodemask);
 
 extern bool cpuset_current_node_allowed(int node, gfp_t gfp_mask);
 
@@ -96,7 +96,7 @@ static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 	return true;
 }
 
-extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
+extern int cpuset_mems_default_intersects(const struct task_struct *tsk1,
 					  const struct task_struct *tsk2);
 
 #ifdef CONFIG_CPUSETS_V1
@@ -111,7 +111,7 @@ extern void __cpuset_memory_pressure_bump(void);
 static inline void cpuset_memory_pressure_bump(void) { }
 #endif
 
-extern void cpuset_task_status_allowed(struct seq_file *m,
+extern void cpuset_task_status_default(struct seq_file *m,
 					struct task_struct *task);
 extern int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
 			    struct pid *pid, struct task_struct *tsk);
@@ -128,12 +128,12 @@ extern bool current_cpuset_is_being_rebound(void);
 extern void dl_rebuild_rd_accounting(void);
 extern void rebuild_sched_domains(void);
 
-extern void cpuset_print_current_mems_allowed(void);
+extern void cpuset_print_current_mems_default(void);
 extern void cpuset_reset_sched_domains(void);
 
 /*
  * read_mems_allowed_begin is required when making decisions involving
- * mems_allowed such as during page allocation. mems_allowed can be updated in
+ * mems_default such as during page allocation. mems_default can be updated in
  * parallel and depending on the new value an operation can fail potentially
  * causing process failure. A retry loop with read_mems_allowed_begin and
  * read_mems_allowed_retry prevents these artificial failures.
@@ -143,13 +143,13 @@ static inline unsigned int read_mems_allowed_begin(void)
 	if (!static_branch_unlikely(&cpusets_pre_enable_key))
 		return 0;
 
-	return read_seqcount_begin(&current->mems_allowed_seq);
+	return read_seqcount_begin(&current->mems_default_seq);
 }
 
 /*
  * If this returns true, the operation that took place after
  * read_mems_allowed_begin may have failed artificially due to a concurrent
- * update of mems_allowed. It is up to the caller to retry the operation if
+ * update of mems_default. It is up to the caller to retry the operation if
  * appropriate.
  */
 static inline bool read_mems_allowed_retry(unsigned int seq)
@@ -157,7 +157,7 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
 	if (!static_branch_unlikely(&cpusets_enabled_key))
 		return false;
 
-	return read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return read_seqcount_retry(&current->mems_default_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -166,9 +166,9 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 
 	task_lock(current);
 	local_irq_save(flags);
-	write_seqcount_begin(&current->mems_allowed_seq);
-	current->mems_allowed = nodemask;
-	write_seqcount_end(&current->mems_allowed_seq);
+	write_seqcount_begin(&current->mems_default_seq);
+	current->mems_default = nodemask;
+	write_seqcount_end(&current->mems_default_seq);
 	local_irq_restore(flags);
 	task_unlock(current);
 }
@@ -216,10 +216,10 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 	return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_states[N_MEMORY])
-static inline void cpuset_init_current_mems_allowed(void) {}
+#define cpuset_current_mems_default (node_states[N_MEMORY])
+static inline void cpuset_init_current_mems_default(void) {}
 
-static inline int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask)
+static inline int cpuset_nodemask_valid_mems_default(const nodemask_t *nodemask)
 {
 	return 1;
 }
@@ -234,7 +234,7 @@ static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 	return true;
 }
 
-static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
+static inline int cpuset_mems_default_intersects(const struct task_struct *tsk1,
 						 const struct task_struct *tsk2)
 {
 	return 1;
@@ -242,7 +242,7 @@ static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
 
 static inline void cpuset_memory_pressure_bump(void) {}
 
-static inline void cpuset_task_status_allowed(struct seq_file *m,
+static inline void cpuset_task_status_default(struct seq_file *m,
 						struct task_struct *task)
 {
 }
@@ -276,7 +276,7 @@ static inline void cpuset_reset_sched_domains(void)
 	partition_sched_domains(1, NULL, NULL);
 }
 
-static inline void cpuset_print_current_mems_allowed(void)
+static inline void cpuset_print_current_mems_default(void)
 {
 }
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 0fe96f3ab3ef..f1a6ab8ac383 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -52,7 +52,7 @@ struct mempolicy {
 	int home_node;		/* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */
 
 	union {
-		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
+		nodemask_t cpuset_mems_default;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
 	} w;
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b469878de25c..e7030c0dfc60 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1223,7 +1223,7 @@ struct task_struct {
 	u64				parent_exec_id;
 	u64				self_exec_id;
 
-	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
+	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_default, mempolicy: */
 	spinlock_t			alloc_lock;
 
 	/* Protection of the PI data structures: */
@@ -1314,9 +1314,9 @@ struct task_struct {
 #endif
 #ifdef CONFIG_CPUSETS
 	/* Protected by ->alloc_lock: */
-	nodemask_t			mems_allowed;
+	nodemask_t			mems_default;
 	/* Sequence number to catch updates: */
-	seqcount_spinlock_t		mems_allowed_seq;
+	seqcount_spinlock_t		mems_default_seq;
 	int				cpuset_mem_spread_rotor;
 #endif
 #ifdef CONFIG_CGROUPS
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206f..6aaeb25327af 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -173,7 +173,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.trc_blkd_node = LIST_HEAD_INIT(init_task.trc_blkd_node),
 #endif
 #ifdef CONFIG_CPUSETS
-	.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
+	.mems_default_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_default_seq,
 						 &init_task.alloc_lock),
 #endif
 #ifdef CONFIG_RT_MUTEXES
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index cd3e2ae83d70..b05c07489a4d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -240,7 +240,7 @@ static struct cpuset top_cpuset = {
  * If a task is only holding callback_lock, then it has read-only
  * access to cpusets.
  *
- * Now, the task_struct fields mems_allowed and mempolicy may be changed
+ * Now, the task_struct fields mems_default and mempolicy may be changed
  * by other task, we use alloc_lock in the task_struct fields to protect
  * them.
  *
@@ -2678,11 +2678,11 @@ static void schedule_flush_migrate_mm(void)
 }
 
 /*
- * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
+ * cpuset_change_task_nodemask - change task's mems_default and mempolicy
  * @tsk: the task to change
  * @newmems: new nodes that the task will be set
  *
- * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
+ * We use the mems_default_seq seqlock to safely update both tsk->mems_default
  * and rebind an eventual tasks' mempolicy. If the task is allocating in
  * parallel, it might temporarily see an empty intersection, which results in
  * a seqlock check and retry before OOM or allocation failure.
@@ -2693,13 +2693,13 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	task_lock(tsk);
 
 	local_irq_disable();
-	write_seqcount_begin(&tsk->mems_allowed_seq);
+	write_seqcount_begin(&tsk->mems_default_seq);
 
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+	nodes_or(tsk->mems_default, tsk->mems_default, *newmems);
 	mpol_rebind_task(tsk, newmems);
-	tsk->mems_allowed = *newmems;
+	tsk->mems_default = *newmems;
 
-	write_seqcount_end(&tsk->mems_allowed_seq);
+	write_seqcount_end(&tsk->mems_default_seq);
 	local_irq_enable();
 
 	task_unlock(tsk);
@@ -2709,9 +2709,9 @@ static void *cpuset_being_rebound;
 
 /**
  * cpuset_update_tasks_nodemask - Update the nodemasks of tasks in the cpuset.
- * @cs: the cpuset in which each task's mems_allowed mask needs to be changed
+ * @cs: the cpuset in which each task's mems_default mask needs to be changed
  *
- * Iterate through each task of @cs updating its mems_allowed to the
+ * Iterate through each task of @cs updating its mems_default to the
  * effective cpuset's.  As this function is called with cpuset_mutex held,
  * cpuset membership stays stable.
  */
@@ -3763,7 +3763,7 @@ static void cpuset_fork(struct task_struct *task)
 			return;
 
 		set_cpus_allowed_ptr(task, current->cpus_ptr);
-		task->mems_allowed = current->mems_allowed;
+		task->mems_default = current->mems_default;
 		return;
 	}
 
@@ -4205,9 +4205,9 @@ bool cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 	return changed;
 }
 
-void __init cpuset_init_current_mems_allowed(void)
+void __init cpuset_init_current_mems_default(void)
 {
-	nodes_setall(current->mems_allowed);
+	nodes_setall(current->mems_default);
 }
 
 /**
@@ -4233,14 +4233,14 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 }
 
 /**
- * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
+ * cpuset_nodemask_valid_mems_default - check nodemask vs. current mems_default
  * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes in the nodemask allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_default?
  */
-int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask)
+int cpuset_nodemask_valid_mems_default(const nodemask_t *nodemask)
 {
-	return nodes_intersects(*nodemask, current->mems_allowed);
+	return nodes_intersects(*nodemask, current->mems_default);
 }
 
 /*
@@ -4262,7 +4262,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * @gfp_mask: memory allocation flags
  *
  * If we're in interrupt, yes, we can always allocate.  If @node is set in
- * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
+ * current's mems_default, yes.  If it's not a __GFP_HARDWALL request and this
  * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
  * yes.  If current has access to memory reserves as an oom victim, yes.
  * Otherwise, no.
@@ -4276,7 +4276,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * Scanning up parent cpusets requires callback_lock.  The
  * __alloc_pages() routine only calls here with __GFP_HARDWALL bit
  * _not_ set if it's a GFP_KERNEL allocation, and all nodes in the
- * current tasks mems_allowed came up empty on the first pass over
+ * current tasks mems_default came up empty on the first pass over
  * the zonelist.  So only GFP_KERNEL allocations, if all nodes in the
  * cpuset are short of memory, might require taking the callback_lock.
  *
@@ -4304,7 +4304,7 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 
 	if (in_interrupt())
 		return true;
-	if (node_isset(node, current->mems_allowed))
+	if (node_isset(node, current->mems_default))
 		return true;
 	/*
 	 * Allow tasks that have access to memory reserves because they have
@@ -4375,13 +4375,13 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
  * certain page cache or slab cache pages such as used for file
  * system buffers and inode caches, then instead of starting on the
  * local node to look for a free page, rather spread the starting
- * node around the tasks mems_allowed nodes.
+ * node around the tasks mems_default nodes.
  *
  * We don't have to worry about the returned node being offline
  * because "it can't happen", and even if it did, it would be ok.
  *
  * The routines calling guarantee_online_mems() are careful to
- * only set nodes in task->mems_allowed that are online.  So it
+ * only set nodes in task->mems_default that are online.  So it
  * should not be possible for the following code to return an
  * offline node.  But if it did, that would be ok, as this routine
  * is not returning the node where the allocation must be, only
@@ -4392,7 +4392,7 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
  */
 static int cpuset_spread_node(int *rotor)
 {
-	return *rotor = next_node_in(*rotor, current->mems_allowed);
+	return *rotor = next_node_in(*rotor, current->mems_default);
 }
 
 /**
@@ -4402,35 +4402,35 @@ int cpuset_mem_spread_node(void)
 {
 	if (current->cpuset_mem_spread_rotor == NUMA_NO_NODE)
 		current->cpuset_mem_spread_rotor =
-			node_random(&current->mems_allowed);
+			node_random(&current->mems_default);
 
 	return cpuset_spread_node(&current->cpuset_mem_spread_rotor);
 }
 
 /**
- * cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's?
+ * cpuset_mems_default_intersects - Does @tsk1's mems_default intersect @tsk2's?
  * @tsk1: pointer to task_struct of some task.
  * @tsk2: pointer to task_struct of some other task.
  *
- * Description: Return true if @tsk1's mems_allowed intersects the
- * mems_allowed of @tsk2.  Used by the OOM killer to determine if
+ * Description: Return true if @tsk1's mems_default intersects the
+ * mems_default of @tsk2.  Used by the OOM killer to determine if
  * one of the task's memory usage might impact the memory available
  * to the other.
  **/
 
-int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
+int cpuset_mems_default_intersects(const struct task_struct *tsk1,
 				   const struct task_struct *tsk2)
 {
-	return nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed);
+	return nodes_intersects(tsk1->mems_default, tsk2->mems_default);
 }
 
 /**
- * cpuset_print_current_mems_allowed - prints current's cpuset and mems_allowed
+ * cpuset_print_current_mems_default - prints current's cpuset and mems_default
  *
  * Description: Prints current's name, cpuset name, and cached copy of its
- * mems_allowed to the kernel log.
+ * mems_default to the kernel log.
  */
-void cpuset_print_current_mems_allowed(void)
+void cpuset_print_current_mems_default(void)
 {
 	struct cgroup *cgrp;
 
@@ -4439,17 +4439,17 @@ void cpuset_print_current_mems_allowed(void)
 	cgrp = task_cs(current)->css.cgroup;
 	pr_cont(",cpuset=");
 	pr_cont_cgroup_name(cgrp);
-	pr_cont(",mems_allowed=%*pbl",
-		nodemask_pr_args(&current->mems_allowed));
+	pr_cont(",mems_default=%*pbl",
+		nodemask_pr_args(&current->mems_default));
 
 	rcu_read_unlock();
 }
 
-/* Display task mems_allowed in /proc/<pid>/status file. */
-void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task)
+/* Display task mems_default in /proc/<pid>/status file. */
+void cpuset_task_status_default(struct seq_file *m, struct task_struct *task)
 {
-	seq_printf(m, "Mems_allowed:\t%*pb\n",
-		   nodemask_pr_args(&task->mems_allowed));
-	seq_printf(m, "Mems_allowed_list:\t%*pbl\n",
-		   nodemask_pr_args(&task->mems_allowed));
+	seq_printf(m, "Mems_default:\t%*pb\n",
+		   nodemask_pr_args(&task->mems_default));
+	seq_printf(m, "Mems_default_list:\t%*pbl\n",
+		   nodemask_pr_args(&task->mems_default));
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..26e4056ca9ac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2120,7 +2120,7 @@ __latent_entropy struct task_struct *copy_process(
 #endif
 #ifdef CONFIG_CPUSETS
 	p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
-	seqcount_spinlock_init(&p->mems_allowed_seq, &p->alloc_lock);
+	seqcount_spinlock_init(&p->mems_default_seq, &p->alloc_lock);
 #endif
 #ifdef CONFIG_TRACE_IRQFLAGS
 	memset(&p->irqtrace, 0, sizeof(p->irqtrace));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb279..e50d79ba7ce9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3317,8 +3317,8 @@ static void task_numa_work(struct callback_head *work)
 	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
 	 * no page can be migrated.
 	 */
-	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) {
-		trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed);
+	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_default) == 1) {
+		trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_default);
 		return;
 	}
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0455119716ec..7925a6973d09 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2366,7 +2366,7 @@ static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
 	 */
 	if (mpol->mode == MPOL_BIND &&
 		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
-		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
+		 cpuset_nodemask_valid_mems_default(&mpol->nodes)))
 		return &mpol->nodes;
 #endif
 	return NULL;
@@ -2389,9 +2389,9 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 
 	mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h));
 	if (mbind_nodemask)
-		nodes_and(alloc_nodemask, *mbind_nodemask, cpuset_current_mems_allowed);
+		nodes_and(alloc_nodemask, *mbind_nodemask, cpuset_current_mems_default);
 	else
-		alloc_nodemask = cpuset_current_mems_allowed;
+		alloc_nodemask = cpuset_current_mems_default;
 
 	lockdep_assert_held(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
@@ -5084,7 +5084,7 @@ static unsigned int allowed_mems_nr(struct hstate *h)
 	gfp_t gfp_mask = htlb_alloc_mask(h);
 
 	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
-	for_each_node_mask(node, cpuset_current_mems_allowed) {
+	for_each_node_mask(node, cpuset_current_mems_default) {
 		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
 			nr += array[node];
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eb83cff7db8c..6225d4d23010 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -396,7 +396,7 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
  * any, for the new policy.  mpol_new() has already validated the nodes
  * parameter with respect to the policy mode and flags.
  *
- * Must be called holding task's alloc_lock to protect task's mems_allowed
+ * Must be called holding task's alloc_lock to protect task's mems_default
  * and mempolicy.  May also be called holding the mmap_lock for write.
  */
 static int mpol_set_nodemask(struct mempolicy *pol,
@@ -414,7 +414,7 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 
 	/* Check N_MEMORY */
 	nodes_and(nsc->mask1,
-		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
+		  cpuset_current_mems_default, node_states[N_MEMORY]);
 
 	VM_BUG_ON(!nodes);
 
@@ -426,7 +426,7 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 	if (mpol_store_user_nodemask(pol))
 		pol->w.user_nodemask = *nodes;
 	else
-		pol->w.cpuset_mems_allowed = cpuset_current_mems_allowed;
+		pol->w.cpuset_mems_default = cpuset_current_mems_default;
 
 	ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
 	return ret;
@@ -501,9 +501,9 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 	else if (pol->flags & MPOL_F_RELATIVE_NODES)
 		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
 	else {
-		nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed,
+		nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_default,
 								*nodes);
-		pol->w.cpuset_mems_allowed = *nodes;
+		pol->w.cpuset_mems_default = *nodes;
 	}
 
 	if (nodes_empty(tmp))
@@ -515,14 +515,14 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 static void mpol_rebind_preferred(struct mempolicy *pol,
 						const nodemask_t *nodes)
 {
-	pol->w.cpuset_mems_allowed = *nodes;
+	pol->w.cpuset_mems_default = *nodes;
 }
 
 /*
  * mpol_rebind_policy - Migrate a policy to a different set of nodes
  *
  * Per-vma policies are protected by mmap_lock. Allocations using per-task
- * policies are protected by task->mems_allowed_seq to prevent a premature
+ * policies are protected by task->mems_default_seq to prevent a premature
  * OOM/allocation failure due to parallel nodemask modification.
  */
 static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
@@ -530,7 +530,7 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
 	if (!pol || pol->mode == MPOL_LOCAL)
 		return;
 	if (!mpol_store_user_nodemask(pol) &&
-	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
+	    nodes_equal(pol->w.cpuset_mems_default, *newmask))
 		return;
 
 	mpol_ops[pol->mode].rebind(pol, newmask);
@@ -1086,7 +1086,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 			return -EINVAL;
 		*policy = 0;	/* just so it's initialized */
 		task_lock(current);
-		*nmask  = cpuset_current_mems_allowed;
+		*nmask  = cpuset_current_mems_default;
 		task_unlock(current);
 		return 0;
 	}
@@ -2029,7 +2029,7 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 	unsigned int cpuset_mems_cookie;
 
 retry:
-	/* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
+	/* to prevent miscount use tsk->mems_default_seq to detect rebind */
 	cpuset_mems_cookie = read_mems_allowed_begin();
 	node = current->il_prev;
 	if (!current->il_weight || !node_isset(node, policy->nodes)) {
@@ -2051,7 +2051,7 @@ static unsigned int interleave_nodes(struct mempolicy *policy)
 	unsigned int nid;
 	unsigned int cpuset_mems_cookie;
 
-	/* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
+	/* to prevent miscount, use tsk->mems_default_seq to detect rebind */
 	do {
 		cpuset_mems_cookie = read_mems_allowed_begin();
 		nid = next_node_in(current->il_prev, policy->nodes);
@@ -2118,7 +2118,7 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
 	/*
 	 * barrier stabilizes the nodemask locally so that it can be iterated
 	 * over safely without concern for changes. Allocators validate node
-	 * selection does not violate mems_allowed, so this is safe.
+	 * selection does not violate mems_default, so this is safe.
 	 */
 	barrier();
 	memcpy(mask, &pol->nodes, sizeof(nodemask_t));
@@ -2210,7 +2210,7 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 	case MPOL_BIND:
 		/* Restrict to nodemask (but not on lower zones) */
 		if (apply_policy_zone(pol, gfp_zone(gfp)) &&
-		    cpuset_nodemask_valid_mems_allowed(&pol->nodes))
+		    cpuset_nodemask_valid_mems_default(&pol->nodes))
 			nodemask = &pol->nodes;
 		if (pol->home_node != NUMA_NO_NODE)
 			*nid = pol->home_node;
@@ -2738,7 +2738,7 @@ int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
 /*
  * If mpol_dup() sees current->cpuset == cpuset_being_rebound, then it
  * rebinds the mempolicy its copying by calling mpol_rebind_policy()
- * with the mems_allowed returned by cpuset_mems_allowed().  This
+ * with the mems_default returned by cpuset_mems_allowed().  This
  * keeps mempolicies cpuset relative after its cpuset moves.  See
  * further kernel/cpuset.c update_nodemask().
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e0b6137835b2..a8f1f086d6a2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -110,7 +110,7 @@ static bool oom_cpuset_eligible(struct task_struct *start,
 			 * This is not a mempolicy constrained oom, so only
 			 * check the mems of tsk's cpuset.
 			 */
-			ret = cpuset_mems_allowed_intersects(current, tsk);
+			ret = cpuset_mems_default_intersects(current, tsk);
 		}
 		if (ret)
 			break;
@@ -300,7 +300,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 
 	if (cpuset_limited) {
 		oc->totalpages = total_swap_pages;
-		for_each_node_mask(nid, cpuset_current_mems_allowed)
+		for_each_node_mask(nid, cpuset_current_mems_default)
 			oc->totalpages += node_present_pages(nid);
 		return CONSTRAINT_CPUSET;
 	}
@@ -451,7 +451,7 @@ static void dump_oom_victim(struct oom_control *oc, struct task_struct *victim)
 	pr_info("oom-kill:constraint=%s,nodemask=%*pbl",
 			oom_constraint_text[oc->constraint],
 			nodemask_pr_args(oc->nodemask));
-	cpuset_print_current_mems_allowed();
+	cpuset_print_current_mems_default();
 	mem_cgroup_print_oom_context(oc->memcg, victim);
 	pr_cont(",task=%s,pid=%d,uid=%d\n", victim->comm, victim->pid,
 		from_kuid(&init_user_ns, task_uid(victim)));
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18213eacf974..a0c27fbb24bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3963,7 +3963,7 @@ void warn_alloc(gfp_t gfp_mask, const nodemask_t *nodemask, const char *fmt, ...
 			nodemask_pr_args(nodemask));
 	va_end(args);
 
-	cpuset_print_current_mems_allowed();
+	cpuset_print_current_mems_default();
 	pr_cont("\n");
 	dump_stack();
 	warn_alloc_show_mem(gfp_mask, nodemask);
@@ -4599,7 +4599,7 @@ static inline bool
 check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
 {
 	/*
-	 * It's possible that cpuset's mems_allowed and the nodemask from
+	 * It's possible that cpuset's mems_default and the nodemask from
 	 * mempolicy don't intersect. This should be normally dealt with by
 	 * policy_nodemask(), but it's possible to race with cpuset update in
 	 * such a way the check therein was true, and then it became false
@@ -4610,13 +4610,13 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
 	 * caller can deal with a violated nodemask.
 	 */
 	if (cpusets_enabled() && ac->nodemask &&
-			!cpuset_nodemask_valid_mems_allowed(ac->nodemask)) {
+			!cpuset_nodemask_valid_mems_default(ac->nodemask)) {
 		ac->nodemask = default_sysram_nodes;
 		return true;
 	}
 
 	/*
-	 * When updating a task's mems_allowed or mempolicy nodemask, it is
+	 * When updating a task's mems_default or mempolicy nodemask, it is
 	 * possible to race with parallel threads in such a way that our
 	 * allocation can fail while the mask is being updated. If we are about
 	 * to fail, check if the cpuset changed during allocation and if so,
@@ -4700,7 +4700,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
 		struct zoneref *z = first_zones_zonelist(ac->zonelist,
 					ac->highest_zoneidx,
-					&cpuset_current_mems_allowed);
+					&cpuset_current_mems_default);
 		if (!zonelist_zone(z))
 			goto nopage;
 	}
@@ -4944,7 +4944,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		 * to the current task context. It means that any node ok.
 		 */
 		if (in_task() && !ac->nodemask)
-			ac->nodemask = &cpuset_current_mems_allowed;
+			ac->nodemask = &cpuset_current_mems_default;
 		else
 			*alloc_flags |= ALLOC_CPUSET;
 	} else if (!ac->nodemask) /* sysram_nodes may be NULL during __init */
@@ -5191,7 +5191,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 
 	/*
 	 * Restore the original nodemask if it was potentially replaced with
-	 * &cpuset_current_mems_allowed to optimize the fast-path attempt.
+	 * &cpuset_current_mems_default to optimize the fast-path attempt.
 	 *
 	 * If not set, default to sysram nodes.
 	 */
@@ -5816,7 +5816,7 @@ build_all_zonelists_init(void)
 		per_cpu_pages_init(&per_cpu(boot_pageset, cpu), &per_cpu(boot_zonestats, cpu));
 
 	mminit_verify_zonelist();
-	cpuset_init_current_mems_allowed();
+	cpuset_init_current_mems_default();
 }
 
 /*
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 24685b5c6dcf..45dd35cae3fb 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -128,7 +128,7 @@ static bool show_mem_node_skip(unsigned int flags, int nid,
 	 * have to be precise here.
 	 */
 	if (!nodemask)
-		nodemask = &cpuset_current_mems_allowed;
+		nodemask = &cpuset_current_mems_default;
 
 	return !node_isset(nid, *nodemask);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03e7f5206ad9..d7aa220b2707 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -355,7 +355,7 @@ static bool can_demote(int nid, struct scan_control *sc,
 	if (demotion_nid == NUMA_NO_NODE)
 		return false;
 
-	/* If demotion node isn't in the cgroup's mems_allowed, fall back */
+	/* If demotion node isn't in the cgroup's mems_default, fall back */
 	return mem_cgroup_node_allowed(memcg, demotion_nid);
 }
 
-- 
2.51.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 5/9] cpuset: introduce cpuset.mems.default
  2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
                   ` (3 preceding siblings ...)
  2025-11-07 22:49 ` [RFC PATCH 4/9] mm,cpusets: rename task->mems_allowed to task->mems_default Gregory Price
@ 2025-11-07 22:49 ` Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 6/9] mm/memory_hotplug: add MHP_PROTECTED_MEMORY flag Gregory Price
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

mems_default is intersect(effective_mems, default_sysram_nodes). This
allows hotplugged memory nodes to be marked "protected".  A protected
node's memory is not default-allocable via standard methods (basic
pages faults, mempolicies, etc).

When checking node_allowed, check for GFP_PROTECTED to determine if
the check should be made against mems_default or mems_allowed, since
mems_default only contains sysram nodes.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/cpuset.h          |  8 ++--
 kernel/cgroup/cpuset-internal.h |  8 ++++
 kernel/cgroup/cpuset-v1.c       |  7 +++
 kernel/cgroup/cpuset.c          | 83 ++++++++++++++++++++++++++-------
 mm/memcontrol.c                 |  2 +-
 mm/mempolicy.c                  |  8 ++--
 mm/migrate.c                    |  4 +-
 7 files changed, 93 insertions(+), 27 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 4db08c580cc3..7f683e4cf6c3 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -77,7 +77,7 @@ extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern bool cpuset_cpu_is_isolated(int cpu);
-extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
+extern nodemask_t cpuset_mems_default(struct task_struct *p);
 #define cpuset_current_mems_default (current->mems_default)
 void cpuset_init_current_mems_default(void);
 int cpuset_nodemask_valid_mems_default(const nodemask_t *nodemask);
@@ -173,7 +173,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+extern bool cpuset_node_default(struct cgroup *cgroup, int nid);
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -211,7 +211,7 @@ static inline bool cpuset_cpu_is_isolated(int cpu)
 	return false;
 }
 
-static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
+static inline nodemask_t cpuset_mems_default(struct task_struct *p)
 {
 	return node_possible_map;
 }
@@ -294,7 +294,7 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+static inline bool cpuset_node_default(struct cgroup *cgroup, int nid)
 {
 	return true;
 }
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 337608f408ce..6978e04477b2 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -55,6 +55,7 @@ typedef enum {
 	FILE_MEMLIST,
 	FILE_EFFECTIVE_CPULIST,
 	FILE_EFFECTIVE_MEMLIST,
+	FILE_MEMS_DEFAULT,
 	FILE_SUBPARTS_CPULIST,
 	FILE_EXCLUSIVE_CPULIST,
 	FILE_EFFECTIVE_XCPULIST,
@@ -104,6 +105,13 @@ struct cpuset {
 	cpumask_var_t effective_cpus;
 	nodemask_t effective_mems;
 
+	/*
+	 * Default Memory Nodes for tasks.
+	 * This is the intersection of effective_mems and default_sysram_nodes.
+	 * Tasks will have their mems_default set to this value.
+	 */
+	nodemask_t mems_default;
+
 	/*
 	 * Exclusive CPUs dedicated to current cgroup (default hierarchy only)
 	 *
diff --git a/kernel/cgroup/cpuset-v1.c b/kernel/cgroup/cpuset-v1.c
index 12e76774c75b..a06f2b032e0d 100644
--- a/kernel/cgroup/cpuset-v1.c
+++ b/kernel/cgroup/cpuset-v1.c
@@ -293,6 +293,7 @@ void cpuset1_hotplug_update_tasks(struct cpuset *cs,
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
+	cpuset_update_mems_default(cs);
 	cpuset_callback_unlock_irq();
 
 	/*
@@ -532,6 +533,12 @@ struct cftype cpuset1_files[] = {
 		.private = FILE_EFFECTIVE_MEMLIST,
 	},
 
+	{
+		.name = "mems_default",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_MEMS_DEFAULT,
+	},
+
 	{
 		.name = "cpu_exclusive",
 		.read_u64 = cpuset_read_u64,
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b05c07489a4d..ea5ca1a05cf5 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -29,6 +29,7 @@
 #include <linux/mempolicy.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/memory-tiers.h>
 #include <linux/export.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
@@ -430,9 +431,9 @@ static void guarantee_active_cpus(struct task_struct *tsk,
  */
 static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 {
-	while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY]))
+	while (!nodes_intersects(cs->mems_default, node_states[N_MEMORY]))
 		cs = parent_cs(cs);
-	nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]);
+	nodes_and(*pmask, cs->mems_default, node_states[N_MEMORY]);
 }
 
 /**
@@ -2748,7 +2749,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
 
 		migrate = is_memory_migrate(cs);
 
-		mpol_rebind_mm(mm, &cs->mems_allowed);
+		mpol_rebind_mm(mm, &cs->mems_default);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
 		else
@@ -2808,6 +2809,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 
 		spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
+		if (!nodes_empty(default_sysram_nodelist))
+			nodes_and(cp->mems_default, cp->effective_mems,
+				  default_sysram_nodelist);
 		spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -3234,7 +3238,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * by skipping the task iteration and update.
 	 */
 	if (cpuset_v2() && !cpus_updated && !mems_updated) {
-		cpuset_attach_nodemask_to = cs->effective_mems;
+		cpuset_attach_nodemask_to = cs->mems_default;
 		goto out;
 	}
 
@@ -3249,7 +3253,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
 	 * not set.
 	 */
-	cpuset_attach_nodemask_to = cs->effective_mems;
+	cpuset_attach_nodemask_to = cs->mems_default;
 	if (!is_memory_migrate(cs) && !mems_updated)
 		goto out;
 
@@ -3371,6 +3375,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_EFFECTIVE_MEMLIST:
 		seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems));
 		break;
+	case FILE_MEMS_DEFAULT:
+		seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->mems_default));
+		break;
 	case FILE_EXCLUSIVE_CPULIST:
 		seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->exclusive_cpus));
 		break;
@@ -3482,6 +3489,12 @@ static struct cftype dfl_files[] = {
 		.private = FILE_EFFECTIVE_MEMLIST,
 	},
 
+	{
+		.name = "mems.default",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_MEMS_DEFAULT,
+	},
+
 	{
 		.name = "cpus.partition",
 		.seq_show = cpuset_partition_show,
@@ -3585,6 +3598,9 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
+		if (!nodes_empty(default_sysram_nodelist))
+			nodes_and(cs->mems_default, cs->effective_mems,
+				  default_sysram_nodelist);
 	}
 	spin_unlock_irq(&callback_lock);
 
@@ -3616,6 +3632,9 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	spin_lock_irq(&callback_lock);
 	cs->mems_allowed = parent->mems_allowed;
 	cs->effective_mems = parent->mems_allowed;
+	if (!nodes_empty(default_sysram_nodelist))
+		nodes_and(cs->mems_default, cs->effective_mems,
+			  default_sysram_nodelist);
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
 	spin_unlock_irq(&callback_lock);
@@ -3818,6 +3837,9 @@ int __init cpuset_init(void)
 	cpumask_setall(top_cpuset.effective_xcpus);
 	cpumask_setall(top_cpuset.exclusive_cpus);
 	nodes_setall(top_cpuset.effective_mems);
+	if (!nodes_empty(default_sysram_nodelist))
+		nodes_and(top_cpuset.mems_default, top_cpuset.effective_mems,
+			  default_sysram_nodelist);
 
 	fmeter_init(&top_cpuset.fmeter);
 	INIT_LIST_HEAD(&remote_children);
@@ -3848,6 +3870,9 @@ hotplug_update_tasks(struct cpuset *cs,
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
+	if (!nodes_empty(default_sysram_nodelist))
+		nodes_and(cs->mems_default, cs->effective_mems,
+			  default_sysram_nodelist);
 	spin_unlock_irq(&callback_lock);
 
 	if (cpus_updated)
@@ -4039,6 +4064,10 @@ static void cpuset_handle_hotplug(void)
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
+		if (!nodes_empty(default_sysram_nodelist))
+			nodes_and(top_cpuset.mems_default,
+				  top_cpuset.effective_mems,
+				  default_sysram_nodelist);
 		spin_unlock_irq(&callback_lock);
 		cpuset_update_tasks_nodemask(&top_cpuset);
 	}
@@ -4109,6 +4138,9 @@ void __init cpuset_init_smp(void)
 
 	cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
 	top_cpuset.effective_mems = node_states[N_MEMORY];
+	if (!nodes_empty(default_sysram_nodelist))
+		nodes_and(top_cpuset.mems_default, top_cpuset.effective_mems,
+			  default_sysram_nodelist);
 
 	hotplug_node_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
 
@@ -4205,22 +4237,27 @@ bool cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 	return changed;
 }
 
+/*
+ * At this point in time, no hotplug nodes can have been added, so just set
+ * the mems_default of the init task to the set of N_MEMORY nodes.
+ */
 void __init cpuset_init_current_mems_default(void)
 {
-	nodes_setall(current->mems_default);
+	nodes_clear(current->mems_default);
+	nodes_or(current->mems_default, current->mems_default, node_states[N_MEMORY]);
 }
 
 /**
- * cpuset_mems_allowed - return mems_allowed mask from a tasks cpuset.
- * @tsk: pointer to task_struct from which to obtain cpuset->mems_allowed.
+ * cpuset_mems_default - return mems_default mask from a tasks cpuset.
+ * @tsk: pointer to task_struct from which to obtain cpuset->mems_default.
  *
- * Description: Returns the nodemask_t mems_allowed of the cpuset
+ * Description: Returns the nodemask_t mems_default of the cpuset
  * attached to the specified @tsk.  Guaranteed to return some non-empty
  * subset of node_states[N_MEMORY], even if this means going outside the
  * tasks cpuset.
  **/
 
-nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
+nodemask_t cpuset_mems_default(struct task_struct *tsk)
 {
 	nodemask_t mask;
 	unsigned long flags;
@@ -4295,17 +4332,29 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *	tsk_is_oom_victim   - any node ok
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
+ *	GFP_PROTECTED - allow non-sysram nodes in mems_allowed
  */
 bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 {
 	struct cpuset *cs;		/* current cpuset ancestors */
 	bool allowed;			/* is allocation in zone z allowed? */
 	unsigned long flags;
+	bool protected_node = gfp_mask & __GFP_PROTECTED;
 
 	if (in_interrupt())
 		return true;
-	if (node_isset(node, current->mems_default))
-		return true;
+
+	if (protected_node) {
+		rcu_read_lock();
+		cs = task_cs(current);
+		allowed = node_isset(node, cs->mems_allowed);
+		rcu_read_unlock();
+	} else if (node_isset(node, current->mems_default))
+		allowed = true;
+
+	if (allowed)
+		return allowed;
+
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
@@ -4322,13 +4371,15 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 	spin_lock_irqsave(&callback_lock, flags);
 
 	cs = nearest_hardwall_ancestor(task_cs(current));
-	allowed = node_isset(node, cs->mems_allowed);
+	allowed = node_isset(node, cs->mems_allowed); /* include protected */
+	if (!protected_node && !nodes_empty(default_sysram_nodelist))
+		allowed &= node_isset(node, default_sysram_nodelist);
 
 	spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 
-bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+bool cpuset_node_default(struct cgroup *cgroup, int nid)
 {
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
@@ -4347,7 +4398,7 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
 		return true;
 
 	/*
-	 * Normally, accessing effective_mems would require the cpuset_mutex
+	 * Normally, accessing mems_default would require the cpuset_mutex
 	 * or callback_lock - but node_isset is atomic and the reference
 	 * taken via cgroup_get_e_css is sufficient to protect css.
 	 *
@@ -4359,7 +4410,7 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
 	 * cannot make strong isolation guarantees, so this is acceptable.
 	 */
 	cs = container_of(css, struct cpuset, css);
-	allowed = node_isset(nid, cs->effective_mems);
+	allowed = node_isset(nid, cs->mems_default);
 	css_put(css);
 	return allowed;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4deda33625f4..a25584cb281e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5599,5 +5599,5 @@ subsys_initcall(mem_cgroup_swap_init);
 
 bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
 {
-	return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
+	return memcg ? cpuset_node_default(memcg->css.cgroup, nid) : true;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6225d4d23010..5360333dc06d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1831,14 +1831,14 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 	}
 	rcu_read_unlock();
 
-	task_nodes = cpuset_mems_allowed(task);
+	task_nodes = cpuset_mems_default(task);
 	/* Is the user allowed to access the target nodes? */
 	if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) {
 		err = -EPERM;
 		goto out_put;
 	}
 
-	task_nodes = cpuset_mems_allowed(current);
+	task_nodes = cpuset_mems_default(current);
 	nodes_and(*new, *new, task_nodes);
 	if (nodes_empty(*new))
 		goto out_put;
@@ -2738,7 +2738,7 @@ int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
 /*
  * If mpol_dup() sees current->cpuset == cpuset_being_rebound, then it
  * rebinds the mempolicy its copying by calling mpol_rebind_policy()
- * with the mems_default returned by cpuset_mems_allowed().  This
+ * with the mems_default returned by cpuset_mems_default().  This
  * keeps mempolicies cpuset relative after its cpuset moves.  See
  * further kernel/cpuset.c update_nodemask().
  *
@@ -2763,7 +2763,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old)
 		*new = *old;
 
 	if (current_cpuset_is_being_rebound()) {
-		nodemask_t mems = cpuset_mems_allowed(current);
+		nodemask_t mems = cpuset_mems_default(current);
 		mpol_rebind_policy(new, &mems);
 	}
 	atomic_set(&new->refcnt, 1);
diff --git a/mm/migrate.c b/mm/migrate.c
index c0e9f15be2a2..f9a910b43a9f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2526,7 +2526,7 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
 	 */
 	if (!pid) {
 		mmget(current->mm);
-		*mem_nodes = cpuset_mems_allowed(current);
+		*mem_nodes = cpuset_mems_default(current);
 		return current->mm;
 	}
 
@@ -2547,7 +2547,7 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
 	mm = ERR_PTR(security_task_movememory(task));
 	if (IS_ERR(mm))
 		goto out;
-	*mem_nodes = cpuset_mems_allowed(task);
+	*mem_nodes = cpuset_mems_default(task);
 	mm = get_task_mm(task);
 out:
 	put_task_struct(task);
-- 
2.51.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 6/9] mm/memory_hotplug: add MHP_PROTECTED_MEMORY flag
  2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
                   ` (4 preceding siblings ...)
  2025-11-07 22:49 ` [RFC PATCH 5/9] cpuset: introduce cpuset.mems.default Gregory Price
@ 2025-11-07 22:49 ` Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 7/9] drivers/dax: add protected memory bit to dev_dax Gregory Price
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

Add support for protected memory blocks/nodes, which signal to
memory_hotplug that a given memory block is considered "protected".

A protected memory block/node is not exposed as SystemRAM by default
via default_sysram_nodes.  Protected memory cannot be added to sysram
nodes, and non-protected memory cannot be added to protected nodes.

This enables these memory blocks to be protected from allocation by
general actions (page faults, demotion, etc) without explicit
integration points which are memory-tier aware.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory_hotplug.h | 10 ++++++++++
 mm/memory_hotplug.c            | 23 +++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 23f038a16231..89f4e5b7054d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -74,6 +74,16 @@ typedef int __bitwise mhp_t;
  * helpful in low-memory situations.
  */
 #define MHP_OFFLINE_INACCESSIBLE	((__force mhp_t)BIT(3))
+/*
+ * The hotplugged memory can only be added to a NUMA node which is
+ * not in default_sysram_nodes.  This prevents the node from be accessible
+ * by the page allocator (mm/page_alloc.c) by way of userland configuration.
+ *
+ * Attempting to hotplug protected memory into a node in default_sysram_nodes
+ * will result in an -EINVAL, and attempting to hotplug non-protected memory
+ * into protected memory node will also result in an -EINVAL.
+ */
+#define MHP_PROTECTED_MEMORY	((__force mhp_t)BIT(4))
 
 /*
  * Extended parameters for memory hotplug:
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0be83039c3b5..ceab56b7231d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -20,6 +20,7 @@
 #include <linux/memory.h>
 #include <linux/memremap.h>
 #include <linux/memory_hotplug.h>
+#include <linux/memory-tiers.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
 #include <linux/delay.h>
@@ -1506,6 +1507,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	struct memory_group *group = NULL;
 	u64 start, size;
 	bool new_node = false;
+	bool node_has_blocks, protected_mem, node_is_sysram;
 	int ret;
 
 	start = res->start;
@@ -1529,6 +1531,19 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 
 	mem_hotplug_begin();
 
+	/*
+	 * If the NUMA node already has memory blocks, then we can only allow
+	 * additional memory blocks of the same protection type (protected or
+	 * un-protected).  Online/offline does not matter at this point.
+	 */
+	node_has_blocks = node_has_memory_blocks(nid);
+	protected_mem = !!(mhp_flags & MHP_PROTECTED_MEMORY);
+	node_is_sysram = node_isset(nid, *default_sysram_nodes);
+	if (node_has_blocks && (protected_mem ^ node_is_sysram)) {
+		ret = -EINVAL;
+		goto error_mem_hotplug_end;
+	}
+
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		if (res->flags & IORESOURCE_SYSRAM_DRIVER_MANAGED)
 			memblock_flags = MEMBLOCK_DRIVER_MANAGED;
@@ -1574,6 +1589,10 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	register_memory_blocks_under_node_hotplug(nid, PFN_DOWN(start),
 					  PFN_UP(start + size - 1));
 
+	/* At this point if not protected, we can add node to sysram nodes */
+	if (!(mhp_flags & MHP_PROTECTED_MEMORY))
+		node_set(nid, *default_sysram_nodes);
+
 	/* create new memmap entry */
 	if (!strcmp(res->name, "System RAM"))
 		firmware_map_add_hotplug(start, start + size, "System RAM");
@@ -2274,6 +2293,10 @@ static int try_remove_memory(u64 start, u64 size)
 	if (nid != NUMA_NO_NODE)
 		try_offline_node(nid);
 
+	/* If no more memblocks, remove node from default sysram nodemask */
+	if (!node_has_memory_blocks(nid))
+		node_clear(nid, *default_sysram_nodes);
+
 	mem_hotplug_done();
 	return 0;
 }
-- 
2.51.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 7/9] drivers/dax: add protected memory bit to dev_dax
  2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
                   ` (5 preceding siblings ...)
  2025-11-07 22:49 ` [RFC PATCH 6/9] mm/memory_hotplug: add MHP_PROTECTED_MEMORY flag Gregory Price
@ 2025-11-07 22:49 ` Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 8/9] drivers/cxl: add protected_memory bit to cxl region Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 9/9] [HACK] mm/zswap: compressed ram integration example Gregory Price
  8 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

This bit is used by dax/kmem to determine whether to set the
MHP_PROTECTED_MEMORY flags, which will make whether hotplug memory
should be restricted to a protected memory NUMA node.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/dax/bus.c         | 39 +++++++++++++++++++++++++++++++++++++++
 drivers/dax/bus.h         |  1 +
 drivers/dax/dax-private.h |  1 +
 drivers/dax/kmem.c        |  2 ++
 4 files changed, 43 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index fde29e0ad68b..4321e80276f0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1361,6 +1361,43 @@ static ssize_t memmap_on_memory_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(memmap_on_memory);
 
+static ssize_t protected_memory_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+
+	return sysfs_emit(buf, "%d\n", dev_dax->protected_memory);
+}
+
+static ssize_t protected_memory_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t len)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	bool val;
+	int rc;
+
+	rc = kstrtobool(buf, &val);
+	if (rc)
+		return rc;
+
+	rc = down_write_killable(&dax_dev_rwsem);
+	if (rc)
+		return rc;
+
+	if (dev_dax->protected_memory != val && dev->driver &&
+	    to_dax_drv(dev->driver)->type == DAXDRV_KMEM_TYPE) {
+		up_write(&dax_dev_rwsem);
+		return -EBUSY;
+	}
+
+	dev_dax->protected_memory = val;
+	up_write(&dax_dev_rwsem);
+
+	return len;
+}
+static DEVICE_ATTR_RW(protected_memory);
+
 static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
@@ -1388,6 +1425,7 @@ static struct attribute *dev_dax_attributes[] = {
 	&dev_attr_resource.attr,
 	&dev_attr_numa_node.attr,
 	&dev_attr_memmap_on_memory.attr,
+	&dev_attr_protected_memory.attr,
 	NULL,
 };
 
@@ -1494,6 +1532,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	ida_init(&dev_dax->ida);
 
 	dev_dax->memmap_on_memory = data->memmap_on_memory;
+	dev_dax->protected_memory = data->protected_memory;
 
 	inode = dax_inode(dax_dev);
 	dev->devt = inode->i_rdev;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index cbbf64443098..0a885bf9839f 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -24,6 +24,7 @@ struct dev_dax_data {
 	resource_size_t size;
 	int id;
 	bool memmap_on_memory;
+	bool protected_memory;
 };
 
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 0867115aeef2..605b7ed87ffe 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -89,6 +89,7 @@ struct dev_dax {
 	struct device dev;
 	struct dev_pagemap *pgmap;
 	bool memmap_on_memory;
+	bool protected_memory;
 	int nr_range;
 	struct dev_dax_range *ranges;
 };
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index c036e4d0b610..140c6cb0ac88 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -169,6 +169,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		mhp_flags = MHP_NID_IS_MGID;
 		if (dev_dax->memmap_on_memory)
 			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+		if (dev_dax->protected_memory)
+			mhp_flags |= MHP_PROTECTED_MEMORY;
 
 		/*
 		 * Ensure that future kexec'd kernels will not treat
-- 
2.51.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 8/9] drivers/cxl: add protected_memory bit to cxl region
  2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
                   ` (6 preceding siblings ...)
  2025-11-07 22:49 ` [RFC PATCH 7/9] drivers/dax: add protected memory bit to dev_dax Gregory Price
@ 2025-11-07 22:49 ` Gregory Price
  2025-11-07 22:49 ` [RFC PATCH 9/9] [HACK] mm/zswap: compressed ram integration example Gregory Price
  8 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

Add protected_memory bit to cxl region.  The setting is subsequently
forwarded to the dax device it creates. This allows the auto-hotplug
process to occur without an intermediate step requiring udev to poke
the DAX device protected memory bit explicitly before onlining.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/core/region.c | 30 ++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h         |  2 ++
 drivers/dax/cxl.c         |  1 +
 3 files changed, 33 insertions(+)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index b06fee1978ba..a0e28821961c 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -754,6 +754,35 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RW(size);
 
+static ssize_t protected_memory_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%d\n", cxlr->protected_memory);
+}
+
+static ssize_t protected_memory_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	bool val;
+	int rc;
+
+	rc = kstrtobool(buf, &val);
+	if (rc)
+		return rc;
+
+	ACQUIRE(rwsem_write_kill, rwsem)(&cxl_rwsem.region);
+	if ((rc = ACQUIRE_ERR(rwsem_read_intr, &rwsem)))
+		return rc;
+
+	cxlr->protected_memory = val;
+	return len;
+}
+static DEVICE_ATTR_RW(protected_memory);
+
 static struct attribute *cxl_region_attrs[] = {
 	&dev_attr_uuid.attr,
 	&dev_attr_commit.attr,
@@ -762,6 +791,7 @@ static struct attribute *cxl_region_attrs[] = {
 	&dev_attr_resource.attr,
 	&dev_attr_size.attr,
 	&dev_attr_mode.attr,
+	&dev_attr_protected_memory.attr,
 	NULL,
 };
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 231ddccf8977..0ff4898224ba 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -530,6 +530,7 @@ enum cxl_partition_mode {
  * @coord: QoS access coordinates for the region
  * @node_notifier: notifier for setting the access coordinates to node
  * @adist_notifier: notifier for calculating the abstract distance of node
+ * @protected_memory: mark region memory as protected from kernel allocation
  */
 struct cxl_region {
 	struct device dev;
@@ -543,6 +544,7 @@ struct cxl_region {
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
 	struct notifier_block node_notifier;
 	struct notifier_block adist_notifier;
+	bool protected_memory;
 };
 
 struct cxl_nvdimm_bridge {
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 13cd94d32ff7..a4232a5507b5 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -27,6 +27,7 @@ static int cxl_dax_region_probe(struct device *dev)
 		.id = -1,
 		.size = range_len(&cxlr_dax->hpa_range),
 		.memmap_on_memory = true,
+		.protected_memory = cxlr->protected_memory,
 	};
 
 	return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
-- 
2.51.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 9/9] [HACK] mm/zswap: compressed ram integration example
  2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
                   ` (7 preceding siblings ...)
  2025-11-07 22:49 ` [RFC PATCH 8/9] drivers/cxl: add protected_memory bit to cxl region Gregory Price
@ 2025-11-07 22:49 ` Gregory Price
  8 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2025-11-07 22:49 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, cgroups, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	hannes, mkoutny, kees, muchun.song, roman.gushchin, shakeel.butt,
	rientjes, jackmanb, cl, harry.yoo, axelrasmussen, yuanchu,
	weixugc, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	fabio.m.de.francesco, rrichter, ming.li, usamaarif642, brauner,
	oleg, namcao, escape, dongjoo.seo1

Here is an example of how you might use a protected memory node.

We hack in an mt_compressed_nodelist to memory-tiers.c as a standin
for a proper compressed-ram component, and use that nodelist to
determine if compressed ram is available in the zswap_compress
function.

If there is compressed ram available, we skip the entire software
compression process and shunt memcpy directly to a compressed memory
folio, and store the newly allocated compressed memory page as the
zswap entry->handle.

On decompress we do the opposite: copy directly from the stored
compressed page to the new destination, and free the compressed
memory page.

Note: We do not integrate any compressed memory device checks at
this point because this is a stand-in to demonstrate how the protected
node allocation mechanism works.  See the "TODO" comment in
`zswap_compress_direct()` for more details on how that would work.

In reality, we would want to make this mechanism out of zswap into
its own component (cram.c?), and enable a more direct migrate_page()
call that actually re-maps the page read-only into any mappings, and
then provides a write-fault handler which promotes the page on write.

This prevents any run-away compression ratio failures, since the
compression ratio would be checked on allocation, rather than allowed
to silently decrease on writes until the device becomes unstable.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory-tiers.h |  1 +
 mm/memory-tiers.c            |  3 ++
 mm/memory_hotplug.c          |  2 ++
 mm/zswap.c                   | 65 +++++++++++++++++++++++++++++++++++-
 4 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 3d3f3687d134..ff2ab7990e8f 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -42,6 +42,7 @@ extern nodemask_t default_dram_nodes;
 extern nodemask_t default_sysram_nodelist;
 #define default_sysram_nodes (nodes_empty(default_sysram_nodelist) ? NULL : \
 			      &default_sysram_nodelist)
+extern nodemask_t mt_compressed_nodelist;
 struct memory_dev_type *alloc_memory_type(int adistance);
 void put_memory_type(struct memory_dev_type *memtype);
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index b2ee4f73ad54..907635611f17 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -51,6 +51,9 @@ nodemask_t default_dram_nodes = NODE_MASK_NONE;
 /* default_sysram_nodelist is the list of nodes with RAM at __init time */
 nodemask_t default_sysram_nodelist = NODE_MASK_NONE;
 
+/* compressed memory nodes */
+nodemask_t mt_compressed_nodelist = NODE_MASK_NONE;
+
 static const struct bus_type memory_tier_subsys = {
 	.name = "memory_tiering",
 	.dev_name = "memory_tier",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ceab56b7231d..8fcd894de93c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1592,6 +1592,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	/* At this point if not protected, we can add node to sysram nodes */
 	if (!(mhp_flags & MHP_PROTECTED_MEMORY))
 		node_set(nid, *default_sysram_nodes);
+	else /* HACK: We would create a proper interface for something like this */
+		node_set(nid, mt_compressed_nodelist);
 
 	/* create new memmap entry */
 	if (!strcmp(res->name, "System RAM"))
diff --git a/mm/zswap.c b/mm/zswap.c
index c1af782e54ec..09010ba2440c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -25,6 +25,7 @@
 #include <linux/scatterlist.h>
 #include <linux/mempolicy.h>
 #include <linux/mempool.h>
+#include <linux/memory-tiers.h>
 #include <crypto/acompress.h>
 #include <linux/zswap.h>
 #include <linux/mm_types.h>
@@ -191,6 +192,7 @@ struct zswap_entry {
 	swp_entry_t swpentry;
 	unsigned int length;
 	bool referenced;
+	bool direct;
 	struct zswap_pool *pool;
 	unsigned long handle;
 	struct obj_cgroup *objcg;
@@ -717,7 +719,8 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
 static void zswap_entry_free(struct zswap_entry *entry)
 {
 	zswap_lru_del(&zswap_list_lru, entry);
-	zs_free(entry->pool->zs_pool, entry->handle);
+	if (!entry->direct)
+		zs_free(entry->pool->zs_pool, entry->handle);
 	zswap_pool_put(entry->pool);
 	if (entry->objcg) {
 		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
@@ -851,6 +854,43 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
 	mutex_unlock(&acomp_ctx->mutex);
 }
 
+static struct page *zswap_compress_direct(struct page *src,
+					  struct zswap_entry *entry)
+{
+	int nid = first_node(mt_compressed_nodelist);
+	struct page *dst;
+	gfp_t gfp;
+
+	if (nid == NUMA_NO_NODE)
+		return NULL;
+
+	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE |
+	      __GFP_PROTECTED;
+	dst = __alloc_pages(gfp, 0, nid, &mt_compressed_nodelist);
+	if (!dst)
+		return NULL;
+
+	/*
+	 * TODO: check that the page is safe to use
+	 *
+	 * In a real implementation, we would not be using ZSWAP to demonstrate this
+	 * and instead would implement a new component (compressed_ram, cram.c?)
+	 *
+	 * At this point we would check via some callback that the device's memory
+	 * is actually safe to use - and if not, free the page (without writing to
+	 * it), and kick off kswapd for that node to make room.
+	 *
+	 * Alternatively, if the compressed memory device(s) report a watermark
+	 * crossing via interrupt, a flag can be set that is checked here rather
+	 * that calling back into a device driver.
+	 *
+	 * In this case, we're testing with normal memory, so the memory is always
+	 * safe to use (i.e. no compression ratio to worry about).
+	 */
+	copy_mc_highpage(dst, src);
+	return dst;
+}
+
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 			   struct zswap_pool *pool)
 {
@@ -862,6 +902,19 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	gfp_t gfp;
 	u8 *dst;
 	bool mapped = false;
+	struct page *zpage;
+
+	/* Try to shunt directly to compressed ram */
+	if (!nodes_empty(mt_compressed_nodelist)) {
+		zpage = zswap_compress_direct(page, entry);
+		if (zpage) {
+			entry->handle = (unsigned long)zpage;
+			entry->length = PAGE_SIZE;
+			entry->direct = true;
+			return true;
+		}
+		/* otherwise fallback to normal zswap */
+	}
 
 	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
 	dst = acomp_ctx->buffer;
@@ -939,6 +992,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	int decomp_ret = 0, dlen = PAGE_SIZE;
 	u8 *src, *obj;
 
+	/* compressed ram page */
+	if (entry->direct) {
+		struct page *src = (struct page*)entry->handle;
+		struct folio *zfolio = page_folio(src);
+		memcpy_folio(folio, 0, zfolio, 0, PAGE_SIZE);
+		__free_page(src);
+		goto direct_done;
+	}
+
 	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
 	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
 
@@ -972,6 +1034,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	zs_obj_read_end(pool->zs_pool, entry->handle, obj);
 	acomp_ctx_put_unlock(acomp_ctx);
 
+direct_done:
 	if (!decomp_ret && dlen == PAGE_SIZE)
 		return true;
 
-- 
2.51.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-11-07 22:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-07 22:49 [RFC LPC2026 PATCH 0/9] Protected Memory NUMA Nodes Gregory Price
2025-11-07 22:49 ` [RFC PATCH 1/9] gfp: Add GFP_PROTECTED for protected-node allocations Gregory Price
2025-11-07 22:49 ` [RFC PATCH 2/9] memory-tiers: create default_sysram_nodes Gregory Price
2025-11-07 22:49 ` [RFC PATCH 3/9] mm: default slub, oom_kill, compaction, and page_alloc to sysram Gregory Price
2025-11-07 22:49 ` [RFC PATCH 4/9] mm,cpusets: rename task->mems_allowed to task->mems_default Gregory Price
2025-11-07 22:49 ` [RFC PATCH 5/9] cpuset: introduce cpuset.mems.default Gregory Price
2025-11-07 22:49 ` [RFC PATCH 6/9] mm/memory_hotplug: add MHP_PROTECTED_MEMORY flag Gregory Price
2025-11-07 22:49 ` [RFC PATCH 7/9] drivers/dax: add protected memory bit to dev_dax Gregory Price
2025-11-07 22:49 ` [RFC PATCH 8/9] drivers/cxl: add protected_memory bit to cxl region Gregory Price
2025-11-07 22:49 ` [RFC PATCH 9/9] [HACK] mm/zswap: compressed ram integration example Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox