linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
@ 2025-11-12 19:29 Gregory Price
  2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
                   ` (14 more replies)
  0 siblings, 15 replies; 35+ messages in thread
From: Gregory Price @ 2025-11-12 19:29 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
	cgroups, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, hannes, mkoutny, kees, muchun.song,
	roman.gushchin, shakeel.butt, rientjes, jackmanb, cl, harry.yoo,
	axelrasmussen, yuanchu, weixugc, zhengqi.arch, yosry.ahmed,
	nphamcs, chengming.zhou, fabio.m.de.francesco, rrichter, ming.li,
	usamaarif642, brauner, oleg, namcao, escape, dongjoo.seo1

This is a code RFC for discussion related to

"Mempolicy is dead, long live memory policy!"
https://lpc.events/event/19/contributions/2143/

base-commit: 24172e0d79900908cf5ebf366600616d29c9b417
(version notes at end)

At LSF 2026, I plan to discuss:
- Why? (In short: shunting to DAX is a failed pattern for users)
- Other designs I considered (mempolicy, cpusets, zone_device)
- Why mempolicy.c and cpusets as-is are insufficient
- SPM types seeking this form of interface (Accelerator, Compression)
- Platform extensions that would be nice to see (SPM-only Bits)

Open Questions
- Single SPM nodemask, or multiple based on features?
- Apply SPM/SysRAM bit on-boot only or at-hotplug?
- Allocate extra "possible" NUMA nodes for flexbility?
- Should SPM Nodes be zone-restricted? (MOVABLE only?)
- How to handle things like reclaim and compaction on these nodes.


With this set, we aim to enable allocation of "special purpose memory"
with the page allocator (mm/page_alloc.c) without exposing the same
memory as "System RAM".  Unless a non-userland component, and does so
with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.

This isolation mechanism is a requirement for memory policies which
depend on certain sets of memory never being used outside special
interfaces (such as a specific mm/component or driver).

We present an example of using this mechanism within ZSWAP, as-if
a "compressed memory node" was present.  How to describe the features
of memory present on nodes is left up to comment here and at LPC '26.

Userspace-driven allocations are restricted by the sysram_nodes mask,
nothing in userspace can explicitly request memory from SPM nodes.

Instead, the intent is to create new components which understand memory
features and register those nodes with those components. This abstracts
the hardware complexity away from userland while also not requiring new
memory innovations to carry entirely new allocators.

The ZSwap example demonstrates this with the `mt_spm_nodemask`.  This
hack treats all spm nodes as-if they are compressed memory nodes, and
we bypass the software compression logic in zswap in favor of simply
copying memory directly to the allocated page.  In a real design

There are 4 major changes in this set:

1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes
   the set of nodes which are eligible for use as normal system ram

   Some existing users now pass mt_sysram_nodelist into the page
   allocator instead of NULL, but passing a NULL pointer in will simply
   have it replaced by mt_sysram_nodelist anyway.  Should a fully NULL
   pointer still make it to the page allocator, without GFP_SPM_NODE
   SPM node zones will simply be skipped.

   mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes
   present during __init, but if empty the use of mt_sysram_nodes()
   will return a NULL to preserve current behavior.


2) The addition of `cpuset.mems.sysram` which restricts allocations to
   `mt_sysram_nodes` unless GFP_SPM_NODE is used.

   SPM Nodes are still allowed in cpuset.mems.allowed and effective.

   This is done to allow separate control over sysram and SPM node sets
   by cgroups while maintaining the existing hierarchical rules.

   current cpuset configuration
   cpuset.mems_allowed
    |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
    |->tasks.mems_allowed    < cpuset.mems_effective

   new cpuset configuration
   cpuset.mems_allowed
    |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
    |.sysram_nodes           < (mems_effective ∩ default_sys_nodemask)
    |->task.sysram_nodes     < cpuset.sysram_nodes

   This means mems_allowed still restricts all node usage in any given
   task context, which is the existing behavior.

3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
   capacity being added should mark the node as an SPM Node. 

   A node is either SysRAM or SPM - never both.  Attempting to add
   incompatible memory to a node results in hotplug failure.

   DAX and CXL are made aware of the bit and have `spm_node` bits added
   to their relevant subsystems.

4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
   from the provided node or nodemask.  It changes the behavior of
   the cpuset mems_allowed and mt_node_allowed() checks.

v1->v2:
- naming improvements
    default_node -> sysram_node
    protected    -> spm (Specific Purpose Memory)
- add missing constify patch
- add patch to update callers of __cpuset_zone_allowed
- add additional logic to the mm sysram_nodes patch
- fix bot build issues (ifdef config builds)
- fix out-of-tree driver build issues (function renames)
- change compressed_nodelist to spm_nodelist
- add latch mechanism for sysram/spm nodes (Dan Williams)
  this drops some extra memory-hotplug logic which is nice
v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/

Gregory Price (11):
  mm: constify oom_control, scan_control, and alloc_context nodemask
  mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
  gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations
  memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes
  mm: restrict slub, oom, compaction, and page_alloc to sysram by
    default
  mm,cpusets: rename task->mems_allowed to task->sysram_nodes
  cpuset: introduce cpuset.mems.sysram
  mm/memory_hotplug: add MHP_SPM_NODE flag
  drivers/dax: add spm_node bit to dev_dax
  drivers/cxl: add spm_node bit to cxl region
  [HACK] mm/zswap: compressed ram integration example

 drivers/cxl/core/region.c       |  30 ++++++
 drivers/cxl/cxl.h               |   2 +
 drivers/dax/bus.c               |  39 ++++++++
 drivers/dax/bus.h               |   1 +
 drivers/dax/cxl.c               |   1 +
 drivers/dax/dax-private.h       |   1 +
 drivers/dax/kmem.c              |   2 +
 fs/proc/array.c                 |   2 +-
 include/linux/cpuset.h          |  62 +++++++------
 include/linux/gfp_types.h       |   5 +
 include/linux/memory-tiers.h    |  47 ++++++++++
 include/linux/memory_hotplug.h  |  10 ++
 include/linux/mempolicy.h       |   2 +-
 include/linux/mm.h              |   4 +-
 include/linux/mmzone.h          |   6 +-
 include/linux/oom.h             |   2 +-
 include/linux/sched.h           |   6 +-
 include/linux/swap.h            |   2 +-
 init/init_task.c                |   2 +-
 kernel/cgroup/cpuset-internal.h |   8 ++
 kernel/cgroup/cpuset-v1.c       |   7 ++
 kernel/cgroup/cpuset.c          | 158 ++++++++++++++++++++------------
 kernel/fork.c                   |   2 +-
 kernel/sched/fair.c             |   4 +-
 mm/compaction.c                 |  10 +-
 mm/hugetlb.c                    |   8 +-
 mm/internal.h                   |   2 +-
 mm/memcontrol.c                 |   3 +-
 mm/memory-tiers.c               |  66 ++++++++++++-
 mm/memory_hotplug.c             |   7 ++
 mm/mempolicy.c                  |  34 +++----
 mm/migrate.c                    |   4 +-
 mm/mmzone.c                     |   5 +-
 mm/oom_kill.c                   |  11 ++-
 mm/page_alloc.c                 |  57 +++++++-----
 mm/show_mem.c                   |  11 ++-
 mm/slub.c                       |  15 ++-
 mm/vmscan.c                     |   6 +-
 mm/zswap.c                      |  66 ++++++++++++-
 39 files changed, 532 insertions(+), 178 deletions(-)

-- 
2.51.1



^ permalink raw reply	[flat|nested] 35+ messages in thread
* [PATCH v2 5/5] dax/kmem: add memory notifier to block external state changes
@ 2026-01-14 23:50 Gregory Price
  2026-01-15  2:42 ` [PATCH] dax/kmem: add build config for protected dax memory blocks Gregory Price
  0 siblings, 1 reply; 35+ messages in thread
From: Gregory Price @ 2026-01-14 23:50 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-cxl, nvdimm, linux-kernel, virtualization, kernel-team,
	dan.j.williams, vishal.l.verma, dave.jiang, david, mst, jasowang,
	xuanzhuo, eperezma, osalvador, akpm

Add a memory notifier to prevent external operations from changing the
online/offline state of memory blocks managed by dax_kmem. This ensures
state changes only occur through the driver's hotplug sysfs interface,
providing consistent state tracking and preventing races with auto-online
policies or direct memory block sysfs manipulation.

The goal of this is to prevent `daxN.M/hotplug` from becoming
inconsistent with the state of the memory blocks it owns.

The notifier uses a transition protocol with memory barriers:
  - Before initiating a state change, set target_state then in_transition
  - Use barrier to ensure target_state is visible before in_transition
  - The notifier checks in_transition, then uses barrier before reading
    target_state to ensure proper ordering on weakly-ordered architectures

The notifier callback:
  - Returns NOTIFY_DONE for non-overlapping memory (not our concern)
  - Returns NOTIFY_BAD if in_transition is false (block external ops)
  - Validates the memory event matches target_state (MEM_GOING_ONLINE
    for online operations, MEM_GOING_OFFLINE for offline/unplug)
  - Returns NOTIFY_OK only for driver-initiated operations with matching
    target_state

This prevents scenarios where:
  - Users manually change memory state via /sys/devices/system/memory/
  - Other kernel subsystems interfere with driver-managed memory state
    (may be important for regions trying to preserve hot-unpluggability)

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/dax/kmem.c | 157 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index c222ae9d675d..f3562f65376c 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -53,6 +53,9 @@ struct dax_kmem_data {
 	struct dev_dax *dev_dax;
 	int state;
 	struct mutex lock; /* protects hotplug state transitions */
+	bool in_transition;
+	int target_state;
+	struct notifier_block mem_nb;
 	struct resource *res[];
 };
 
@@ -71,6 +74,116 @@ static void kmem_put_memory_types(void)
 	mt_put_memory_types(&kmem_memory_types);
 }
 
+/**
+ * dax_kmem_start_transition - begin a driver-initiated state transition
+ * @data: the dax_kmem_data structure
+ * @target: the target state (MMOP_ONLINE, MMOP_ONLINE_MOVABLE, or MMOP_OFFLINE)
+ *
+ * Sets up state for a driver-initiated memory operation. The memory notifier
+ * will only allow operations that match this target state while in transition.
+ * Uses store-release to ensure target_state is visible before in_transition.
+ */
+static void dax_kmem_start_transition(struct dax_kmem_data *data, int target)
+{
+	data->target_state = target;
+	smp_store_release(&data->in_transition, true);
+}
+
+/**
+ * dax_kmem_end_transition - end a driver-initiated state transition
+ * @data: the dax_kmem_data structure
+ *
+ * Clears the in_transition flag after a state change completes or aborts.
+ */
+static void dax_kmem_end_transition(struct dax_kmem_data *data)
+{
+	WRITE_ONCE(data->in_transition, false);
+}
+
+/**
+ * dax_kmem_overlaps_range - check if a memory range overlaps with this device
+ * @data: the dax_kmem_data structure
+ * @start: start physical address of the range to check
+ * @size: size of the range to check
+ *
+ * Returns true if the range overlaps with any of the device's memory ranges.
+ */
+static bool dax_kmem_overlaps_range(struct dax_kmem_data *data,
+				    u64 start, u64 size)
+{
+	struct dev_dax *dev_dax = data->dev_dax;
+	int i;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct range range;
+		struct range check = DEFINE_RANGE(start, start + size - 1);
+
+		if (dax_kmem_range(dev_dax, i, &range))
+			continue;
+
+		if (!data->res[i])
+			continue;
+
+		if (range_overlaps(&range, &check))
+			return true;
+	}
+	return false;
+}
+
+/**
+ * dax_kmem_memory_notifier_cb - memory notifier callback for dax kmem
+ * @nb: the notifier block (embedded in dax_kmem_data)
+ * @action: the memory event (MEM_GOING_ONLINE, MEM_GOING_OFFLINE, etc.)
+ * @arg: pointer to memory_notify structure
+ *
+ * This callback prevents external operations (e.g., from sysfs or auto-online
+ * policies) on memory blocks managed by dax_kmem. Only operations initiated
+ * by the driver itself (via the hotplug sysfs interface) are allowed.
+ *
+ * Returns NOTIFY_OK to allow the operation, NOTIFY_BAD to block it,
+ * or NOTIFY_DONE if the memory doesn't belong to this device.
+ */
+static int dax_kmem_memory_notifier_cb(struct notifier_block *nb,
+				       unsigned long action, void *arg)
+{
+	struct dax_kmem_data *data = container_of(nb, struct dax_kmem_data,
+						  mem_nb);
+	struct memory_notify *mhp = arg;
+	const u64 start = PFN_PHYS(mhp->start_pfn);
+	const u64 size = PFN_PHYS(mhp->nr_pages);
+
+	/* Only interested in going online/offline events */
+	if (action != MEM_GOING_ONLINE && action != MEM_GOING_OFFLINE)
+		return NOTIFY_DONE;
+
+	/* Check if this memory belongs to our device */
+	if (!dax_kmem_overlaps_range(data, start, size))
+		return NOTIFY_DONE;
+
+	/*
+	 * Block all operations unless we're in a driver-initiated transition.
+	 * When in_transition is set, only allow operations that match our
+	 * target_state to prevent races with external operations.
+	 *
+	 * Use load-acquire to pair with the store-release in
+	 * dax_kmem_start_transition(), ensuring target_state is visible.
+	 */
+	if (!smp_load_acquire(&data->in_transition))
+		return NOTIFY_BAD;
+
+	/* Online operations expect MEM_GOING_ONLINE */
+	if (action == MEM_GOING_ONLINE &&
+	    (data->target_state == MMOP_ONLINE ||
+	     data->target_state == MMOP_ONLINE_MOVABLE))
+		return NOTIFY_OK;
+
+	/* Offline/hotremove operations expect MEM_GOING_OFFLINE */
+	if (action == MEM_GOING_OFFLINE && data->target_state == MMOP_OFFLINE)
+		return NOTIFY_OK;
+
+	return NOTIFY_BAD;
+}
+
 /**
  * dax_kmem_do_hotplug - hotplug memory for dax kmem device
  * @dev_dax: the dev_dax instance
@@ -325,11 +438,27 @@ static ssize_t hotplug_store(struct device *dev, struct device_attribute *attr,
 	if (data->state == online_type)
 		return len;
 
+	/*
+	 * Start transition with target_state for the notifier.
+	 * For unplug, use MMOP_OFFLINE since memory goes offline before removal.
+	 */
+	if (online_type == DAX_KMEM_UNPLUGGED || online_type == MMOP_OFFLINE)
+		dax_kmem_start_transition(data, MMOP_OFFLINE);
+	else
+		dax_kmem_start_transition(data, online_type);
+
 	if (online_type == DAX_KMEM_UNPLUGGED) {
+		int expected = 0;
+
+		for (rc = 0; rc < dev_dax->nr_range; rc++)
+			if (data->res[rc])
+				expected++;
+
 		rc = dax_kmem_do_hotremove(dev_dax, data);
-		if (rc < 0) {
+		dax_kmem_end_transition(data);
+		if (rc < expected) {
 			dev_warn(dev, "hotplug state is inconsistent\n");
-			return rc;
+			return rc == 0 ? -EBUSY : -EIO;
 		}
 		data->state = DAX_KMEM_UNPLUGGED;
 		return len;
@@ -339,10 +468,14 @@ static ssize_t hotplug_store(struct device *dev, struct device_attribute *attr,
 	 * online_type is MMOP_ONLINE or MMOP_ONLINE_MOVABLE
 	 * Cannot switch between online types without unplugging first
 	 */
-	if (data->state == MMOP_ONLINE || data->state == MMOP_ONLINE_MOVABLE)
+	if (data->state == MMOP_ONLINE || data->state == MMOP_ONLINE_MOVABLE) {
+		dax_kmem_end_transition(data);
 		return -EBUSY;
+	}
 
 	rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
+	dax_kmem_end_transition(data);
+
 	if (rc < 0)
 		return rc;
 
@@ -430,13 +563,26 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	if (rc < 0)
 		goto err_resources;
 
+	/* Register memory notifier to block external operations */
+	data->mem_nb.notifier_call = dax_kmem_memory_notifier_cb;
+	rc = register_memory_notifier(&data->mem_nb);
+	if (rc) {
+		dev_warn(dev, "failed to register memory notifier\n");
+		goto err_notifier;
+	}
+
 	/*
 	 * Hotplug using the system default policy - this preserves backwards
 	 * for existing users who rely on the default auto-online behavior.
+	 *
+	 * Start transition with resolved system default since the notifier
+	 * validates the operation type matches.
 	 */
 	online_type = mhp_get_default_online_type();
 	if (online_type != MMOP_OFFLINE) {
+		dax_kmem_start_transition(data, online_type);
 		rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
+		dax_kmem_end_transition(data);
 		if (rc < 0)
 			goto err_hotplug;
 		data->state = online_type;
@@ -449,6 +595,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	return 0;
 
 err_hotplug:
+	unregister_memory_notifier(&data->mem_nb);
+err_notifier:
 	dax_kmem_cleanup_resources(dev_dax, data);
 err_resources:
 	dev_set_drvdata(dev, NULL);
@@ -471,6 +619,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 
 	device_remove_file(dev, &dev_attr_hotplug);
 	dax_kmem_cleanup_resources(dev_dax, data);
+	unregister_memory_notifier(&data->mem_nb);
 	memory_group_unregister(data->mgid);
 	kfree(data->res_name);
 	kfree(data);
@@ -488,8 +637,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
+	struct dax_kmem_data *data = dev_get_drvdata(dev);
 
 	device_remove_file(dev, &dev_attr_hotplug);
+	unregister_memory_notifier(&data->mem_nb);
 
 	/*
 	 * Without hotremove purposely leak the request_mem_region() for the
-- 
2.52.0



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2026-01-15  2:43 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2025-12-15  6:11   ` Balbir Singh
2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
2025-12-15  6:14   ` Balbir Singh
2025-12-15 12:38     ` Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
2025-11-13 14:58   ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
2025-11-13 16:37     ` kernel test robot
2026-01-15  2:38     ` [PATCH] dax/kmem: add build config for protected dax memory blocks Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
2025-11-18  7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
2025-11-18 10:36   ` Gregory Price
2025-11-21 21:07   ` Gregory Price
2025-11-23 23:09     ` Alistair Popple
2025-11-24 15:28       ` Gregory Price
2025-11-27  5:03         ` Alistair Popple
2025-11-24  9:19 ` David Hildenbrand (Red Hat)
2025-11-24 18:06   ` Gregory Price
2025-12-10 23:29     ` Yiannis Nikolakopoulos
2025-11-25 14:09 ` Kiryl Shutsemau
2025-11-25 15:05   ` Gregory Price
2025-11-27  5:12     ` Alistair Popple
2025-11-26  3:23 ` Balbir Singh
2025-11-26  8:29   ` Gregory Price
2025-12-03  4:36     ` Balbir Singh
2025-12-03  5:25       ` Gregory Price
2026-01-14 23:50 [PATCH v2 5/5] dax/kmem: add memory notifier to block external state changes Gregory Price
2026-01-15  2:42 ` [PATCH] dax/kmem: add build config for protected dax memory blocks Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox