[RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
@ 2026-01-08 20:37 Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Gregory Price
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-08 20:37 UTC (permalink / raw)
  To: linux-mm, cgroups, linux-cxl
  Cc: linux-doc, linux-kernel, linux-fsdevel, kernel-team, longman, tj,
	hannes, mkoutny, corbet, gregkh, rafael, dakr, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, akpm, vbabka, surenb, mhocko,
	jackmanb, ziy, david, lorenzo.stoakes, Liam.Howlett, rppt,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, rientjes,
	shakeel.butt, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	yosry.ahmed, chengming.zhou, roman.gushchin, muchun.song,
	osalvador, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, cl, harry.yoo, zhengqi.arch

This series introduces N_PRIVATE, a new node state for memory nodes 
whose memory is not intended for general system consumption.  Today,
device drivers (CXL, accelerators, etc.) hotplug their memory to access
mm/ services like page allocation and reclaim, but this exposes general
workloads to memory with different characteristics and reliability
guarantees than system RAM.

N_PRIVATE provides isolation by default while enabling explicit access
via __GFP_THISNODE for subsystems that understand how to manage these
specialized memory regions.

Motivation
==========

Several emerging memory technologies require kernel memory management
services but should not be used for general allocations:

  - CXL Compressed RAM (CRAM): Hardware-compressed memory where the
    effective capacity depends on data compressibility.  Uncontrolled
    use risks capacity exhaustion when compression ratios degrade.

  - Accelerator Memory: GPU/TPU-attached memory optimized for specific
    access patterns that are not intended for general allocation.

  - Tiered Memory: Memory intended only as a demotion target, not for
    initial allocations.

Currently, these devices either avoid hotplugging entirely (losing mm/
services) or hotplug as regular N_MEMORY (risking reliability issues).
N_PRIVATE solves this by creating an isolated node class.

Design
======

The series introduces:

  1. N_PRIVATE node state (mutually exclusive with N_MEMORY)
  2. private_memtype enum for policy-based access control
  3. cpuset.mems.sysram for user-visible isolation
  4. Integration points for subsystems (zswap demonstrated)
  5. A cxl private_region example to demonstrate full plumbing

Private Memory Types (private_memtype)
======================================

The private_memtype enum defines policy bits that control how different
kernel subsystems may access private nodes:

  enum private_memtype {
      NODE_MEM_NOTYPE,      /* No type assigned (invalid state) */
      NODE_MEM_ZSWAP,       /* Swap compression target */
      NODE_MEM_COMPRESSED,  /* General compressed RAM */
      NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
      NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
      NODE_MAX_MEMTYPE,
  };

These types serve as policy hints for subsystems:

NODE_MEM_ZSWAP
--------------
Nodes with this type are registered as zswap compression targets.  When
zswap compresses a page, it can allocate directly from ZSWAP-typed nodes
using __GFP_THISNODE, bypassing software compression if the device
provides hardware compression.

Example flow:
  1. CXL device creates private_region with type=zswap
  2. Driver calls node_register_private() with NODE_MEM_ZSWAP
  3. zswap_add_direct_node() registers the node as a compression target
  4. On swap-out, zswap allocates from the private node
  5. page_allocated() callback validates compression ratio headroom
  6. page_freed() callback zeros pages to improve device compression

Prototype Note:
  This patch set does not actually do compression ratio validation, as
  this requires an actual device to provide some kind of counter and/or
  interrupt to denote when allocations are safe.  The callbacks are
  left as stubs with TODOs for device vendors to pick up the next step
  (we'll continue with a QEMU example if reception is positive).

  For now, this always succeeds because compressed=real capacity.

NODE_MEM_COMPRESSED (CRAM)
--------------------------
For general compressed RAM devices.  Unlike ZSWAP nodes, CRAM nodes
could be exposed to subsystems that understand compression semantics:

  - vmscan: Could prefer demoting pages to CRAM nodes before swap
  - memory-tiering: Could place CRAM between DRAM and persistent memory
  - zram: Could use as backing store instead of or alongside zswap

Such a component (mm/cram.c) would differ from zswap or zram by allowing
the compressed pages to remain mapped Read-Only in the page table.

NODE_MEM_ACCELERATOR
--------------------
For GPU/TPU/accelerator-attached memory.  Policy implications:

  - Default allocations: Never (isolated from general page_alloc)
  - GPU drivers: Explicit allocation via __GFP_THISNODE
  - NUMA balancing: Excluded from automatic migration
  - Memory tiering: Not a demotion target

Some GPU vendors want management of their memory via NUMA nodes, but
don't want fallback or migration allocations to occur.  This enables
that pattern.

mm/mempolicy.c could be used to allow for N_PRIVATE nodes of this type
if the intent is per-vma access to accelerator memory (e.g. via mbind)
but this is omitted from this series from now to limit userland
exposure until first class examples are provided.

NODE_MEM_DEMOTE_ONLY
--------------------
For memory intended exclusively as a demotion target in memory tiering:

  - page_alloc: Never allocates initially (slab, page faults, etc.)
  - vmscan/reclaim: Valid demotion target during memory pressure
  - memory-tiering: Allow hotness monitoring/promotion for this region

This enables "cold storage" tiers using slower/cheaper memory (CXL-
attached DRAM, persistent memory in volatile mode) without the memory
appearing in allocation fast paths.

This also adds some additional bonus of enforcing memory placement on
these nodes to be movable allocations only (with all the normal caveats
around page pinning).

Subsystem Integration Points
============================

The private_node_ops structure provides callbacks for integration:

  struct private_node_ops {
      struct list_head list;
      resource_size_t res_start;
      resource_size_t res_end;
      enum private_memtype memtype;
      int (*page_allocated)(struct page *page, void *data);
      void (*page_freed)(struct page *page, void *data);
      void *data;
  };

page_allocated(): Called after allocation, returns 0 to accept or
-ENOSPC/-ENODEV to reject (caller retries elsewhere).  Enables:
  - Compression ratio enforcement for CRAM/zswap
  - Capacity tracking for accelerator memory
  - Rate limiting for demotion targets

page_freed(): Called on free, enables:
  - Zeroing for compression ratio recovery
  - Capacity accounting updates
  - Device-specific cleanup

Isolation Enforcement
=====================

The series modifies core allocators to respect N_PRIVATE isolation:

  - page_alloc: Constrains zone iteration to cpuset.mems.sysram
  - slub: Allocates only from N_MEMORY nodes
  - compaction: Skips N_PRIVATE nodes
  - mempolicy: Uses sysram_nodes for policy evaluation

__GFP_THISNODE bypasses isolation, enabling explicit access:

  page = alloc_pages_node(private_nid, GFP_KERNEL | __GFP_THISNODE, 0);

This pattern is used by zswap, and would be used by other subsystems
that explicitly opt into private node access.

User-Visible Changes
====================

cpuset gains cpuset.mems.sysram (read-only), shows N_MEMORY nodes.

ABI: /proc/<pid>/status Mems_allowed shows sysram nodes only.

Drivers create private regions via sysfs:
  echo region0 > /sys/bus/cxl/.../create_private_region
  echo zswap > /sys/bus/cxl/.../region0/private_type
  echo 1 > /sys/bus/cxl/.../region0/commit

Series Organization
===================

Patch 1: numa,memory_hotplug: create N_PRIVATE (Private Nodes)
  Core infrastructure: N_PRIVATE node state, node_mark_private(),
  private_memtype enum, and private_node_ops registration.

Patch 2: mm: constify oom_control, scan_control, and alloc_context 
nodemask
  Preparatory cleanup for enforcing that nodemasks don't change.

Patch 3: mm: restrict slub, compaction, and page_alloc to sysram
  Enforce N_MEMORY-only allocation for general paths.

Patch 4: cpuset: introduce cpuset.mems.sysram
  User-visible isolation via cpuset interface.

Patch 5: Documentation/admin-guide/cgroups: update docs for mems_allowed
  Document the new behavior and sysram_nodes.

Patch 6: drivers/cxl/core/region: add private_region
  CXL infrastructure for private regions.

Patch 7: mm/zswap: compressed ram direct integration
  Zswap integration demonstrating direct hardware compression.

Patch 8: drivers/cxl: add zswap private_region type
  Complete example: CXL region as zswap compression target.

Future Work
===========

This series provides the foundation.  Planned follow-ups include:

  - CRAM integration with vmscan for smart demotion
  - ACCELERATOR type for GPU memory management
  - Memory-tiering integration with DEMOTE_ONLY nodes

Testing
=======

All patches build cleanly.  Tested with:
  - CXL QEMU emulation with private regions
  - Zswap stress tests with private compression targets
  - Cpuset verification of mems.sysram isolation

Gregory Price (8):
  numa,memory_hotplug: create N_PRIVATE (Private Nodes)
  mm: constify oom_control, scan_control, and alloc_context nodemask
  mm: restrict slub, compaction, and page_alloc to sysram
  cpuset: introduce cpuset.mems.sysram
  Documentation/admin-guide/cgroups: update docs for mems_allowed
  drivers/cxl/core/region: add private_region
  mm/zswap: compressed ram direct integration
  drivers/cxl: add zswap private_region type

 .../admin-guide/cgroup-v1/cpusets.rst         |  19 +-
 Documentation/admin-guide/cgroup-v2.rst       |  26 ++-
 Documentation/filesystems/proc.rst            |   2 +-
 drivers/base/node.c                           | 199 ++++++++++++++++++
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/core.h                       |   4 +
 drivers/cxl/core/port.c                       |   4 +
 drivers/cxl/core/private_region/Makefile      |  12 ++
 .../cxl/core/private_region/private_region.c  | 129 ++++++++++++
 .../cxl/core/private_region/private_region.h  |  14 ++
 drivers/cxl/core/private_region/zswap.c       | 127 +++++++++++
 drivers/cxl/core/region.c                     |  63 +++++-
 drivers/cxl/cxl.h                             |  22 ++
 include/linux/cpuset.h                        |  24 ++-
 include/linux/gfp.h                           |   6 +
 include/linux/mm.h                            |   4 +-
 include/linux/mmzone.h                        |   6 +-
 include/linux/node.h                          |  60 ++++++
 include/linux/nodemask.h                      |   1 +
 include/linux/oom.h                           |   2 +-
 include/linux/swap.h                          |   2 +-
 include/linux/zswap.h                         |   5 +
 kernel/cgroup/cpuset-internal.h               |   8 +
 kernel/cgroup/cpuset-v1.c                     |   8 +
 kernel/cgroup/cpuset.c                        |  98 ++++++---
 mm/compaction.c                               |   6 +-
 mm/internal.h                                 |   2 +-
 mm/memcontrol.c                               |   2 +-
 mm/memory_hotplug.c                           |   2 +-
 mm/mempolicy.c                                |   6 +-
 mm/migrate.c                                  |   4 +-
 mm/mmzone.c                                   |   5 +-
 mm/page_alloc.c                               |  31 +--
 mm/show_mem.c                                 |   9 +-
 mm/slub.c                                     |   8 +-
 mm/vmscan.c                                   |   6 +-
 mm/zswap.c                                    | 106 +++++++++-
 37 files changed, 942 insertions(+), 91 deletions(-)
 create mode 100644 drivers/cxl/core/private_region/Makefile
 create mode 100644 drivers/cxl/core/private_region/private_region.c
 create mode 100644 drivers/cxl/core/private_region/private_region.h
 create mode 100644 drivers/cxl/core/private_region/zswap.c
---
base-commit: 803dd4b1159cf9864be17aab8a17653e6ecbbbb6

-- 
2.52.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes)
  2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
@ 2026-01-08 20:37 ` Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-08 20:37 UTC (permalink / raw)
  To: linux-mm, cgroups, linux-cxl
  Cc: linux-doc, linux-kernel, linux-fsdevel, kernel-team, longman, tj,
	hannes, mkoutny, corbet, gregkh, rafael, dakr, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, akpm, vbabka, surenb, mhocko,
	jackmanb, ziy, david, lorenzo.stoakes, Liam.Howlett, rppt,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, rientjes,
	shakeel.butt, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	yosry.ahmed, chengming.zhou, roman.gushchin, muchun.song,
	osalvador, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, cl, harry.yoo, zhengqi.arch,
	Balbir Singh

N_MEMORY nodes are intended to contain general System RAM.  Today, some
device drivers hotplug their memory (marked Specific Purpose or Reserved)
to get access to mm/ services, but don't intend it for general consumption.

This creates reliability issues as there are no isolation guarantees.

Create N_PRIVATE for memory nodes whose memory is not intended for
general consumption.  This state is mutually exclusive with N_MEMORY.

This will allow existing service code (like page_alloc.c) to manage
N_PRIVATE nodes without exposing N_MEMORY users to that memory.

Add `node_mark_private()` for device drivers to call to mark a node
as private prior to hotplugging memory.  This fails if the node is
already online or already has N_MEMORY.

Private nodes must have a memory types so that multiple drivers
trying to online private memory onto the same node are warned
when a conflict occurs.

Suggested-by: David Hildenbrand <david@kernel.org>
Suggested-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/base/node.c      | 199 +++++++++++++++++++++++++++++++++++++++
 include/linux/node.h     |  60 ++++++++++++
 include/linux/nodemask.h |   1 +
 mm/memory_hotplug.c      |   2 +-
 4 files changed, 261 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 00cf4532f121..b503782ea109 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -861,6 +861,193 @@ void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
 			   (void *)&nid, register_mem_block_under_node_hotplug);
 	return;
 }
+
+static enum private_memtype *private_nodes;
+/* Per-node list of private node operations callbacks */
+static struct list_head private_node_ops_list[MAX_NUMNODES];
+static DEFINE_MUTEX(private_node_ops_lock);
+static bool private_node_ops_initialized;
+
+/*
+ * Note: private_node_ops_list is initialized in node_dev_init() before
+ * any calls to node_register_private() can occur.
+ */
+
+/**
+ * node_register_private - Mark a node as private and register ops
+ * @nid: Node identifier
+ * @ops: Callback operations structure (required, but callbacks may be NULL)
+ *
+ * Mark a node as private and register the given ops structure. The ops
+ * structure must have res_start and res_end set to the physical address
+ * range covered by this registration, and memtype set to the private
+ * memory type. Multiple registrations for the same node are allowed as
+ * long as they have the same memtype.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+int node_register_private(int nid, struct private_node_ops *ops)
+{
+	int rc = 0;
+	enum private_memtype ctype;
+	enum private_memtype type;
+
+	if (!ops)
+		return -EINVAL;
+
+	type = ops->memtype;
+
+	if (!node_possible(nid) || !private_nodes || type >= NODE_MAX_MEMTYPE)
+		return -EINVAL;
+
+	/* Validate resource bounds */
+	if (ops->res_start > ops->res_end)
+		return -EINVAL;
+
+	mutex_lock(&private_node_ops_lock);
+
+	/* hotplug lock must be held while checking online/node state */
+	mem_hotplug_begin();
+
+	/*
+	 * N_PRIVATE and N_MEMORY are mutually exclusive. Fail if the node
+	 * already has N_MEMORY set, regardless of online state.
+	 */
+	if (node_state(nid, N_MEMORY)) {
+		rc = -EBUSY;
+		goto out;
+	}
+
+	ctype = private_nodes[nid];
+	if (ctype > NODE_MEM_NOTYPE && ctype != type) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	/* Initialize the ops list entry and add to the node's list */
+	INIT_LIST_HEAD(&ops->list);
+	list_add_tail_rcu(&ops->list, &private_node_ops_list[nid]);
+
+	private_nodes[nid] = type;
+	node_set_state(nid, N_PRIVATE);
+out:
+	mem_hotplug_done();
+	mutex_unlock(&private_node_ops_lock);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(node_register_private);
+
+/**
+ * node_unregister_private - Unregister ops and potentially unmark node as private
+ * @nid: Node identifier
+ * @ops: Callback operations structure to remove
+ *
+ * Remove the given ops structure from the node's ops list. If this is
+ * the last ops structure for the node and the node is offline, the
+ * node is unmarked as private.
+ */
+void node_unregister_private(int nid, struct private_node_ops *ops)
+{
+	if (!node_possible(nid) || !private_nodes || !ops)
+		return;
+
+	mutex_lock(&private_node_ops_lock);
+	mem_hotplug_begin();
+
+	list_del_rcu(&ops->list);
+	/* If list is now empty, clear private state */
+	if (list_empty(&private_node_ops_list[nid])) {
+		private_nodes[nid] = NODE_MEM_NOTYPE;
+		node_clear_state(nid, N_PRIVATE);
+	}
+
+	mem_hotplug_done();
+	mutex_unlock(&private_node_ops_lock);
+	synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(node_unregister_private);
+
+/**
+ * node_private_allocated - Validate a page allocation from a private node
+ * @page: The allocated page
+ *
+ * Find the ops structure whose region contains the page's physical address
+ * and call its page_allocated callback if one is registered.
+ *
+ * Returns:
+ *   0 if the callback succeeds or no callback is registered for this region
+ *   -ENXIO if the page is not found in any registered region
+ *   Other negative error code if the callback indicates the page is not safe
+ */
+int node_private_allocated(struct page *page)
+{
+	struct private_node_ops *ops;
+	phys_addr_t page_phys;
+	int nid = page_to_nid(page);
+	int ret = -ENXIO;
+
+	if (!node_possible(nid) || nid >= MAX_NUMNODES)
+		return -ENXIO;
+
+	if (!private_node_ops_initialized)
+		return -ENXIO;
+
+	page_phys = page_to_phys(page);
+
+	/*
+	 * Use RCU to safely traverse the list without holding locks.
+	 * Writers use list_add_tail_rcu/list_del_rcu with synchronize_rcu()
+	 * to ensure safe concurrent access.
+	 */
+	rcu_read_lock();
+	list_for_each_entry_rcu(ops, &private_node_ops_list[nid], list) {
+		if (page_phys >= ops->res_start && page_phys <= ops->res_end) {
+			if (ops->page_allocated)
+				ret = ops->page_allocated(page, ops->data);
+			else
+				ret = 0;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(node_private_allocated);
+
+/**
+ * node_private_freed - Notify that a page from a private node is being freed
+ * @page: The page being freed
+ *
+ * Find the ops structure whose region contains the page's physical address
+ * and call its page_freed callback if one is registered.
+ */
+void node_private_freed(struct page *page)
+{
+	struct private_node_ops *ops;
+	phys_addr_t page_phys;
+	int nid = page_to_nid(page);
+
+	if (!node_possible(nid) || nid >= MAX_NUMNODES)
+		return;
+
+	if (!private_node_ops_initialized)
+		return;
+
+	page_phys = page_to_phys(page);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ops, &private_node_ops_list[nid], list) {
+		if (page_phys >= ops->res_start && page_phys <= ops->res_end) {
+			if (ops->page_freed)
+				ops->page_freed(page, ops->data);
+			break;
+		}
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(node_private_freed);
+
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 /**
@@ -959,6 +1146,7 @@ static struct node_attr node_state_attr[] = {
 	[N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
 #endif
 	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
+	[N_PRIVATE] = _NODE_ATTR(has_private_memory, N_PRIVATE),
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
 	[N_GENERIC_INITIATOR] = _NODE_ATTR(has_generic_initiator,
 					   N_GENERIC_INITIATOR),
@@ -972,6 +1160,7 @@ static struct attribute *node_state_attrs[] = {
 	&node_state_attr[N_HIGH_MEMORY].attr.attr,
 #endif
 	&node_state_attr[N_MEMORY].attr.attr,
+	&node_state_attr[N_PRIVATE].attr.attr,
 	&node_state_attr[N_CPU].attr.attr,
 	&node_state_attr[N_GENERIC_INITIATOR].attr.attr,
 	NULL
@@ -1007,5 +1196,15 @@ void __init node_dev_init(void)
 			panic("%s() failed to add node: %d\n", __func__, ret);
 	}
 
+	private_nodes = kzalloc(sizeof(enum private_memtype) * MAX_NUMNODES,
+				GFP_KERNEL);
+	if (!private_nodes)
+		pr_warn("Failed to allocate private_nodes, private node support disabled\n");
+
+	/* Initialize private node ops lists */
+	for (i = 0; i < MAX_NUMNODES; i++)
+		INIT_LIST_HEAD(&private_node_ops_list[i]);
+	private_node_ops_initialized = true;
+
 	register_memory_blocks_under_nodes();
 }
diff --git a/include/linux/node.h b/include/linux/node.h
index 0269b064ba65..53a9fb63b60e 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -62,6 +62,47 @@ enum cache_mode {
 	NODE_CACHE_ADDR_MODE_EXTENDED_LINEAR,
 };
 
+enum private_memtype {
+	NODE_MEM_NOTYPE,
+	NODE_MEM_ZSWAP,
+	NODE_MEM_COMPRESSED,
+	NODE_MEM_ACCELERATOR,
+	NODE_MEM_DEMOTE_ONLY,
+	NODE_MAX_MEMTYPE,
+};
+
+/**
+ * struct private_node_ops - Callbacks for private node operations
+ * @list: List node for per-node ops list
+ * @res_start: Start physical address of the memory region
+ * @res_end: End physical address of the memory region (inclusive)
+ * @memtype: Private node memory type for this region
+ * @page_allocated: Called after a page is allocated from this region
+ *                  to validate that the page is safe to use. Returns 0
+ *                  on success, negative error code on failure. If this
+ *                  returns an error, the caller should free the page
+ *                  and try another node. May be NULL if no validation
+ *                  is needed.
+ * @page_freed: Called when a page from this region is being freed.
+ *              Allows the driver to update its internal tracking.
+ *              May be NULL if no notification is needed.
+ * @data: Driver-private data passed to callbacks
+ *
+ * Multiple drivers may register ops for a single private node. Each
+ * registration covers a specific physical memory region. When a page
+ * is allocated, the appropriate ops structure is found by matching
+ * the page's physical address against the registered regions.
+ */
+struct private_node_ops {
+	struct list_head list;
+	resource_size_t res_start;
+	resource_size_t res_end;
+	enum private_memtype memtype;
+	int (*page_allocated)(struct page *page, void *data);
+	void (*page_freed)(struct page *page, void *data);
+	void *data;
+};
+
 /**
  * struct node_cache_attrs - system memory caching attributes
  *
@@ -121,6 +162,10 @@ extern struct node *node_devices[];
 #if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA)
 void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
 					       unsigned long end_pfn);
+int node_register_private(int nid, struct private_node_ops *ops);
+void node_unregister_private(int nid, struct private_node_ops *ops);
+int node_private_allocated(struct page *page);
+void node_private_freed(struct page *page);
 #else
 static inline void register_memory_blocks_under_node_hotplug(int nid,
 							     unsigned long start_pfn,
@@ -130,6 +175,21 @@ static inline void register_memory_blocks_under_node_hotplug(int nid,
 static inline void register_memory_blocks_under_nodes(void)
 {
 }
+static inline int node_register_private(int nid, struct private_node_ops *ops)
+{
+	return -ENODEV;
+}
+static inline void node_unregister_private(int nid,
+					   struct private_node_ops *ops)
+{
+}
+static inline int node_private_allocated(struct page *page)
+{
+	return -ENXIO;
+}
+static inline void node_private_freed(struct page *page)
+{
+}
 #endif
 
 struct node_notify {
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index bd38648c998d..dac250c6f1a9 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -391,6 +391,7 @@ enum node_states {
 	N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
 	N_MEMORY,		/* The node has memory(regular, high, movable) */
+	N_PRIVATE,		/* The node's memory is private */
 	N_CPU,		/* The node has one or more cpus */
 	N_GENERIC_INITIATOR,	/* The node has one or more Generic Initiators */
 	NR_NODE_STATES
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 389989a28abe..57463fcb4021 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1207,7 +1207,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	online_pages_range(pfn, nr_pages);
 	adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
 
-	if (node_arg.nid >= 0)
+	if (node_arg.nid >= 0 && !node_state(nid, N_PRIVATE))
 		node_set_state(nid, N_MEMORY);
 	if (need_zonelists_rebuild)
 		build_all_zonelists(NULL);
-- 
2.52.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask
  2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Gregory Price
@ 2026-01-08 20:37 ` Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram Gregory Price
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-08 20:37 UTC (permalink / raw)
  To: linux-mm, cgroups, linux-cxl
  Cc: linux-doc, linux-kernel, linux-fsdevel, kernel-team, longman, tj,
	hannes, mkoutny, corbet, gregkh, rafael, dakr, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, akpm, vbabka, surenb, mhocko,
	jackmanb, ziy, david, lorenzo.stoakes, Liam.Howlett, rppt,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, rientjes,
	shakeel.butt, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	yosry.ahmed, chengming.zhou, roman.gushchin, muchun.song,
	osalvador, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, cl, harry.yoo, zhengqi.arch

The nodemasks in these structures may come from a variety of sources,
including tasks and cpusets - and should never be modified by any code
when being passed around inside another context.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/cpuset.h | 4 ++--
 include/linux/mm.h     | 4 ++--
 include/linux/mmzone.h | 6 +++---
 include/linux/oom.h    | 2 +-
 include/linux/swap.h   | 2 +-
 kernel/cgroup/cpuset.c | 2 +-
 mm/internal.h          | 2 +-
 mm/mmzone.c            | 5 +++--
 mm/page_alloc.c        | 4 ++--
 mm/show_mem.c          | 9 ++++++---
 mm/vmscan.c            | 6 +++---
 11 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 631577384677..fe4f29624117 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -81,7 +81,7 @@ extern bool cpuset_cpu_is_isolated(int cpu);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
-int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
+int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask);
 
 extern bool cpuset_current_node_allowed(int node, gfp_t gfp_mask);
 
@@ -226,7 +226,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 #define cpuset_current_mems_allowed (node_states[N_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 
-static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
+static inline int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask)
 {
 	return 1;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 45dfb2f2883c..dd4f5d49f638 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3572,7 +3572,7 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
 extern void mem_init(void);
 extern void __init mmap_init(void);
 
-extern void __show_mem(unsigned int flags, nodemask_t *nodemask, int max_zone_idx);
+extern void __show_mem(unsigned int flags, const nodemask_t *nodemask, int max_zone_idx);
 static inline void show_mem(void)
 {
 	__show_mem(0, NULL, MAX_NR_ZONES - 1);
@@ -3582,7 +3582,7 @@ extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
 extern __printf(3, 4)
-void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
+void warn_alloc(gfp_t gfp_mask, const nodemask_t *nodemask, const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a7db0fee54a..7f94d67ffac4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1721,7 +1721,7 @@ static inline int zonelist_node_idx(const struct zoneref *zoneref)
 
 struct zoneref *__next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes);
+					const nodemask_t *nodes);
 
 /**
  * next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
@@ -1740,7 +1740,7 @@ struct zoneref *__next_zones_zonelist(struct zoneref *z,
  */
 static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+					const nodemask_t *nodes)
 {
 	if (likely(!nodes && zonelist_zone_idx(z) <= highest_zoneidx))
 		return z;
@@ -1766,7 +1766,7 @@ static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z,
  */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+					const nodemask_t *nodes)
 {
 	return next_zones_zonelist(zonelist->_zonerefs,
 							highest_zoneidx, nodes);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7b02bc1d0a7e..00da05d227e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -30,7 +30,7 @@ struct oom_control {
 	struct zonelist *zonelist;
 
 	/* Used to determine mempolicy */
-	nodemask_t *nodemask;
+	const nodemask_t *nodemask;
 
 	/* Memory cgroup in which oom is invoked, or NULL for global oom */
 	struct mem_cgroup *memcg;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..1569f3f4773b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -370,7 +370,7 @@ extern void swap_setup(void);
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-					gfp_t gfp_mask, nodemask_t *mask);
+					gfp_t gfp_mask, const nodemask_t *mask);
 
 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
 #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 289fb1a72550..a3ade9d5968b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4326,7 +4326,7 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
  *
  * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
+int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask)
 {
 	return nodes_intersects(*nodemask, current->mems_allowed);
 }
diff --git a/mm/internal.h b/mm/internal.h
index 6dc83c243120..50d32055b544 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -587,7 +587,7 @@ void page_alloc_sysctl_init(void);
  */
 struct alloc_context {
 	struct zonelist *zonelist;
-	nodemask_t *nodemask;
+	const nodemask_t *nodemask;
 	struct zoneref *preferred_zoneref;
 	int migratetype;
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 0c8f181d9d50..59dc3f2076a6 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -43,7 +43,8 @@ struct zone *next_zone(struct zone *zone)
 	return zone;
 }
 
-static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+static inline int zref_in_nodemask(struct zoneref *zref,
+				   const nodemask_t *nodes)
 {
 #ifdef CONFIG_NUMA
 	return node_isset(zonelist_node_idx(zref), *nodes);
@@ -55,7 +56,7 @@ static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 struct zoneref *__next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+					const nodemask_t *nodes)
 {
 	/*
 	 * Find the next suitable zone to use for the allocation.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ecb2646b57ba..bb89d81aa68c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3988,7 +3988,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	return NULL;
 }
 
-static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
+static void warn_alloc_show_mem(gfp_t gfp_mask, const nodemask_t *nodemask)
 {
 	unsigned int filter = SHOW_MEM_FILTER_NODES;
 
@@ -4008,7 +4008,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
 	mem_cgroup_show_protected_memory(NULL);
 }
 
-void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
+void warn_alloc(gfp_t gfp_mask, const nodemask_t *nodemask, const char *fmt, ...)
 {
 	struct va_format vaf;
 	va_list args;
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 3a4b5207635d..24685b5c6dcf 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -116,7 +116,8 @@ void si_meminfo_node(struct sysinfo *val, int nid)
  * Determine whether the node should be displayed or not, depending on whether
  * SHOW_MEM_FILTER_NODES was passed to show_free_areas().
  */
-static bool show_mem_node_skip(unsigned int flags, int nid, nodemask_t *nodemask)
+static bool show_mem_node_skip(unsigned int flags, int nid,
+			       const nodemask_t *nodemask)
 {
 	if (!(flags & SHOW_MEM_FILTER_NODES))
 		return false;
@@ -177,7 +178,8 @@ static bool node_has_managed_zones(pg_data_t *pgdat, int max_zone_idx)
  * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's
  *   cpuset.
  */
-static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
+static void show_free_areas(unsigned int filter, const nodemask_t *nodemask,
+			    int max_zone_idx)
 {
 	unsigned long free_pcp = 0;
 	int cpu, nid;
@@ -399,7 +401,8 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 	show_swap_cache_info();
 }
 
-void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
+void __show_mem(unsigned int filter, const nodemask_t *nodemask,
+		int max_zone_idx)
 {
 	unsigned long total = 0, reserved = 0, highmem = 0;
 	struct zone *zone;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7c962ee7819f..23f68e754738 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -80,7 +80,7 @@ struct scan_control {
 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
 	 * are scanned.
 	 */
-	nodemask_t	*nodemask;
+	const nodemask_t *nodemask;
 
 	/*
 	 * The memory cgroup that hit its limit and as a result is the
@@ -6502,7 +6502,7 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
  * happens, the page allocator should not consider triggering the OOM killer.
  */
 static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
-					nodemask_t *nodemask)
+				    const nodemask_t *nodemask)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -6582,7 +6582,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, const nodemask_t *nodemask)
 {
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
-- 
2.52.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram
  2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
@ 2026-01-08 20:37 ` Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram Gregory Price
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-08 20:37 UTC (permalink / raw)
  To: linux-mm, cgroups, linux-cxl
  Cc: linux-doc, linux-kernel, linux-fsdevel, kernel-team, longman, tj,
	hannes, mkoutny, corbet, gregkh, rafael, dakr, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, akpm, vbabka, surenb, mhocko,
	jackmanb, ziy, david, lorenzo.stoakes, Liam.Howlett, rppt,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, rientjes,
	shakeel.butt, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	yosry.ahmed, chengming.zhou, roman.gushchin, muchun.song,
	osalvador, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, cl, harry.yoo, zhengqi.arch

Restrict page allocation and zone iteration to N_MEMORY nodes via
cpusets - or node_states[N_MEMORY] when cpusets is disabled.

__GFP_THISNODE allows N_PRIVATE nodes to be used explicitly (all
nodes become valid targets with __GFP_THISNODE).

This constrains core users of nodemasks to the node_states[N_MEMORY],
which is guaranteed to at least contain the set of nodes with sysram
memory blocks present at boot.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/gfp.h |  6 ++++++
 mm/compaction.c     |  6 ++----
 mm/page_alloc.c     | 27 ++++++++++++++++-----------
 mm/slub.c           |  8 ++++++--
 4 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b155929af5b1..0b6cdef7a232 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -321,6 +321,7 @@ struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
 		struct mempolicy *mpol, pgoff_t ilx, int nid);
 struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr);
+bool numa_zone_allowed(int alloc_flags, struct zone *zone, gfp_t gfp_mask);
 #else
 static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order)
 {
@@ -337,6 +338,11 @@ static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int orde
 }
 #define vma_alloc_folio_noprof(gfp, order, vma, addr)		\
 	folio_alloc_noprof(gfp, order)
+static inline bool numa_zone_allowed(int alloc_flags, struct zone *zone,
+				     gfp_t gfp_mask)
+{
+	return true;
+}
 #endif
 
 #define alloc_pages(...)			alloc_hooks(alloc_pages_noprof(__VA_ARGS__))
diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c..63ef9803607f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2829,10 +2829,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 					ac->highest_zoneidx, ac->nodemask) {
 		enum compact_result status;
 
-		if (cpusets_enabled() &&
-			(alloc_flags & ALLOC_CPUSET) &&
-			!__cpuset_zone_allowed(zone, gfp_mask))
-				continue;
+		if (!numa_zone_allowed(alloc_flags, zone, gfp_mask))
+			continue;
 
 		if (prio > MIN_COMPACT_PRIORITY
 					&& compaction_deferred(zone, order)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb89d81aa68c..76b12cef7dfc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3723,6 +3723,16 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
 				node_reclaim_distance;
 }
+bool numa_zone_allowed(int alloc_flags, struct zone *zone, gfp_t gfp_mask)
+{
+	/* If cpusets is being used, check mems_allowed or sysram_nodes */
+	if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET))
+		return cpuset_zone_allowed(zone, gfp_mask);
+
+	/* Otherwise only allow N_PRIVATE if __GFP_THISNODE is present */
+	return (gfp_mask & __GFP_THISNODE) ||
+		node_isset(zone_to_nid(zone), node_states[N_MEMORY]);
+}
 #else	/* CONFIG_NUMA */
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
@@ -3814,10 +3824,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		struct page *page;
 		unsigned long mark;
 
-		if (cpusets_enabled() &&
-			(alloc_flags & ALLOC_CPUSET) &&
-			!__cpuset_zone_allowed(zone, gfp_mask))
-				continue;
+		if (!numa_zone_allowed(alloc_flags, zone, gfp_mask))
+			continue;
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a node that is within its dirty
@@ -4618,10 +4627,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		unsigned long min_wmark = min_wmark_pages(zone);
 		bool wmark;
 
-		if (cpusets_enabled() &&
-			(alloc_flags & ALLOC_CPUSET) &&
-			!__cpuset_zone_allowed(zone, gfp_mask))
-				continue;
+		if (!numa_zone_allowed(alloc_flags, zone, gfp_mask))
+			continue;
 
 		available = reclaimable = zone_reclaimable_pages(zone);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@ -5131,10 +5138,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	for_next_zone_zonelist_nodemask(zone, z, ac.highest_zoneidx, ac.nodemask) {
 		unsigned long mark;
 
-		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
-		    !__cpuset_zone_allowed(zone, gfp)) {
+		if (!numa_zone_allowed(alloc_flags, zone, gfp))
 			continue;
-		}
 
 		if (nr_online_nodes > 1 && zone != zonelist_zone(ac.preferred_zoneref) &&
 		    zone_to_nid(zone) != zonelist_node_idx(ac.preferred_zoneref)) {
diff --git a/mm/slub.c b/mm/slub.c
index 861592ac5425..adebbddc48f6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3594,9 +3594,13 @@ static struct slab *get_any_partial(struct kmem_cache *s,
 			struct kmem_cache_node *n;
 
 			n = get_node(s, zone_to_nid(zone));
+			if (!n)
+				continue;
+
+			if (!numa_zone_allowed(ALLOC_CPUSET, zone, pc->flags))
+				continue;
 
-			if (n && cpuset_zone_allowed(zone, pc->flags) &&
-					n->nr_partial > s->min_partial) {
+			if (n->nr_partial > s->min_partial) {
 				slab = get_partial_node(s, n, pc);
 				if (slab) {
 					/*
-- 
2.52.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram
  2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
                   ` (2 preceding siblings ...)
  2026-01-08 20:37 ` [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram Gregory Price
@ 2026-01-08 20:37 ` Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed Gregory Price
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-08 20:37 UTC (permalink / raw)
  To: linux-mm, cgroups, linux-cxl
  Cc: linux-doc, linux-kernel, linux-fsdevel, kernel-team, longman, tj,
	hannes, mkoutny, corbet, gregkh, rafael, dakr, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, akpm, vbabka, surenb, mhocko,
	jackmanb, ziy, david, lorenzo.stoakes, Liam.Howlett, rppt,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, rientjes,
	shakeel.butt, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	yosry.ahmed, chengming.zhou, roman.gushchin, muchun.song,
	osalvador, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, cl, harry.yoo, zhengqi.arch

mems_sysram contains only SystemRAM nodes (omitting Private Nodes).

The nodemask is intersect(effective_mems, node_states[N_MEMORY]).

When checking mems_allowed, check for __GFP_THISNODE to determine if
the check should be made against sysram_nodes or mems_allowed.

This omits Private Nodes (N_PRIVATE) from default mems_allowed checks,
making those nodes unreachable via normal allocation paths (page
faults, mempolicies, etc).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/cpuset.h          | 20 +++++--
 kernel/cgroup/cpuset-internal.h |  8 +++
 kernel/cgroup/cpuset-v1.c       |  8 +++
 kernel/cgroup/cpuset.c          | 96 +++++++++++++++++++++++++--------
 mm/memcontrol.c                 |  2 +-
 mm/mempolicy.c                  |  6 +--
 mm/migrate.c                    |  4 +-
 7 files changed, 112 insertions(+), 32 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index fe4f29624117..1ae09ec0fcb7 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,7 +174,9 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
-extern void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask);
+extern void cpuset_sysram_nodes_allowed(struct cgroup *cgroup,
+					nodemask_t *mask);
+extern nodemask_t cpuset_sysram_nodemask(struct task_struct *p);
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -218,7 +220,13 @@ static inline bool cpuset_cpu_is_isolated(int cpu)
 	return false;
 }
 
-static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
+static inline void cpuset_sysram_nodes_allowed(struct cgroup *cgroup,
+					       nodemask_t *mask)
+{
+	nodes_copy(*mask, node_possible_map);
+}
+
+static inline nodemask_t cpuset_sysram_nodemask(struct task_struct *p)
 {
 	return node_possible_map;
 }
@@ -301,10 +309,16 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
-static inline void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
+static inline void cpuset_sysram_nodes_allowed(struct cgroup *cgroup,
+					       nodemask_t *mask)
 {
 	nodes_copy(*mask, node_states[N_MEMORY]);
 }
+
+static nodemask_t cpuset_sysram_nodemask(struct task_struct *p)
+{
+	return node_states[N_MEMORY];
+}
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 01976c8e7d49..4764aaef585f 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -53,6 +53,7 @@ typedef enum {
 	FILE_MEMORY_MIGRATE,
 	FILE_CPULIST,
 	FILE_MEMLIST,
+	FILE_MEMS_SYSRAM,
 	FILE_EFFECTIVE_CPULIST,
 	FILE_EFFECTIVE_MEMLIST,
 	FILE_SUBPARTS_CPULIST,
@@ -104,6 +105,13 @@ struct cpuset {
 	cpumask_var_t effective_cpus;
 	nodemask_t effective_mems;
 
+	/*
+	 * SystemRAM Memory Nodes for tasks.
+	 * This is the intersection of effective_mems and node_states[N_MEMORY].
+	 * Tasks will have their sysram_nodes set to this value.
+	 */
+	nodemask_t mems_sysram;
+
 	/*
 	 * Exclusive CPUs dedicated to current cgroup (default hierarchy only)
 	 *
diff --git a/kernel/cgroup/cpuset-v1.c b/kernel/cgroup/cpuset-v1.c
index 12e76774c75b..45b74181effd 100644
--- a/kernel/cgroup/cpuset-v1.c
+++ b/kernel/cgroup/cpuset-v1.c
@@ -293,6 +293,8 @@ void cpuset1_hotplug_update_tasks(struct cpuset *cs,
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
+	nodes_and(cs->mems_sysram, cs->effective_mems, node_states[N_MEMORY]);
+	cpuset_update_tasks_nodemask(cs);
 	cpuset_callback_unlock_irq();
 
 	/*
@@ -532,6 +534,12 @@ struct cftype cpuset1_files[] = {
 		.private = FILE_EFFECTIVE_MEMLIST,
 	},
 
+	{
+		.name = "mems_sysram",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_MEMS_SYSRAM,
+	},
+
 	{
 		.name = "cpu_exclusive",
 		.read_u64 = cpuset_read_u64,
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a3ade9d5968b..4c213a2ea7ac 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -29,6 +29,7 @@
 #include <linux/mempolicy.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/memory-tiers.h>
 #include <linux/export.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
@@ -454,11 +455,11 @@ static void guarantee_active_cpus(struct task_struct *tsk,
  *
  * Call with callback_lock or cpuset_mutex held.
  */
-static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
+static void guarantee_online_sysram_nodes(struct cpuset *cs, nodemask_t *pmask)
 {
-	while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY]))
+	while (!nodes_intersects(cs->mems_sysram, node_states[N_MEMORY]))
 		cs = parent_cs(cs);
-	nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]);
+	nodes_and(*pmask, cs->mems_sysram, node_states[N_MEMORY]);
 }
 
 /**
@@ -2791,7 +2792,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
 
 	cpuset_being_rebound = cs;		/* causes mpol_dup() rebind */
 
-	guarantee_online_mems(cs, &newmems);
+	guarantee_online_sysram_nodes(cs, &newmems);
 
 	/*
 	 * The mpol_rebind_mm() call takes mmap_lock, which we couldn't
@@ -2816,7 +2817,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
 
 		migrate = is_memory_migrate(cs);
 
-		mpol_rebind_mm(mm, &cs->mems_allowed);
+		mpol_rebind_mm(mm, &cs->mems_sysram);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
 		else
@@ -2876,6 +2877,8 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 
 		spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
+		nodes_and(cp->mems_sysram, cp->effective_mems,
+			  node_states[N_MEMORY]);
 		spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -3304,11 +3307,11 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * by skipping the task iteration and update.
 	 */
 	if (cpuset_v2() && !cpus_updated && !mems_updated) {
-		cpuset_attach_nodemask_to = cs->effective_mems;
+		cpuset_attach_nodemask_to = cs->mems_sysram;
 		goto out;
 	}
 
-	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+	guarantee_online_sysram_nodes(cs, &cpuset_attach_nodemask_to);
 
 	cgroup_taskset_for_each(task, css, tset)
 		cpuset_attach_task(cs, task);
@@ -3319,7 +3322,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
 	 * not set.
 	 */
-	cpuset_attach_nodemask_to = cs->effective_mems;
+	cpuset_attach_nodemask_to = cs->mems_sysram;
 	if (!is_memory_migrate(cs) && !mems_updated)
 		goto out;
 
@@ -3441,6 +3444,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_EFFECTIVE_MEMLIST:
 		seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems));
 		break;
+	case FILE_MEMS_SYSRAM:
+		seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->mems_sysram));
+		break;
 	case FILE_EXCLUSIVE_CPULIST:
 		seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->exclusive_cpus));
 		break;
@@ -3552,6 +3558,12 @@ static struct cftype dfl_files[] = {
 		.private = FILE_EFFECTIVE_MEMLIST,
 	},
 
+	{
+		.name = "mems.sysram",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_MEMS_SYSRAM,
+	},
+
 	{
 		.name = "cpus.partition",
 		.seq_show = cpuset_partition_show,
@@ -3654,6 +3666,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
+		nodes_and(cs->mems_sysram, cs->effective_mems,
+			  node_states[N_MEMORY]);
 	}
 	spin_unlock_irq(&callback_lock);
 
@@ -3685,6 +3699,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	spin_lock_irq(&callback_lock);
 	cs->mems_allowed = parent->mems_allowed;
 	cs->effective_mems = parent->mems_allowed;
+	nodes_and(cs->mems_sysram, cs->effective_mems,
+		  node_states[N_MEMORY]);
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
 	spin_unlock_irq(&callback_lock);
@@ -3838,7 +3854,7 @@ static void cpuset_fork(struct task_struct *task)
 
 	/* CLONE_INTO_CGROUP */
 	mutex_lock(&cpuset_mutex);
-	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+	guarantee_online_sysram_nodes(cs, &cpuset_attach_nodemask_to);
 	cpuset_attach_task(cs, task);
 
 	dec_attach_in_progress_locked(cs);
@@ -3887,7 +3903,8 @@ int __init cpuset_init(void)
 	cpumask_setall(top_cpuset.effective_xcpus);
 	cpumask_setall(top_cpuset.exclusive_cpus);
 	nodes_setall(top_cpuset.effective_mems);
-
+	nodes_and(top_cpuset.mems_sysram, top_cpuset.effective_mems,
+		  node_states[N_MEMORY]);
 	fmeter_init(&top_cpuset.fmeter);
 
 	BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
@@ -3916,6 +3933,7 @@ hotplug_update_tasks(struct cpuset *cs,
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
+	nodes_and(cs->mems_sysram, cs->effective_mems, node_states[N_MEMORY]);
 	spin_unlock_irq(&callback_lock);
 
 	if (cpus_updated)
@@ -4064,7 +4082,15 @@ static void cpuset_handle_hotplug(void)
 
 	/* fetch the available cpus/mems and find out which changed how */
 	cpumask_copy(&new_cpus, cpu_active_mask);
-	new_mems = node_states[N_MEMORY];
+
+	/*
+	 * Effective mems is union(N_MEMORY, N_PRIVATE), this allows
+	 * control over N_PRIVATE node usage from cgroups while
+	 * mems.sysram prevents N_PRIVATE nodes from being used
+	 * without __GFP_THISNODE.
+	 */
+	nodes_clear(new_mems);
+	nodes_or(new_mems, node_states[N_MEMORY], node_states[N_PRIVATE]);
 
 	/*
 	 * If subpartitions_cpus is populated, it is likely that the check
@@ -4106,6 +4132,8 @@ static void cpuset_handle_hotplug(void)
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
+		nodes_and(top_cpuset.mems_sysram, top_cpuset.effective_mems,
+			  node_states[N_MEMORY]);
 		spin_unlock_irq(&callback_lock);
 		cpuset_update_tasks_nodemask(&top_cpuset);
 	}
@@ -4176,6 +4204,7 @@ void __init cpuset_init_smp(void)
 
 	cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
 	top_cpuset.effective_mems = node_states[N_MEMORY];
+	top_cpuset.mems_sysram = node_states[N_MEMORY];
 
 	hotplug_node_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
 
@@ -4293,14 +4322,18 @@ bool cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 	return changed;
 }
 
+/*
+ * At this point in time, no hotplug nodes can have been added, so just set
+ * the sysram_nodes of the init task to the set of N_MEMORY nodes.
+ */
 void __init cpuset_init_current_mems_allowed(void)
 {
-	nodes_setall(current->mems_allowed);
+	current->mems_allowed = node_states[N_MEMORY];
 }
 
 /**
- * cpuset_mems_allowed - return mems_allowed mask from a tasks cpuset.
- * @tsk: pointer to task_struct from which to obtain cpuset->mems_allowed.
+ * cpuset_sysram_nodemask - return mems_sysram mask from a tasks cpuset.
+ * @tsk: pointer to task_struct from which to obtain cpuset->mems_sysram.
  *
  * Description: Returns the nodemask_t mems_allowed of the cpuset
  * attached to the specified @tsk.  Guaranteed to return some non-empty
@@ -4308,13 +4341,13 @@ void __init cpuset_init_current_mems_allowed(void)
  * tasks cpuset.
  **/
 
-nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
+nodemask_t cpuset_sysram_nodemask(struct task_struct *tsk)
 {
 	nodemask_t mask;
 	unsigned long flags;
 
 	spin_lock_irqsave(&callback_lock, flags);
-	guarantee_online_mems(task_cs(tsk), &mask);
+	guarantee_online_sysram_nodes(task_cs(tsk), &mask);
 	spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
@@ -4383,17 +4416,30 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *	tsk_is_oom_victim   - any node ok
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
+ *	GFP_THISNODE - allows private memory nodes in mems_allowed
  */
 bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 {
 	struct cpuset *cs;		/* current cpuset ancestors */
 	bool allowed;			/* is allocation in zone z allowed? */
 	unsigned long flags;
+	bool private_nodes = gfp_mask & __GFP_THISNODE;
 
+	/* Only SysRAM nodes are valid in interrupt context */
 	if (in_interrupt())
-		return true;
-	if (node_isset(node, current->mems_allowed))
-		return true;
+		return node_isset(node, node_states[N_MEMORY]);
+
+	if (private_nodes) {
+		rcu_read_lock();
+		cs = task_cs(current);
+		allowed = node_isset(node, cs->effective_mems);
+		rcu_read_unlock();
+	} else
+		allowed = node_isset(node, current->mems_allowed);
+
+	if (allowed)
+		return allowed;
+
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
@@ -4412,6 +4458,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 
+	/* If not allowing private node allocation, restrict to sysram nodes */
+	if (!private_nodes)
+		allowed &= node_isset(node, node_states[N_MEMORY]);
+
 	spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
@@ -4434,7 +4484,7 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
  * online due to hot plugins. Callers should check the mask for validity on
  * return based on its subsequent use.
  **/
-void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
+void cpuset_sysram_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
 {
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
@@ -4457,16 +4507,16 @@ void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
 
 	/*
 	 * The reference taken via cgroup_get_e_css is sufficient to
-	 * protect css, but it does not imply safe accesses to effective_mems.
+	 * protect css, but it does not imply safe accesses to mems_sysram.
 	 *
-	 * Normally, accessing effective_mems would require the cpuset_mutex
+	 * Normally, accessing mems_sysram would require the cpuset_mutex
 	 * or callback_lock - but the correctness of this information is stale
 	 * immediately after the query anyway. We do not acquire the lock
 	 * during this process to save lock contention in exchange for racing
 	 * against mems_allowed rebinds.
 	 */
 	cs = container_of(css, struct cpuset, css);
-	nodes_copy(*mask, cs->effective_mems);
+	nodes_copy(*mask, cs->mems_sysram);
 	css_put(css);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7fbe9565cd06..2df7168edca0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5610,7 +5610,7 @@ void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
 	 * in effective_mems and hot-unpluging of nodes, inaccurate allowed
 	 * mask is acceptable.
 	 */
-	cpuset_nodes_allowed(memcg->css.cgroup, &allowed);
+	cpuset_sysram_nodes_allowed(memcg->css.cgroup, &allowed);
 	nodes_and(*mask, *mask, allowed);
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 76da50425712..760b5b6b4ae6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1901,14 +1901,14 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 	}
 	rcu_read_unlock();
 
-	task_nodes = cpuset_mems_allowed(task);
+	task_nodes = cpuset_sysram_nodemask(task);
 	/* Is the user allowed to access the target nodes? */
 	if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) {
 		err = -EPERM;
 		goto out_put;
 	}
 
-	task_nodes = cpuset_mems_allowed(current);
+	task_nodes = cpuset_sysram_nodemask(current);
 	nodes_and(*new, *new, task_nodes);
 	if (nodes_empty(*new))
 		goto out_put;
@@ -2833,7 +2833,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old)
 		*new = *old;
 
 	if (current_cpuset_is_being_rebound()) {
-		nodemask_t mems = cpuset_mems_allowed(current);
+		nodemask_t mems = cpuset_sysram_nodemask(current);
 		mpol_rebind_policy(new, &mems);
 	}
 	atomic_set(&new->refcnt, 1);
diff --git a/mm/migrate.c b/mm/migrate.c
index 5169f9717f60..0ad893bf862b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2534,7 +2534,7 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
 	 */
 	if (!pid) {
 		mmget(current->mm);
-		*mem_nodes = cpuset_mems_allowed(current);
+		*mem_nodes = cpuset_sysram_nodemask(current);
 		return current->mm;
 	}
 
@@ -2555,7 +2555,7 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
 	mm = ERR_PTR(security_task_movememory(task));
 	if (IS_ERR(mm))
 		goto out;
-	*mem_nodes = cpuset_mems_allowed(task);
+	*mem_nodes = cpuset_sysram_nodemask(task);
 	mm = get_task_mm(task);
 out:
 	put_task_struct(task);
-- 
2.52.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed
  2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
                   ` (3 preceding siblings ...)
  2026-01-08 20:37 ` [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram Gregory Price
@ 2026-01-08 20:37 ` Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region Gregory Price
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-08 20:37 UTC (permalink / raw)
  To: linux-mm, cgroups, linux-cxl
  Cc: linux-doc, linux-kernel, linux-fsdevel, kernel-team, longman, tj,
	hannes, mkoutny, corbet, gregkh, rafael, dakr, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, akpm, vbabka, surenb, mhocko,
	jackmanb, ziy, david, lorenzo.stoakes, Liam.Howlett, rppt,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, rientjes,
	shakeel.butt, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	yosry.ahmed, chengming.zhou, roman.gushchin, muchun.song,
	osalvador, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, cl, harry.yoo, zhengqi.arch

Add new information about mems_allowed and sysram_nodes, which says
mems_allowed may contain union(N_MEMORY, N_PRIVATE) nodes, while
sysram_nodes may only contain a subset of N_MEMORY nodes.

cpuset.mems.sysram is a new RO ABI which reports the list of
N_MEMORY nodes the cpuset is allowed to use, while
cpusets.mems and mems.effective may also contain N_PRIVATE.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 .../admin-guide/cgroup-v1/cpusets.rst         | 19 +++++++++++---
 Documentation/admin-guide/cgroup-v2.rst       | 26 +++++++++++++++++--
 Documentation/filesystems/proc.rst            |  2 +-
 3 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index c7909e5ac136..6d326056f7b4 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -158,21 +158,26 @@ new system calls are added for cpusets - all support for querying and
 modifying cpusets is via this cpuset file system.
 
 The /proc/<pid>/status file for each task has four added lines,
-displaying the task's cpus_allowed (on which CPUs it may be scheduled)
-and mems_allowed (on which Memory Nodes it may obtain memory),
-in the two formats seen in the following example::
+displaying the task's cpus_allowed (on which CPUs it may be scheduled),
+and mems_allowed (on which SystemRAM nodes it may obtain memory),
+in the formats seen in the following example::
 
   Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
   Cpus_allowed_list:      0-127
   Mems_allowed:   ffffffff,ffffffff
   Mems_allowed_list:      0-63
 
+Note that Mems_allowed only shows SystemRAM nodes (N_MEMORY), not
+Private Nodes.  Private Nodes may be accessible via __GFP_THISNODE
+allocations if they appear in the task's cpuset.effective_mems.
+
 Each cpuset is represented by a directory in the cgroup file system
 containing (on top of the standard cgroup files) the following
 files describing that cpuset:
 
  - cpuset.cpus: list of CPUs in that cpuset
  - cpuset.mems: list of Memory Nodes in that cpuset
+ - cpuset.mems.sysram: read-only list of SystemRAM nodes (excludes Private Nodes)
  - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
  - cpuset.cpu_exclusive flag: is cpu placement exclusive?
  - cpuset.mem_exclusive flag: is memory placement exclusive?
@@ -227,7 +232,9 @@ nodes with memory--using the cpuset_track_online_nodes() hook.
 
 The cpuset.effective_cpus and cpuset.effective_mems files are
 normally read-only copies of cpuset.cpus and cpuset.mems files
-respectively.  If the cpuset cgroup filesystem is mounted with the
+respectively.  The cpuset.effective_mems file may include both
+regular SystemRAM nodes (N_MEMORY) and Private Nodes (N_PRIVATE).
+If the cpuset cgroup filesystem is mounted with the
 special "cpuset_v2_mode" option, the behavior of these files will become
 similar to the corresponding files in cpuset v2.  In other words, hotplug
 events will not change cpuset.cpus and cpuset.mems.  Those events will
@@ -236,6 +243,10 @@ the actual cpus and memory nodes that are currently used by this cpuset.
 See Documentation/admin-guide/cgroup-v2.rst for more information about
 cpuset v2 behavior.
 
+The cpuset.mems.sysram file shows only the SystemRAM nodes (N_MEMORY)
+from cpuset.effective_mems, excluding any Private Nodes. This
+represents the nodes available for general memory allocation.
+
 
 1.4 What are exclusive cpusets ?
 --------------------------------
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 7f5b59d95fce..6af54efb84a2 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2530,8 +2530,11 @@ Cpuset Interface Files
 	cpuset-enabled cgroups.
 
 	It lists the onlined memory nodes that are actually granted to
-	this cgroup by its parent. These memory nodes are allowed to
-	be used by tasks within the current cgroup.
+	this cgroup by its parent.  This includes both regular SystemRAM
+	nodes (N_MEMORY) and Private Nodes (N_PRIVATE) that provide
+	device-specific memory not intended for general consumption.
+	Tasks within this cgroup may access Private Nodes using explicit
+	__GFP_THISNODE allocations if the node is in this mask.
 
 	If "cpuset.mems" is empty, it shows all the memory nodes from the
 	parent cgroup that will be available to be used by this cgroup.
@@ -2541,6 +2544,25 @@ Cpuset Interface Files
 
 	Its value will be affected by memory nodes hotplug events.
 
+  cpuset.mems.sysram
+	A read-only multiple values file which exists on all
+	cpuset-enabled cgroups.
+
+	It lists the SystemRAM nodes (N_MEMORY) that are available for
+	general memory allocation by tasks within this cgroup.  This is
+	a subset of "cpuset.mems.effective" that excludes Private Nodes.
+
+	Normal page allocations are restricted to nodes in this mask.
+	The kernel page allocator, slab allocator, and compaction only
+	consider SystemRAM nodes when allocating memory for tasks.
+
+	Private Nodes are excluded from this mask because their memory
+	is managed by device drivers for specific purposes (e.g., CXL
+	compressed memory, accelerator memory) and should not be used
+	for general allocations.
+
+	Its value will be affected by memory nodes hotplug events.
+
   cpuset.cpus.exclusive
 	A read-write multiple values file which exists on non-root
 	cpuset-enabled cgroups.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c92e95e28047..68f3d8ffc03b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -294,7 +294,7 @@ It's slow but very precise.
  Cpus_active_mm              mask of CPUs on which this process has an active
                              memory context
  Cpus_active_mm_list         Same as previous, but in "list format"
- Mems_allowed                mask of memory nodes allowed to this process
+ Mems_allowed                mask of SystemRAM nodes for general allocations
  Mems_allowed_list           Same as previous, but in "list format"
  voluntary_ctxt_switches     number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
-- 
2.52.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region
  2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
                   ` (4 preceding siblings ...)
  2026-01-08 20:37 ` [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed Gregory Price
@ 2026-01-08 20:37 ` Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Gregory Price
  2026-01-08 20:37 ` [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type Gregory Price
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-08 20:37 UTC (permalink / raw)
  To: linux-mm, cgroups, linux-cxl
  Cc: linux-doc, linux-kernel, linux-fsdevel, kernel-team, longman, tj,
	hannes, mkoutny, corbet, gregkh, rafael, dakr, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, akpm, vbabka, surenb, mhocko,
	jackmanb, ziy, david, lorenzo.stoakes, Liam.Howlett, rppt,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, rientjes,
	shakeel.butt, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	yosry.ahmed, chengming.zhou, roman.gushchin, muchun.song,
	osalvador, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, cl, harry.yoo, zhengqi.arch

A private_region is just a RAM region which attempts to set the
target_node to N_PRIVATE before continuing to create a DAX device and
subsequently hotplugging memory onto the system.

A CXL device driver would create a private_region with the intent to
manage how the memory can be used more granuarly than typical SystemRAM.

This patch adds the infrastructure for a private memory region. Added
as a separate folder to keep private region types organized.

usage:
    echo regionN > decoderX.Y/create_private_region
    echo type    > regionN/private_type

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/core.h                       |   4 +
 drivers/cxl/core/port.c                       |   4 +
 drivers/cxl/core/private_region/Makefile      |   9 ++
 .../cxl/core/private_region/private_region.c  | 119 ++++++++++++++++++
 .../cxl/core/private_region/private_region.h  |  10 ++
 drivers/cxl/core/region.c                     |  63 ++++++++--
 drivers/cxl/cxl.h                             |  20 +++
 8 files changed, 219 insertions(+), 11 deletions(-)
 create mode 100644 drivers/cxl/core/private_region/Makefile
 create mode 100644 drivers/cxl/core/private_region/private_region.c
 create mode 100644 drivers/cxl/core/private_region/private_region.h

diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 5ad8fef210b5..2dd882a52609 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -17,6 +17,7 @@ cxl_core-y += cdat.o
 cxl_core-y += ras.o
 cxl_core-$(CONFIG_TRACING) += trace.o
 cxl_core-$(CONFIG_CXL_REGION) += region.o
+obj-$(CONFIG_CXL_REGION) += private_region/
 cxl_core-$(CONFIG_CXL_MCE) += mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1fb66132b777..159f92e4bea1 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -21,6 +21,7 @@ enum cxl_detach_mode {
 #ifdef CONFIG_CXL_REGION
 extern struct device_attribute dev_attr_create_pmem_region;
 extern struct device_attribute dev_attr_create_ram_region;
+extern struct device_attribute dev_attr_create_private_region;
 extern struct device_attribute dev_attr_delete_region;
 extern struct device_attribute dev_attr_region;
 extern const struct device_type cxl_pmem_region_type;
@@ -30,6 +31,9 @@ extern const struct device_type cxl_region_type;
 int cxl_decoder_detach(struct cxl_region *cxlr,
 		       struct cxl_endpoint_decoder *cxled, int pos,
 		       enum cxl_detach_mode mode);
+int devm_cxl_add_dax_region(struct cxl_region *cxlr);
+struct cxl_region *to_cxl_region(struct device *dev);
+extern struct device_attribute dev_attr_private_type;
 
 #define CXL_REGION_ATTR(x) (&dev_attr_##x.attr)
 #define CXL_REGION_TYPE(x) (&cxl_region_type)
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index fef3aa0c6680..aedecb83e59b 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -333,6 +333,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
 	&dev_attr_qos_class.attr,
 	SET_CXL_REGION_ATTR(create_pmem_region)
 	SET_CXL_REGION_ATTR(create_ram_region)
+	SET_CXL_REGION_ATTR(create_private_region)
 	SET_CXL_REGION_ATTR(delete_region)
 	NULL,
 };
@@ -362,6 +363,9 @@ static umode_t cxl_root_decoder_visible(struct kobject *kobj, struct attribute *
 	if (a == CXL_REGION_ATTR(create_ram_region) && !can_create_ram(cxlrd))
 		return 0;
 
+	if (a == CXL_REGION_ATTR(create_private_region) && !can_create_ram(cxlrd))
+		return 0;
+
 	if (a == CXL_REGION_ATTR(delete_region) &&
 	    !(can_create_pmem(cxlrd) || can_create_ram(cxlrd)))
 		return 0;
diff --git a/drivers/cxl/core/private_region/Makefile b/drivers/cxl/core/private_region/Makefile
new file mode 100644
index 000000000000..d17498129ba6
--- /dev/null
+++ b/drivers/cxl/core/private_region/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# CXL Private Region type implementations
+#
+
+ccflags-y += -I$(srctree)/drivers/cxl
+
+# Core dispatch and sysfs
+obj-$(CONFIG_CXL_REGION) += private_region.o
diff --git a/drivers/cxl/core/private_region/private_region.c b/drivers/cxl/core/private_region/private_region.c
new file mode 100644
index 000000000000..ead48abb9fc7
--- /dev/null
+++ b/drivers/cxl/core/private_region/private_region.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CXL Private Region - dispatch and lifecycle management
+ *
+ * This file implements the main registration and unregistration dispatch
+ * for CXL private regions. It handles common initialization and delegates
+ * to type-specific implementations.
+ */
+
+#include <linux/device.h>
+#include <linux/cleanup.h>
+#include "../../cxl.h"
+#include "../core.h"
+#include "private_region.h"
+
+static const char *private_type_to_string(enum cxl_private_region_type type)
+{
+	switch (type) {
+	default:
+		return "";
+	}
+}
+
+static enum cxl_private_region_type string_to_private_type(const char *str)
+{
+	return CXL_PRIVATE_NONE;
+}
+
+static ssize_t private_type_show(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%s\n", private_type_to_string(cxlr->private_type));
+}
+
+static ssize_t private_type_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	struct cxl_region_params *p = &cxlr->params;
+	enum cxl_private_region_type type;
+	ssize_t rc;
+
+	type = string_to_private_type(buf);
+	if (type == CXL_PRIVATE_NONE)
+		return -EINVAL;
+
+	ACQUIRE(rwsem_write_kill, rwsem)(&cxl_rwsem.region);
+	if ((rc = ACQUIRE_ERR(rwsem_write_kill, &rwsem)))
+		return rc;
+
+	/* Can only change type before region is committed */
+	if (p->state >= CXL_CONFIG_COMMIT)
+		return -EBUSY;
+
+	cxlr->private_type = type;
+
+	return len;
+}
+DEVICE_ATTR_RW(private_type);
+
+/*
+ * Register a private CXL region based on its private_type.
+ *
+ * This function is called during commit. It validates the private_type,
+ * initializes the private_ops, and dispatches to the appropriate
+ * registration function which handles memtype, callbacks, and node
+ * registration.
+ */
+int cxl_register_private_region(struct cxl_region *cxlr)
+{
+	int rc = 0;
+
+	if (!cxlr->params.res)
+		return -EINVAL;
+
+	if (cxlr->private_type == CXL_PRIVATE_NONE) {
+		dev_err(&cxlr->dev, "private_type must be set before commit\n");
+		return -EINVAL;
+	}
+
+	/* Initialize the private_ops with region info */
+	cxlr->private_ops.res_start = cxlr->params.res->start;
+	cxlr->private_ops.res_end = cxlr->params.res->end;
+	cxlr->private_ops.data = cxlr;
+
+	/* Call type-specific registration which sets memtype and callbacks */
+	switch (cxlr->private_type) {
+	default:
+		dev_dbg(&cxlr->dev, "unsupported private_type: %d\n",
+			cxlr->private_type);
+		rc = -EINVAL;
+		break;
+	}
+
+	if (!rc)
+		set_bit(CXL_REGION_F_PRIVATE_REGISTERED, &cxlr->flags);
+	return rc;
+}
+
+/*
+ * Unregister a private CXL region.
+ *
+ * This function is called during region reset or device release.
+ * It dispatches to the appropriate type-specific cleanup function.
+ */
+void cxl_unregister_private_region(struct cxl_region *cxlr)
+{
+	if (!test_and_clear_bit(CXL_REGION_F_PRIVATE_REGISTERED, &cxlr->flags))
+		return;
+
+	/* Dispatch to type-specific cleanup */
+	switch (cxlr->private_type) {
+	default:
+		break;
+	}
+}
diff --git a/drivers/cxl/core/private_region/private_region.h b/drivers/cxl/core/private_region/private_region.h
new file mode 100644
index 000000000000..9b34e51d8df4
--- /dev/null
+++ b/drivers/cxl/core/private_region/private_region.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __CXL_PRIVATE_REGION_H__
+#define __CXL_PRIVATE_REGION_H__
+
+struct cxl_region;
+
+int cxl_register_private_region(struct cxl_region *cxlr);
+void cxl_unregister_private_region(struct cxl_region *cxlr);
+
+#endif /* __CXL_PRIVATE_REGION_H__ */
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index ae899f68551f..c60eef96c0ca 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -15,6 +15,7 @@
 #include <cxlmem.h>
 #include <cxl.h>
 #include "core.h"
+#include "private_region/private_region.h"
 
 /**
  * DOC: cxl core region
@@ -38,8 +39,6 @@
  */
 static nodemask_t nodemask_region_seen = NODE_MASK_NONE;
 
-static struct cxl_region *to_cxl_region(struct device *dev);
-
 #define __ACCESS_ATTR_RO(_level, _name) {				\
 	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
 	.show	= _name##_access##_level##_show,			\
@@ -398,9 +397,6 @@ static int __commit(struct cxl_region *cxlr)
 		return rc;
 
 	rc = cxl_region_decode_commit(cxlr);
-	if (rc)
-		return rc;
-
 	p->state = CXL_CONFIG_COMMIT;
 
 	return 0;
@@ -615,12 +611,17 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
 	struct cxl_region *cxlr = to_cxl_region(dev);
 	const char *desc;
 
-	if (cxlr->mode == CXL_PARTMODE_RAM)
-		desc = "ram";
-	else if (cxlr->mode == CXL_PARTMODE_PMEM)
+	switch (cxlr->mode) {
+	case CXL_PARTMODE_RAM:
+		desc = cxlr->private ? "private" : "ram";
+		break;
+	case CXL_PARTMODE_PMEM:
 		desc = "pmem";
-	else
+		break;
+	default:
 		desc = "";
+		break;
+	}
 
 	return sysfs_emit(buf, "%s\n", desc);
 }
@@ -772,6 +773,7 @@ static struct attribute *cxl_region_attrs[] = {
 	&dev_attr_size.attr,
 	&dev_attr_mode.attr,
 	&dev_attr_extended_linear_cache_size.attr,
+	&dev_attr_private_type.attr,
 	NULL,
 };
 
@@ -2400,6 +2402,9 @@ static void cxl_region_release(struct device *dev)
 	struct cxl_region *cxlr = to_cxl_region(dev);
 	int id = atomic_read(&cxlrd->region_id);
 
+	/* Ensure private region is cleaned up if not already done */
+	cxl_unregister_private_region(cxlr);
+
 	/*
 	 * Try to reuse the recently idled id rather than the cached
 	 * next id to prevent the region id space from increasing
@@ -2429,7 +2434,7 @@ bool is_cxl_region(struct device *dev)
 }
 EXPORT_SYMBOL_NS_GPL(is_cxl_region, "CXL");
 
-static struct cxl_region *to_cxl_region(struct device *dev)
+struct cxl_region *to_cxl_region(struct device *dev)
 {
 	if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
 			  "not a cxl_region device\n"))
@@ -2638,6 +2643,13 @@ static ssize_t create_ram_region_show(struct device *dev,
 	return __create_region_show(to_cxl_root_decoder(dev), buf);
 }
 
+static ssize_t create_private_region_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *buf)
+{
+	return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
 static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
 					  enum cxl_partition_mode mode, int id)
 {
@@ -2698,6 +2710,28 @@ static ssize_t create_ram_region_store(struct device *dev,
 }
 DEVICE_ATTR_RW(create_ram_region);
 
+static ssize_t create_private_region_store(struct device *dev,
+					   struct device_attribute *attr,
+					   const char *buf, size_t len)
+{
+	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
+	struct cxl_region *cxlr;
+	int rc, id;
+
+	rc = sscanf(buf, "region%d\n", &id);
+	if (rc != 1)
+		return -EINVAL;
+
+	cxlr = __create_region(cxlrd, CXL_PARTMODE_RAM, id);
+	if (IS_ERR(cxlr))
+		return PTR_ERR(cxlr);
+
+	cxlr->private = true;
+
+	return len;
+}
+DEVICE_ATTR_RW(create_private_region);
+
 static ssize_t region_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -3431,7 +3465,7 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
 	device_unregister(&cxlr_dax->dev);
 }
 
-static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
+int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 {
 	struct cxl_dax_region *cxlr_dax;
 	struct device *dev;
@@ -3974,6 +4008,13 @@ static int cxl_region_probe(struct device *dev)
 					p->res->start, p->res->end, cxlr,
 					is_system_ram) > 0)
 			return 0;
+
+
+		if (cxlr->private) {
+			rc = cxl_register_private_region(cxlr);
+			if (rc)
+				return rc;
+		}
 		return devm_cxl_add_dax_region(cxlr);
 	default:
 		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index ba17fa86d249..b276956ff88d 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -525,6 +525,20 @@ enum cxl_partition_mode {
  */
 #define CXL_REGION_F_LOCK 2
 
+/*
+ * Indicate that this region has been registered as a private region.
+ * Used to track lifecycle and prevent double-unregistration.
+ */
+#define CXL_REGION_F_PRIVATE_REGISTERED 3
+
+/**
+ * enum cxl_private_region_type - CXL private region types
+ * @CXL_PRIVATE_NONE: No private region type set
+ */
+enum cxl_private_region_type {
+	CXL_PRIVATE_NONE,
+};
+
 /**
  * struct cxl_region - CXL region
  * @dev: This region's device
@@ -534,10 +548,13 @@ enum cxl_partition_mode {
  * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
  * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
  * @flags: Region state flags
+ * @private: Region is private (not exposed to system memory)
  * @params: active + config params for the region
  * @coord: QoS access coordinates for the region
  * @node_notifier: notifier for setting the access coordinates to node
  * @adist_notifier: notifier for calculating the abstract distance of node
+ * @private_type: CXL private region type for dispatch (set via sysfs)
+ * @private_ops: private node operations for callbacks (if mode is PRIVATE)
  */
 struct cxl_region {
 	struct device dev;
@@ -547,10 +564,13 @@ struct cxl_region {
 	struct cxl_nvdimm_bridge *cxl_nvb;
 	struct cxl_pmem_region *cxlr_pmem;
 	unsigned long flags;
+	bool private;
 	struct cxl_region_params params;
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
 	struct notifier_block node_notifier;
 	struct notifier_block adist_notifier;
+	enum cxl_private_region_type private_type;
+	struct private_node_ops private_ops;
 };
 
 struct cxl_nvdimm_bridge {
-- 
2.52.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration
  2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
                   ` (5 preceding siblings ...)
  2026-01-08 20:37 ` [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region Gregory Price
@ 2026-01-08 20:37 ` Gregory Price
  2026-01-09 16:00   ` Yosry Ahmed
  2026-01-08 20:37 ` [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type Gregory Price
  7 siblings, 1 reply; 12+ messages in thread
From: Gregory Price @ 2026-01-08 20:37 UTC (permalink / raw)
  To: linux-mm, cgroups, linux-cxl
  Cc: linux-doc, linux-kernel, linux-fsdevel, kernel-team, longman, tj,
	hannes, mkoutny, corbet, gregkh, rafael, dakr, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, akpm, vbabka, surenb, mhocko,
	jackmanb, ziy, david, lorenzo.stoakes, Liam.Howlett, rppt,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, rientjes,
	shakeel.butt, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	yosry.ahmed, chengming.zhou, roman.gushchin, muchun.song,
	osalvador, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, cl, harry.yoo, zhengqi.arch

If a private zswap-node is available, skip the entire software
compression process and memcpy directly to a compressed memory
folio, and store the newly allocated compressed memory page as
the zswap entry->handle.

On decompress we do the opposite: copy directly from the stored
page to the destination, and free the compressed memory page.

The driver callback is responsible for preventing run-away
compression ratio failures by checking that the allocated page is
safe to use (i.e. a compression ratio limit hasn't been crossed).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/zswap.h |   5 ++
 mm/zswap.c            | 106 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 30c193a1207e..4b52fe447e7e 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -35,6 +35,8 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
 void zswap_folio_swapin(struct folio *folio);
 bool zswap_is_enabled(void);
 bool zswap_never_enabled(void);
+void zswap_add_direct_node(int nid);
+void zswap_remove_direct_node(int nid);
 #else
 
 struct zswap_lruvec_state {};
@@ -69,6 +71,9 @@ static inline bool zswap_never_enabled(void)
 	return true;
 }
 
+static inline void zswap_add_direct_node(int nid) {}
+static inline void zswap_remove_direct_node(int nid) {}
+
 #endif
 
 #endif /* _LINUX_ZSWAP_H */
diff --git a/mm/zswap.c b/mm/zswap.c
index de8858ff1521..aada588c957e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -35,6 +35,7 @@
 #include <linux/workqueue.h>
 #include <linux/list_lru.h>
 #include <linux/zsmalloc.h>
+#include <linux/node.h>
 
 #include "swap.h"
 #include "internal.h"
@@ -190,6 +191,7 @@ struct zswap_entry {
 	swp_entry_t swpentry;
 	unsigned int length;
 	bool referenced;
+	bool direct;
 	struct zswap_pool *pool;
 	unsigned long handle;
 	struct obj_cgroup *objcg;
@@ -199,6 +201,20 @@ struct zswap_entry {
 static struct xarray *zswap_trees[MAX_SWAPFILES];
 static unsigned int nr_zswap_trees[MAX_SWAPFILES];
 
+/* Nodemask for compressed RAM nodes used by zswap_compress_direct */
+static nodemask_t zswap_direct_nodes = NODE_MASK_NONE;
+
+void zswap_add_direct_node(int nid)
+{
+	node_set(nid, zswap_direct_nodes);
+}
+
+void zswap_remove_direct_node(int nid)
+{
+	if (!node_online(nid))
+		node_clear(nid, zswap_direct_nodes);
+}
+
 /* RCU-protected iteration */
 static LIST_HEAD(zswap_pools);
 /* protects zswap_pools list modification */
@@ -716,7 +732,13 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
 static void zswap_entry_free(struct zswap_entry *entry)
 {
 	zswap_lru_del(&zswap_list_lru, entry);
-	zs_free(entry->pool->zs_pool, entry->handle);
+	if (entry->direct) {
+		struct page *page = (struct page *)entry->handle;
+
+		node_private_freed(page);
+		__free_page(page);
+	} else
+		zs_free(entry->pool->zs_pool, entry->handle);
 	zswap_pool_put(entry->pool);
 	if (entry->objcg) {
 		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
@@ -849,6 +871,58 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
 	mutex_unlock(&acomp_ctx->mutex);
 }
 
+static struct page *zswap_compress_direct(struct page *src,
+					  struct zswap_entry *entry)
+{
+	int nid;
+	struct page *dst;
+	gfp_t gfp;
+	nodemask_t tried_nodes = NODE_MASK_NONE;
+
+	if (nodes_empty(zswap_direct_nodes))
+		return NULL;
+
+	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE |
+	      __GFP_THISNODE;
+
+	for_each_node_mask(nid, zswap_direct_nodes) {
+		int ret;
+
+		/* Skip nodes we've already tried and failed */
+		if (node_isset(nid, tried_nodes))
+			continue;
+
+		dst = __alloc_pages(gfp, 0, nid, &zswap_direct_nodes);
+		if (!dst)
+			continue;
+
+		/*
+		 * Check with the device driver that this page is safe to use.
+		 * If the device reports an error (e.g., compression ratio is
+		 * too low and the page can't safely store data), free the page
+		 * and try another node.
+		 */
+		ret = node_private_allocated(dst);
+		if (ret) {
+			__free_page(dst);
+			node_set(nid, tried_nodes);
+			continue;
+		}
+
+		goto found;
+	}
+
+	return NULL;
+
+found:
+	/* If we fail to copy at this point just fallback */
+	if (copy_mc_highpage(dst, src)) {
+		__free_page(dst);
+		dst = NULL;
+	}
+	return dst;
+}
+
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 			   struct zswap_pool *pool)
 {
@@ -860,6 +934,17 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	gfp_t gfp;
 	u8 *dst;
 	bool mapped = false;
+	struct page *zpage;
+
+	/* Try to shunt directly to compressed ram */
+	zpage = zswap_compress_direct(page, entry);
+	if (zpage) {
+		entry->handle = (unsigned long)zpage;
+		entry->length = PAGE_SIZE;
+		entry->direct = true;
+		return true;
+	}
+	/* otherwise fallback to normal zswap */
 
 	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
 	dst = acomp_ctx->buffer;
@@ -913,6 +998,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	zs_obj_write(pool->zs_pool, handle, dst, dlen);
 	entry->handle = handle;
 	entry->length = dlen;
+	entry->direct = false;
 
 unlock:
 	if (mapped)
@@ -936,6 +1022,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	int decomp_ret = 0, dlen = PAGE_SIZE;
 	u8 *src, *obj;
 
+	/* compressed ram page */
+	if (entry->direct) {
+		struct page *src = (struct page *)entry->handle;
+		struct folio *zfolio = page_folio(src);
+
+		memcpy_folio(folio, 0, zfolio, 0, PAGE_SIZE);
+		goto direct_done;
+	}
+
 	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
 	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
 
@@ -969,6 +1064,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	zs_obj_read_end(pool->zs_pool, entry->handle, obj);
 	acomp_ctx_put_unlock(acomp_ctx);
 
+direct_done:
 	if (!decomp_ret && dlen == PAGE_SIZE)
 		return true;
 
@@ -1483,7 +1579,13 @@ static bool zswap_store_page(struct page *page,
 	return true;
 
 store_failed:
-	zs_free(pool->zs_pool, entry->handle);
+	if (entry->direct) {
+		struct page *freepage = (struct page *)entry->handle;
+
+		node_private_freed(freepage);
+		__free_page(freepage);
+	} else
+		zs_free(pool->zs_pool, entry->handle);
 compress_failed:
 	zswap_entry_cache_free(entry);
 	return false;
-- 
2.52.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type
  2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
                   ` (6 preceding siblings ...)
  2026-01-08 20:37 ` [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Gregory Price
@ 2026-01-08 20:37 ` Gregory Price
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-08 20:37 UTC (permalink / raw)
  To: linux-mm, cgroups, linux-cxl
  Cc: linux-doc, linux-kernel, linux-fsdevel, kernel-team, longman, tj,
	hannes, mkoutny, corbet, gregkh, rafael, dakr, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, akpm, vbabka, surenb, mhocko,
	jackmanb, ziy, david, lorenzo.stoakes, Liam.Howlett, rppt,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, rientjes,
	shakeel.butt, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	yosry.ahmed, chengming.zhou, roman.gushchin, muchun.song,
	osalvador, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, cl, harry.yoo, zhengqi.arch

Add a sample type of a zswap region, which registers itself as a valid
target node with mm/zswap.  Zswap will callback into the driver on new
page allocation and free.

On cxl_zswap_page_allocated(), we would check whether the worst case vs
current compression ratio is safe to allow new writes.

On cxl_zswap_page_freed(), zero the page to adjust the ratio down.

A device driver registering a Zswap private region would need to provide
an indicator to this component whether to allow new allocations - this
would probably be done via an interrupt setting a bit which says the
compression ratio has reached some conservative threshold.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/core/private_region/Makefile      |   3 +
 .../cxl/core/private_region/private_region.c  |  10 ++
 .../cxl/core/private_region/private_region.h  |   4 +
 drivers/cxl/core/private_region/zswap.c       | 127 ++++++++++++++++++
 drivers/cxl/cxl.h                             |   2 +
 5 files changed, 146 insertions(+)
 create mode 100644 drivers/cxl/core/private_region/zswap.c

diff --git a/drivers/cxl/core/private_region/Makefile b/drivers/cxl/core/private_region/Makefile
index d17498129ba6..ba495cd3f89f 100644
--- a/drivers/cxl/core/private_region/Makefile
+++ b/drivers/cxl/core/private_region/Makefile
@@ -7,3 +7,6 @@ ccflags-y += -I$(srctree)/drivers/cxl
 
 # Core dispatch and sysfs
 obj-$(CONFIG_CXL_REGION) += private_region.o
+
+# Type-specific implementations
+obj-$(CONFIG_CXL_REGION) += zswap.o
diff --git a/drivers/cxl/core/private_region/private_region.c b/drivers/cxl/core/private_region/private_region.c
index ead48abb9fc7..da5fb3d264e1 100644
--- a/drivers/cxl/core/private_region/private_region.c
+++ b/drivers/cxl/core/private_region/private_region.c
@@ -16,6 +16,8 @@
 static const char *private_type_to_string(enum cxl_private_region_type type)
 {
 	switch (type) {
+	case CXL_PRIVATE_ZSWAP:
+		return "zswap";
 	default:
 		return "";
 	}
@@ -23,6 +25,8 @@ static const char *private_type_to_string(enum cxl_private_region_type type)
 
 static enum cxl_private_region_type string_to_private_type(const char *str)
 {
+	if (sysfs_streq(str, "zswap"))
+		return CXL_PRIVATE_ZSWAP;
 	return CXL_PRIVATE_NONE;
 }
 
@@ -88,6 +92,9 @@ int cxl_register_private_region(struct cxl_region *cxlr)
 
 	/* Call type-specific registration which sets memtype and callbacks */
 	switch (cxlr->private_type) {
+	case CXL_PRIVATE_ZSWAP:
+		rc = cxl_register_zswap_region(cxlr);
+		break;
 	default:
 		dev_dbg(&cxlr->dev, "unsupported private_type: %d\n",
 			cxlr->private_type);
@@ -113,6 +120,9 @@ void cxl_unregister_private_region(struct cxl_region *cxlr)
 
 	/* Dispatch to type-specific cleanup */
 	switch (cxlr->private_type) {
+	case CXL_PRIVATE_ZSWAP:
+		cxl_unregister_zswap_region(cxlr);
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/cxl/core/private_region/private_region.h b/drivers/cxl/core/private_region/private_region.h
index 9b34e51d8df4..84d43238dbe1 100644
--- a/drivers/cxl/core/private_region/private_region.h
+++ b/drivers/cxl/core/private_region/private_region.h
@@ -7,4 +7,8 @@ struct cxl_region;
 int cxl_register_private_region(struct cxl_region *cxlr);
 void cxl_unregister_private_region(struct cxl_region *cxlr);
 
+/* Type-specific registration functions - called from region.c dispatch */
+int cxl_register_zswap_region(struct cxl_region *cxlr);
+void cxl_unregister_zswap_region(struct cxl_region *cxlr);
+
 #endif /* __CXL_PRIVATE_REGION_H__ */
diff --git a/drivers/cxl/core/private_region/zswap.c b/drivers/cxl/core/private_region/zswap.c
new file mode 100644
index 000000000000..c213abe2fad7
--- /dev/null
+++ b/drivers/cxl/core/private_region/zswap.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CXL Private Region - zswap type implementation
+ *
+ * This file implements the zswap private region type for CXL devices.
+ * It handles registration/unregistration of CXL regions as zswap
+ * compressed memory targets.
+ */
+
+#include <linux/device.h>
+#include <linux/highmem.h>
+#include <linux/node.h>
+#include <linux/zswap.h>
+#include <linux/memory_hotplug.h>
+#include "../../cxl.h"
+#include "../core.h"
+#include "private_region.h"
+
+/*
+ * CXL zswap region page_allocated callback
+ *
+ * This callback is invoked by zswap when a page is allocated from a private
+ * node to validate that the page is safe to use. For a real compressed memory
+ * device, this would check the device's compression ratio and return an error
+ * if the page cannot safely store data.
+ *
+ * Currently this is a placeholder that always succeeds. A real implementation
+ * would query the device hardware to determine if sufficient compression
+ * headroom exists.
+ */
+static int cxl_zswap_page_allocated(struct page *page, void *data)
+{
+	struct cxl_region *cxlr = data;
+
+	/*
+	 * TODO: Query the CXL device to check if this page allocation is safe.
+	 *
+	 * A real compressed memory device would track its compression ratio
+	 * and report whether it has headroom to accept new data. If the
+	 * compression ratio is too low (device is near capacity), this should
+	 * return -ENOSPC to tell zswap to try another node.
+	 *
+	 * For now, always succeed since we're testing with regular memory.
+	 */
+	dev_dbg(&cxlr->dev, "page_allocated callback for nid %d\n",
+		page_to_nid(page));
+
+	return 0;
+}
+
+/*
+ * CXL zswap region page_freed callback
+ *
+ * This callback is invoked when a page from a private node is being freed.
+ * We zero the page before returning it to the allocator so that the compressed
+ * memory device can reclaim capacity - zeroed pages achieve excellent
+ * compression ratios.
+ */
+static void cxl_zswap_page_freed(struct page *page, void *data)
+{
+	struct cxl_region *cxlr = data;
+
+	/*
+	 * Zero the page to improve the device's compression ratio.
+	 * Zeroed pages compress extremely well, reclaiming device capacity.
+	 */
+	clear_highpage(page);
+
+	dev_dbg(&cxlr->dev, "page_freed callback for nid %d\n",
+		page_to_nid(page));
+}
+
+/*
+ * Unregister a zswap region from the zswap subsystem.
+ *
+ * This function removes the node from zswap direct nodes and unregisters
+ * the private node operations.
+ */
+void cxl_unregister_zswap_region(struct cxl_region *cxlr)
+{
+	int nid;
+
+	if (!cxlr->private ||
+	    cxlr->private_ops.memtype != NODE_MEM_ZSWAP)
+		return;
+
+	if (!cxlr->params.res)
+		return;
+
+	nid = phys_to_target_node(cxlr->params.res->start);
+
+	zswap_remove_direct_node(nid);
+	node_unregister_private(nid, &cxlr->private_ops);
+
+	dev_dbg(&cxlr->dev, "unregistered zswap region for nid %d\n", nid);
+}
+
+/*
+ * Register a zswap region with the zswap subsystem.
+ *
+ * This function sets up the memtype, page_allocated callback, and
+ * registers the node with zswap as a direct compression target.
+ * The caller is responsible for adding the dax region after this succeeds.
+ */
+int cxl_register_zswap_region(struct cxl_region *cxlr)
+{
+	int nid, rc;
+
+	if (!cxlr->private || !cxlr->params.res)
+		return -EINVAL;
+
+	nid = phys_to_target_node(cxlr->params.res->start);
+
+	/* Register with node subsystem as zswap memory */
+	cxlr->private_ops.memtype = NODE_MEM_ZSWAP;
+	cxlr->private_ops.page_allocated = cxl_zswap_page_allocated;
+	cxlr->private_ops.page_freed = cxl_zswap_page_freed;
+	rc = node_register_private(nid, &cxlr->private_ops);
+	if (rc)
+		return rc;
+
+	/* Register this node with zswap as a direct compression target */
+	zswap_add_direct_node(nid);
+
+	dev_dbg(&cxlr->dev, "registered zswap region for nid %d\n", nid);
+	return 0;
+}
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b276956ff88d..89d8ae4e796c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -534,9 +534,11 @@ enum cxl_partition_mode {
 /**
  * enum cxl_private_region_type - CXL private region types
  * @CXL_PRIVATE_NONE: No private region type set
+ * @CXL_PRIVATE_ZSWAP: Region used for zswap compressed memory
  */
 enum cxl_private_region_type {
 	CXL_PRIVATE_NONE,
+	CXL_PRIVATE_ZSWAP,
 };
 
 /**
-- 
2.52.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration
  2026-01-08 20:37 ` [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Gregory Price
@ 2026-01-09 16:00   ` Yosry Ahmed
  2026-01-09 17:03     ` Gregory Price
  2026-01-09 21:40     ` Gregory Price
  0 siblings, 2 replies; 12+ messages in thread
From: Yosry Ahmed @ 2026-01-09 16:00 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, cgroups, linux-cxl, linux-doc, linux-kernel,
	linux-fsdevel, kernel-team, longman, tj, hannes, mkoutny, corbet,
	gregkh, rafael, dakr, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	akpm, vbabka, surenb, mhocko, jackmanb, ziy, david,
	lorenzo.stoakes, Liam.Howlett, rppt, axelrasmussen, yuanchu,
	weixugc, yury.norov, linux, rientjes, shakeel.butt, chrisl,
	kasong, shikemeng, nphamcs, bhe, baohua, chengming.zhou,
	roman.gushchin, muchun.song, osalvador, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple, cl,
	harry.yoo, zhengqi.arch

On Thu, Jan 08, 2026 at 03:37:54PM -0500, Gregory Price wrote:
> If a private zswap-node is available, skip the entire software
> compression process and memcpy directly to a compressed memory
> folio, and store the newly allocated compressed memory page as
> the zswap entry->handle.
> 
> On decompress we do the opposite: copy directly from the stored
> page to the destination, and free the compressed memory page.
> 
> The driver callback is responsible for preventing run-away
> compression ratio failures by checking that the allocated page is
> safe to use (i.e. a compression ratio limit hasn't been crossed).
> 
> Signed-off-by: Gregory Price <gourry@gourry.net>

Hi Gregory,

Thanks for sending this, I have a lot of questions/comments below, but
from a high-level I am trying to understand the benefit of using a
compressed node for zswap rather than as a second tier.

If the memory is byte-addressable, using it as a second tier makes it
directly accessible without page faults, so the access latency is much
better than a swapped out page in zswap.

Are there some HW limitations that allow a node to be used as a backend
for zswap but not a second tier?

Or is the idea to make promotions from compressed memory to normal
memory fault-driver instead of relying on page hotness?

I also think there are some design decisions that need to be made before
we commit to this, see the comments below for more.

> ---
>  include/linux/zswap.h |   5 ++
>  mm/zswap.c            | 106 +++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 109 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 30c193a1207e..4b52fe447e7e 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -35,6 +35,8 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
>  void zswap_folio_swapin(struct folio *folio);
>  bool zswap_is_enabled(void);
>  bool zswap_never_enabled(void);
> +void zswap_add_direct_node(int nid);
> +void zswap_remove_direct_node(int nid);
>  #else
>  
>  struct zswap_lruvec_state {};
> @@ -69,6 +71,9 @@ static inline bool zswap_never_enabled(void)
>  	return true;
>  }
>  
> +static inline void zswap_add_direct_node(int nid) {}
> +static inline void zswap_remove_direct_node(int nid) {}
> +
>  #endif
>  
>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/zswap.c b/mm/zswap.c
> index de8858ff1521..aada588c957e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -35,6 +35,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/list_lru.h>
>  #include <linux/zsmalloc.h>
> +#include <linux/node.h>
>  
>  #include "swap.h"
>  #include "internal.h"
> @@ -190,6 +191,7 @@ struct zswap_entry {
>  	swp_entry_t swpentry;
>  	unsigned int length;
>  	bool referenced;
> +	bool direct;
>  	struct zswap_pool *pool;
>  	unsigned long handle;
>  	struct obj_cgroup *objcg;
> @@ -199,6 +201,20 @@ struct zswap_entry {
>  static struct xarray *zswap_trees[MAX_SWAPFILES];
>  static unsigned int nr_zswap_trees[MAX_SWAPFILES];
>  
> +/* Nodemask for compressed RAM nodes used by zswap_compress_direct */
> +static nodemask_t zswap_direct_nodes = NODE_MASK_NONE;
> +
> +void zswap_add_direct_node(int nid)
> +{
> +	node_set(nid, zswap_direct_nodes);
> +}
> +
> +void zswap_remove_direct_node(int nid)
> +{
> +	if (!node_online(nid))
> +		node_clear(nid, zswap_direct_nodes);
> +}
> +
>  /* RCU-protected iteration */
>  static LIST_HEAD(zswap_pools);
>  /* protects zswap_pools list modification */
> @@ -716,7 +732,13 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>  static void zswap_entry_free(struct zswap_entry *entry)
>  {
>  	zswap_lru_del(&zswap_list_lru, entry);
> -	zs_free(entry->pool->zs_pool, entry->handle);
> +	if (entry->direct) {
> +		struct page *page = (struct page *)entry->handle;

Would it be cleaner to add a union in zswap_entry that has entry->handle
and entry->page?

> +
> +		node_private_freed(page);
> +		__free_page(page);
> +	} else
> +		zs_free(entry->pool->zs_pool, entry->handle);
>  	zswap_pool_put(entry->pool);
>  	if (entry->objcg) {
>  		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
> @@ -849,6 +871,58 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
>  	mutex_unlock(&acomp_ctx->mutex);
>  }
>  
> +static struct page *zswap_compress_direct(struct page *src,
> +					  struct zswap_entry *entry)
> +{
> +	int nid;
> +	struct page *dst;
> +	gfp_t gfp;
> +	nodemask_t tried_nodes = NODE_MASK_NONE;
> +
> +	if (nodes_empty(zswap_direct_nodes))
> +		return NULL;
> +
> +	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE |
> +	      __GFP_THISNODE;
> +
> +	for_each_node_mask(nid, zswap_direct_nodes) {
> +		int ret;
> +
> +		/* Skip nodes we've already tried and failed */
> +		if (node_isset(nid, tried_nodes))
> +			continue;

Why do we need this? Does for_each_node_mask() iterate each node more
than once?

> +
> +		dst = __alloc_pages(gfp, 0, nid, &zswap_direct_nodes);
> +		if (!dst)
> +			continue;
> +
> +		/*
> +		 * Check with the device driver that this page is safe to use.
> +		 * If the device reports an error (e.g., compression ratio is
> +		 * too low and the page can't safely store data), free the page
> +		 * and try another node.
> +		 */
> +		ret = node_private_allocated(dst);
> +		if (ret) {
> +			__free_page(dst);
> +			node_set(nid, tried_nodes);
> +			continue;
> +		}

I think we can drop the 'found' label by moving things around, would
this be simpler?

	for_each_node_mask(..) {
		...
		ret = node_private_allocated(dst);
		if (!ret)
			break;

		__free_page(dst);
		dst = NULL;
	}

	if (!dst)
		return NULL;

	if (copy_mc_highpage(..) {
		..
	}
	return dst;
		

> +
> +		goto found;
> +	}
> +
> +	return NULL;
> +
> +found:
> +	/* If we fail to copy at this point just fallback */
> +	if (copy_mc_highpage(dst, src)) {
> +		__free_page(dst);
> +		dst = NULL;
> +	}
> +	return dst;
> +}
> +

So the CXL code tells zswap what nodes are usable, then zswap tries
getting a page from these nodes and checking them using APIs provided by
the CXL code.

Wouldn't it be a better abstraction if the nodemask lived in the CXL
code and an API was exposed to zswap just to allocate a page to copy to?
Or we can abstract the copy as well and provide an API that directly
tries to copy the page to the compressible node.

IOW move zswap_compress_direct() (probably under a different name?) and
zswap_direct_nodes into CXL code since it's not really zswap logic.

Also, I am not sure if the zswap_compress_direct() call and check would
introduce any latency, since almost all existing callers will pay for it
without benefiting.

If we move the function into CXL code, we could probably have an inline
wrapper in a header with a static key guarding it to make there is no
overhead for existing users.

>  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  			   struct zswap_pool *pool)
>  {
> @@ -860,6 +934,17 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	gfp_t gfp;
>  	u8 *dst;
>  	bool mapped = false;
> +	struct page *zpage;
> +
> +	/* Try to shunt directly to compressed ram */
> +	zpage = zswap_compress_direct(page, entry);
> +	if (zpage) {
> +		entry->handle = (unsigned long)zpage;
> +		entry->length = PAGE_SIZE;
> +		entry->direct = true;
> +		return true;
> +	}

I don't think this works. Setting entry->length = PAGE_SIZE will cause a
few problems, off the top of my head:

1. An entire page of memory will be charged to the memcg, so swapping
out the page won't reduce the memcg usage, which will cause thrashing
(reclaim with no progress when hitting the limit).

Ideally we'd get the compressed length from HW and record it here to
charge it appropriately, but I am not sure how we actually want to
charge memory on a compressed node. Do we charge the compressed size as
normal memory? Does it need separate charging and a separate limit?

There are design discussions to be had before we commit to something.

2. The page will be incorrectly counted in
zswap_stored_incompressible_pages.

Aside from that, zswap_total_pages() will be wrong now, as it gets the
pool size from zsmalloc and these pages are not allocated from zsmalloc.
This is used when checking the pool limits and is exposed in stats.

> +	/* otherwise fallback to normal zswap */
>  
>  	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
>  	dst = acomp_ctx->buffer;
> @@ -913,6 +998,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	zs_obj_write(pool->zs_pool, handle, dst, dlen);
>  	entry->handle = handle;
>  	entry->length = dlen;
> +	entry->direct = false;
>  
>  unlock:
>  	if (mapped)
> @@ -936,6 +1022,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	int decomp_ret = 0, dlen = PAGE_SIZE;
>  	u8 *src, *obj;
>  
> +	/* compressed ram page */
> +	if (entry->direct) {
> +		struct page *src = (struct page *)entry->handle;
> +		struct folio *zfolio = page_folio(src);
> +
> +		memcpy_folio(folio, 0, zfolio, 0, PAGE_SIZE);

Why are we using memcpy_folio() here but copy_mc_highpage() on the
compression path? Are they equivalent?

> +		goto direct_done;
> +	}
> +
>  	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
>  	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
>  
> @@ -969,6 +1064,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	zs_obj_read_end(pool->zs_pool, entry->handle, obj);
>  	acomp_ctx_put_unlock(acomp_ctx);
>  
> +direct_done:
>  	if (!decomp_ret && dlen == PAGE_SIZE)
>  		return true;
>  
> @@ -1483,7 +1579,13 @@ static bool zswap_store_page(struct page *page,
>  	return true;
>  
>  store_failed:
> -	zs_free(pool->zs_pool, entry->handle);
> +	if (entry->direct) {
> +		struct page *freepage = (struct page *)entry->handle;
> +
> +		node_private_freed(freepage);
> +		__free_page(freepage);
> +	} else
> +		zs_free(pool->zs_pool, entry->handle);

This code is repeated in zswap_entry_free(), we should probably wrap it
in a helper that frees the private page or the zsmalloc entry based on
entry->direct.

>  compress_failed:
>  	zswap_entry_cache_free(entry);
>  	return false;
> -- 
> 2.52.0
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration
  2026-01-09 16:00   ` Yosry Ahmed
@ 2026-01-09 17:03     ` Gregory Price
  2026-01-09 21:40     ` Gregory Price
  1 sibling, 0 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-09 17:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-mm, cgroups, linux-cxl, linux-doc, linux-kernel,
	linux-fsdevel, kernel-team, longman, tj, hannes, mkoutny, corbet,
	gregkh, rafael, dakr, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	akpm, vbabka, surenb, mhocko, jackmanb, ziy, david,
	lorenzo.stoakes, Liam.Howlett, rppt, axelrasmussen, yuanchu,
	weixugc, yury.norov, linux, rientjes, shakeel.butt, chrisl,
	kasong, shikemeng, nphamcs, bhe, baohua, chengming.zhou,
	roman.gushchin, muchun.song, osalvador, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple, cl,
	harry.yoo, zhengqi.arch

On Fri, Jan 09, 2026 at 04:00:00PM +0000, Yosry Ahmed wrote:
> On Thu, Jan 08, 2026 at 03:37:54PM -0500, Gregory Price wrote:
> > If a private zswap-node is available, skip the entire software
> > compression process and memcpy directly to a compressed memory
> > folio, and store the newly allocated compressed memory page as
> > the zswap entry->handle.
> > 
> > On decompress we do the opposite: copy directly from the stored
> > page to the destination, and free the compressed memory page.
> > 
> > The driver callback is responsible for preventing run-away
> > compression ratio failures by checking that the allocated page is
> > safe to use (i.e. a compression ratio limit hasn't been crossed).
> > 
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> 
> Hi Gregory,
> 
> Thanks for sending this, I have a lot of questions/comments below, but
> from a high-level I am trying to understand the benefit of using a
> compressed node for zswap rather than as a second tier.
>

Don't think to hard about it - this is a stepping stone until we figure
out the cram.c usage pattern.

unrestricted write access to compress-ram a reliability issue, so:
  - zswap restricts both read and write.
  - a cram.c service would restrict write but leave pages mapped read

Have to step away, will come back to the rest of feedback a bit latter,
thank you for the review.

~Gregory


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration
  2026-01-09 16:00   ` Yosry Ahmed
  2026-01-09 17:03     ` Gregory Price
@ 2026-01-09 21:40     ` Gregory Price
  1 sibling, 0 replies; 12+ messages in thread
From: Gregory Price @ 2026-01-09 21:40 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-mm, cgroups, linux-cxl, linux-doc, linux-kernel,
	linux-fsdevel, kernel-team, longman, tj, hannes, mkoutny, corbet,
	gregkh, rafael, dakr, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
	akpm, vbabka, surenb, mhocko, jackmanb, ziy, david,
	lorenzo.stoakes, Liam.Howlett, rppt, axelrasmussen, yuanchu,
	weixugc, yury.norov, linux, rientjes, shakeel.butt, chrisl,
	kasong, shikemeng, nphamcs, bhe, baohua, chengming.zhou,
	roman.gushchin, muchun.song, osalvador, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple, cl,
	harry.yoo, zhengqi.arch

On Fri, Jan 09, 2026 at 04:00:00PM +0000, Yosry Ahmed wrote:
> On Thu, Jan 08, 2026 at 03:37:54PM -0500, Gregory Price wrote:
> 
> If the memory is byte-addressable, using it as a second tier makes it
> directly accessible without page faults, so the access latency is much
> better than a swapped out page in zswap.
> 
> Are there some HW limitations that allow a node to be used as a backend
> for zswap but not a second tier?
>

Coming back around - presumably any compressed node capable of hosting a
proper tier would be compatible with zswap, but you might have hardware
which is sufficiently slow(er than dram, faster than storage) that using
it as a proper tier may be less efficient than incurring faults.

The standard I've been using is 500ns+ cacheline fetches, but this is
somewhat arbitrary.  Even 500ns might be better than accessing multi-us
storage, but then when you add compression you might hit 600ns-1us.

This is besides the point, and apologies for the wall of text below,
feel free to skip this next section - writing out what hardware-specific
details I can share for the sake of completeness.

Some hardware details
=====================
The way every proposed piece of compressed memory hardware I have seen
would operate is essentially by lying about its capacity to the
operating system - and then providing mechanisms to determine when the
compression ratio becomes is dropping to dangerous levels.

Hardware Says : 8GB
Hardware Has  : 1GB
Node Capacity : 8GB

The capacity numbers are static.  Even with hotplug, they must be
considered static - because the runtime compression ratio can change.

If the device fails to achieve a 4:1 compression ratio, and real usage
starts to exceed real capacity - the system will fail.
(dropped writes, poisons, machine checks, etc).

We can mitigate this with strong write-controls and querying the device
for compression ratio data prior to actually migrating a page. 

Why Zswap to start
==================
ZSwap is an existing, clean read and write control path control.
   - We fault on all accesses.
   - It otherwise uses system memory under the hood (kmalloc)

I decided to use zswap as a proving ground for the concept.  While the
design in this patch is simplistic (and as you suggest below, can
clearly be improved), it demonstrates the entire concept:

on demotion:
- allocate a page from private memory
- ask the driver if it's safe to use
- if safe -> migrate
  if unsafe -> fallback

on memory access:
- "promote" to a real page
- inform the driver the page has been released (zero or discard)

As you point out, the real value in byte-accessible memory is leaving
the memory mapped, the only difference on cram.c and zswap.c in the
above pattern would be:

on demotion:
- allocate a page from private memory
- ask the driver if it's safe to use
- if safe -> migrate and remap the page as RO in page tables
  if unsafe
     -> trigger reclaim on cram node
     -> fallback to another demotion

on *write* access:
- promote to real page
- clean up the compressed page

> Or is the idea to make promotions from compressed memory to normal
> memory fault-driver instead of relying on page hotness?
> 
> I also think there are some design decisions that need to be made before
> we commit to this, see the comments below for more.
>

100% agreed, i'm absolutely not locked into a design, this just gets the
ball rolling :].

> >  /* RCU-protected iteration */
> >  static LIST_HEAD(zswap_pools);
> >  /* protects zswap_pools list modification */
> > @@ -716,7 +732,13 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
> >  static void zswap_entry_free(struct zswap_entry *entry)
> >  {
> >  	zswap_lru_del(&zswap_list_lru, entry);
> > -	zs_free(entry->pool->zs_pool, entry->handle);
> > +	if (entry->direct) {
> > +		struct page *page = (struct page *)entry->handle;
> 
> Would it be cleaner to add a union in zswap_entry that has entry->handle
> and entry->page?
> 

Absolutely. Ack.

> > +		/* Skip nodes we've already tried and failed */
> > +		if (node_isset(nid, tried_nodes))
> > +			continue;
> 
> Why do we need this? Does for_each_node_mask() iterate each node more
> than once?
>

This is just me being stupid, i will clean this up.  I think i wrote
this when i was using a _next nodemask variant that can loop around and
just left this in when i got it working.

> I think we can drop the 'found' label by moving things around, would
> this be simpler?
> 	for_each_node_mask(..) {
> 		...
> 		ret = node_private_allocated(dst);
> 		if (!ret)
> 			break;
> 
> 		__free_page(dst);
> 		dst = NULL;
> 	}
> 

ack, thank you.

> So the CXL code tells zswap what nodes are usable, then zswap tries
> getting a page from these nodes and checking them using APIs provided by
> the CXL code.
> 
> Wouldn't it be a better abstraction if the nodemask lived in the CXL
> code and an API was exposed to zswap just to allocate a page to copy to?
> Or we can abstract the copy as well and provide an API that directly
> tries to copy the page to the compressible node.
>
> IOW move zswap_compress_direct() (probably under a different name?) and
> zswap_direct_nodes into CXL code since it's not really zswap logic.
> 
> Also, I am not sure if the zswap_compress_direct() call and check would
> introduce any latency, since almost all existing callers will pay for it
> without benefiting.
> 
> If we move the function into CXL code, we could probably have an inline
> wrapper in a header with a static key guarding it to make there is no
> overhead for existing users.
> 

CXL is also the wrong place to put it - cxl is just one potential
source of such a node.  We'd want that abstracted...

So this looks like a good use of memor-tiers.c - do dispatch there and
have it set static branches for various features on node registration.

struct page* mt_migrate_page_to(NODE_TYPE, src, &size);
-> on success return dst page and the size of the page on hardware
   (target_size would address your accounting notes below)

Then have the migrate function in mt do all the node_private callbacks.

So that would limit the zswap internal change to

if (zswap_node_check()) { /* static branch check */
    cpage = mt_migrate_page_to(NODE_PRIVATE_ZSWAP, src, &size);
    if (compressed_page) {
        entry->page_handle = cpage;
        entry->length = size;
        entry->direct = true;
	return true;
    }
}
/* Fallthrough */

ack. this is all great, thank you.

... snip ...
> > entry->length = size
>
> I don't think this works. Setting entry->length = PAGE_SIZE will cause a
> few problems, off the top of my head:
> 
> 1. An entire page of memory will be charged to the memcg, so swapping
> out the page won't reduce the memcg usage, which will cause thrashing
> (reclaim with no progress when hitting the limit).
>
> Ideally we'd get the compressed length from HW and record it here to
> charge it appropriately, but I am not sure how we actually want to
> charge memory on a compressed node. Do we charge the compressed size as
> normal memory? Does it need separate charging and a separate limit?
> 
> There are design discussions to be had before we commit to something.

I have a feeling tracking individual page usage would be way too
granular / inefficient, but I will consult with some folks on whether
this can be quieried.  If so, we can add way to get that info.

node_private_page_size(page) -> returns device reported page size.

or work it directly into the migrate() call like above

--- assuming there isn't a way and we have to deal with fuzzy math ---

The goal should definitely be to leave the charging statistics the same
from the perspective of services - i.e zswap should charge a whole page,
because according to the OS it just used a whole page.

What this would mean is memcg would have to work with fuzzy data.
If 1GB is charged and the compression ratio is 4:1, reclaim should
operate (by way of callback) like it has used 256MB.

I think this is the best you can do without tracking individual pages.

> 
> 2. The page will be incorrectly counted in
> zswap_stored_incompressible_pages.
> 

If we can track individual page size, then we can fix that.

If we can't, then we'd need zswap_stored_direct_pages and to do the
accounting a bit differently.  Probably want direct_pages accounting
anyway, so i might just add that.

> Aside from that, zswap_total_pages() will be wrong now, as it gets the
> pool size from zsmalloc and these pages are not allocated from zsmalloc.
> This is used when checking the pool limits and is exposed in stats.
>

This is ignorance of zswap on my part, and yeah good point.  Will look
into this accounting a little more.

> > +		memcpy_folio(folio, 0, zfolio, 0, PAGE_SIZE);
> 
> Why are we using memcpy_folio() here but copy_mc_highpage() on the
> compression path? Are they equivalent?
> 

both are in include/linux/highmem.h

I was avoiding page->folio conversions in the compression path because
I had a struct page already.

tl;dr: I'm still looking for the "right" way to do this.  I originally
had a "HACK:" tag here previously but seems I definitely dropped it
prematurely.

(I also think this code can be pushed into mt_ or callbacks)

> > +	if (entry->direct) {
> > +		struct page *freepage = (struct page *)entry->handle;
> > +
> > +		node_private_freed(freepage);
> > +		__free_page(freepage);
> > +	} else
> > +		zs_free(pool->zs_pool, entry->handle);
> 
> This code is repeated in zswap_entry_free(), we should probably wrap it
> in a helper that frees the private page or the zsmalloc entry based on
> entry->direct.
>

ack.

Thank you again for taking a look, this has been enlightening.  Good
takeaways for the rest of the N_PRIVATE design.

I think we can minimize zswap changes even further given this.

~Gregory

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-01-09 21:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Gregory Price
2026-01-09 16:00   ` Yosry Ahmed
2026-01-09 17:03     ` Gregory Price
2026-01-09 21:40     ` Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox