From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-cxl@vger.kernel.org
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, kernel-team@meta.com,
longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org,
mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org,
rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net,
jonathan.cameron@huawei.com, dave.jiang@intel.com,
alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, dan.j.williams@intel.com,
akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com,
mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com,
david@kernel.org, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, rppt@kernel.org,
axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
yury.norov@gmail.com, linux@rasmusvillemoes.dk,
rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org,
kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com,
bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev,
chengming.zhou@linux.dev, roman.gushchin@linux.dev,
muchun.song@linux.dev, osalvador@suse.de,
matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com,
zhengqi.arch@bytedance.com
Subject: Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Date: Mon, 12 Jan 2026 22:12:23 +1100 [thread overview]
Message-ID: <6604d787-1744-4acf-80c0-e428fee1677e@nvidia.com> (raw)
In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net>
On 1/9/26 06:37, Gregory Price wrote:
> This series introduces N_PRIVATE, a new node state for memory nodes
> whose memory is not intended for general system consumption. Today,
> device drivers (CXL, accelerators, etc.) hotplug their memory to access
> mm/ services like page allocation and reclaim, but this exposes general
> workloads to memory with different characteristics and reliability
> guarantees than system RAM.
>
> N_PRIVATE provides isolation by default while enabling explicit access
> via __GFP_THISNODE for subsystems that understand how to manage these
> specialized memory regions.
>
I assume each class of N_PRIVATE is a separate set of NUMA nodes, these
could be real or virtual memory nodes?
> Motivation
> ==========
>
> Several emerging memory technologies require kernel memory management
> services but should not be used for general allocations:
>
> - CXL Compressed RAM (CRAM): Hardware-compressed memory where the
> effective capacity depends on data compressibility. Uncontrolled
> use risks capacity exhaustion when compression ratios degrade.
>
> - Accelerator Memory: GPU/TPU-attached memory optimized for specific
> access patterns that are not intended for general allocation.
>
> - Tiered Memory: Memory intended only as a demotion target, not for
> initial allocations.
>
> Currently, these devices either avoid hotplugging entirely (losing mm/
> services) or hotplug as regular N_MEMORY (risking reliability issues).
> N_PRIVATE solves this by creating an isolated node class.
>
> Design
> ======
>
> The series introduces:
>
> 1. N_PRIVATE node state (mutually exclusive with N_MEMORY)
We should call it N_PRIVATE_MEMORY
> 2. private_memtype enum for policy-based access control
> 3. cpuset.mems.sysram for user-visible isolation
> 4. Integration points for subsystems (zswap demonstrated)
> 5. A cxl private_region example to demonstrate full plumbing
>
> Private Memory Types (private_memtype)
> ======================================
>
> The private_memtype enum defines policy bits that control how different
> kernel subsystems may access private nodes:
>
> enum private_memtype {
> NODE_MEM_NOTYPE, /* No type assigned (invalid state) */
> NODE_MEM_ZSWAP, /* Swap compression target */
> NODE_MEM_COMPRESSED, /* General compressed RAM */
> NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
> NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
> NODE_MAX_MEMTYPE,
> };
>
> These types serve as policy hints for subsystems:
>
Do these nodes have fallback(s)? Are these nodes prone to OOM when memory is exhausted
in one class of N_PRIVATE node(s)?
What about page cache allocation form these nodes? Since default allocations
never use them, a file system would need to do additional work to allocate
on them, if there was ever a desire to use them. Would memory
migration would work between N_PRIVATE and N_MEMORY using move_pages()?
> NODE_MEM_ZSWAP
> --------------
> Nodes with this type are registered as zswap compression targets. When
> zswap compresses a page, it can allocate directly from ZSWAP-typed nodes
> using __GFP_THISNODE, bypassing software compression if the device
> provides hardware compression.
>
> Example flow:
> 1. CXL device creates private_region with type=zswap
> 2. Driver calls node_register_private() with NODE_MEM_ZSWAP
> 3. zswap_add_direct_node() registers the node as a compression target
> 4. On swap-out, zswap allocates from the private node
> 5. page_allocated() callback validates compression ratio headroom
> 6. page_freed() callback zeros pages to improve device compression
>
> Prototype Note:
> This patch set does not actually do compression ratio validation, as
> this requires an actual device to provide some kind of counter and/or
> interrupt to denote when allocations are safe. The callbacks are
> left as stubs with TODOs for device vendors to pick up the next step
> (we'll continue with a QEMU example if reception is positive).
>
> For now, this always succeeds because compressed=real capacity.
>
> NODE_MEM_COMPRESSED (CRAM)
> --------------------------
> For general compressed RAM devices. Unlike ZSWAP nodes, CRAM nodes
> could be exposed to subsystems that understand compression semantics:
>
> - vmscan: Could prefer demoting pages to CRAM nodes before swap
> - memory-tiering: Could place CRAM between DRAM and persistent memory
> - zram: Could use as backing store instead of or alongside zswap
>
> Such a component (mm/cram.c) would differ from zswap or zram by allowing
> the compressed pages to remain mapped Read-Only in the page table.
>
> NODE_MEM_ACCELERATOR
> --------------------
> For GPU/TPU/accelerator-attached memory. Policy implications:
>
> - Default allocations: Never (isolated from general page_alloc)
> - GPU drivers: Explicit allocation via __GFP_THISNODE
> - NUMA balancing: Excluded from automatic migration
> - Memory tiering: Not a demotion target
>
> Some GPU vendors want management of their memory via NUMA nodes, but
> don't want fallback or migration allocations to occur. This enables
> that pattern.
>
> mm/mempolicy.c could be used to allow for N_PRIVATE nodes of this type
> if the intent is per-vma access to accelerator memory (e.g. via mbind)
> but this is omitted from this series from now to limit userland
> exposure until first class examples are provided.
>
> NODE_MEM_DEMOTE_ONLY
> --------------------
> For memory intended exclusively as a demotion target in memory tiering:
>
> - page_alloc: Never allocates initially (slab, page faults, etc.)
> - vmscan/reclaim: Valid demotion target during memory pressure
> - memory-tiering: Allow hotness monitoring/promotion for this region
>
> This enables "cold storage" tiers using slower/cheaper memory (CXL-
> attached DRAM, persistent memory in volatile mode) without the memory
> appearing in allocation fast paths.
>
> This also adds some additional bonus of enforcing memory placement on
> these nodes to be movable allocations only (with all the normal caveats
> around page pinning).
>
> Subsystem Integration Points
> ============================
>
> The private_node_ops structure provides callbacks for integration:
>
> struct private_node_ops {
> struct list_head list;
> resource_size_t res_start;
> resource_size_t res_end;
> enum private_memtype memtype;
> int (*page_allocated)(struct page *page, void *data);
> void (*page_freed)(struct page *page, void *data);
> void *data;
> };
>
> page_allocated(): Called after allocation, returns 0 to accept or
> -ENOSPC/-ENODEV to reject (caller retries elsewhere). Enables:
> - Compression ratio enforcement for CRAM/zswap
> - Capacity tracking for accelerator memory
> - Rate limiting for demotion targets
>
> page_freed(): Called on free, enables:
> - Zeroing for compression ratio recovery
> - Capacity accounting updates
> - Device-specific cleanup
>
> Isolation Enforcement
> =====================
>
> The series modifies core allocators to respect N_PRIVATE isolation:
>
> - page_alloc: Constrains zone iteration to cpuset.mems.sysram
> - slub: Allocates only from N_MEMORY nodes
> - compaction: Skips N_PRIVATE nodes
> - mempolicy: Uses sysram_nodes for policy evaluation
>
> __GFP_THISNODE bypasses isolation, enabling explicit access:
>
> page = alloc_pages_node(private_nid, GFP_KERNEL | __GFP_THISNODE, 0);
>
> This pattern is used by zswap, and would be used by other subsystems
> that explicitly opt into private node access.
>
> User-Visible Changes
> ====================
>
> cpuset gains cpuset.mems.sysram (read-only), shows N_MEMORY nodes.
>
> ABI: /proc/<pid>/status Mems_allowed shows sysram nodes only.
>
> Drivers create private regions via sysfs:
> echo region0 > /sys/bus/cxl/.../create_private_region
> echo zswap > /sys/bus/cxl/.../region0/private_type
> echo 1 > /sys/bus/cxl/.../region0/commit
>
> Series Organization
> ===================
>
> Patch 1: numa,memory_hotplug: create N_PRIVATE (Private Nodes)
> Core infrastructure: N_PRIVATE node state, node_mark_private(),
> private_memtype enum, and private_node_ops registration.
>
> Patch 2: mm: constify oom_control, scan_control, and alloc_context
> nodemask
> Preparatory cleanup for enforcing that nodemasks don't change.
>
> Patch 3: mm: restrict slub, compaction, and page_alloc to sysram
> Enforce N_MEMORY-only allocation for general paths.
>
> Patch 4: cpuset: introduce cpuset.mems.sysram
> User-visible isolation via cpuset interface.
>
> Patch 5: Documentation/admin-guide/cgroups: update docs for mems_allowed
> Document the new behavior and sysram_nodes.
>
> Patch 6: drivers/cxl/core/region: add private_region
> CXL infrastructure for private regions.
>
> Patch 7: mm/zswap: compressed ram direct integration
> Zswap integration demonstrating direct hardware compression.
>
> Patch 8: drivers/cxl: add zswap private_region type
> Complete example: CXL region as zswap compression target.
>
> Future Work
> ===========
>
> This series provides the foundation. Planned follow-ups include:
>
> - CRAM integration with vmscan for smart demotion
> - ACCELERATOR type for GPU memory management
> - Memory-tiering integration with DEMOTE_ONLY nodes
>
> Testing
> =======
>
> All patches build cleanly. Tested with:
> - CXL QEMU emulation with private regions
> - Zswap stress tests with private compression targets
> - Cpuset verification of mems.sysram isolation
>
>
> Gregory Price (8):
> numa,memory_hotplug: create N_PRIVATE (Private Nodes)
> mm: constify oom_control, scan_control, and alloc_context nodemask
> mm: restrict slub, compaction, and page_alloc to sysram
> cpuset: introduce cpuset.mems.sysram
> Documentation/admin-guide/cgroups: update docs for mems_allowed
> drivers/cxl/core/region: add private_region
> mm/zswap: compressed ram direct integration
> drivers/cxl: add zswap private_region type
>
> .../admin-guide/cgroup-v1/cpusets.rst | 19 +-
> Documentation/admin-guide/cgroup-v2.rst | 26 ++-
> Documentation/filesystems/proc.rst | 2 +-
> drivers/base/node.c | 199 ++++++++++++++++++
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/core.h | 4 +
> drivers/cxl/core/port.c | 4 +
> drivers/cxl/core/private_region/Makefile | 12 ++
> .../cxl/core/private_region/private_region.c | 129 ++++++++++++
> .../cxl/core/private_region/private_region.h | 14 ++
> drivers/cxl/core/private_region/zswap.c | 127 +++++++++++
> drivers/cxl/core/region.c | 63 +++++-
> drivers/cxl/cxl.h | 22 ++
> include/linux/cpuset.h | 24 ++-
> include/linux/gfp.h | 6 +
> include/linux/mm.h | 4 +-
> include/linux/mmzone.h | 6 +-
> include/linux/node.h | 60 ++++++
> include/linux/nodemask.h | 1 +
> include/linux/oom.h | 2 +-
> include/linux/swap.h | 2 +-
> include/linux/zswap.h | 5 +
> kernel/cgroup/cpuset-internal.h | 8 +
> kernel/cgroup/cpuset-v1.c | 8 +
> kernel/cgroup/cpuset.c | 98 ++++++---
> mm/compaction.c | 6 +-
> mm/internal.h | 2 +-
> mm/memcontrol.c | 2 +-
> mm/memory_hotplug.c | 2 +-
> mm/mempolicy.c | 6 +-
> mm/migrate.c | 4 +-
> mm/mmzone.c | 5 +-
> mm/page_alloc.c | 31 +--
> mm/show_mem.c | 9 +-
> mm/slub.c | 8 +-
> mm/vmscan.c | 6 +-
> mm/zswap.c | 106 +++++++++-
> 37 files changed, 942 insertions(+), 91 deletions(-)
> create mode 100644 drivers/cxl/core/private_region/Makefile
> create mode 100644 drivers/cxl/core/private_region/private_region.c
> create mode 100644 drivers/cxl/core/private_region/private_region.h
> create mode 100644 drivers/cxl/core/private_region/zswap.c
> ---
> base-commit: 803dd4b1159cf9864be17aab8a17653e6ecbbbb6
>
Thanks,
Balbir
next prev parent reply other threads:[~2026-01-12 11:12 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-08 20:37 Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed Gregory Price
2026-01-12 14:30 ` Michal Koutný
2026-01-12 15:25 ` Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Gregory Price
2026-01-09 16:00 ` Yosry Ahmed
2026-01-09 17:03 ` Gregory Price
2026-01-09 21:40 ` Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type Gregory Price
2026-01-12 11:12 ` Balbir Singh [this message]
2026-01-12 14:36 ` [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6604d787-1744-4acf-80c0-e428fee1677e@nvidia.com \
--to=balbirs@nvidia.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=cl@gentwo.org \
--cc=corbet@lwn.net \
--cc=dakr@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@rasmusvillemoes.dk \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=weixugc@google.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
--cc=yuanchu@google.com \
--cc=yury.norov@gmail.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox