Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-cxl@vger.kernel.org
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, kernel-team@meta.com,
	longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org,
	mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org,
	rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net,
	jonathan.cameron@huawei.com, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, dan.j.williams@intel.com,
	akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com,
	mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com,
	david@kernel.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, rppt@kernel.org,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	yury.norov@gmail.com, linux@rasmusvillemoes.dk,
	rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org,
	kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com,
	bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, roman.gushchin@linux.dev,
	muchun.song@linux.dev, osalvador@suse.de,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
	apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com,
	zhengqi.arch@bytedance.com
Subject: Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Date: Mon, 12 Jan 2026 22:12:23 +1100	[thread overview]
Message-ID: <6604d787-1744-4acf-80c0-e428fee1677e@nvidia.com> (raw)
In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net>

On 1/9/26 06:37, Gregory Price wrote:
> This series introduces N_PRIVATE, a new node state for memory nodes 
> whose memory is not intended for general system consumption.  Today,
> device drivers (CXL, accelerators, etc.) hotplug their memory to access
> mm/ services like page allocation and reclaim, but this exposes general
> workloads to memory with different characteristics and reliability
> guarantees than system RAM.
> 
> N_PRIVATE provides isolation by default while enabling explicit access
> via __GFP_THISNODE for subsystems that understand how to manage these
> specialized memory regions.
> 

I assume each class of N_PRIVATE is a separate set of NUMA nodes, these
could be real or virtual memory nodes?

> Motivation
> ==========
> 
> Several emerging memory technologies require kernel memory management
> services but should not be used for general allocations:
> 
>   - CXL Compressed RAM (CRAM): Hardware-compressed memory where the
>     effective capacity depends on data compressibility.  Uncontrolled
>     use risks capacity exhaustion when compression ratios degrade.
> 
>   - Accelerator Memory: GPU/TPU-attached memory optimized for specific
>     access patterns that are not intended for general allocation.
> 
>   - Tiered Memory: Memory intended only as a demotion target, not for
>     initial allocations.
> 
> Currently, these devices either avoid hotplugging entirely (losing mm/
> services) or hotplug as regular N_MEMORY (risking reliability issues).
> N_PRIVATE solves this by creating an isolated node class.
> 
> Design
> ======
> 
> The series introduces:
> 
>   1. N_PRIVATE node state (mutually exclusive with N_MEMORY)

We should call it N_PRIVATE_MEMORY

>   2. private_memtype enum for policy-based access control
>   3. cpuset.mems.sysram for user-visible isolation
>   4. Integration points for subsystems (zswap demonstrated)
>   5. A cxl private_region example to demonstrate full plumbing
> 
> Private Memory Types (private_memtype)
> ======================================
> 
> The private_memtype enum defines policy bits that control how different
> kernel subsystems may access private nodes:
> 
>   enum private_memtype {
>       NODE_MEM_NOTYPE,      /* No type assigned (invalid state) */
>       NODE_MEM_ZSWAP,       /* Swap compression target */
>       NODE_MEM_COMPRESSED,  /* General compressed RAM */
>       NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
>       NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
>       NODE_MAX_MEMTYPE,
>   };
> 
> These types serve as policy hints for subsystems:
> 

Do these nodes have fallback(s)? Are these nodes prone to OOM when memory is exhausted
in one class of N_PRIVATE node(s)?


What about page cache allocation form these nodes? Since default allocations
never use them, a file system would need to do additional work to allocate
on them, if there was ever a desire to use them. Would memory
migration would work between N_PRIVATE and N_MEMORY using move_pages()?


> NODE_MEM_ZSWAP
> --------------
> Nodes with this type are registered as zswap compression targets.  When
> zswap compresses a page, it can allocate directly from ZSWAP-typed nodes
> using __GFP_THISNODE, bypassing software compression if the device
> provides hardware compression.
> 
> Example flow:
>   1. CXL device creates private_region with type=zswap
>   2. Driver calls node_register_private() with NODE_MEM_ZSWAP
>   3. zswap_add_direct_node() registers the node as a compression target
>   4. On swap-out, zswap allocates from the private node
>   5. page_allocated() callback validates compression ratio headroom
>   6. page_freed() callback zeros pages to improve device compression
> 
> Prototype Note:
>   This patch set does not actually do compression ratio validation, as
>   this requires an actual device to provide some kind of counter and/or
>   interrupt to denote when allocations are safe.  The callbacks are
>   left as stubs with TODOs for device vendors to pick up the next step
>   (we'll continue with a QEMU example if reception is positive).
> 
>   For now, this always succeeds because compressed=real capacity.
> 
> NODE_MEM_COMPRESSED (CRAM)
> --------------------------
> For general compressed RAM devices.  Unlike ZSWAP nodes, CRAM nodes
> could be exposed to subsystems that understand compression semantics:
> 
>   - vmscan: Could prefer demoting pages to CRAM nodes before swap
>   - memory-tiering: Could place CRAM between DRAM and persistent memory
>   - zram: Could use as backing store instead of or alongside zswap
> 
> Such a component (mm/cram.c) would differ from zswap or zram by allowing
> the compressed pages to remain mapped Read-Only in the page table.
> 
> NODE_MEM_ACCELERATOR
> --------------------
> For GPU/TPU/accelerator-attached memory.  Policy implications:
> 
>   - Default allocations: Never (isolated from general page_alloc)
>   - GPU drivers: Explicit allocation via __GFP_THISNODE
>   - NUMA balancing: Excluded from automatic migration
>   - Memory tiering: Not a demotion target
> 
> Some GPU vendors want management of their memory via NUMA nodes, but
> don't want fallback or migration allocations to occur.  This enables
> that pattern.
> 
> mm/mempolicy.c could be used to allow for N_PRIVATE nodes of this type
> if the intent is per-vma access to accelerator memory (e.g. via mbind)
> but this is omitted from this series from now to limit userland
> exposure until first class examples are provided.
> 
> NODE_MEM_DEMOTE_ONLY
> --------------------
> For memory intended exclusively as a demotion target in memory tiering:
> 
>   - page_alloc: Never allocates initially (slab, page faults, etc.)
>   - vmscan/reclaim: Valid demotion target during memory pressure
>   - memory-tiering: Allow hotness monitoring/promotion for this region
> 
> This enables "cold storage" tiers using slower/cheaper memory (CXL-
> attached DRAM, persistent memory in volatile mode) without the memory
> appearing in allocation fast paths.
> 
> This also adds some additional bonus of enforcing memory placement on
> these nodes to be movable allocations only (with all the normal caveats
> around page pinning).
> 
> Subsystem Integration Points
> ============================
> 
> The private_node_ops structure provides callbacks for integration:
> 
>   struct private_node_ops {
>       struct list_head list;
>       resource_size_t res_start;
>       resource_size_t res_end;
>       enum private_memtype memtype;
>       int (*page_allocated)(struct page *page, void *data);
>       void (*page_freed)(struct page *page, void *data);
>       void *data;
>   };
> 
> page_allocated(): Called after allocation, returns 0 to accept or
> -ENOSPC/-ENODEV to reject (caller retries elsewhere).  Enables:
>   - Compression ratio enforcement for CRAM/zswap
>   - Capacity tracking for accelerator memory
>   - Rate limiting for demotion targets
> 
> page_freed(): Called on free, enables:
>   - Zeroing for compression ratio recovery
>   - Capacity accounting updates
>   - Device-specific cleanup
> 
> Isolation Enforcement
> =====================
> 
> The series modifies core allocators to respect N_PRIVATE isolation:
> 
>   - page_alloc: Constrains zone iteration to cpuset.mems.sysram
>   - slub: Allocates only from N_MEMORY nodes
>   - compaction: Skips N_PRIVATE nodes
>   - mempolicy: Uses sysram_nodes for policy evaluation
> 
> __GFP_THISNODE bypasses isolation, enabling explicit access:
> 
>   page = alloc_pages_node(private_nid, GFP_KERNEL | __GFP_THISNODE, 0);
> 
> This pattern is used by zswap, and would be used by other subsystems
> that explicitly opt into private node access.
> 
> User-Visible Changes
> ====================
> 
> cpuset gains cpuset.mems.sysram (read-only), shows N_MEMORY nodes.
> 
> ABI: /proc/<pid>/status Mems_allowed shows sysram nodes only.
> 
> Drivers create private regions via sysfs:
>   echo region0 > /sys/bus/cxl/.../create_private_region
>   echo zswap > /sys/bus/cxl/.../region0/private_type
>   echo 1 > /sys/bus/cxl/.../region0/commit
> 
> Series Organization
> ===================
> 
> Patch 1: numa,memory_hotplug: create N_PRIVATE (Private Nodes)
>   Core infrastructure: N_PRIVATE node state, node_mark_private(),
>   private_memtype enum, and private_node_ops registration.
> 
> Patch 2: mm: constify oom_control, scan_control, and alloc_context 
> nodemask
>   Preparatory cleanup for enforcing that nodemasks don't change.
> 
> Patch 3: mm: restrict slub, compaction, and page_alloc to sysram
>   Enforce N_MEMORY-only allocation for general paths.
> 
> Patch 4: cpuset: introduce cpuset.mems.sysram
>   User-visible isolation via cpuset interface.
> 
> Patch 5: Documentation/admin-guide/cgroups: update docs for mems_allowed
>   Document the new behavior and sysram_nodes.
> 
> Patch 6: drivers/cxl/core/region: add private_region
>   CXL infrastructure for private regions.
> 
> Patch 7: mm/zswap: compressed ram direct integration
>   Zswap integration demonstrating direct hardware compression.
> 
> Patch 8: drivers/cxl: add zswap private_region type
>   Complete example: CXL region as zswap compression target.
> 
> Future Work
> ===========
> 
> This series provides the foundation.  Planned follow-ups include:
> 
>   - CRAM integration with vmscan for smart demotion
>   - ACCELERATOR type for GPU memory management
>   - Memory-tiering integration with DEMOTE_ONLY nodes
> 
> Testing
> =======
> 
> All patches build cleanly.  Tested with:
>   - CXL QEMU emulation with private regions
>   - Zswap stress tests with private compression targets
>   - Cpuset verification of mems.sysram isolation
> 
> 
> Gregory Price (8):
>   numa,memory_hotplug: create N_PRIVATE (Private Nodes)
>   mm: constify oom_control, scan_control, and alloc_context nodemask
>   mm: restrict slub, compaction, and page_alloc to sysram
>   cpuset: introduce cpuset.mems.sysram
>   Documentation/admin-guide/cgroups: update docs for mems_allowed
>   drivers/cxl/core/region: add private_region
>   mm/zswap: compressed ram direct integration
>   drivers/cxl: add zswap private_region type
> 
>  .../admin-guide/cgroup-v1/cpusets.rst         |  19 +-
>  Documentation/admin-guide/cgroup-v2.rst       |  26 ++-
>  Documentation/filesystems/proc.rst            |   2 +-
>  drivers/base/node.c                           | 199 ++++++++++++++++++
>  drivers/cxl/core/Makefile                     |   1 +
>  drivers/cxl/core/core.h                       |   4 +
>  drivers/cxl/core/port.c                       |   4 +
>  drivers/cxl/core/private_region/Makefile      |  12 ++
>  .../cxl/core/private_region/private_region.c  | 129 ++++++++++++
>  .../cxl/core/private_region/private_region.h  |  14 ++
>  drivers/cxl/core/private_region/zswap.c       | 127 +++++++++++
>  drivers/cxl/core/region.c                     |  63 +++++-
>  drivers/cxl/cxl.h                             |  22 ++
>  include/linux/cpuset.h                        |  24 ++-
>  include/linux/gfp.h                           |   6 +
>  include/linux/mm.h                            |   4 +-
>  include/linux/mmzone.h                        |   6 +-
>  include/linux/node.h                          |  60 ++++++
>  include/linux/nodemask.h                      |   1 +
>  include/linux/oom.h                           |   2 +-
>  include/linux/swap.h                          |   2 +-
>  include/linux/zswap.h                         |   5 +
>  kernel/cgroup/cpuset-internal.h               |   8 +
>  kernel/cgroup/cpuset-v1.c                     |   8 +
>  kernel/cgroup/cpuset.c                        |  98 ++++++---
>  mm/compaction.c                               |   6 +-
>  mm/internal.h                                 |   2 +-
>  mm/memcontrol.c                               |   2 +-
>  mm/memory_hotplug.c                           |   2 +-
>  mm/mempolicy.c                                |   6 +-
>  mm/migrate.c                                  |   4 +-
>  mm/mmzone.c                                   |   5 +-
>  mm/page_alloc.c                               |  31 +--
>  mm/show_mem.c                                 |   9 +-
>  mm/slub.c                                     |   8 +-
>  mm/vmscan.c                                   |   6 +-
>  mm/zswap.c                                    | 106 +++++++++-
>  37 files changed, 942 insertions(+), 91 deletions(-)
>  create mode 100644 drivers/cxl/core/private_region/Makefile
>  create mode 100644 drivers/cxl/core/private_region/private_region.c
>  create mode 100644 drivers/cxl/core/private_region/private_region.h
>  create mode 100644 drivers/cxl/core/private_region/zswap.c
> ---
> base-commit: 803dd4b1159cf9864be17aab8a17653e6ecbbbb6
> 

Thanks,
Balbir

next prev parent reply	other threads:[~2026-01-12 11:12 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-08 20:37 Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed Gregory Price
2026-01-12 14:30   ` Michal Koutný
2026-01-12 15:25     ` Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Gregory Price
2026-01-09 16:00   ` Yosry Ahmed
2026-01-09 17:03     ` Gregory Price
2026-01-09 21:40     ` Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type Gregory Price
2026-01-12 11:12 ` Balbir Singh [this message]
2026-01-12 14:36   ` [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6604d787-1744-4acf-80c0-e428fee1677e@nvidia.com \
    --to=balbirs@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dakr@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yuanchu@google.com \
    --cc=yury.norov@gmail.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox