[LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
@ 2026-02-22  8:48 Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
                   ` (26 more replies)
  0 siblings, 27 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Topic type: MM

Presenter: Gregory Price <gourry@gourry.net>

This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
managed by the buddy allocator but excluded from normal allocations.

I present it with an end-to-end Compressed RAM service (mm/cram.c)
that would otherwise not be possible (or would be considerably more
difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).

TL;DR
===

N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
explicit holes in that isolation to do useful things we couldn't do
before without re-implementing entire portions of mm/ in a driver.

/* This is my memory. There are many like it, but this one is mine. */
rc = add_private_memory_driver_managed(nid, start, size, name, flags,
                                       online_type, private_context);

page = alloc_pages_node(nid, __GFP_PRIVATE, 0);

/* Ok but I want to do something useful with it */
static const struct node_private_ops ops = {
        .migrate_to     = my_migrate_to,
        .folio_migrate  = my_folio_migrate,
        .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
};
node_private_set_ops(nid, &ops);

/* And now I can use mempolicy with my memory */
buf = mmap(...);
mbind(buf, len, mode, private_node, ...);
buf[0] = 0xdeadbeef;  /* Faults onto private node */

/* And to be clear, no one else gets my memory */
buf2 = malloc(4096);  /* Standard allocation */
buf2[0] = 0xdeadbeef; /* Can never land on private node */

/* But i can choose to migrate it to the private node */
move_pages(0, 1, &buf, &private_node, NULL, ...);

/* And more fun things like this */

Patchwork
===
A fully working branch based on cxl/next can be found here:
https://github.com/gourryinverse/linux/tree/private_compression

A QEMU device which can inject high/low interrupts can be found here:
https://github.com/gourryinverse/qemu/tree/compressed_cxl_clean

The additional patches on these branches are CXL and DAX driver
housecleaning only tangentially relevant to this RFC, so i've
omitted them for the sake of trying to keep it somewhat clean
here.  Those patches should (hopefully) be going upstream anyway.

Patches 1-22: Core Private Node Infrastructure

  Patch  1:      Introduce N_MEMORY_PRIVATE scaffolding
  Patch  2:      Introduce __GFP_PRIVATE
  Patch  3:      Apply allocation isolation mechanisms
  Patch  4:      Add N_MEMORY nodes to private fallback lists
  Patches 5-9:   Filter operations not yet supported
  Patch 10:      free_folio callback
  Patch 11:      split_folio callback
  Patches 12-20: mm/ service opt-ins:
                   Migration, Mempolicy, Demotion, Write Protect,
                   Reclaim, OOM, NUMA Balancing, Compaction,
                   LongTerm Pinning
  Patch 21:      memory_failure callback
  Patch 22:      Memory hotplug plumbing for private nodes

Patch 23: mm/cram -- Compressed RAM Management

Patches 24-27: CXL Driver examples
  Sysram Regions with Private node support
  Basic Driver Example: (MIGRATION | MEMPOLICY)
  Compression Driver Example (Generic)

Background
===

Today, drivers that want mm-like services on non-general-purpose
memory either use ZONE_DEVICE (self-managed memory) or hotplug into
N_MEMORY and accept the risk of uncontrolled allocation.

Neither option provides what we really want - the ability to:
	1) selectively participate in mm/ subsystems, while
	2) isolating that memory from general purpose use.

Some device-attached memory cannot be managed as fully general-purpose
system RAM.  CXL devices with inline compression, for example, may
corrupt data or crash the machine if the compression ratio drops
below a threshold -- we simply run out of physical memory.

This is a hard problem to solve: how does an operating system deal
with a device that basically lies about how much capacity it has?

(We'll discuss that in the CRAM section)

Core Proposal: N_MEMORY_PRIVATE
===

Introduce N_MEMORY_PRIVATE, a NUMA node state for memory managed by
the buddy allocator, but excluded from normal allocation paths.

Private nodes:

  - Are filtered from zonelist fallback: all existing callers to
    get_page_from_freelist cannot reach these nodes through any
    normal fallback mechanism.

  - Filter allocation requests on __GFP_PRIVATE
    	numa_zone_allowed() excludes them otherwise. 

    Applies to systems with and without cpusets.

    GFP_PRIVATE is (__GFP_PRIVATE | __GFP_THISNODE).

    Services use it when they need to allocate specifically from
    a private node (e.g., CRAM allocating a destination folio).

    No existing allocator path sets __GFP_PRIVATE, so private nodes
    are unreachable by default.

  - Use standard struct page / folio.  No ZONE_DEVICE, no pgmap,
    no struct page metadata limitations.

  - Use a node-scoped metadata structure to accomplish filtering
    and callback support.

  - May participate in the buddy allocator, reclaim, compaction,
    and LRU like normal memory, gated by an opt-in set of flags.

The key abstraction is node_private_ops: a per-node callback table
registered by a driver or service.  

Each callback is individually gated by an NP_OPS_* capability flag.

A driver opts in only to the mm/ operations it needs.

It is similar to ZONE_DEVICE's pgmap at a node granularity.

In fact...

Re-use of ZONE_DEVICE Hooks
===

The callback insertion points deliberately mirror existing ZONE_DEVICE
hooks to minimize the surface area of the mechanism.

I believe this could subsume most DEVICE_COHERENT users, and greatly
simplify the device-managed memory development process (no more
per-driver allocator and migration code).

(Also it's just "So Fresh, So Clean").

The base set of callbacks introduced include:

  free_folio           - mirrors ZONE_DEVICE's
                         free_zone_device_page() hook in
                         __folio_put() / folios_put_refs()

  folio_split          - mirrors ZONE_DEVICE's
  			 called when a huge page is split up

  migrate_to           - demote_folio_list() custom demotion (same
                         site as ZONE_DEVICE demotion rejection)

  folio_migrate        - called when private node folio is moved to
                         another location (e.g. compaction)

  handle_fault         - mirrors the ZONE_DEVICE fault dispatch in
                         handle_pte_fault() (do_wp_page path)

  reclaim_policy       - called by reclaim to let a driver own the
                         boost lifecycle (driver can driver node reclaim)

  memory_failure       - parallels memory_failure_dev_pagemap(),
                         but for online pages that enter the normal
                         hwpoison path

At skip sites (mlock, madvise, KSM, user migration), a unified
folio_is_private_managed() predicate covers both ZONE_DEVICE and
N_MEMORY_PRIVATE folios, consolidating existing zone_device checks
with private node checks rather than adding new ones.

  static inline bool folio_is_private_managed(struct folio *folio)
  {
          return folio_is_zone_device(folio) ||
                 folio_is_private_node(folio);
  }

Most integration points become a one-line swap:

  -     if (folio_is_zone_device(folio))
  +     if (unlikely(folio_is_private_managed(folio)))

Where a one-line integration is insufficient, the integration is
kept as clean as possible with zone_device, rather than simply
adding more call-sites on top of it:

static inline bool folio_managed_handle_fault(struct folio *folio,
  struct vm_fault *vmf, vm_fault_t *ret)
{
  /* Zone device pages use swap entries; handled in do_swap_page */
  if (folio_is_zone_device(folio))
    return false;

  if (folio_is_private_node(folio)) {
    const struct node_private_ops *ops = folio_node_private_ops(folio);

    if (ops && ops->handle_fault) {
      *ret = ops->handle_fault(vmf);
      return true;
    }
  }
  return false;
}

Flag-gated behavior (NP_OPS_*) controls:
===

We use OPS flags to denote what mm/ services we want to allow on our
private node.   I've plumbed these through so far:

  NP_OPS_MIGRATION       - Node supports migration
  NP_OPS_MEMPOLICY       - Node supports mempolicy actions
  NP_OPS_DEMOTION        - Node appears in demotion target lists
  NP_OPS_PROTECT_WRITE   - Node memory is read-only (wrprotect)
  NP_OPS_RECLAIM         - Node supports reclaim
  NP_OPS_NUMA_BALANCING  - Node supports numa balancing
  NP_OPS_COMPACTION      - Node supports compaction
  NP_OPS_LONGTERM_PIN    - Node supports longterm pinning
  NP_OPS_OOM_ELIGIBLE	 - (MIGRATION | DEMOTION), node is reachable
                           as normal system ram storage, so it should
			   be considered in OOM pressure calculations.

I wasn't quite sure how to classify ksm, khugepaged, madvise, and
mlock - so i have omitted those for now.

Most hooks are straightforward.

Including a node as a demotion-eligible target was as simple as:

static void establish_demotion_targets(void)
{
  ..... snip .....
  /*
   * Include private nodes that have opted in to demotion
   * via NP_OPS_DEMOTION.  A node might have custom migrate
   */
  all_memory = node_states[N_MEMORY];
  for_each_node_state(node, N_MEMORY_PRIVATE) {
      if (node_private_has_flag(node, NP_OPS_DEMOTION))
      node_set(node, all_memory);
  }
  ..... snip .....
}

The Migration and Mempolicy support are the two most complex pieces,
and most useful things are built on top of Migration (meaning the
remaining implementations are usually simple).

Private Node Hotplug Lifecycle
===

Registration follows a strict order enforced by
add_private_memory_driver_managed():

  1. Driver calls add_private_memory_driver_managed(nid, start,
     size, resource_name, mhp_flags, online_type, &np).

  2. node_private_register(nid, &np) stores the driver's
     node_private in pgdat and sets pgdat->private.  N_MEMORY and
     N_MEMORY_PRIVATE are mutually exclusive -- registration fails
     with -EBUSY if the node already has N_MEMORY set.

     Only one driver may register per private node.

  3. Memory is hotplugged via __add_memory_driver_managed().

     When online_pages() runs, it checks pgdat->private and sets
     N_MEMORY_PRIVATE instead of N_MEMORY.  

     Zonelist construction gives private nodes a self-only NOFALLBACK
     list and an N_MEMORY fallback list (so kernel/slab allocations on
     behalf of private node work can fall back to DRAM).

  4. kswapd and kcompactd are NOT started for private nodes.  The
     owning service is responsible for driving reclaim if needed
     (e.g., CRAM uses watermark_boost to wake kswapd on demand).

Teardown is the reverse:

  1. Driver calls offline_and_remove_private_memory(nid, start,
     size).

  2. offline_pages() offlines the memory.  When the last block is
     offlined, N_MEMORY_PRIVATE is cleared automatically.

  3. node_private_unregister() clears pgdat->node_private and
     drops the refcount.  It refuses to unregister (-EBUSY) if
     N_MEMORY_PRIVATE is still set (other memory ranges remain).

The driver is responsible for ensuring memory is hot-unpluggable
before teardown.  The service must ensure all memory is cleaned
up before hot-unplug - or the service must support migration (so
memory_hotplug.c can evacuate the memory itself).

In the CRAM example, the service supports migration, so memory
hot-unplug can remove memory without any special infrastructure.

Application: Compressed RAM (mm/cram)
===

Compressed RAM has a serious design issue:  Its capacity a lie.

A compression device reports more capacity than it physically has.
If workloads write faster than the OS can reclaim from the device,
we run out of real backing store and corrupt data or crash.

I call this problem: "Trying to Out Run A Bear"

I.e. This is only stable as long as we stay ahead of the pressure.

We don't want to design a system where stability depends on outrunning
a bear - I am slow and do not know where to acquire bear spray.

  Fun fact:   Grizzly bears have a top-speed of 56-64 km/h.
  Unfun Fact: Humans typically top out at ~24 km/h.

This MVP takes a conservative position:

   all compressed memory is mapped read-only.

  - Folios reach the private node only via reclaim (demotion)
  - migrate_to implements custom demotion with backpressure.
  - fixup_migration_pte write-protects PTEs on arrival.
  - wrprotect hooks prevent silent upgrades
  - handle_fault promotes folios back to DRAM on write.
  - free_folio scrubs stale data before buddy free.

Because pages are read-only, writes can never cause runaway
compression ratio loss behind the allocator's back.  Every write
goes through handle_fault, which promotes the folio to DRAM first.

The device only ever sees net compression (demotion in) and explicit
decompression (promotion out via fault or reclaim), and has a much
wider timeframe to respond to poor compression scenarios.

That means there's no bear to out run. The bears are safely asleep in
their bear den, and even if they show up we have a bear-proof cage.

The backpressure system is our bear-proof cage: the driver reports real
device utilization (generalized via watermark_boost on the private
node's zone), and CRAM throttles demotion when capacity is tight.

If compression ratios are bad, we stop demoting pages and start
evicting pages aggressively.

The service as designed is ~350 functional lines of code because it
re-uses mm/ services:

  - Existing reclaim/vmscan code handles demotion.
  - Existing migration code handles migration to/from.
  - Existing page fault handling dispatches faults.

The driver contains all the CXL nastiness core developers don't want
anything to do with - No vendor logic touches mm/ internals.

Future CRAM : Loosening the read-only constraint
===

The read-only model is safe but conservative.  For workloads where
compressed pages are occasionally written, the promotion fault adds
latency.  A future optimization could allow a tunable fraction of
compressed pages to be mapped writable, accepting some risk of
write-driven decompression in exchange for lower overhead.

The private node ops make this straightforward:

  - Adjust fixup_migration_pte to selectively skip
    write-protection.
  - Use the backpressure system to either revoke writable mappings,
    deny additional demotions, or evict when device pressure rises.

This comes at a mild memory overhead: 32MB of DRAM per 1TB of CRAM.
(1 bit per 4KB page).

This is not proposed here, but it should be somewhat trivial.

Discussion Topics
===
0. Obviously I've included the set as an RFC, please rip it apart.

1. Is N_MEMORY_PRIVATE the right isolation abstraction, or should
   this extend ZONE_DEVICE?  Prior feedback pushed away from new
   ZONE logic, but this will likely be debated further.

   My comments on this:

   ZONE_DEVICE requires re-implementing every service you want to
   provide to your device memory, including basic allocation.

   Private nodes use real struct pages with no metadata
   limitations, participate in the buddy allocator, and get NUMA
   topology for free.

2. Can this subsume ZONE_DEVICE COHERENT users?  The architecture
   was designed with this in mind, but it is only a thought experiment.

3. Is a dedicated mm/ service (cram) the right place for compressed
   memory management, or should this be purely driver-side until
   more devices exist?

   I wrote it this way because I forsee more "innovation" in the
   compressed RAM space given current... uh... "Market Conditions".

   I don't see CRAM being CXL-specific, though the only solutions I've
   seen have been CXL.  Nothing is stopping someone from soldering such
   memory directly to a PCB.

5. Where is your hardware-backed data that shows this works?

   I should have some by conference time.

Thanks for reading
Gregory (Gourry)

Gregory Price (27):
  numa: introduce N_MEMORY_PRIVATE node state
  mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE
  mm/page_alloc: add numa_zone_allowed() and wire it up
  mm/page_alloc: Add private node handling to build_zonelists
  mm: introduce folio_is_private_managed() unified predicate
  mm/mlock: skip mlock for managed-memory folios
  mm/madvise: skip madvise for managed-memory folios
  mm/ksm: skip KSM for managed-memory folios
  mm/khugepaged: skip private node folios when trying to collapse.
  mm/swap: add free_folio callback for folio release cleanup
  mm/huge_memory.c: add private node folio split notification callback
  mm/migrate: NP_OPS_MIGRATION - support private node user migration
  mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy
  mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion
  mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades
  mm: NP_OPS_RECLAIM - private node reclaim participation
  mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation
  mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing
  mm/compaction: NP_OPS_COMPACTION - private node compaction support
  mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support
  mm/memory-failure: add memory_failure callback to node_private_ops
  mm/memory_hotplug: add add_private_memory_driver_managed()
  mm/cram: add compressed ram memory management subsystem
  cxl/core: Add cxl_sysram region type
  cxl/core: Add private node support to cxl_sysram
  cxl: add cxl_mempolicy sample PCI driver
  cxl: add cxl_compression PCI driver

 drivers/base/node.c                           |  250 +++-
 drivers/cxl/Kconfig                           |    2 +
 drivers/cxl/Makefile                          |    2 +
 drivers/cxl/core/Makefile                     |    1 +
 drivers/cxl/core/core.h                       |    4 +
 drivers/cxl/core/port.c                       |    2 +
 drivers/cxl/core/region_sysram.c              |  381 ++++++
 drivers/cxl/cxl.h                             |   53 +
 drivers/cxl/type3_drivers/Kconfig             |    3 +
 drivers/cxl/type3_drivers/Makefile            |    3 +
 .../cxl/type3_drivers/cxl_compression/Kconfig |   20 +
 .../type3_drivers/cxl_compression/Makefile    |    4 +
 .../cxl_compression/compression.c             | 1025 +++++++++++++++++
 .../cxl/type3_drivers/cxl_mempolicy/Kconfig   |   16 +
 .../cxl/type3_drivers/cxl_mempolicy/Makefile  |    4 +
 .../type3_drivers/cxl_mempolicy/mempolicy.c   |  297 +++++
 include/linux/cpuset.h                        |    9 -
 include/linux/cram.h                          |   66 ++
 include/linux/gfp_types.h                     |   15 +-
 include/linux/memory-tiers.h                  |    9 +
 include/linux/memory_hotplug.h                |   11 +
 include/linux/migrate.h                       |   17 +-
 include/linux/mm.h                            |   22 +
 include/linux/mmzone.h                        |   16 +
 include/linux/node_private.h                  |  532 +++++++++
 include/linux/nodemask.h                      |    1 +
 include/trace/events/mmflags.h                |    4 +-
 include/uapi/linux/mempolicy.h                |    1 +
 kernel/cgroup/cpuset.c                        |   49 +-
 mm/Kconfig                                    |   10 +
 mm/Makefile                                   |    1 +
 mm/compaction.c                               |   32 +-
 mm/cram.c                                     |  508 ++++++++
 mm/damon/paddr.c                              |    3 +
 mm/huge_memory.c                              |   23 +-
 mm/hugetlb.c                                  |    2 +-
 mm/internal.h                                 |  226 +++-
 mm/khugepaged.c                               |    7 +-
 mm/ksm.c                                      |    9 +-
 mm/madvise.c                                  |    5 +-
 mm/memory-failure.c                           |   15 +
 mm/memory-tiers.c                             |   46 +-
 mm/memory.c                                   |   26 +
 mm/memory_hotplug.c                           |  122 +-
 mm/mempolicy.c                                |   69 +-
 mm/migrate.c                                  |   63 +-
 mm/mlock.c                                    |    5 +-
 mm/mprotect.c                                 |    4 +-
 mm/oom_kill.c                                 |   52 +-
 mm/page_alloc.c                               |   79 +-
 mm/rmap.c                                     |    4 +-
 mm/slub.c                                     |    3 +-
 mm/swap.c                                     |   21 +-
 mm/vmscan.c                                   |   55 +-
 54 files changed, 4057 insertions(+), 152 deletions(-)
 create mode 100644 drivers/cxl/core/region_sysram.c
 create mode 100644 drivers/cxl/type3_drivers/Kconfig
 create mode 100644 drivers/cxl/type3_drivers/Makefile
 create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Kconfig
 create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Makefile
 create mode 100644 drivers/cxl/type3_drivers/cxl_compression/compression.c
 create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig
 create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Makefile
 create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c
 create mode 100644 include/linux/cram.h
 create mode 100644 include/linux/node_private.h
 create mode 100644 mm/cram.c

-- 
2.53.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

N_MEMORY nodes are intended to contain general System RAM. Today, some
device drivers hotplug their memory (marked Specific Purpose or Reserved)
to get access to mm/ services, but don't intend it for general consumption.

Create N_MEMORY_PRIVATE for memory nodes whose memory is not intended for
general consumption. This state is mutually exclusive with N_MEMORY.

Add the node_private infrastructure for N_MEMORY_PRIVATE nodes:

  - struct node_private: Per-node container stored in NODE_DATA(nid),
    holding driver callbacks (ops), owner, and refcount.

  - struct node_private_ops: Initial structure with void *reserved
    placeholder and flags field.  Callbacks will be added by subsequent
    commits as each consumer is wired up.

  - folio_is_private_node() / page_is_private_node(): check if a
    folio/page resides on a private node.

  - folio_node_private_ops() / node_private_flags(): retrieve the ops
    vtable or flags for a folio's node.

  - Registration API: node_private_register()/unregister() for drivers
    to register callbacks for private nodes. Only one driver callback
    can be registered per node - attempting to register different ops
    returns -EBUSY.

  - sysfs attribute exposing N_MEMORY_PRIVATE node state.

Zonelist construction changes for private nodes are deferred to a
subsequent commit.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/base/node.c          | 197 ++++++++++++++++++++++++++++++++
 include/linux/mmzone.h       |   4 +
 include/linux/node_private.h | 210 +++++++++++++++++++++++++++++++++++
 include/linux/nodemask.h     |   1 +
 4 files changed, 412 insertions(+)
 create mode 100644 include/linux/node_private.h

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 00cf4532f121..646dc48a23b5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -22,6 +22,7 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/memblock.h>
+#include <linux/node_private.h>
 
 static const struct bus_type node_subsys = {
 	.name = "node",
@@ -861,6 +862,198 @@ void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
 			   (void *)&nid, register_mem_block_under_node_hotplug);
 	return;
 }
+
+static DEFINE_MUTEX(node_private_lock);
+static bool node_private_initialized;
+
+/**
+ * node_private_register - Register a private node
+ * @nid: Node identifier
+ * @np: The node_private structure (driver-allocated, driver-owned)
+ *
+ * Register a driver for a private node. Only one driver can register
+ * per node. If another driver has already registered (with different np),
+ * -EBUSY is returned. Re-registration with the same np is allowed.
+ *
+ * The driver owns the node_private memory and must ensure it remains valid
+ * until refcount reaches 0 after node_private_unregister().
+ *
+ * Returns 0 on success, negative errno on failure.
+ */
+int node_private_register(int nid, struct node_private *np)
+{
+	struct node_private *existing;
+	pg_data_t *pgdat;
+	int ret = 0;
+
+	if (!np || !node_possible(nid))
+		return -EINVAL;
+
+	if (!node_private_initialized)
+		return -ENODEV;
+
+	mutex_lock(&node_private_lock);
+	mem_hotplug_begin();
+
+	/* N_MEMORY_PRIVATE and N_MEMORY are mutually exclusive */
+	if (node_state(nid, N_MEMORY)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	pgdat = NODE_DATA(nid);
+	existing = rcu_dereference_protected(pgdat->node_private,
+					     lockdep_is_held(&node_private_lock));
+
+	/* Only one source my register this node */
+	if (existing) {
+		if (existing != np) {
+			ret = -EBUSY;
+			goto out;
+		}
+		goto out;
+	}
+
+	refcount_set(&np->refcount, 1);
+	init_completion(&np->released);
+
+	rcu_assign_pointer(pgdat->node_private, np);
+	pgdat->private = true;
+
+out:
+	mem_hotplug_done();
+	mutex_unlock(&node_private_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(node_private_register);
+
+/**
+ * node_private_set_ops - Set service callbacks on a registered private node
+ * @nid: Node identifier
+ * @ops: Service callbacks and flags (driver-owned, must outlive registration)
+ *
+ * Validates flag dependencies and sets the ops on the node's node_private.
+ * The node must already be registered via node_private_register().
+ *
+ * Returns 0 on success, -EINVAL for invalid flag combinations,
+ * -ENODEV if no node_private is registered on @nid.
+ */
+int node_private_set_ops(int nid, const struct node_private_ops *ops)
+{
+	struct node_private *np;
+	int ret = 0;
+
+	if (!ops)
+		return -EINVAL;
+
+	if (!node_possible(nid))
+		return -EINVAL;
+
+	mutex_lock(&node_private_lock);
+	np = rcu_dereference_protected(NODE_DATA(nid)->node_private,
+				       lockdep_is_held(&node_private_lock));
+	if (!np)
+		ret = -ENODEV;
+	else
+		np->ops = ops;
+	mutex_unlock(&node_private_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(node_private_set_ops);
+
+/**
+ * node_private_clear_ops - Clear service callbacks from a private node
+ * @nid: Node identifier
+ * @ops: Expected ops pointer (must match current ops)
+ *
+ * Clears the ops only if @ops matches the currently registered ops,
+ * preventing one service from accidentally clearing another's callbacks.
+ *
+ * Returns 0 on success, -ENODEV if no node_private is registered,
+ * -EINVAL if @ops does not match.
+ */
+int node_private_clear_ops(int nid, const struct node_private_ops *ops)
+{
+	struct node_private *np;
+	int ret = 0;
+
+	if (!node_possible(nid))
+		return -EINVAL;
+
+	mutex_lock(&node_private_lock);
+	np = rcu_dereference_protected(NODE_DATA(nid)->node_private,
+				       lockdep_is_held(&node_private_lock));
+	if (!np)
+		ret = -ENODEV;
+	else if (np->ops != ops)
+		ret = -EINVAL;
+	else
+		np->ops = NULL;
+	mutex_unlock(&node_private_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(node_private_clear_ops);
+
+/**
+ * node_private_unregister - Unregister a private node
+ * @nid: Node identifier
+ *
+ * Unregister the driver from a private node. Only succeeds if all memory
+ * has been offlined and the node is no longer N_MEMORY_PRIVATE.
+ * When successful, drops the refcount to 0 indicating the driver can
+ * free its context.
+ *
+ * N_MEMORY_PRIVATE state is cleared by offline_pages() when the last
+ * memory is offlined, not by this function.
+ *
+ * Return: 0 if unregistered, -EBUSY if N_MEMORY_PRIVATE is still set
+ * (other memory blocks remain on this node).
+ */
+int node_private_unregister(int nid)
+{
+	struct node_private *np;
+	pg_data_t *pgdat;
+
+	if (!node_possible(nid))
+		return 0;
+
+	mutex_lock(&node_private_lock);
+	mem_hotplug_begin();
+
+	pgdat = NODE_DATA(nid);
+	np = rcu_dereference_protected(pgdat->node_private,
+				       lockdep_is_held(&node_private_lock));
+	if (!np) {
+		mem_hotplug_done();
+		mutex_unlock(&node_private_lock);
+		return 0;
+	}
+
+	/*
+	 * Only unregister if all memory is offline and N_MEMORY_PRIVATE is
+	 * cleared. N_MEMORY_PRIVATE is cleared by offline_pages() when the
+	 * last memory block is offlined.
+	 */
+	if (node_state(nid, N_MEMORY_PRIVATE)) {
+		mem_hotplug_done();
+		mutex_unlock(&node_private_lock);
+		return -EBUSY;
+	}
+
+	rcu_assign_pointer(pgdat->node_private, NULL);
+	pgdat->private = false;
+
+	mem_hotplug_done();
+	mutex_unlock(&node_private_lock);
+
+	synchronize_rcu();
+
+	if (!refcount_dec_and_test(&np->refcount))
+		wait_for_completion(&np->released);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(node_private_unregister);
+
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 /**
@@ -959,6 +1152,7 @@ static struct node_attr node_state_attr[] = {
 	[N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
 #endif
 	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
+	[N_MEMORY_PRIVATE] = _NODE_ATTR(has_private_memory, N_MEMORY_PRIVATE),
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
 	[N_GENERIC_INITIATOR] = _NODE_ATTR(has_generic_initiator,
 					   N_GENERIC_INITIATOR),
@@ -972,6 +1166,7 @@ static struct attribute *node_state_attrs[] = {
 	&node_state_attr[N_HIGH_MEMORY].attr.attr,
 #endif
 	&node_state_attr[N_MEMORY].attr.attr,
+	&node_state_attr[N_MEMORY_PRIVATE].attr.attr,
 	&node_state_attr[N_CPU].attr.attr,
 	&node_state_attr[N_GENERIC_INITIATOR].attr.attr,
 	NULL
@@ -1007,5 +1202,7 @@ void __init node_dev_init(void)
 			panic("%s() failed to add node: %d\n", __func__, ret);
 	}
 
+	node_private_initialized = true;
+
 	register_memory_blocks_under_nodes();
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b01cb1e49896..992eb1c5a2c6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -25,6 +25,8 @@
 #include <linux/zswap.h>
 #include <asm/page.h>
 
+struct node_private;
+
 /* Free memory management - zoned buddy allocator.  */
 #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
 #define MAX_PAGE_ORDER 10
@@ -1514,6 +1516,8 @@ typedef struct pglist_data {
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
 #ifdef CONFIG_NUMA
 	struct memory_tier __rcu *memtier;
+	struct node_private __rcu *node_private;
+	bool private;
 #endif
 #ifdef CONFIG_MEMORY_FAILURE
 	struct memory_failure_stats mf_stats;
diff --git a/include/linux/node_private.h b/include/linux/node_private.h
new file mode 100644
index 000000000000..6a70ec39d569
--- /dev/null
+++ b/include/linux/node_private.h
@@ -0,0 +1,210 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_NODE_PRIVATE_H
+#define _LINUX_NODE_PRIVATE_H
+
+#include <linux/completion.h>
+#include <linux/mm.h>
+#include <linux/nodemask.h>
+#include <linux/rcupdate.h>
+#include <linux/refcount.h>
+
+struct page;
+struct vm_area_struct;
+struct vm_fault;
+
+/**
+ * struct node_private_ops - Callbacks for private node services
+ *
+ * Services register these callbacks to intercept MM operations that affect
+ * their private nodes.
+ *
+ * Flag bits control which MM subsystems may operate on folios on this node.
+ *
+ * The pgdat->node_private pointer is RCU-protected.  Callbacks fall into
+ * three categories based on their calling context:
+ *
+ * Folio-referenced callbacks (RCU released before callback):
+ *   The caller holds a reference to a folio on the private node, which
+ *   pins the node's memory online and prevents node_private teardown.
+ *
+ * Refcounted callbacks (RCU released before callback):
+ *   The caller has no folio on the private node (e.g., folios are on a
+ *   source node being migrated TO this node).  A temporary refcount is
+ *   taken on node_private under rcu_read_lock to keep the structure (and
+ *   the service module) alive across the callback.  node_private_unregister
+ *   waits for all temporary references to drain before returning.
+ *
+ * Non-folio callbacks (rcu_read_lock held during callback):
+ *   No folio reference exists, so rcu_read_lock is held across the
+ *   callback to prevent node_private from being freed.
+ *   These callbacks MUST NOT sleep.
+ *
+ * @flags: Operation exclusion flags (NP_OPS_* constants).
+ *
+ */
+struct node_private_ops {
+	unsigned long flags;
+};
+
+/**
+ * struct node_private - Per-node container for N_MEMORY_PRIVATE nodes
+ *
+ * This structure is allocated by the driver and passed to node_private_register().
+ * The driver owns the memory and must ensure it remains valid until after
+ * node_private_unregister() returns with the reference count dropped to 0.
+ *
+ * @owner: Opaque driver identifier
+ * @refcount: Reference count (1 = registered; temporary refs for non-folio
+ *		callbacks that may sleep; 0 = fully released)
+ * @released: Signaled when refcount drops to 0; unregister waits on this
+ * @ops: Service callbacks and exclusion flags (NULL until service registers)
+ */
+struct node_private {
+	void *owner;
+	refcount_t refcount;
+	struct completion released;
+	const struct node_private_ops *ops;
+};
+
+#ifdef CONFIG_NUMA
+
+#include <linux/mmzone.h>
+
+/**
+ * folio_is_private_node - Check if folio is on an N_MEMORY_PRIVATE node
+ * @folio: The folio to check
+ *
+ * Returns true if the folio resides on a private node.
+ */
+static inline bool folio_is_private_node(struct folio *folio)
+{
+	return node_state(folio_nid(folio), N_MEMORY_PRIVATE);
+}
+
+/**
+ * page_is_private_node - Check if page is on an N_MEMORY_PRIVATE node
+ * @page: The page to check
+ *
+ * Returns true if the page resides on a private node.
+ */
+static inline bool page_is_private_node(struct page *page)
+{
+	return node_state(page_to_nid(page), N_MEMORY_PRIVATE);
+}
+
+static inline const struct node_private_ops *
+folio_node_private_ops(struct folio *folio)
+{
+	const struct node_private_ops *ops;
+	struct node_private *np;
+
+	rcu_read_lock();
+	np = rcu_dereference(NODE_DATA(folio_nid(folio))->node_private);
+	ops = np ? np->ops : NULL;
+	rcu_read_unlock();
+
+	return ops;
+}
+
+static inline unsigned long node_private_flags(int nid)
+{
+	struct node_private *np;
+	unsigned long flags;
+
+	rcu_read_lock();
+	np = rcu_dereference(NODE_DATA(nid)->node_private);
+	flags = (np && np->ops) ? np->ops->flags : 0;
+	rcu_read_unlock();
+
+	return flags;
+}
+
+static inline bool folio_private_flags(struct folio *f, unsigned long flag)
+{
+	return node_private_flags(folio_nid(f)) & flag;
+}
+
+static inline bool node_private_has_flag(int nid, unsigned long flag)
+{
+	return node_private_flags(nid) & flag;
+}
+
+static inline bool zone_private_flags(struct zone *z, unsigned long flag)
+{
+	return node_private_flags(zone_to_nid(z)) & flag;
+}
+
+#else /* !CONFIG_NUMA */
+
+static inline bool folio_is_private_node(struct folio *folio)
+{
+	return false;
+}
+
+static inline bool page_is_private_node(struct page *page)
+{
+	return false;
+}
+
+static inline const struct node_private_ops *
+folio_node_private_ops(struct folio *folio)
+{
+	return NULL;
+}
+
+static inline unsigned long node_private_flags(int nid)
+{
+	return 0;
+}
+
+static inline bool folio_private_flags(struct folio *f, unsigned long flag)
+{
+	return false;
+}
+
+static inline bool node_private_has_flag(int nid, unsigned long flag)
+{
+	return false;
+}
+
+static inline bool zone_private_flags(struct zone *z, unsigned long flag)
+{
+	return false;
+}
+
+#endif /* CONFIG_NUMA */
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+
+int node_private_register(int nid, struct node_private *np);
+int node_private_unregister(int nid);
+int node_private_set_ops(int nid, const struct node_private_ops *ops);
+int node_private_clear_ops(int nid, const struct node_private_ops *ops);
+
+#else /* !CONFIG_NUMA || !CONFIG_MEMORY_HOTPLUG */
+
+static inline int node_private_register(int nid, struct node_private *np)
+{
+	return -ENODEV;
+}
+
+static inline int node_private_unregister(int nid)
+{
+	return 0;
+}
+
+static inline int node_private_set_ops(int nid,
+				       const struct node_private_ops *ops)
+{
+	return -ENODEV;
+}
+
+static inline int node_private_clear_ops(int nid,
+					 const struct node_private_ops *ops)
+{
+	return -ENODEV;
+}
+
+#endif /* CONFIG_NUMA && CONFIG_MEMORY_HOTPLUG */
+
+#endif /* _LINUX_NODE_PRIVATE_H */
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index bd38648c998d..c9bcfd5a9a06 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -391,6 +391,7 @@ enum node_states {
 	N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
 	N_MEMORY,		/* The node has memory(regular, high, movable) */
+	N_MEMORY_PRIVATE,	/* The node's memory is private */
 	N_CPU,		/* The node has one or more cpus */
 	N_GENERIC_INITIATOR,	/* The node has one or more Generic Initiators */
 	NR_NODE_STATES
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

N_MEMORY_PRIVATE nodes hold device-managed memory that should not be
used for general allocations. Without a gating mechanism, any allocation
could land on a private node if it appears in the task's mems_allowed.

Introduce __GFP_PRIVATE that explicitly opts in to allocation from
N_MEMORY_PRIVATE nodes.

Add the GFP_PRIVATE compound mask (__GFP_PRIVATE | __GFP_THISNODE)
for callers that explicitly target private nodes to help prevent
fallback allocations from DRAM.

Update cpuset_current_node_allowed() to filter out N_MEMORY_PRIVATE
nodes unless __GFP_PRIVATE is set.

In interrupt context, only N_MEMORY nodes are valid.

Update cpuset_handle_hotplug() to include N_MEMORY_PRIVATE nodes in
the effective mems set, allowing cgroup-level control over private
node access.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/gfp_types.h      | 15 +++++++++++++--
 include/trace/events/mmflags.h |  4 ++--
 kernel/cgroup/cpuset.c         | 32 ++++++++++++++++++++++++++++----
 3 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 3de43b12209e..ac375f9a0fc2 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -33,7 +33,7 @@ enum {
 	___GFP_IO_BIT,
 	___GFP_FS_BIT,
 	___GFP_ZERO_BIT,
-	___GFP_UNUSED_BIT,	/* 0x200u unused */
+	___GFP_PRIVATE_BIT,
 	___GFP_DIRECT_RECLAIM_BIT,
 	___GFP_KSWAPD_RECLAIM_BIT,
 	___GFP_WRITE_BIT,
@@ -69,7 +69,7 @@ enum {
 #define ___GFP_IO		BIT(___GFP_IO_BIT)
 #define ___GFP_FS		BIT(___GFP_FS_BIT)
 #define ___GFP_ZERO		BIT(___GFP_ZERO_BIT)
-/* 0x200u unused */
+#define ___GFP_PRIVATE		BIT(___GFP_PRIVATE_BIT)
 #define ___GFP_DIRECT_RECLAIM	BIT(___GFP_DIRECT_RECLAIM_BIT)
 #define ___GFP_KSWAPD_RECLAIM	BIT(___GFP_KSWAPD_RECLAIM_BIT)
 #define ___GFP_WRITE		BIT(___GFP_WRITE_BIT)
@@ -139,6 +139,11 @@ enum {
  * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
  *
  * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
+ *
+ * %__GFP_PRIVATE allows allocation from N_MEMORY_PRIVATE nodes (e.g., compressed
+ * memory, accelerator memory). Without this flag, allocations are restricted
+ * to N_MEMORY nodes only. Used by migration/demotion paths when explicitly
+ * targeting private nodes.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
@@ -146,6 +151,7 @@ enum {
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
 #define __GFP_ACCOUNT	((__force gfp_t)___GFP_ACCOUNT)
 #define __GFP_NO_OBJ_EXT   ((__force gfp_t)___GFP_NO_OBJ_EXT)
+#define __GFP_PRIVATE	((__force gfp_t)___GFP_PRIVATE)
 
 /**
  * DOC: Watermark modifiers
@@ -367,6 +373,10 @@ enum {
  * available and will not wake kswapd/kcompactd on failure. The _LIGHT
  * version does not attempt reclaim/compaction at all and is by default used
  * in page fault path, while the non-light is used by khugepaged.
+ *
+ * %GFP_PRIVATE adds %__GFP_THISNODE by default to prevent any fallback
+ * allocations to other nodes, given that the caller was already attempting
+ * to access driver-managed memory explicitly.
  */
 #define GFP_ATOMIC	(__GFP_HIGH|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
@@ -382,5 +392,6 @@ enum {
 #define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
+#define GFP_PRIVATE	(__GFP_PRIVATE | __GFP_THISNODE)
 
 #endif /* __LINUX_GFP_TYPES_H */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a6e5a44c9b42..f042cd848451 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -37,7 +37,8 @@
 	TRACE_GFP_EM(HARDWALL)			\
 	TRACE_GFP_EM(THISNODE)			\
 	TRACE_GFP_EM(ACCOUNT)			\
-	TRACE_GFP_EM(ZEROTAGS)
+	TRACE_GFP_EM(ZEROTAGS)			\
+	TRACE_GFP_EM(PRIVATE)
 
 #ifdef CONFIG_KASAN_HW_TAGS
 # define TRACE_GFP_FLAGS_KASAN			\
@@ -73,7 +74,6 @@
 TRACE_GFP_FLAGS
 
 /* Just in case these are ever used */
-TRACE_DEFINE_ENUM(___GFP_UNUSED_BIT);
 TRACE_DEFINE_ENUM(___GFP_LAST_BIT);
 
 #define gfpflag_string(flag) {(__force unsigned long)flag, #flag}
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 473aa9261e16..1a597f0c7c6c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -444,21 +444,32 @@ static void guarantee_active_cpus(struct task_struct *tsk,
 }
 
 /*
- * Return in *pmask the portion of a cpusets's mems_allowed that
+ * Return in *pmask the portion of a cpuset's mems_allowed that
  * are online, with memory.  If none are online with memory, walk
  * up the cpuset hierarchy until we find one that does have some
  * online mems.  The top cpuset always has some mems online.
  *
  * One way or another, we guarantee to return some non-empty subset
- * of node_states[N_MEMORY].
+ * of node_states[N_MEMORY].  N_MEMORY_PRIVATE nodes from the
+ * original cpuset are preserved, but only N_MEMORY nodes are
+ * pulled from ancestors.
  *
  * Call with callback_lock or cpuset_mutex held.
  */
 static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 {
+	struct cpuset *orig_cs = cs;
+	int nid;
+
 	while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY]))
 		cs = parent_cs(cs);
+
 	nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]);
+
+	for_each_node_state(nid, N_MEMORY_PRIVATE) {
+		if (node_isset(nid, orig_cs->effective_mems))
+			node_set(nid, *pmask);
+	}
 }
 
 /**
@@ -4075,7 +4086,9 @@ static void cpuset_handle_hotplug(void)
 
 	/* fetch the available cpus/mems and find out which changed how */
 	cpumask_copy(&new_cpus, cpu_active_mask);
-	new_mems = node_states[N_MEMORY];
+
+	/* Include N_MEMORY_PRIVATE so cpuset controls access the same way */
+	nodes_or(new_mems, node_states[N_MEMORY], node_states[N_MEMORY_PRIVATE]);
 
 	/*
 	 * If subpartitions_cpus is populated, it is likely that the check
@@ -4488,10 +4501,21 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
  * __alloc_pages() will include all nodes.  If the slab allocator
  * is passed an offline node, it will fall back to the local node.
  * See kmem_cache_alloc_node().
+ *
+ *
+ * Private nodes aren't eligible for these allocations, so skip them.
+ * guarantee_online_mems guaranttes at least one N_MEMORY node is set.
  */
 static int cpuset_spread_node(int *rotor)
 {
-	return *rotor = next_node_in(*rotor, current->mems_allowed);
+	int node;
+
+	do {
+		node = next_node_in(*rotor, current->mems_allowed);
+		*rotor = node;
+	} while (node_state(node, N_MEMORY_PRIVATE));
+
+	return node;
 }
 
 /**
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Various locations in mm/ open-code cpuset filtering with:

  cpusets_enabled() && ALLOC_CPUSET && !__cpuset_zone_allowed()

This pattern does not account for N_MEMORY_PRIVATE nodes on systems
without cpusets, so private-node zones can leak into allocation
paths that should only see general-purpose memory.

Add numa_zone_allowed() which consolidates zone filtering. It checks
cpuset membership when cpusets are enabled, and otherwise gates
N_MEMORY_PRIVATE zones behind __GFP_PRIVATE globally.

Replace the open-coded patterns in mm/ with the new helper.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/compaction.c |  6 ++----
 mm/hugetlb.c    |  2 +-
 mm/internal.h   |  7 +++++++
 mm/page_alloc.c | 31 ++++++++++++++++++++-----------
 mm/slub.c       |  3 ++-
 5 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c..6a65145b03d8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2829,10 +2829,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 					ac->highest_zoneidx, ac->nodemask) {
 		enum compact_result status;
 
-		if (cpusets_enabled() &&
-			(alloc_flags & ALLOC_CPUSET) &&
-			!__cpuset_zone_allowed(zone, gfp_mask))
-				continue;
+		if (!numa_zone_alloc_allowed(alloc_flags, zone, gfp_mask))
+			continue;
 
 		if (prio > MIN_COMPACT_PRIORITY
 					&& compaction_deferred(zone, order)) {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51273baec9e5..f2b914ab5910 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,7 +1353,7 @@ static struct folio *dequeue_hugetlb_folio_nodemask(struct hstate *h, gfp_t gfp_
 	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
 		struct folio *folio;
 
-		if (!cpuset_zone_allowed(zone, gfp_mask))
+		if (!numa_zone_alloc_allowed(ALLOC_CPUSET, zone, gfp_mask))
 			continue;
 		/*
 		 * no need to ask again on the same node. Pool is node rather than
diff --git a/mm/internal.h b/mm/internal.h
index 23ee14790227..97023748e6a9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1206,6 +1206,8 @@ extern int node_reclaim_mode;
 
 extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
 extern int find_next_best_node(int node, nodemask_t *used_node_mask);
+extern bool numa_zone_alloc_allowed(int alloc_flags, struct zone *zone,
+			      gfp_t gfp_mask);
 #else
 #define node_reclaim_mode 0
 
@@ -1218,6 +1220,11 @@ static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	return NUMA_NO_NODE;
 }
+static inline bool numa_zone_alloc_allowed(int alloc_flags, struct zone *zone,
+				     gfp_t gfp_mask)
+{
+	return true;
+}
 #endif
 
 static inline bool node_reclaim_enabled(void)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2facee0805da..47f2619d3840 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3690,6 +3690,21 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
 				node_reclaim_distance;
 }
+
+/* Returns true if allocation from this zone is permitted */
+bool numa_zone_alloc_allowed(int alloc_flags, struct zone *zone, gfp_t gfp_mask)
+{
+	/* Gate N_MEMORY_PRIVATE zones behind __GFP_PRIVATE */
+	if (!(gfp_mask & __GFP_PRIVATE) &&
+	    node_state(zone_to_nid(zone), N_MEMORY_PRIVATE))
+		return false;
+
+	/* If cpusets is being used, check mems_allowed */
+	if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET))
+		return cpuset_zone_allowed(zone, gfp_mask);
+
+	return true;
+}
 #else	/* CONFIG_NUMA */
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
@@ -3781,10 +3796,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		struct page *page;
 		unsigned long mark;
 
-		if (cpusets_enabled() &&
-			(alloc_flags & ALLOC_CPUSET) &&
-			!__cpuset_zone_allowed(zone, gfp_mask))
-				continue;
+		if (!numa_zone_alloc_allowed(alloc_flags, zone, gfp_mask))
+			continue;
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a node that is within its dirty
@@ -4585,10 +4598,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		unsigned long min_wmark = min_wmark_pages(zone);
 		bool wmark;
 
-		if (cpusets_enabled() &&
-			(alloc_flags & ALLOC_CPUSET) &&
-			!__cpuset_zone_allowed(zone, gfp_mask))
-				continue;
+		if (!numa_zone_alloc_allowed(alloc_flags, zone, gfp_mask))
+			continue;
 
 		available = reclaimable = zone_reclaimable_pages(zone);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@ -5084,10 +5095,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	for_next_zone_zonelist_nodemask(zone, z, ac.highest_zoneidx, ac.nodemask) {
 		unsigned long mark;
 
-		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
-		    !__cpuset_zone_allowed(zone, gfp)) {
+		if (!numa_zone_alloc_allowed(alloc_flags, zone, gfp))
 			continue;
-		}
 
 		if (nr_online_nodes > 1 && zone != zonelist_zone(ac.preferred_zoneref) &&
 		    zone_to_nid(zone) != zonelist_node_idx(ac.preferred_zoneref)) {
diff --git a/mm/slub.c b/mm/slub.c
index 861592ac5425..e4bd6ede81d1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3595,7 +3595,8 @@ static struct slab *get_any_partial(struct kmem_cache *s,
 
 			n = get_node(s, zone_to_nid(zone));
 
-			if (n && cpuset_zone_allowed(zone, pc->flags) &&
+			if (n && numa_zone_alloc_allowed(ALLOC_CPUSET, zone,
+						   pc->flags) &&
 					n->nr_partial > s->min_partial) {
 				slab = get_partial_node(s, n, pc);
 				if (slab) {
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (2 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

N_MEMORY fallback lists should not include N_MEMORY_PRIVATE nodes, at
worst this would allow allocation from them in some scenarios, and at
best it causes iterations over nodes that aren't eligible.

Private node primary fallback lists do include N_MEMORY nodes so
kernel/slab allocations made on behalf of the private node can
fall back to DRAM when __GFP_PRIVATE is not set.

The nofallback list contains only the node's own zones, restricting
__GFP_THISNODE allocations to the private node.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/page_alloc.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47f2619d3840..5a1b35421d78 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5683,6 +5683,26 @@ static void build_zonelists(pg_data_t *pgdat)
 	local_node = pgdat->node_id;
 	prev_node = local_node;
 
+	/*
+	 * Private nodes need N_MEMORY nodes as fallback for kernel allocations
+	 * (e.g., slab objects allocated on behalf of this node).
+	 */
+	if (node_state(local_node, N_MEMORY_PRIVATE)) {
+		node_order[nr_nodes++] = local_node;
+		node_set(local_node, used_mask);
+
+		while ((node = find_next_best_node(local_node, &used_mask)) >= 0)
+			node_order[nr_nodes++] = node;
+
+		build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
+		build_thisnode_zonelists(pgdat);
+		pr_info("Fallback order for Node %d (private):", local_node);
+		for (node = 0; node < nr_nodes; node++)
+			pr_cont(" %d", node_order[node]);
+		pr_cont("\n");
+		return;
+	}
+
 	memset(node_order, 0, sizeof(node_order));
 	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
 		/*
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (3 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Multiple mm/ subsystems already skip operations for ZONE_DEVICE folios,
and N_MEMORY_PRIVATE folios share the checkpoints for ZONE_DEVICE pages.

Add folio_is_private_managed() as a unified predicate that returns true
for folios on N_MEMORY_PRIVATE nodes or in ZONE_DEVICE.

This predicate replaces folio_is_zone_device at skip sites where both
folio types should be excluded from an MM operation.

At some locations, explicit zone_device vs private_node checks are more
appropriate when the operations between the two fundamentally differ.

The !CONFIG_NUMA stubs fall through to folio_is_zone_device() only,
preserving existing behavior when NUMA is disabled.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/node_private.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index 6a70ec39d569..7687a4cf990c 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -92,6 +92,16 @@ static inline bool page_is_private_node(struct page *page)
 	return node_state(page_to_nid(page), N_MEMORY_PRIVATE);
 }
 
+static inline bool folio_is_private_managed(struct folio *folio)
+{
+	return folio_is_zone_device(folio) || folio_is_private_node(folio);
+}
+
+static inline bool page_is_private_managed(struct page *page)
+{
+	return folio_is_private_managed(page_folio(page));
+}
+
 static inline const struct node_private_ops *
 folio_node_private_ops(struct folio *folio)
 {
@@ -146,6 +156,16 @@ static inline bool page_is_private_node(struct page *page)
 	return false;
 }
 
+static inline bool folio_is_private_managed(struct folio *folio)
+{
+	return folio_is_zone_device(folio);
+}
+
+static inline bool page_is_private_managed(struct page *page)
+{
+	return folio_is_private_managed(page_folio(page));
+}
+
 static inline const struct node_private_ops *
 folio_node_private_ops(struct folio *folio)
 {
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (4 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Private node folios are managed by device drivers and should not be
mlocked.  The existing folio_is_zone_device check is already correctly
placed to handle this - simply extend it for private nodes.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/mlock.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 2f699c3497a5..c56159253e45 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -25,6 +25,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
 #include <linux/secretmem.h>
+#include <linux/node_private.h>
 
 #include "internal.h"
 
@@ -366,7 +367,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 		if (is_huge_zero_pmd(*pmd))
 			goto out;
 		folio = pmd_folio(*pmd);
-		if (folio_is_zone_device(folio))
+		if (unlikely(folio_is_private_managed(folio)))
 			goto out;
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_folio(folio);
@@ -386,7 +387,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!pte_present(ptent))
 			continue;
 		folio = vm_normal_folio(vma, addr, ptent);
-		if (!folio || folio_is_zone_device(folio))
+		if (!folio || unlikely(folio_is_private_managed(folio)))
 			continue;
 
 		step = folio_mlock_step(folio, pte, addr, end);
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 07/27] mm/madvise: skip madvise for managed-memory folios
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (5 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Private node folios are managed by device drivers and should not be
subjectto madvise cold/pageout/free operations that would interfere
with the driver's memory management.

Extend the existing zone_device check to cover private nodes.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/madvise.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index b617b1be0f53..3aac105e840b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -32,6 +32,7 @@
 #include <linux/leafops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/node_private.h>
 
 #include <asm/tlb.h>
 
@@ -475,7 +476,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			continue;
 
 		folio = vm_normal_folio(vma, addr, ptent);
-		if (!folio || folio_is_zone_device(folio))
+		if (!folio || unlikely(folio_is_private_managed(folio)))
 			continue;
 
 		/*
@@ -704,7 +705,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		}
 
 		folio = vm_normal_folio(vma, addr, ptent);
-		if (!folio || folio_is_zone_device(folio))
+		if (!folio || unlikely(folio_is_private_managed(folio)))
 			continue;
 
 		/*
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 08/27] mm/ksm: skip KSM for managed-memory folios
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (6 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Private node folios should not participate in KSM merging by default.
The driver manages the memory lifecycle and KSM's page sharing can
interfere with driver operations.

Extend the existing zone_device checks in get_mergeable_page and
ksm_next_page_pmd_entry to cover private node folios as well.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/ksm.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 2d89a7c8b4eb..c48e95a6fff9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -40,6 +40,7 @@
 #include <linux/oom.h>
 #include <linux/numa.h>
 #include <linux/pagewalk.h>
+#include <linux/node_private.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -808,7 +809,7 @@ static struct page *get_mergeable_page(struct ksm_rmap_item *rmap_item)
 
 	folio = folio_walk_start(&fw, vma, addr, 0);
 	if (folio) {
-		if (!folio_is_zone_device(folio) &&
+		if (!folio_is_private_managed(folio) &&
 		    folio_test_anon(folio)) {
 			folio_get(folio);
 			page = fw.page;
@@ -2521,7 +2522,8 @@ static int ksm_next_page_pmd_entry(pmd_t *pmdp, unsigned long addr, unsigned lon
 				goto not_found_unlock;
 			folio = page_folio(page);
 
-			if (folio_is_zone_device(folio) || !folio_test_anon(folio))
+			if (unlikely(folio_is_private_managed(folio)) ||
+			    !folio_test_anon(folio))
 				goto not_found_unlock;
 
 			page += ((addr & (PMD_SIZE - 1)) >> PAGE_SHIFT);
@@ -2545,7 +2547,8 @@ static int ksm_next_page_pmd_entry(pmd_t *pmdp, unsigned long addr, unsigned lon
 			continue;
 		folio = page_folio(page);
 
-		if (folio_is_zone_device(folio) || !folio_test_anon(folio))
+		if (unlikely(folio_is_private_managed(folio)) ||
+		    !folio_test_anon(folio))
 			continue;
 		goto found_unlock;
 	}
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse.
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (7 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

A collapse operation allocates a new large folio and migrates the
smaller folios into it.  This is an issue for private nodes:

  1. The private node service may not support migration
  2. Collapse may promotes pages from the private node to a local node,
     which may result in an LRU inversion that defeats memory tiering.

Handle this just like zone_device for now.

It may be possible to support this later for some private node services
that report explicit support for collapse (and migration).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/khugepaged.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..36f6bc5da53c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -21,6 +21,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/dax.h>
 #include <linux/ksm.h>
+#include <linux/node_private.h>
 #include <linux/pgalloc.h>
 
 #include <asm/tlb.h>
@@ -571,7 +572,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 		page = vm_normal_page(vma, addr, pteval);
-		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
+		if (unlikely(!page) || unlikely(page_is_private_managed(page))) {
 			result = SCAN_PAGE_NULL;
 			goto out;
 		}
@@ -1323,7 +1324,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 
 		page = vm_normal_page(vma, addr, pteval);
-		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
+		if (unlikely(!page) || unlikely(page_is_private_managed(page))) {
 			result = SCAN_PAGE_NULL;
 			goto out_unmap;
 		}
@@ -1575,7 +1576,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		}
 
 		page = vm_normal_page(vma, addr, ptent);
-		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
+		if (WARN_ON_ONCE(page && page_is_private_managed(page)))
 			page = NULL;
 		/*
 		 * Note that uprobe, debugger, or MAP_PRIVATE may change the
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (8 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

When a folio's refcount drops to zero, the service may need to perform
cleanup before the page returns to the buddy allocator (e.g. zeroing
pages to scrub stale compressed data / release compression ratio).

Add folio_managed_on_free() to wrap both zone_device and private node
semantics for this operation since they are the same.

One difference between zone_device and private node folios:
  - private nodes may choose to either take a reference and return true
    ("handled"), or return false to return it back to the buddy.

  - zone_device returns the page to the buddy (always returns true)

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/node_private.h |  6 ++++++
 mm/internal.h                | 30 ++++++++++++++++++++++++++++++
 mm/swap.c                    | 21 ++++++++++-----------
 3 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index 7687a4cf990c..09ea7c4cb13c 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -39,10 +39,16 @@ struct vm_fault;
  *   callback to prevent node_private from being freed.
  *   These callbacks MUST NOT sleep.
  *
+ * @free_folio: Called when a folio refcount drops to 0
+ *   [folio-referenced callback]
+ *   Returns: true if handled (skip return to buddy)
+ *            false if no op (return to buddy)
+ *
  * @flags: Operation exclusion flags (NP_OPS_* constants).
  *
  */
 struct node_private_ops {
+	bool (*free_folio)(struct folio *folio);
 	unsigned long flags;
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index 97023748e6a9..658da41cdb8e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1412,6 +1412,36 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 void free_zone_device_folio(struct folio *folio);
 int migrate_device_coherent_folio(struct folio *folio);
 
+/**
+ * folio_managed_on_free - Notify managed-memory service that folio
+ *                         refcount reached zero.
+ * @folio: the folio being freed
+ *
+ * Returns true if the folio is fully handled (zone_device -- caller
+ * must return immediately).  Returns false if the callback ran but
+ * the folio should continue through the normal free path
+ * (private_node -- pages go back to buddy).
+ *
+ * Returns false for normal folios (no-op).
+ */
+static inline bool folio_managed_on_free(struct folio *folio)
+{
+	if (folio_is_zone_device(folio)) {
+		free_zone_device_folio(folio);
+		return true;
+	}
+	if (folio_is_private_node(folio)) {
+		const struct node_private_ops *ops =
+			folio_node_private_ops(folio);
+
+		if (ops && ops->free_folio) {
+			if (ops->free_folio(folio))
+				return true;
+		}
+	}
+	return false;
+}
+
 struct vm_struct *__get_vm_area_node(unsigned long size,
 				     unsigned long align, unsigned long shift,
 				     unsigned long vm_flags, unsigned long start,
diff --git a/mm/swap.c b/mm/swap.c
index 2260dcd2775e..dca306e1ae6d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
+#include <linux/node_private.h>
 
 #include "internal.h"
 
@@ -96,10 +97,9 @@ static void page_cache_release(struct folio *folio)
 
 void __folio_put(struct folio *folio)
 {
-	if (unlikely(folio_is_zone_device(folio))) {
-		free_zone_device_folio(folio);
-		return;
-	}
+	if (unlikely(folio_is_private_managed(folio)))
+		if (folio_managed_on_free(folio))
+			return;
 
 	if (folio_test_hugetlb(folio)) {
 		free_huge_folio(folio);
@@ -961,19 +961,18 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		if (is_huge_zero_folio(folio))
 			continue;
 
-		if (folio_is_zone_device(folio)) {
+		if (!folio_ref_sub_and_test(folio, nr_refs))
+			continue;
+
+		if (unlikely(folio_is_private_managed(folio))) {
 			if (lruvec) {
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			if (folio_ref_sub_and_test(folio, nr_refs))
-				free_zone_device_folio(folio);
-			continue;
+			if (folio_managed_on_free(folio))
+				continue;
 		}
 
-		if (!folio_ref_sub_and_test(folio, nr_refs))
-			continue;
-
 		/* hugetlb has its own memcg */
 		if (folio_test_hugetlb(folio)) {
 			if (lruvec) {
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (9 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Some private node services may need to update internal metadata when
a THP folio is split.  ZONE_DEVICE already has a split callback via
pgmap->ops; private nodes can provide the same capability.

Just like zone_device, some private node services may want to know
about a folio being split.  Add this optional callback to the ops
struct and add a wrapper for zone_device and private node callback
dispatch to be consolidated.

Wire this into __folio_split() where the zone_device check was made.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/node_private.h | 33 +++++++++++++++++++++++++++++++++
 mm/huge_memory.c             |  6 ++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index 09ea7c4cb13c..f9dd2d25c8a5 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -3,6 +3,7 @@
 #define _LINUX_NODE_PRIVATE_H
 
 #include <linux/completion.h>
+#include <linux/memremap.h>
 #include <linux/mm.h>
 #include <linux/nodemask.h>
 #include <linux/rcupdate.h>
@@ -44,11 +45,19 @@ struct vm_fault;
  *   Returns: true if handled (skip return to buddy)
  *            false if no op (return to buddy)
  *
+ * @folio_split: Notification that a folio on this private node is being split.
+ *    [folio-referenced callback]
+ *     Called from the folio split path via folio_managed_split_cb().
+ *     @folio is the original folio; @new_folio is the newly created folio,
+ *     or NULL when called for the final (original) folio after all sub-folios
+ *     have been split off.
+ *
  * @flags: Operation exclusion flags (NP_OPS_* constants).
  *
  */
 struct node_private_ops {
 	bool (*free_folio)(struct folio *folio);
+	void (*folio_split)(struct folio *folio, struct folio *new_folio);
 	unsigned long flags;
 };
 
@@ -150,6 +159,24 @@ static inline bool zone_private_flags(struct zone *z, unsigned long flag)
 	return node_private_flags(zone_to_nid(z)) & flag;
 }
 
+static inline void node_private_split_cb(struct folio *folio,
+					 struct folio *new_folio)
+{
+	const struct node_private_ops *ops = folio_node_private_ops(folio);
+
+	if (ops && ops->folio_split)
+		ops->folio_split(folio, new_folio);
+}
+
+static inline void folio_managed_split_cb(struct folio *original_folio,
+					  struct folio *new_folio)
+{
+	if (folio_is_zone_device(original_folio))
+		zone_device_private_split_cb(original_folio, new_folio);
+	else if (folio_is_private_node(original_folio))
+		node_private_split_cb(original_folio, new_folio);
+}
+
 #else /* !CONFIG_NUMA */
 
 static inline bool folio_is_private_node(struct folio *folio)
@@ -198,6 +225,12 @@ static inline bool zone_private_flags(struct zone *z, unsigned long flag)
 	return false;
 }
 
+static inline void folio_managed_split_cb(struct folio *original_folio,
+					  struct folio *new_folio)
+{
+	if (folio_is_zone_device(original_folio))
+		zone_device_private_split_cb(original_folio, new_folio);
+}
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21..2ecae494291a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -24,6 +24,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/memremap.h>
+#include <linux/node_private.h>
 #include <linux/pagemap.h>
 #include <linux/debugfs.h>
 #include <linux/migrate.h>
@@ -3850,7 +3851,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 
 			next = folio_next(new_folio);
 
-			zone_device_private_split_cb(folio, new_folio);
+			folio_managed_split_cb(folio, new_folio);
 
 			folio_ref_unfreeze(new_folio,
 					   folio_cache_ref_count(new_folio) + 1);
@@ -3889,7 +3890,8 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 			folio_put_refs(new_folio, nr_pages);
 		}
 
-		zone_device_private_split_cb(folio, NULL);
+		folio_managed_split_cb(folio, NULL);
+
 		/*
 		 * Unfreeze @folio only after all page cache entries, which
 		 * used to point to it, have been updated with new folios.
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (10 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Private node services may want to support user-driven migration
(migrate_pages syscall, mbind) to allow data movement between regular
and private nodes.

ZONE_DEVICE always rejects user migration, but private nodes should
be able to opt in.

Add NP_OPS_MIGRATION flag and folio_managed_user_migrate() wrapper that
dispatches migration requests.  Private nodes can either set the flag
and provide a custom migrate_to callback for driver-managed migration.

In migrate_to_node(), allows GFP_PRIVATE when the destination node
supports NP_OPS_MIGRATION, enabling migrate_pages syscall to target
private nodes.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/base/node.c          |   4 ++
 include/linux/migrate.h      |  10 +++
 include/linux/node_private.h | 122 +++++++++++++++++++++++++++++++++++
 mm/damon/paddr.c             |   3 +
 mm/internal.h                |  24 +++++++
 mm/mempolicy.c               |  10 +--
 mm/migrate.c                 |  49 ++++++++++----
 mm/rmap.c                    |   4 +-
 8 files changed, 206 insertions(+), 20 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 646dc48a23b5..e587f5781135 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -949,6 +949,10 @@ int node_private_set_ops(int nid, const struct node_private_ops *ops)
 	if (!node_possible(nid))
 		return -EINVAL;
 
+	if ((ops->flags & NP_OPS_MIGRATION) &&
+	    (!ops->migrate_to || !ops->folio_migrate))
+		return -EINVAL;
+
 	mutex_lock(&node_private_lock);
 	np = rcu_dereference_protected(NODE_DATA(nid)->node_private,
 				       lockdep_is_held(&node_private_lock));
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 26ca00c325d9..7b2da3875ff2 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -71,6 +71,9 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
 int folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int extra_count);
 int set_movable_ops(const struct movable_operations *ops, enum pagetype type);
+int migrate_folios_to_node(struct list_head *folios, int nid,
+				    enum migrate_mode mode,
+				    enum migrate_reason reason);
 
 #else
 
@@ -96,6 +99,13 @@ static inline int set_movable_ops(const struct movable_operations *ops, enum pag
 {
 	return -ENOSYS;
 }
+static inline int migrate_folios_to_node(struct list_head *folios,
+						  int nid,
+						  enum migrate_mode mode,
+						  enum migrate_reason reason)
+{
+	return -ENOSYS;
+}
 
 #endif /* CONFIG_MIGRATION */
 
diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index f9dd2d25c8a5..0c5be1ee6e60 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -4,6 +4,7 @@
 
 #include <linux/completion.h>
 #include <linux/memremap.h>
+#include <linux/migrate_mode.h>
 #include <linux/mm.h>
 #include <linux/nodemask.h>
 #include <linux/rcupdate.h>
@@ -52,15 +53,40 @@ struct vm_fault;
  *     or NULL when called for the final (original) folio after all sub-folios
  *     have been split off.
  *
+ * @migrate_to: Migrate folios TO this node.
+ *	[refcounted callback]
+ *	Returns: 0 on full success, >0 = number of folios that failed to
+ *		 migrate, <0 = error.  Matches migrate_pages() semantics.
+ *		 @nr_succeeded is set to the number of successfully migrated
+ *		 folios (may be NULL if caller doesn't need it).
+ *
+ * @folio_migrate: Post-migration notification that a folio on this private node
+ *    changed physical location (on the same node or a different node).
+ *    [folio-referenced callback]
+ *     Called from migrate_folio_move() after data has been copied but before
+ *     migration entries are replaced with real PTEs.  Both @src and @dst are
+ *     locked.  Faults block in migration_entry_wait() until
+ *     remove_migration_ptes() runs, so the service can safely update
+ *     PFN-based metadata (compression tables, device page tables, DMA
+ *     mappings, etc.) before any access through the page tables.
+ *
  * @flags: Operation exclusion flags (NP_OPS_* constants).
  *
  */
 struct node_private_ops {
 	bool (*free_folio)(struct folio *folio);
 	void (*folio_split)(struct folio *folio, struct folio *new_folio);
+	int (*migrate_to)(struct list_head *folios, int nid,
+				  enum migrate_mode mode,
+				  enum migrate_reason reason,
+				  unsigned int *nr_succeeded);
+	void (*folio_migrate)(struct folio *src, struct folio *dst);
 	unsigned long flags;
 };
 
+/* Allow user/kernel migration; requires migrate_to and folio_migrate */
+#define NP_OPS_MIGRATION		BIT(0)
+
 /**
  * struct node_private - Per-node container for N_MEMORY_PRIVATE nodes
  *
@@ -177,6 +203,81 @@ static inline void folio_managed_split_cb(struct folio *original_folio,
 		node_private_split_cb(original_folio, new_folio);
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+static inline int folio_managed_allows_user_migrate(struct folio *folio)
+{
+	if (folio_is_zone_device(folio))
+		return -ENOENT;
+	return node_private_has_flag(folio_nid(folio), NP_OPS_MIGRATION) ?
+	       folio_nid(folio) : -ENOENT;
+}
+
+/**
+ * folio_managed_allows_migrate - Check if a managed folio supports migration
+ * @folio: The folio to check
+ *
+ * Returns true if the folio can be migrated.  For zone_device folios, only
+ * device_private and device_coherent support migration.  For private node
+ * folios, migration requires NP_OPS_MIGRATION.  Normal folios always
+ * return true.
+ */
+static inline bool folio_managed_allows_migrate(struct folio *folio)
+{
+	if (folio_is_zone_device(folio))
+		return folio_is_device_private(folio) ||
+		       folio_is_device_coherent(folio);
+	if (folio_is_private_node(folio))
+		return folio_private_flags(folio, NP_OPS_MIGRATION);
+	return true;
+}
+
+/**
+ * node_private_migrate_to - Attempt service-specific migration to a private node
+ * @folios: list of folios to migrate (may sleep)
+ * @nid: target node
+ * @mode: migration mode (MIGRATE_ASYNC, MIGRATE_SYNC, etc.)
+ * @reason: migration reason (MR_DEMOTION, MR_SYSCALL, etc.)
+ * @nr_succeeded: optional output for number of successfully migrated folios
+ *
+ * If @nid is an N_MEMORY_PRIVATE node with a migrate_to callback,
+ * invokes the callback and returns the result with migrate_pages()
+ * semantics (0 = full success, >0 = failure count, <0 = error).
+ * Returns -ENODEV if the node is not private or the service is being
+ * torn down.
+ *
+ * The source folios are on other nodes, so they do not pin the target
+ * node's node_private.  A temporary refcount is taken under rcu_read_lock
+ * to keep node_private (and the service module) alive across the callback.
+ */
+static inline int node_private_migrate_to(struct list_head *folios, int nid,
+					  enum migrate_mode mode,
+					  enum migrate_reason reason,
+					  unsigned int *nr_succeeded)
+{
+	int (*fn)(struct list_head *, int, enum migrate_mode,
+		  enum migrate_reason, unsigned int *);
+	struct node_private *np;
+	int ret;
+
+	rcu_read_lock();
+	np = rcu_dereference(NODE_DATA(nid)->node_private);
+	if (!np || !np->ops || !np->ops->migrate_to ||
+	    !refcount_inc_not_zero(&np->refcount)) {
+		rcu_read_unlock();
+		return -ENODEV;
+	}
+	fn = np->ops->migrate_to;
+	rcu_read_unlock();
+
+	ret = fn(folios, nid, mode, reason, nr_succeeded);
+
+	if (refcount_dec_and_test(&np->refcount))
+		complete(&np->released);
+
+	return ret;
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
 #else /* !CONFIG_NUMA */
 
 static inline bool folio_is_private_node(struct folio *folio)
@@ -242,6 +343,27 @@ int node_private_clear_ops(int nid, const struct node_private_ops *ops);
 
 #else /* !CONFIG_NUMA || !CONFIG_MEMORY_HOTPLUG */
 
+static inline int folio_managed_allows_user_migrate(struct folio *folio)
+{
+	return -ENOENT;
+}
+
+static inline bool folio_managed_allows_migrate(struct folio *folio)
+{
+	if (folio_is_zone_device(folio))
+		return folio_is_device_private(folio) ||
+		       folio_is_device_coherent(folio);
+	return true;
+}
+
+static inline int node_private_migrate_to(struct list_head *folios, int nid,
+					  enum migrate_mode mode,
+					  enum migrate_reason reason,
+					  unsigned int *nr_succeeded)
+{
+	return -ENODEV;
+}
+
 static inline int node_private_register(int nid, struct node_private *np)
 {
 	return -ENODEV;
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 07a8aead439e..532b8e2c62b0 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -277,6 +277,9 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
 		else
 			*sz_filter_passed += folio_size(folio) / addr_unit;
 
+		if (!folio_managed_allows_migrate(folio))
+			goto put_folio;
+
 		if (!folio_isolate_lru(folio))
 			goto put_folio;
 		list_add(&folio->lru, &folio_list);
diff --git a/mm/internal.h b/mm/internal.h
index 658da41cdb8e..6ab4679fe943 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1442,6 +1442,30 @@ static inline bool folio_managed_on_free(struct folio *folio)
 	return false;
 }
 
+/**
+ * folio_managed_migrate_notify - Notify service that a folio changed location
+ * @src: the old folio (about to be freed)
+ * @dst: the new folio (data already copied, migration entries still in place)
+ *
+ * Called from migrate_folio_move() after data has been copied but before
+ * remove_migration_ptes() installs real PTEs pointing to @dst.  While
+ * migration entries are in place, faults block in migration_entry_wait(),
+ * so the service can safely update PFN-based metadata before any access
+ * through the page tables.  Both @src and @dst are locked.
+ */
+static inline void folio_managed_migrate_notify(struct folio *src,
+						struct folio *dst)
+{
+	const struct node_private_ops *ops;
+
+	if (!folio_is_private_node(src))
+		return;
+
+	ops = folio_node_private_ops(src);
+	if (ops && ops->folio_migrate)
+		ops->folio_migrate(src, dst);
+}
+
 struct vm_struct *__get_vm_area_node(unsigned long size,
 				     unsigned long align, unsigned long shift,
 				     unsigned long vm_flags, unsigned long start,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 68a98ba57882..2b0f9762d171 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -111,6 +111,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
 #include <linux/leafops.h>
+#include <linux/node_private.h>
 #include <linux/gcd.h>
 
 #include <asm/tlbflush.h>
@@ -1282,11 +1283,6 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest,
 	LIST_HEAD(pagelist);
 	long nr_failed;
 	long err = 0;
-	struct migration_target_control mtc = {
-		.nid = dest,
-		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
-		.reason = MR_SYSCALL,
-	};
 
 	nodes_clear(nmask);
 	node_set(source, nmask);
@@ -1311,8 +1307,8 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest,
 	mmap_read_unlock(mm);
 
 	if (!list_empty(&pagelist)) {
-		err = migrate_pages(&pagelist, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL);
+		err = migrate_folios_to_node(&pagelist, dest, MIGRATE_SYNC,
+					     MR_SYSCALL);
 		if (err)
 			putback_movable_pages(&pagelist);
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 5169f9717f60..a54d4af04df3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -43,6 +43,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
 #include <linux/pagewalk.h>
+#include <linux/node_private.h>
 
 #include <asm/tlbflush.h>
 
@@ -1387,6 +1388,8 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 	if (old_page_state & PAGE_WAS_MLOCKED)
 		lru_add_drain();
 
+	folio_managed_migrate_notify(src, dst);
+
 	if (old_page_state & PAGE_WAS_MAPPED)
 		remove_migration_ptes(src, dst, 0);
 
@@ -2165,6 +2168,7 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
 
 	return rc_gather;
 }
+EXPORT_SYMBOL_GPL(migrate_pages);
 
 struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 {
@@ -2204,6 +2208,31 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 
 	return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
 }
+EXPORT_SYMBOL_GPL(alloc_migration_target);
+
+static int __migrate_folios_to_node(struct list_head *folios, int nid,
+				    enum migrate_mode mode,
+				    enum migrate_reason reason)
+{
+	struct migration_target_control mtc = {
+		.nid = nid,
+		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
+		.reason = reason,
+	};
+
+	return migrate_pages(folios, alloc_migration_target, NULL,
+			     (unsigned long)&mtc, mode, reason, NULL);
+}
+
+int migrate_folios_to_node(struct list_head *folios, int nid,
+			   enum migrate_mode mode,
+			   enum migrate_reason reason)
+{
+	if (node_state(nid, N_MEMORY_PRIVATE))
+		return node_private_migrate_to(folios, nid, mode,
+					       reason, NULL);
+	return __migrate_folios_to_node(folios, nid, mode, reason);
+}
 
 #ifdef CONFIG_NUMA
 
@@ -2221,14 +2250,8 @@ static int store_status(int __user *status, int start, int value, int nr)
 static int do_move_pages_to_node(struct list_head *pagelist, int node)
 {
 	int err;
-	struct migration_target_control mtc = {
-		.nid = node,
-		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
-		.reason = MR_SYSCALL,
-	};
 
-	err = migrate_pages(pagelist, alloc_migration_target, NULL,
-		(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL);
+	err = migrate_folios_to_node(pagelist, node, MIGRATE_SYNC, MR_SYSCALL);
 	if (err)
 		putback_movable_pages(pagelist);
 	return err;
@@ -2240,7 +2263,7 @@ static int __add_folio_for_migration(struct folio *folio, int node,
 	if (is_zero_folio(folio) || is_huge_zero_folio(folio))
 		return -EFAULT;
 
-	if (folio_is_zone_device(folio))
+	if (!folio_managed_allows_migrate(folio))
 		return -ENOENT;
 
 	if (folio_nid(folio) == node)
@@ -2364,7 +2387,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 		err = -ENODEV;
 		if (node < 0 || node >= MAX_NUMNODES)
 			goto out_flush;
-		if (!node_state(node, N_MEMORY))
+		if (!node_state(node, N_MEMORY) &&
+		    !node_state(node, N_MEMORY_PRIVATE))
 			goto out_flush;
 
 		err = -EACCES;
@@ -2449,8 +2473,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (folio) {
 			if (is_zero_folio(folio) || is_huge_zero_folio(folio))
 				err = -EFAULT;
-			else if (folio_is_zone_device(folio))
-				err = -ENOENT;
+			else if (unlikely(folio_is_private_managed(folio)))
+				err = folio_managed_allows_user_migrate(folio);
 			else
 				err = folio_nid(folio);
 			folio_walk_end(&fw, vma);
@@ -2660,6 +2684,9 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 	int nr_pages = folio_nr_pages(folio);
 	pg_data_t *pgdat = NODE_DATA(node);
 
+	if (!folio_managed_allows_migrate(folio))
+		return -ENOENT;
+
 	if (folio_is_file_lru(folio)) {
 		/*
 		 * Do not migrate file folios that are mapped in multiple
diff --git a/mm/rmap.c b/mm/rmap.c
index f955f02d570e..805f9ceb82f3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,7 @@
 #include <linux/backing-dev.h>
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
+#include <linux/node_private.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/mm_inline.h>
 #include <linux/oom.h>
@@ -2616,8 +2617,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 					TTU_SYNC | TTU_BATCH_FLUSH)))
 		return;
 
-	if (folio_is_zone_device(folio) &&
-	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
+	if (!folio_managed_allows_migrate(folio))
 		return;
 
 	/*
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (11 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Some private nodes want userland to directly allocate from the node
via set_mempolicy() and mbind() - but don't want that node as normal
allocable system memory in the fallback lists.

Add NP_OPS_MEMPOLICY flag requiring NP_OPS_MIGRATION (since mbind can
drive migrations).  Only allow private nodes in policy nodemasks if
all private nodes in the mask support NP_OPS_MEMPOLICY. This prevents
__GFP_PRIVATE from unlocking nodes without NP_OPS_MEMPOLICY support.

Add __GFP_PRIVATE to mempolicy migration sites so moves to opted-in
private nodes succeed.

Update the sysfs "has_memory" attribute to include N_MEMORY_PRIVATE
nodes with NP_OPS_MEMPOLICY set, allowing existing numactl userland
tools to work without modification.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/base/node.c            | 22 +++++++++++++-
 include/linux/node_private.h   | 40 +++++++++++++++++++++++++
 include/uapi/linux/mempolicy.h |  1 +
 mm/mempolicy.c                 | 54 ++++++++++++++++++++++++++++++----
 mm/page_alloc.c                |  5 ++++
 5 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index e587f5781135..c08b5a948779 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -953,6 +953,10 @@ int node_private_set_ops(int nid, const struct node_private_ops *ops)
 	    (!ops->migrate_to || !ops->folio_migrate))
 		return -EINVAL;
 
+	if ((ops->flags & NP_OPS_MEMPOLICY) &&
+	    !(ops->flags & NP_OPS_MIGRATION))
+		return -EINVAL;
+
 	mutex_lock(&node_private_lock);
 	np = rcu_dereference_protected(NODE_DATA(nid)->node_private,
 				       lockdep_is_held(&node_private_lock));
@@ -1145,6 +1149,21 @@ static ssize_t show_node_state(struct device *dev,
 			  nodemask_pr_args(&node_states[na->state]));
 }
 
+/* has_memory includes N_MEMORY + N_MEMORY_PRIVATE that support mempolicy. */
+static ssize_t show_has_memory(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	nodemask_t mask = node_states[N_MEMORY];
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY_PRIVATE) {
+		if (node_private_has_flag(nid, NP_OPS_MEMPOLICY))
+			node_set(nid, mask);
+	}
+
+	return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&mask));
+}
+
 #define _NODE_ATTR(name, state) \
 	{ __ATTR(name, 0444, show_node_state, NULL), state }
 
@@ -1155,7 +1174,8 @@ static struct node_attr node_state_attr[] = {
 #ifdef CONFIG_HIGHMEM
 	[N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
 #endif
-	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
+	[N_MEMORY] = { __ATTR(has_memory, 0444, show_has_memory, NULL),
+		       N_MEMORY },
 	[N_MEMORY_PRIVATE] = _NODE_ATTR(has_private_memory, N_MEMORY_PRIVATE),
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
 	[N_GENERIC_INITIATOR] = _NODE_ATTR(has_generic_initiator,
diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index 0c5be1ee6e60..e9b58afa366b 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -86,6 +86,8 @@ struct node_private_ops {
 
 /* Allow user/kernel migration; requires migrate_to and folio_migrate */
 #define NP_OPS_MIGRATION		BIT(0)
+/* Allow mempolicy-directed allocation and mbind migration to this node */
+#define NP_OPS_MEMPOLICY		BIT(1)
 
 /**
  * struct node_private - Per-node container for N_MEMORY_PRIVATE nodes
@@ -276,6 +278,34 @@ static inline int node_private_migrate_to(struct list_head *folios, int nid,
 
 	return ret;
 }
+
+static inline bool node_mpol_eligible(int nid)
+{
+	bool ret;
+
+	if (!node_state(nid, N_MEMORY_PRIVATE))
+		return node_state(nid, N_MEMORY);
+
+	rcu_read_lock();
+	ret = node_private_has_flag(nid, NP_OPS_MEMPOLICY);
+	rcu_read_unlock();
+	return ret;
+}
+
+static inline bool nodes_private_mpol_allowed(const nodemask_t *nodes)
+{
+	int nid;
+	bool eligible = false;
+
+	for_each_node_mask(nid, *nodes) {
+		if (!node_state(nid, N_MEMORY_PRIVATE))
+			continue;
+		if (!node_mpol_eligible(nid))
+			return false;
+		eligible = true;
+	}
+	return eligible;
+}
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #else /* !CONFIG_NUMA */
@@ -364,6 +394,16 @@ static inline int node_private_migrate_to(struct list_head *folios, int nid,
 	return -ENODEV;
 }
 
+static inline bool node_mpol_eligible(int nid)
+{
+	return false;
+}
+
+static inline bool nodes_private_mpol_allowed(const nodemask_t *nodes)
+{
+	return false;
+}
+
 static inline int node_private_register(int nid, struct node_private *np)
 {
 	return -ENODEV;
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 8fbbe613611a..b606eae983c8 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -64,6 +64,7 @@ enum {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
+#define MPOL_F_PRIVATE	(1 << 5) /* policy targets private node; use __GFP_PRIVATE */
 
 /*
  * Enabling zone reclaim means the page allocator will attempt to fulfill
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2b0f9762d171..8ac014950e88 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -406,8 +406,6 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 static int mpol_set_nodemask(struct mempolicy *pol,
 		     const nodemask_t *nodes, struct nodemask_scratch *nsc)
 {
-	int ret;
-
 	/*
 	 * Default (pol==NULL) resp. local memory policies are not a
 	 * subject of any remapping. They also do not need any special
@@ -416,9 +414,12 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 	if (!pol || pol->mode == MPOL_LOCAL)
 		return 0;
 
-	/* Check N_MEMORY */
+	/* Check N_MEMORY and N_MEMORY_PRIVATE*/
 	nodes_and(nsc->mask1,
 		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
+	nodes_and(nsc->mask2, cpuset_current_mems_allowed,
+		  node_states[N_MEMORY_PRIVATE]);
+	nodes_or(nsc->mask1, nsc->mask1, nsc->mask2);
 
 	VM_BUG_ON(!nodes);
 
@@ -432,8 +433,13 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 	else
 		pol->w.cpuset_mems_allowed = cpuset_current_mems_allowed;
 
-	ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
-	return ret;
+	/* All private nodes in the mask must have NP_OPS_MEMPOLICY. */
+	if (nodes_private_mpol_allowed(&nsc->mask2))
+		pol->flags |= MPOL_F_PRIVATE;
+	else if (nodes_intersects(nsc->mask2, node_states[N_MEMORY_PRIVATE]))
+		return -EINVAL;
+
+	return mpol_ops[pol->mode].create(pol, &nsc->mask2);
 }
 
 /*
@@ -500,6 +506,7 @@ static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
 static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	nodemask_t tmp;
+	int nid;
 
 	if (pol->flags & MPOL_F_STATIC_NODES)
 		nodes_and(tmp, pol->w.user_nodemask, *nodes);
@@ -514,6 +521,21 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 	if (nodes_empty(tmp))
 		tmp = *nodes;
 
+	/*
+	 * Drop private nodes that don't have mempolicy support.
+	 * cpusets guarantees at least one N_MEMORY node in effective_mems
+	 * and mems_allowed, so dropping private nodes here is safe.
+	 */
+	for_each_node_mask(nid, tmp) {
+		if (node_state(nid, N_MEMORY_PRIVATE) &&
+		    !node_private_has_flag(nid, NP_OPS_MEMPOLICY))
+			node_clear(nid, tmp);
+	}
+	if (nodes_intersects(tmp, node_states[N_MEMORY_PRIVATE]))
+		pol->flags |= MPOL_F_PRIVATE;
+	else
+		pol->flags &= ~MPOL_F_PRIVATE;
+
 	pol->nodes = tmp;
 }
 
@@ -661,6 +683,9 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
 	}
 	if (!queue_folio_required(folio, qp))
 		return;
+	if (folio_is_private_node(folio) &&
+	    !folio_private_flags(folio, NP_OPS_MIGRATION))
+		return;
 	if (!(qp->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
 	    !vma_migratable(walk->vma) ||
 	    !migrate_folio_add(folio, qp->pagelist, qp->flags))
@@ -717,6 +742,9 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 		folio = vm_normal_folio(vma, addr, ptent);
 		if (!folio || folio_is_zone_device(folio))
 			continue;
+		if (folio_is_private_node(folio) &&
+		    !folio_private_flags(folio, NP_OPS_MIGRATION))
+			continue;
 		if (folio_test_large(folio) && max_nr != 1)
 			nr = folio_pte_batch(folio, pte, ptent, max_nr);
 		/*
@@ -1451,6 +1479,9 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 	else
 		gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
 
+	if (pol->flags & MPOL_F_PRIVATE)
+		gfp |= __GFP_PRIVATE;
+
 	return folio_alloc_mpol(gfp, order, pol, ilx, nid);
 }
 #else
@@ -2280,6 +2311,15 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 			nodemask = &pol->nodes;
 		if (pol->home_node != NUMA_NO_NODE)
 			*nid = pol->home_node;
+		else if ((pol->flags & MPOL_F_PRIVATE) &&
+			 !node_isset(*nid, pol->nodes)) {
+			/*
+			 * Private nodes are not in N_MEMORY nodes' zonelists.
+			 * When the preferred nid (usually numa_node_id()) can't
+			 * reach the policy nodes, start from a policy node.
+			 */
+			*nid = first_node(pol->nodes);
+		}
 		/*
 		 * __GFP_THISNODE shouldn't even be used with the bind policy
 		 * because we might easily break the expectation to stay on the
@@ -2533,6 +2573,10 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct
 		gfp |= __GFP_NOWARN;
 
 	pol = get_vma_policy(vma, addr, order, &ilx);
+
+	if (pol->flags & MPOL_F_PRIVATE)
+		gfp |= __GFP_PRIVATE;
+
 	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
 	mpol_cond_put(pol);
 	return folio;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5a1b35421d78..ec6c1f8e85d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3849,8 +3849,13 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 * if another process has NUMA bindings and is causing
 		 * kswapd wakeups on only some nodes. Avoid accidental
 		 * "node_reclaim_mode"-like behavior in this case.
+		 *
+		 * Nodes without kswapd (some private nodes) are never
+		 * skipped - this causes some mempolicies to silently
+		 * fall back to DRAM even if the node is eligible.
 		 */
 		if (skip_kswapd_nodes &&
+		    zone->zone_pgdat->kswapd &&
 		    !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
 			skipped_kswapd_nodes = true;
 			continue;
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (12 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

The memory-tier subsystem needs to know which private nodes should
appear as demotion targets.

Add NP_OPS_DEMOTION (BIT(2)):
   Node can be added as a demotion target by memory-tiers.

Add demotion backpressure support so private nodes can reject
new demotions cleanly, allowing vmscan to fall back to swap.

In the demotion path, try demotion to private nodes invididually,
then clear private nodes from the demotion target mask until a
non-private node is found, then fall back to the remaining mask.
This prevents LRU inversion while still allowing forward progress.

This is the closest match to the current behavior without making
private nodes inaccessible or preventing forward progress. We
should probably completely re-do the demotion logic to allow less
fallback and kick kswapd instead - right now we induce LRU
inversions by simply falling back to any node in the demotion list.

Add memory_tier_refresh_demotion() export for services to trigger
re-evaluation of demotion targets after changing their flags.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory-tiers.h |  9 +++++++
 include/linux/node_private.h | 22 +++++++++++++++++
 mm/internal.h                |  7 ++++++
 mm/memory-tiers.c            | 46 ++++++++++++++++++++++++++++++++----
 mm/page_alloc.c              | 12 +++++++---
 mm/vmscan.c                  | 30 ++++++++++++++++++++++-
 6 files changed, 117 insertions(+), 9 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 3e1159f6762c..e1476432e359 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -58,6 +58,7 @@ struct memory_dev_type *mt_get_memory_type(int adist);
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
 bool node_is_toptier(int node);
+void memory_tier_refresh_demotion(void);
 #else
 static inline int next_demotion_node(int node)
 {
@@ -73,6 +74,10 @@ static inline bool node_is_toptier(int node)
 {
 	return true;
 }
+
+static inline void memory_tier_refresh_demotion(void)
+{
+}
 #endif
 
 #else
@@ -106,6 +111,10 @@ static inline bool node_is_toptier(int node)
 	return true;
 }
 
+static inline void memory_tier_refresh_demotion(void)
+{
+}
+
 static inline int register_mt_adistance_algorithm(struct notifier_block *nb)
 {
 	return 0;
diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index e9b58afa366b..e254e36056cd 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -88,6 +88,8 @@ struct node_private_ops {
 #define NP_OPS_MIGRATION		BIT(0)
 /* Allow mempolicy-directed allocation and mbind migration to this node */
 #define NP_OPS_MEMPOLICY		BIT(1)
+/* Node participates as a demotion target in memory-tiers */
+#define NP_OPS_DEMOTION			BIT(2)
 
 /**
  * struct node_private - Per-node container for N_MEMORY_PRIVATE nodes
@@ -101,12 +103,14 @@ struct node_private_ops {
  *		callbacks that may sleep; 0 = fully released)
  * @released: Signaled when refcount drops to 0; unregister waits on this
  * @ops: Service callbacks and exclusion flags (NULL until service registers)
+ * @migration_blocked: Service signals migrations should pause
  */
 struct node_private {
 	void *owner;
 	refcount_t refcount;
 	struct completion released;
 	const struct node_private_ops *ops;
+	bool migration_blocked;
 };
 
 #ifdef CONFIG_NUMA
@@ -306,6 +310,19 @@ static inline bool nodes_private_mpol_allowed(const nodemask_t *nodes)
 	}
 	return eligible;
 }
+
+static inline bool node_private_migration_blocked(int nid)
+{
+	struct node_private *np;
+	bool blocked;
+
+	rcu_read_lock();
+	np = rcu_dereference(NODE_DATA(nid)->node_private);
+	blocked = np && READ_ONCE(np->migration_blocked);
+	rcu_read_unlock();
+
+	return blocked;
+}
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #else /* !CONFIG_NUMA */
@@ -404,6 +421,11 @@ static inline bool nodes_private_mpol_allowed(const nodemask_t *nodes)
 	return false;
 }
 
+static inline bool node_private_migration_blocked(int nid)
+{
+	return false;
+}
+
 static inline int node_private_register(int nid, struct node_private *np)
 {
 	return -ENODEV;
diff --git a/mm/internal.h b/mm/internal.h
index 6ab4679fe943..5950e20d4023 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1206,6 +1206,8 @@ extern int node_reclaim_mode;
 
 extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
 extern int find_next_best_node(int node, nodemask_t *used_node_mask);
+extern int find_next_best_node_in(int node, nodemask_t *used_node_mask,
+				  const nodemask_t *candidates);
 extern bool numa_zone_alloc_allowed(int alloc_flags, struct zone *zone,
 			      gfp_t gfp_mask);
 #else
@@ -1220,6 +1222,11 @@ static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	return NUMA_NO_NODE;
 }
+static inline int find_next_best_node_in(int node, nodemask_t *used_node_mask,
+					 const nodemask_t *candidates)
+{
+	return NUMA_NO_NODE;
+}
 static inline bool numa_zone_alloc_allowed(int alloc_flags, struct zone *zone,
 				     gfp_t gfp_mask)
 {
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 9c742e18e48f..434190fdc078 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -3,6 +3,7 @@
 #include <linux/lockdep.h>
 #include <linux/sysfs.h>
 #include <linux/kobject.h>
+#include <linux/node_private.h>
 #include <linux/memory.h>
 #include <linux/memory-tiers.h>
 #include <linux/notifier.h>
@@ -380,6 +381,8 @@ static void disable_all_demotion_targets(void)
 		if (memtier)
 			memtier->lower_tier_mask = NODE_MASK_NONE;
 	}
+	for_each_node_state(node, N_MEMORY_PRIVATE)
+		node_demotion[node].preferred = NODE_MASK_NONE;
 	/*
 	 * Ensure that the "disable" is visible across the system.
 	 * Readers will see either a combination of before+disable
@@ -421,6 +424,7 @@ static void establish_demotion_targets(void)
 	int target = NUMA_NO_NODE, node;
 	int distance, best_distance;
 	nodemask_t tier_nodes, lower_tier;
+	nodemask_t all_memory;
 
 	lockdep_assert_held_once(&memory_tier_lock);
 
@@ -429,6 +433,13 @@ static void establish_demotion_targets(void)
 
 	disable_all_demotion_targets();
 
+	/* Include private nodes that have opted in to demotion. */
+	all_memory = node_states[N_MEMORY];
+	for_each_node_state(node, N_MEMORY_PRIVATE) {
+		if (node_private_has_flag(node, NP_OPS_DEMOTION))
+			node_set(node, all_memory);
+	}
+
 	for_each_node_state(node, N_MEMORY) {
 		best_distance = -1;
 		nd = &node_demotion[node];
@@ -442,12 +453,12 @@ static void establish_demotion_targets(void)
 		memtier = list_next_entry(memtier, list);
 		tier_nodes = get_memtier_nodemask(memtier);
 		/*
-		 * find_next_best_node, use 'used' nodemask as a skip list.
+		 * find_next_best_node_in, use 'used' nodemask as a skip list.
 		 * Add all memory nodes except the selected memory tier
 		 * nodelist to skip list so that we find the best node from the
 		 * memtier nodelist.
 		 */
-		nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes);
+		nodes_andnot(tier_nodes, all_memory, tier_nodes);
 
 		/*
 		 * Find all the nodes in the memory tier node list of same best distance.
@@ -455,7 +466,8 @@ static void establish_demotion_targets(void)
 		 * in the preferred mask when allocating pages during demotion.
 		 */
 		do {
-			target = find_next_best_node(node, &tier_nodes);
+			target = find_next_best_node_in(node, &tier_nodes,
+							&all_memory);
 			if (target == NUMA_NO_NODE)
 				break;
 
@@ -495,7 +507,7 @@ static void establish_demotion_targets(void)
 	 * allocation to a set of nodes that is closer the above selected
 	 * preferred node.
 	 */
-	lower_tier = node_states[N_MEMORY];
+	lower_tier = all_memory;
 	list_for_each_entry(memtier, &memory_tiers, list) {
 		/*
 		 * Keep removing current tier from lower_tier nodes,
@@ -542,7 +554,7 @@ static struct memory_tier *set_node_memory_tier(int node)
 
 	lockdep_assert_held_once(&memory_tier_lock);
 
-	if (!node_state(node, N_MEMORY))
+	if (!node_state(node, N_MEMORY) && !node_state(node, N_MEMORY_PRIVATE))
 		return ERR_PTR(-EINVAL);
 
 	mt_calc_adistance(node, &adist);
@@ -865,6 +877,30 @@ int mt_calc_adistance(int node, int *adist)
 }
 EXPORT_SYMBOL_GPL(mt_calc_adistance);
 
+/**
+ * memory_tier_refresh_demotion() - Re-establish demotion targets
+ *
+ * Called by services after registering or unregistering ops->migrate_to on
+ * a private node, so that establish_demotion_targets() picks up the change.
+ */
+void memory_tier_refresh_demotion(void)
+{
+	int nid;
+
+	mutex_lock(&memory_tier_lock);
+	/*
+	 * Ensure private nodes are registered with a tier, otherwise
+	 * they won't show up in any node's demotion targets nodemask.
+	 */
+	for_each_node_state(nid, N_MEMORY_PRIVATE) {
+		if (!__node_get_memory_tier(nid))
+			set_node_memory_tier(nid);
+	}
+	establish_demotion_targets();
+	mutex_unlock(&memory_tier_lock);
+}
+EXPORT_SYMBOL_GPL(memory_tier_refresh_demotion);
+
 static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 					      unsigned long action, void *_arg)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ec6c1f8e85d8..e272dfdc6b00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5589,7 +5589,8 @@ static int node_load[MAX_NUMNODES];
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node_in(int node, nodemask_t *used_node_mask,
+			   const nodemask_t *candidates)
 {
 	int n, val;
 	int min_val = INT_MAX;
@@ -5599,12 +5600,12 @@ int find_next_best_node(int node, nodemask_t *used_node_mask)
 	 * Use the local node if we haven't already, but for memoryless local
 	 * node, we should skip it and fall back to other nodes.
 	 */
-	if (!node_isset(node, *used_node_mask) && node_state(node, N_MEMORY)) {
+	if (!node_isset(node, *used_node_mask) && node_isset(node, *candidates)) {
 		node_set(node, *used_node_mask);
 		return node;
 	}
 
-	for_each_node_state(n, N_MEMORY) {
+	for_each_node_mask(n, *candidates) {
 
 		/* Don't want a node to appear more than once */
 		if (node_isset(n, *used_node_mask))
@@ -5636,6 +5637,11 @@ int find_next_best_node(int node, nodemask_t *used_node_mask)
 	return best_node;
 }
 
+int find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+	return find_next_best_node_in(node, used_node_mask,
+				      &node_states[N_MEMORY]);
+}
 
 /*
  * Build zonelists ordered by node and zones within node.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6113be4d3519..0f534428ea88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -58,6 +58,7 @@
 #include <linux/random.h>
 #include <linux/mmu_notifier.h>
 #include <linux/parser.h>
+#include <linux/node_private.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -355,6 +356,10 @@ static bool can_demote(int nid, struct scan_control *sc,
 	if (demotion_nid == NUMA_NO_NODE)
 		return false;
 
+	/* Don't demote when the target's service signals backpressure */
+	if (node_private_migration_blocked(demotion_nid))
+		return false;
+
 	/* If demotion node isn't in the cgroup's mems_allowed, fall back */
 	return mem_cgroup_node_allowed(memcg, demotion_nid);
 }
@@ -1022,8 +1027,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
 				     struct pglist_data *pgdat)
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
-	unsigned int nr_succeeded;
+	int first_nid = target_nid;
+	unsigned int nr_succeeded = 0;
 	nodemask_t allowed_mask;
+	int ret;
 
 	struct migration_target_control mtc = {
 		/*
@@ -1046,6 +1053,27 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
 
 	node_get_allowed_targets(pgdat, &allowed_mask);
 
+	/* Try private node targets until we find non-private node */
+	while (node_state(target_nid, N_MEMORY_PRIVATE)) {
+		unsigned int nr = 0;
+
+		ret = node_private_migrate_to(demote_folios, target_nid,
+					      MIGRATE_ASYNC, MR_DEMOTION,
+					      &nr);
+		nr_succeeded += nr;
+		if (ret == 0 || list_empty(demote_folios))
+			return nr_succeeded;
+
+		target_nid = next_node_in(target_nid, allowed_mask);
+		if (target_nid == first_nid)
+			return nr_succeeded;
+		if (!node_state(target_nid, N_MEMORY_PRIVATE))
+			break;
+	}
+
+	/* target_nid is a non-private node; use standard migration */
+	mtc.nid = target_nid;
+
 	/* Demotion ignores all cpuset and mempolicy settings */
 	migrate_pages(demote_folios, alloc_demote_folio, NULL,
 		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (13 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Services that intercept write faults (e.g., for promotion tracking)
need PTEs to stay read-only. This requires preventing mprotect
from silently upgrade the PTE, bypassing the service's handle_fault
callback.

Add NP_OPS_PROTECT_WRITE and folio_managed_wrprotect().

In change_pte_range() and change_huge_pmd(), suppress PTE write-upgrade
when MM_CP_TRY_CHANGE_WRITABLE is sees the folio is write-protected.

In handle_pte_fault() and do_huge_pmd_wp_page(), dispatch to the node's
ops->handle_fault callback when set, allowing the service to handle write
faults with promotion or other custom logic.

NP_OPS_MEMPOLICY is incompatible with NP_OPS_PROTECT_WRITE to avoid the
footgun of binding a writable VMA to a write-protected node.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/base/node.c          |  4 ++
 include/linux/node_private.h | 22 ++++++++
 mm/huge_memory.c             | 17 ++++++-
 mm/internal.h                | 99 ++++++++++++++++++++++++++++++++++++
 mm/memory.c                  | 15 ++++++
 mm/migrate.c                 | 14 +----
 mm/mprotect.c                |  4 +-
 7 files changed, 159 insertions(+), 16 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index c08b5a948779..a4955b9b5b93 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -957,6 +957,10 @@ int node_private_set_ops(int nid, const struct node_private_ops *ops)
 	    !(ops->flags & NP_OPS_MIGRATION))
 		return -EINVAL;
 
+	if ((ops->flags & NP_OPS_MEMPOLICY) &&
+	    (ops->flags & NP_OPS_PROTECT_WRITE))
+		return -EINVAL;
+
 	mutex_lock(&node_private_lock);
 	np = rcu_dereference_protected(NODE_DATA(nid)->node_private,
 				       lockdep_is_held(&node_private_lock));
diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index e254e36056cd..27d6e5d84e61 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -70,6 +70,24 @@ struct vm_fault;
  *     PFN-based metadata (compression tables, device page tables, DMA
  *     mappings, etc.) before any access through the page tables.
  *
+ * @handle_fault: Handle fault on folio on this private node.
+ *   [folio-referenced callback, PTL held on entry]
+ *
+ *   Called from handle_pte_fault() (PTE level) or do_huge_pmd_wp_page()
+ *   (PMD level) after lock acquisition and entry verification.
+ *   @folio is the faulting folio, @level indicates the page table level.
+ *
+ *   For PGTABLE_LEVEL_PTE: vmf->pte is mapped and vmf->ptl is the
+ *   PTE lock.  Release via pte_unmap_unlock(vmf->pte, vmf->ptl).
+ *
+ *   For PGTABLE_LEVEL_PMD: vmf->pte is NULL and vmf->ptl is the
+ *   PMD lock.  Release via spin_unlock(vmf->ptl).
+ *
+ *   The callback MUST release PTL on ALL paths.
+ *   The caller will NOT touch the page table entry after this returns.
+ *
+ *   Returns: vm_fault_t result (0, VM_FAULT_RETRY, etc.)
+ *
  * @flags: Operation exclusion flags (NP_OPS_* constants).
  *
  */
@@ -81,6 +99,8 @@ struct node_private_ops {
 				  enum migrate_reason reason,
 				  unsigned int *nr_succeeded);
 	void (*folio_migrate)(struct folio *src, struct folio *dst);
+	vm_fault_t (*handle_fault)(struct folio *folio, struct vm_fault *vmf,
+				   enum pgtable_level level);
 	unsigned long flags;
 };
 
@@ -90,6 +110,8 @@ struct node_private_ops {
 #define NP_OPS_MEMPOLICY		BIT(1)
 /* Node participates as a demotion target in memory-tiers */
 #define NP_OPS_DEMOTION			BIT(2)
+/* Prevent mprotect/NUMA from upgrading PTEs to writable on this node */
+#define NP_OPS_PROTECT_WRITE		BIT(3)
 
 /**
  * struct node_private - Per-node container for N_MEMORY_PRIVATE nodes
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ecae494291a..d9ba6593244d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2063,12 +2063,14 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	pmd_t orig_pmd = vmf->orig_pmd;
+	vm_fault_t ret;
+
 
 	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
 
 	if (is_huge_zero_pmd(orig_pmd)) {
-		vm_fault_t ret = do_huge_zero_wp_pmd(vmf);
+		ret = do_huge_zero_wp_pmd(vmf);
 
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -2088,6 +2090,13 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	folio = page_folio(page);
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/* Private-managed write-protect: let the service handle the fault */
+	if (unlikely(folio_is_private_managed(folio))) {
+		if (folio_managed_handle_fault(folio, vmf,
+					      PGTABLE_LEVEL_PMD, &ret))
+			return ret;
+	}
+
 	/* Early check when only holding the PT lock. */
 	if (PageAnonExclusive(page))
 		goto reuse;
@@ -2633,7 +2642,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 	/* See change_pte_range(). */
 	if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
-	    can_change_pmd_writable(vma, addr, entry))
+	    can_change_pmd_writable(vma, addr, entry) &&
+	    !folio_managed_wrprotect(pmd_folio(entry)))
 		entry = pmd_mkwrite(entry, vma);
 
 	ret = HPAGE_PMD_NR;
@@ -4943,6 +4953,9 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	if (folio_test_dirty(folio) && softleaf_is_migration_dirty(entry))
 		pmde = pmd_mkdirty(pmde);
 
+	if (folio_managed_wrprotect(folio))
+		pmde = pmd_wrprotect(pmde);
+
 	if (folio_is_device_private(folio)) {
 		swp_entry_t entry;
 
diff --git a/mm/internal.h b/mm/internal.h
index 5950e20d4023..ae4ff86e8dc6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -11,6 +11,7 @@
 #include <linux/khugepaged.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/node_private.h>
 #include <linux/pagemap.h>
 #include <linux/pagewalk.h>
 #include <linux/rmap.h>
@@ -18,6 +19,7 @@
 #include <linux/leafops.h>
 #include <linux/swap_cgroup.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/node_private.h>
 
 /* Internal core VMA manipulation functions. */
 #include "vma.h"
@@ -1449,6 +1451,103 @@ static inline bool folio_managed_on_free(struct folio *folio)
 	return false;
 }
 
+/*
+ * folio_managed_handle_fault - Dispatch fault on managed-memory folio
+ * @folio: the faulting folio (must not be NULL)
+ * @vmf: the vm_fault descriptor (PTL held: vmf->ptl locked)
+ * @level: page table level (PGTABLE_LEVEL_PTE or PGTABLE_LEVEL_PMD)
+ * @ret: output fault result if handled
+ *
+ * Called with PTL held.  If a handle_fault callback exists, it is invoked
+ * with PTL still held.  The callback is responsible for releasing PTL on
+ * all paths.
+ *
+ * Returns true if the service handled the fault (PTL released by callback,
+ * caller returns *ret).  Returns false if no handler exists (PTL still held,
+ * caller continues with normal fault handling).
+ */
+static inline bool folio_managed_handle_fault(struct folio *folio,
+					      struct vm_fault *vmf,
+					      enum pgtable_level level,
+					      vm_fault_t *ret)
+{
+	/* Zone device pages use swap entries; handled in do_swap_page */
+	if (folio_is_zone_device(folio))
+		return false;
+
+	if (folio_is_private_node(folio)) {
+		const struct node_private_ops *ops =
+			folio_node_private_ops(folio);
+
+		if (ops && ops->handle_fault) {
+			*ret = ops->handle_fault(folio, vmf, level);
+			return true;
+		}
+	}
+	return false;
+}
+
+/**
+ * folio_managed_wrprotect - Should this folio's mappings stay write-protected?
+ * @folio: the folio to check
+ *
+ * Returns true if the folio is on a private node with NP_OPS_PROTECT_WRITE,
+ * meaning page table entries (PTE or PMD) should not be made writable.
+ * Write faults are intercepted by the service's handle_fault callback
+ * to promote the folio to DRAM.
+ *
+ * Used by:
+ *   - change_pte_range() / change_huge_pmd(): prevent mprotect write-upgrade
+ *   - remove_migration_pte() / remove_migration_pmd(): strip write after migration
+ *   - do_huge_pmd_wp_page(): dispatch to fault handler instead of reuse
+ */
+static inline bool folio_managed_wrprotect(struct folio *folio)
+{
+	return unlikely(folio_is_private_node(folio) &&
+			folio_private_flags(folio, NP_OPS_PROTECT_WRITE));
+}
+
+/**
+ * folio_managed_fixup_migration_pte - Fixup PTE after migration for
+ *                                     managed memory pages.
+ * @new: the destination page
+ * @pte: the PTE being installed (normal PTE built by caller)
+ * @old_pte: the original PTE (before migration, for swap entry flags)
+ * @vma: the VMA
+ *
+ * For MEMORY_DEVICE_PRIVATE pages: replaces the PTE with a device-private
+ * swap entry, preserving soft_dirty and uffd_wp from old_pte.
+ *
+ * For N_MEMORY_PRIVATE pages with NP_OPS_PROTECT_WRITE: strips the write
+ * bit so the next write triggers the fault handler for promotion.
+ *
+ * For normal pages: returns pte unmodified.
+ */
+static inline pte_t folio_managed_fixup_migration_pte(struct page *new,
+						      pte_t pte,
+						      pte_t old_pte,
+						      struct vm_area_struct *vma)
+{
+	if (unlikely(is_device_private_page(new))) {
+		softleaf_t entry;
+
+		if (pte_write(pte))
+			entry = make_writable_device_private_entry(
+						page_to_pfn(new));
+		else
+			entry = make_readable_device_private_entry(
+						page_to_pfn(new));
+		pte = softleaf_to_pte(entry);
+		if (pte_swp_soft_dirty(old_pte))
+			pte = pte_swp_mksoft_dirty(pte);
+		if (pte_swp_uffd_wp(old_pte))
+			pte = pte_swp_mkuffd_wp(pte);
+	} else if (folio_managed_wrprotect(page_folio(new))) {
+		pte = pte_wrprotect(pte);
+	}
+	return pte;
+}
+
 /**
  * folio_managed_migrate_notify - Notify service that a folio changed location
  * @src: the old folio (about to be freed)
diff --git a/mm/memory.c b/mm/memory.c
index 2a55edc48a65..0f78988befef 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6079,6 +6079,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 * Make it present again, depending on how arch implements
 	 * non-accessible ptes, some can allow access by kernel mode.
 	 */
+	if (unlikely(folio && folio_managed_wrprotect(folio))) {
+		writable = false;
+		ignore_writable = true;
+	}
 	if (folio && folio_test_large(folio))
 		numa_rebuild_large_mapping(vmf, vma, folio, pte, ignore_writable,
 					   pte_write_upgrade);
@@ -6228,6 +6232,7 @@ static void fix_spurious_fault(struct vm_fault *vmf,
  */
 static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 {
+	struct folio *folio;
 	pte_t entry;
 
 	if (unlikely(pmd_none(*vmf->pmd))) {
@@ -6284,6 +6289,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
 		goto unlock;
 	}
+
+	folio = vm_normal_folio(vmf->vma, vmf->address, entry);
+	if (unlikely(folio && folio_is_private_managed(folio))) {
+		vm_fault_t fault_ret;
+
+		if (folio_managed_handle_fault(folio, vmf, PGTABLE_LEVEL_PTE,
+					       &fault_ret))
+			return fault_ret;
+	}
+
 	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
diff --git a/mm/migrate.c b/mm/migrate.c
index a54d4af04df3..f632e8b03504 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -398,19 +398,7 @@ static bool remove_migration_pte(struct folio *folio,
 		if (folio_test_anon(folio) && !softleaf_is_migration_read(entry))
 			rmap_flags |= RMAP_EXCLUSIVE;
 
-		if (unlikely(is_device_private_page(new))) {
-			if (pte_write(pte))
-				entry = make_writable_device_private_entry(
-							page_to_pfn(new));
-			else
-				entry = make_readable_device_private_entry(
-							page_to_pfn(new));
-			pte = softleaf_to_pte(entry);
-			if (pte_swp_soft_dirty(old_pte))
-				pte = pte_swp_mksoft_dirty(pte);
-			if (pte_swp_uffd_wp(old_pte))
-				pte = pte_swp_mkuffd_wp(pte);
-		}
+		pte = folio_managed_fixup_migration_pte(new, pte, old_pte, vma);
 
 #ifdef CONFIG_HUGETLB_PAGE
 		if (folio_test_hugetlb(folio)) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 283889e4f1ce..830be609bc24 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -30,6 +30,7 @@
 #include <linux/mm_inline.h>
 #include <linux/pgtable.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/node_private.h>
 #include <uapi/linux/mman.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
@@ -290,7 +291,8 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * COW or special handling is required.
 			 */
 			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
-			     !pte_write(ptent))
+			     !pte_write(ptent) &&
+			     !(folio && folio_managed_wrprotect(folio)))
 				set_write_prot_commit_flush_ptes(vma, folio, page,
 				addr, pte, oldpte, ptent, nr_ptes, tlb);
 			else
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (14 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Private node services that drive kswapd via watermark_boost need
control over the reclaim policy.  There are three problems:

1) Boosted reclaim suppresses may_swap and may_writepage.  When
   demotion is not possible, swap is the only evict path, so kswapd
   cannot make progress and pages are stranded.

2) __setup_per_zone_wmarks() unconditionally zeros watermark_boost,
   killing the service's pressure signal.

3) Not all private nodes want reclaim to touch their pages.

Add a reclaim_policy callback to struct node_private_ops and a
struct node_reclaim_policy with:

  - active:             set by the helper when a callback was invoked
  - may_swap:           allow swap writeback during boosted reclaim
  - may_writepage:      allow writepage during boosted reclaim
  - managed_watermarks: service owns watermark_boost lifecycle

We do not allow disabling swap/writepage, as core MM may have
explicitly enabled them on a non-boosted pass.

We only allow enablign swap/writepage, so that the supression during
a boost can be overridden.  This allows a device to force evictions
even when the system otherwise would not percieve pressure.

This is important for a service like compressed RAM, as device capacity
may differ from reported capacity, and device may want to relieve real
pressure (poor compression ratio) as opposed to percieved pressure
(i.e. how many pages are in use).

Add zone_reclaim_allowed() to filter private nodes that have not
opted into reclaim.

Regular nodes fall through to cpuset_zone_allowed() unchanged.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/node_private.h | 28 ++++++++++++++++++++++++++++
 mm/internal.h                | 36 ++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c              | 11 ++++++++++-
 mm/vmscan.c                  | 25 +++++++++++++++++++++++--
 4 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index 27d6e5d84e61..34be52383255 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -14,6 +14,24 @@ struct page;
 struct vm_area_struct;
 struct vm_fault;
 
+/**
+ * struct node_reclaim_policy - Reclaim policy overrides for private nodes
+ * @active: set by node_private_reclaim_policy() when a callback was invoked
+ * @may_swap: allow swap writeback during boosted reclaim
+ * @may_writepage: allow writepage during boosted reclaim
+ * @managed_watermarks: service owns watermark_boost lifecycle; kswapd must
+ *                      not clear it after boosted reclaim
+ *
+ * Passed to the reclaim_policy callback so each private node service can
+ * inject its own reclaim policy before kswapd runs boosted reclaim.
+ */
+struct node_reclaim_policy {
+	bool active;
+	bool may_swap;
+	bool may_writepage;
+	bool managed_watermarks;
+};
+
 /**
  * struct node_private_ops - Callbacks for private node services
  *
@@ -88,6 +106,13 @@ struct vm_fault;
  *
  *   Returns: vm_fault_t result (0, VM_FAULT_RETRY, etc.)
  *
+ * @reclaim_policy: Configure reclaim policy for boosted reclaim.
+ *   [called hodling rcu_read_lock, MUST NOT sleep]
+ *   Called by kswapd before boosted reclaim to let the service override
+ *   may_swap / may_writepage.  If provided, the service also owns the
+ *   watermark_boost lifecycle (kswapd will not clear it).
+ *   If NULL, normal boost policy applies.
+ *
  * @flags: Operation exclusion flags (NP_OPS_* constants).
  *
  */
@@ -101,6 +126,7 @@ struct node_private_ops {
 	void (*folio_migrate)(struct folio *src, struct folio *dst);
 	vm_fault_t (*handle_fault)(struct folio *folio, struct vm_fault *vmf,
 				   enum pgtable_level level);
+	void (*reclaim_policy)(int nid, struct node_reclaim_policy *policy);
 	unsigned long flags;
 };
 
@@ -112,6 +138,8 @@ struct node_private_ops {
 #define NP_OPS_DEMOTION			BIT(2)
 /* Prevent mprotect/NUMA from upgrading PTEs to writable on this node */
 #define NP_OPS_PROTECT_WRITE		BIT(3)
+/* Kernel reclaim (kswapd, direct reclaim, OOM) operates on this node */
+#define NP_OPS_RECLAIM			BIT(4)
 
 /**
  * struct node_private - Per-node container for N_MEMORY_PRIVATE nodes
diff --git a/mm/internal.h b/mm/internal.h
index ae4ff86e8dc6..db32cb2d7a29 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1572,6 +1572,42 @@ static inline void folio_managed_migrate_notify(struct folio *src,
 		ops->folio_migrate(src, dst);
 }
 
+/**
+ * node_private_reclaim_policy - invoke the service's reclaim policy callback
+ * @nid: NUMA node id
+ * @policy: reclaim policy struct to fill in
+ *
+ * Called by kswapd before boosted reclaim.  Zeroes @policy, then if the
+ * private node service provides a reclaim_policy callback, invokes it
+ * and sets policy->active to true.
+ */
+#ifdef CONFIG_NUMA
+static inline void node_private_reclaim_policy(int nid,
+					       struct node_reclaim_policy *policy)
+{
+	struct node_private *np;
+
+	memset(policy, 0, sizeof(*policy));
+
+	if (!node_state(nid, N_MEMORY_PRIVATE))
+		return;
+
+	rcu_read_lock();
+	np = rcu_dereference(NODE_DATA(nid)->node_private);
+	if (np && np->ops && np->ops->reclaim_policy) {
+		np->ops->reclaim_policy(nid, policy);
+		policy->active = true;
+	}
+	rcu_read_unlock();
+}
+#else
+static inline void node_private_reclaim_policy(int nid,
+					       struct node_reclaim_policy *policy)
+{
+	memset(policy, 0, sizeof(*policy));
+}
+#endif
+
 struct vm_struct *__get_vm_area_node(unsigned long size,
 				     unsigned long align, unsigned long shift,
 				     unsigned long vm_flags, unsigned long start,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e272dfdc6b00..9692048ab5fb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -55,6 +55,7 @@
 #include <linux/delayacct.h>
 #include <linux/cacheinfo.h>
 #include <linux/pgalloc_tag.h>
+#include <linux/node_private.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -6437,6 +6438,8 @@ static void __setup_per_zone_wmarks(void)
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
+	struct node_reclaim_policy rp;
+	int prev_nid = NUMA_NO_NODE;
 
 	/* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */
 	for_each_zone(zone) {
@@ -6446,6 +6449,7 @@ static void __setup_per_zone_wmarks(void)
 
 	for_each_zone(zone) {
 		u64 tmp;
+		int nid = zone_to_nid(zone);
 
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone_managed_pages(zone);
@@ -6482,7 +6486,12 @@ static void __setup_per_zone_wmarks(void)
 			    mult_frac(zone_managed_pages(zone),
 				      watermark_scale_factor, 10000));
 
-		zone->watermark_boost = 0;
+		if (nid != prev_nid) {
+			node_private_reclaim_policy(nid, &rp);
+			prev_nid = nid;
+		}
+		if (!rp.managed_watermarks)
+			zone->watermark_boost = 0;
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = low_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_PROMO] = high_wmark_pages(zone) + tmp;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f534428ea88..07de666c1276 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -73,6 +73,13 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+static inline bool zone_reclaim_allowed(struct zone *zone, gfp_t gfp_mask)
+{
+	if (node_state(zone_to_nid(zone), N_MEMORY_PRIVATE))
+		return zone_private_flags(zone, NP_OPS_RECLAIM);
+	return cpuset_zone_allowed(zone, gfp_mask);
+}
+
 struct scan_control {
 	/* How many pages shrink_list() should reclaim */
 	unsigned long nr_to_reclaim;
@@ -6274,7 +6281,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		 * to global LRU.
 		 */
 		if (!cgroup_reclaim(sc)) {
-			if (!cpuset_zone_allowed(zone,
+			if (!zone_reclaim_allowed(zone,
 						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
@@ -6992,6 +6999,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
 	bool boosted;
 	struct zone *zone;
+	struct node_reclaim_policy policy;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
@@ -7016,6 +7024,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 	}
 	boosted = nr_boost_reclaim;
 
+	/* Query/cache private node reclaim policy once per balance() */
+	node_private_reclaim_policy(pgdat->node_id, &policy);
+
 restart:
 	set_reclaim_active(pgdat, highest_zoneidx);
 	sc.priority = DEF_PRIORITY;
@@ -7083,6 +7094,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
 		sc.may_swap = !nr_boost_reclaim;
 
+		/* Private nodes may enable swap/writepage when using boost */
+		if (policy.active) {
+			sc.may_swap |= policy.may_swap;
+			sc.may_writepage |= policy.may_writepage;
+		}
+
 		/*
 		 * Do some background aging, to give pages a chance to be
 		 * referenced before reclaiming. All pages are rotated
@@ -7176,6 +7193,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 			if (!zone_boosts[i])
 				continue;
 
+			/* Some private nodes may own the\ boost lifecycle */
+			if (policy.managed_watermarks)
+				continue;
+
 			/* Increments are under the zone lock */
 			zone = pgdat->node_zones + i;
 			spin_lock_irqsave(&zone->lock, flags);
@@ -7406,7 +7427,7 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 	if (!managed_zone(zone))
 		return;
 
-	if (!cpuset_zone_allowed(zone, gfp_flags))
+	if (!zone_reclaim_allowed(zone, gfp_flags))
 		return;
 
 	pgdat = zone->zone_pgdat;
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (15 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

The OOM killer must know whether killing a task can actually free
memory such that pressure is reduced.

A private node only contributes to relieving pressure if it participates
in both reclaim and demotion. Without this check, the check, the OOM
killer may select an undeserving victim.

Introduce NP_OPS_OOM_ELIGIBLE and helpers node_oom_eligible() and
zone_oom_eligible().

Replace cpuset_mems_allowed_intersects() in oom_cpuset_eligible()
with oom_mems_intersect() that iterates N_MEMORY nodes and skips
ineligible private nodes.

Update constrained_alloc() to use zone_oom_eligible() for constraint
detection and node_oom_eligible() to exclude ineligible nodes from
totalpages accounting.

Remove cpuset_mems_allowed_intersects() as it has no remaining callers.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/cpuset.h       |  9 -------
 include/linux/node_private.h |  3 +++
 kernel/cgroup/cpuset.c       | 17 ------------
 mm/oom_kill.c                | 52 ++++++++++++++++++++++++++++++++----
 4 files changed, 50 insertions(+), 31 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 7b2f3f6b68a9..53ccfb00b277 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -97,9 +97,6 @@ static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 	return true;
 }
 
-extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
-					  const struct task_struct *tsk2);
-
 #ifdef CONFIG_CPUSETS_V1
 #define cpuset_memory_pressure_bump() 				\
 	do {							\
@@ -241,12 +238,6 @@ static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 	return true;
 }
 
-static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
-						 const struct task_struct *tsk2)
-{
-	return 1;
-}
-
 static inline void cpuset_memory_pressure_bump(void) {}
 
 static inline void cpuset_task_status_allowed(struct seq_file *m,
diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index 34be52383255..34d862f09e24 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -141,6 +141,9 @@ struct node_private_ops {
 /* Kernel reclaim (kswapd, direct reclaim, OOM) operates on this node */
 #define NP_OPS_RECLAIM			BIT(4)
 
+/* Private node is OOM-eligible: reclaim can run and pages can be demoted here */
+#define NP_OPS_OOM_ELIGIBLE		(NP_OPS_RECLAIM | NP_OPS_DEMOTION)
+
 /**
  * struct node_private - Per-node container for N_MEMORY_PRIVATE nodes
  *
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1a597f0c7c6c..29789d544fd5 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4530,23 +4530,6 @@ int cpuset_mem_spread_node(void)
 	return cpuset_spread_node(&current->cpuset_mem_spread_rotor);
 }
 
-/**
- * cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's?
- * @tsk1: pointer to task_struct of some task.
- * @tsk2: pointer to task_struct of some other task.
- *
- * Description: Return true if @tsk1's mems_allowed intersects the
- * mems_allowed of @tsk2.  Used by the OOM killer to determine if
- * one of the task's memory usage might impact the memory available
- * to the other.
- **/
-
-int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
-				   const struct task_struct *tsk2)
-{
-	return nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed);
-}
-
 /**
  * cpuset_print_current_mems_allowed - prints current's cpuset and mems_allowed
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5eb11fbba704..cd0d65ccd1e8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -74,7 +74,45 @@ static inline bool is_memcg_oom(struct oom_control *oc)
 	return oc->memcg != NULL;
 }
 
+/* Private nodes are only eligible if they support both reclaim and demotion */
+static inline bool node_oom_eligible(int nid)
+{
+	if (!node_state(nid, N_MEMORY_PRIVATE))
+		return true;
+	return (node_private_flags(nid) & NP_OPS_OOM_ELIGIBLE) ==
+		NP_OPS_OOM_ELIGIBLE;
+}
+
+static inline bool zone_oom_eligible(struct zone *zone, gfp_t gfp_mask)
+{
+	if (!node_oom_eligible(zone_to_nid(zone)))
+		return false;
+	return cpuset_zone_allowed(zone, gfp_mask);
+}
+
 #ifdef CONFIG_NUMA
+/*
+ * Killing a task can only relieve system pressure if freed memory can be
+ * demoted there and reclaim can operate on the node's pages, so we
+ * omit private nodes that aren't eligible.
+ */
+static bool oom_mems_intersect(const struct task_struct *tsk1,
+			       const struct task_struct *tsk2)
+{
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (!node_isset(nid, tsk1->mems_allowed))
+			continue;
+		if (!node_isset(nid, tsk2->mems_allowed))
+			continue;
+		if (!node_oom_eligible(nid))
+			continue;
+		return true;
+	}
+	return false;
+}
+
 /**
  * oom_cpuset_eligible() - check task eligibility for kill
  * @start: task struct of which task to consider
@@ -107,9 +145,10 @@ static bool oom_cpuset_eligible(struct task_struct *start,
 		} else {
 			/*
 			 * This is not a mempolicy constrained oom, so only
-			 * check the mems of tsk's cpuset.
+			 * check the mems of tsk's cpuset, excluding private
+			 * nodes that do not participate in kernel reclaim.
 			 */
-			ret = cpuset_mems_allowed_intersects(current, tsk);
+			ret = oom_mems_intersect(current, tsk);
 		}
 		if (ret)
 			break;
@@ -291,16 +330,19 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 		return CONSTRAINT_MEMORY_POLICY;
 	}
 
-	/* Check this allocation failure is caused by cpuset's wall function */
+	/* Check this allocation failure is caused by cpuset or private node constraints */
 	for_each_zone_zonelist_nodemask(zone, z, oc->zonelist,
 			highest_zoneidx, oc->nodemask)
-		if (!cpuset_zone_allowed(zone, oc->gfp_mask))
+		if (!zone_oom_eligible(zone, oc->gfp_mask))
 			cpuset_limited = true;
 
 	if (cpuset_limited) {
 		oc->totalpages = total_swap_pages;
-		for_each_node_mask(nid, cpuset_current_mems_allowed)
+		for_each_node_mask(nid, cpuset_current_mems_allowed) {
+			if (!node_oom_eligible(nid))
+				continue;
 			oc->totalpages += node_present_pages(nid);
+		}
 		return CONSTRAINT_CPUSET;
 	}
 	return CONSTRAINT_NONE;
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (16 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Not all private nodes may wish to engage in NUMA balancing faults.

Add the NP_OPS_NUMA_BALANCING flag (BIT(5)) as an opt-in method.

Introduce folio_managed_allows_numa() helper:
   ZONE_DEVICE folios always return false (never NUMA-scanned)
   NP_OPS_NUMA_BALANCING filters for private nodes

In do_numa_page(), if a private-node folio with NP_OPS_PROTECT_WRITE
is still on its node after a failed/skipped migration, enforce
write-protection so the next write triggers handle_fault.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/base/node.c          |  4 ++++
 include/linux/node_private.h | 16 ++++++++++++++++
 mm/memory.c                  | 11 +++++++++++
 mm/mempolicy.c               |  5 ++++-
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a4955b9b5b93..88aaac45e814 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -961,6 +961,10 @@ int node_private_set_ops(int nid, const struct node_private_ops *ops)
 	    (ops->flags & NP_OPS_PROTECT_WRITE))
 		return -EINVAL;
 
+	if ((ops->flags & NP_OPS_NUMA_BALANCING) &&
+	    !(ops->flags & NP_OPS_MIGRATION))
+		return -EINVAL;
+
 	mutex_lock(&node_private_lock);
 	np = rcu_dereference_protected(NODE_DATA(nid)->node_private,
 				       lockdep_is_held(&node_private_lock));
diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index 34d862f09e24..5ac60db1f044 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -140,6 +140,8 @@ struct node_private_ops {
 #define NP_OPS_PROTECT_WRITE		BIT(3)
 /* Kernel reclaim (kswapd, direct reclaim, OOM) operates on this node */
 #define NP_OPS_RECLAIM			BIT(4)
+/* Allow NUMA balancing to scan and migrate folios on this node */
+#define NP_OPS_NUMA_BALANCING		BIT(5)
 
 /* Private node is OOM-eligible: reclaim can run and pages can be demoted here */
 #define NP_OPS_OOM_ELIGIBLE		(NP_OPS_RECLAIM | NP_OPS_DEMOTION)
@@ -263,6 +265,15 @@ static inline void folio_managed_split_cb(struct folio *original_folio,
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static inline bool folio_managed_allows_numa(struct folio *folio)
+{
+	if (!folio_is_private_managed(folio))
+		return true;
+	if (folio_is_zone_device(folio))
+		return false;
+	return folio_private_flags(folio, NP_OPS_NUMA_BALANCING);
+}
+
 static inline int folio_managed_allows_user_migrate(struct folio *folio)
 {
 	if (folio_is_zone_device(folio))
@@ -443,6 +454,11 @@ int node_private_clear_ops(int nid, const struct node_private_ops *ops);
 
 #else /* !CONFIG_NUMA || !CONFIG_MEMORY_HOTPLUG */
 
+static inline bool folio_managed_allows_numa(struct folio *folio)
+{
+	return !folio_is_zone_device(folio);
+}
+
 static inline int folio_managed_allows_user_migrate(struct folio *folio)
 {
 	return -ENOENT;
diff --git a/mm/memory.c b/mm/memory.c
index 0f78988befef..88a581baae40 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/pgalloc.h>
 #include <linux/uaccess.h>
+#include <linux/node_private.h>
 
 #include <trace/events/kmem.h>
 
@@ -6041,6 +6042,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	if (!folio || folio_is_zone_device(folio))
 		goto out_map;
 
+	/*
+	 * We do not need to check private-node folios here because the private
+	 * memory service either never opted in to NUMA balancing, or it did
+	 * and we need to restore private PTE controls on the failure path.
+	 */
+
 	nid = folio_nid(folio);
 	nr_pages = folio_nr_pages(folio);
 
@@ -6078,6 +6085,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	/*
 	 * Make it present again, depending on how arch implements
 	 * non-accessible ptes, some can allow access by kernel mode.
+	 *
+	 * If the folio is still on a private node with NP_OPS_PROTECT_WRITE,
+	 * enforce write-protection so the next write triggers handle_fault.
+	 * This covers migration-failed and migration-skipped paths.
 	 */
 	if (unlikely(folio && folio_managed_wrprotect(folio))) {
 		writable = false;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8ac014950e88..8a3a9916ab59 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -861,7 +861,10 @@ bool folio_can_map_prot_numa(struct folio *folio, struct vm_area_struct *vma,
 {
 	int nid;
 
-	if (!folio || folio_is_zone_device(folio) || folio_test_ksm(folio))
+	if (!folio || folio_test_ksm(folio))
+		return false;
+
+	if (unlikely(!folio_managed_allows_numa(folio)))
 		return false;
 
 	/* Also skip shared copy-on-write folios */
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (17 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Private node zones should not be compacted unless the service explicitly
opts in - as compaction requires migration and services may have
PFN-based metadata that needs updating.

Add a folio_migrate callback which fires from migrate_folio_move() for
each relocated folio before faults are unblocked.

Add zone_supports_compaction() which returns true for normal zones and
checks NP_OPS_COMPACTION for N_MEMORY_PRIVATE zones.

Filter three direct compaction zone loops:
  - compaction_zonelist_suitable() (reclaimer eligibility)
  - try_to_compact_pages()         (direct compaction)
  - compact_node()                 (proactive/manual compaction)

kcompactd paths are intentionally unfiltered -- the service is
responsible for starting kcompactd on its node.

NP_OPS_COMPACTION requires NP_OPS_MIGRATION.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/base/node.c          |  4 ++++
 include/linux/node_private.h |  2 ++
 mm/compaction.c              | 26 ++++++++++++++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 88aaac45e814..da523aca18fa 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -965,6 +965,10 @@ int node_private_set_ops(int nid, const struct node_private_ops *ops)
 	    !(ops->flags & NP_OPS_MIGRATION))
 		return -EINVAL;
 
+	if ((ops->flags & NP_OPS_COMPACTION) &&
+	    !(ops->flags & NP_OPS_MIGRATION))
+		return -EINVAL;
+
 	mutex_lock(&node_private_lock);
 	np = rcu_dereference_protected(NODE_DATA(nid)->node_private,
 				       lockdep_is_held(&node_private_lock));
diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index 5ac60db1f044..fe0336773ddb 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -142,6 +142,8 @@ struct node_private_ops {
 #define NP_OPS_RECLAIM			BIT(4)
 /* Allow NUMA balancing to scan and migrate folios on this node */
 #define NP_OPS_NUMA_BALANCING		BIT(5)
+/* Allow compaction to run on the node.  Service must start kcompactd. */
+#define NP_OPS_COMPACTION		BIT(6)
 
 /* Private node is OOM-eligible: reclaim can run and pages can be demoted here */
 #define NP_OPS_OOM_ELIGIBLE		(NP_OPS_RECLAIM | NP_OPS_DEMOTION)
diff --git a/mm/compaction.c b/mm/compaction.c
index 6a65145b03d8..d8532b957ec6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -24,9 +24,26 @@
 #include <linux/page_owner.h>
 #include <linux/psi.h>
 #include <linux/cpuset.h>
+#include <linux/node_private.h>
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
+
+/*
+ * Private node zones require NP_OPS_COMPACTION to opt in.  Normal zones
+ * always support compaction.
+ */
+static inline bool zone_supports_compaction(struct zone *zone)
+{
+#ifdef CONFIG_NUMA
+	if (!node_state(zone_to_nid(zone), N_MEMORY_PRIVATE))
+		return true;
+	return zone_private_flags(zone, NP_OPS_COMPACTION);
+#else
+	return true;
+#endif
+}
+
 /*
  * Fragmentation score check interval for proactive compaction purposes.
  */
@@ -2443,6 +2460,9 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 				ac->highest_zoneidx, ac->nodemask) {
 		unsigned long available;
 
+		if (!zone_supports_compaction(zone))
+			continue;
+
 		/*
 		 * Do not consider all the reclaimable memory because we do not
 		 * want to trash just for a single high order allocation which
@@ -2832,6 +2852,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		if (!numa_zone_alloc_allowed(alloc_flags, zone, gfp_mask))
 			continue;
 
+		if (!zone_supports_compaction(zone))
+			continue;
+
 		if (prio > MIN_COMPACT_PRIORITY
 					&& compaction_deferred(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
@@ -2906,6 +2929,9 @@ static int compact_node(pg_data_t *pgdat, bool proactive)
 		if (!populated_zone(zone))
 			continue;
 
+		if (!zone_supports_compaction(zone))
+			continue;
+
 		if (fatal_signal_pending(current))
 			return -EINTR;
 
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (18 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Private node folios should not be longterm-pinnable by default.
A pinned folio is frozen in place, no migration, compaction, or
reclaim, so the service loses control for the duration of the pin.

Some services may depend on hot-unplugability and must disallow
longterm pinning.  Others (accelerators with shared CPU-device state)
need pinning to work.

Add NP_OPS_LONGTERM_PIN flag for services to opt in with. Hook into
folio_is_longterm_pinnable() in mm.h, which all GUP callers
out-of-line helper, node_private_allows_longterm_pin(),  called
only for N_MEMORY_PRIVATE nodes.

Without the flag: folio_is_longterm_pinnable() returns false, migration
fails (no __GFP_PRIVATE in GFP mask) and pin_user_pages(FOLL_LONGTERM)
returns -ENOMEM.

With the flag: pin succeeds and the folio stays on the private node.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/base/node.c          | 15 +++++++++++++++
 include/linux/mm.h           | 22 ++++++++++++++++++++++
 include/linux/node_private.h |  2 ++
 3 files changed, 39 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index da523aca18fa..5d2487fd54f4 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -866,6 +866,21 @@ void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
 static DEFINE_MUTEX(node_private_lock);
 static bool node_private_initialized;
 
+/**
+ * node_private_allows_longterm_pin - Check if a private node allows longterm pinning
+ * @nid: Node identifier
+ *
+ * Out-of-line helper for folio_is_longterm_pinnable() since mm.h cannot
+ * include node_private.h (circular dependency).
+ *
+ * Returns true if the node has NP_OPS_LONGTERM_PIN set.
+ */
+bool node_private_allows_longterm_pin(int nid)
+{
+	return node_private_has_flag(nid, NP_OPS_LONGTERM_PIN);
+}
+EXPORT_SYMBOL_GPL(node_private_allows_longterm_pin);
+
 /**
  * node_private_register - Register a private node
  * @nid: Node identifier
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fb1819ad42c3..9088fd08aeb9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2192,6 +2192,13 @@ static inline bool is_zero_folio(const struct folio *folio)
 
 /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin folios */
 #ifdef CONFIG_MIGRATION
+
+#ifdef CONFIG_NUMA
+bool node_private_allows_longterm_pin(int nid);
+#else
+static inline bool node_private_allows_longterm_pin(int nid) { return false; }
+#endif
+
 static inline bool folio_is_longterm_pinnable(struct folio *folio)
 {
 #ifdef CONFIG_CMA
@@ -2215,6 +2222,21 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
 	if (folio_is_fsdax(folio))
 		return false;
 
+	/*
+	 * Private node folios are not longterm pinnable by default.
+	 * Services that support pinning opt in via NP_OPS_LONGTERM_PIN.
+	 * node_private_allows_longterm_pin() is out-of-line because
+	 * node_private.h includes mm.h (circular dependency).
+	 *
+	 * Guarded by CONFIG_NUMA because on !CONFIG_NUMA the single-node
+	 * node_state() stub returns true for node 0, which would make
+	 * all folios non-pinnable via the false-returning stub.
+	 */
+#ifdef CONFIG_NUMA
+	if (node_state(folio_nid(folio), N_MEMORY_PRIVATE))
+		return node_private_allows_longterm_pin(folio_nid(folio));
+#endif
+
 	/* Otherwise, non-movable zone folios can be pinned. */
 	return !folio_is_zone_movable(folio);
 
diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index fe0336773ddb..7a7438fb9eda 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -144,6 +144,8 @@ struct node_private_ops {
 #define NP_OPS_NUMA_BALANCING		BIT(5)
 /* Allow compaction to run on the node.  Service must start kcompactd. */
 #define NP_OPS_COMPACTION		BIT(6)
+/* Allow longterm DMA pinning (RDMA, VFIO, etc.) of folios on this node */
+#define NP_OPS_LONGTERM_PIN		BIT(7)
 
 /* Private node is OOM-eligible: reclaim can run and pages can be demoted here */
 #define NP_OPS_OOM_ELIGIBLE		(NP_OPS_RECLAIM | NP_OPS_DEMOTION)
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (19 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Add a void memory_failure notification callback to struct
node_private_ops so services managing N_MEMORY_PRIVATE nodes notified
when a page on their node experiences a hardware error.

The callback is notification only -- the kernel always proceeds with
standard hwpoison handling for online pages.

The notification hook fires after TestSetPageHWPoison succeeds and
before get_hwpoison_page giving the service a chance to clean up.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/node_private.h |  6 ++++++
 mm/internal.h                | 16 ++++++++++++++++
 mm/memory-failure.c          | 15 +++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/include/linux/node_private.h b/include/linux/node_private.h
index 7a7438fb9eda..d2669f68ac20 100644
--- a/include/linux/node_private.h
+++ b/include/linux/node_private.h
@@ -113,6 +113,10 @@ struct node_reclaim_policy {
  *   watermark_boost lifecycle (kswapd will not clear it).
  *   If NULL, normal boost policy applies.
  *
+ * @memory_failure: Notification of hardware error on a page on this node.
+ *   [folio-referenced callback]
+ *   Notification only, kernel always handles the failure.
+ *
  * @flags: Operation exclusion flags (NP_OPS_* constants).
  *
  */
@@ -127,6 +131,8 @@ struct node_private_ops {
 	vm_fault_t (*handle_fault)(struct folio *folio, struct vm_fault *vmf,
 				   enum pgtable_level level);
 	void (*reclaim_policy)(int nid, struct node_reclaim_policy *policy);
+	void (*memory_failure)(struct folio *folio, unsigned long pfn,
+			       int mf_flags);
 	unsigned long flags;
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index db32cb2d7a29..64467ca774f1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1608,6 +1608,22 @@ static inline void node_private_reclaim_policy(int nid,
 }
 #endif
 
+static inline void folio_managed_memory_failure(struct folio *folio,
+						unsigned long pfn,
+						int mf_flags)
+{
+	/* Zone device pages handle memory failure via dev_pagemap_ops */
+	if (folio_is_zone_device(folio))
+		return;
+	if (folio_is_private_node(folio)) {
+		const struct node_private_ops *ops =
+			folio_node_private_ops(folio);
+
+		if (ops && ops->memory_failure)
+			ops->memory_failure(folio, pfn, mf_flags);
+	}
+}
+
 struct vm_struct *__get_vm_area_node(unsigned long size,
 				     unsigned long align, unsigned long shift,
 				     unsigned long vm_flags, unsigned long start,
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c80c2907da33..79c91d44ec1e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2379,6 +2379,15 @@ int memory_failure(unsigned long pfn, int flags)
 		goto unlock_mutex;
 	}
 
+	/*
+	 * Notify private-node services about the hardware error so they
+	 * can update internal tracking (e.g., CXL poison lists, stop
+	 * demoting to failing DIMMs).  This is notification only -- the
+	 * kernel proceeds with standard hwpoison handling regardless.
+	 */
+	if (unlikely(page_is_private_managed(p)))
+		folio_managed_memory_failure(page_folio(p), pfn, flags);
+
 	/*
 	 * We need/can do nothing about count=0 pages.
 	 * 1) it's a free page, and therefore in safe hand:
@@ -2825,6 +2834,12 @@ static int soft_offline_in_use_page(struct page *page)
 		return 0;
 	}
 
+	if (!folio_managed_allows_migrate(folio)) {
+		pr_info("%#lx: cannot migrate private node folio\n", pfn);
+		folio_put(folio);
+		return -EBUSY;
+	}
+
 	isolated = isolate_folio_to_list(folio, &pagelist);
 
 	/*
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed()
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (20 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Add a new function for drivers to hotplug memory as N_MEMORY_PRIVATE.

This function combines node_private_region_register() with
__add_memory_driver_managed() to ensure proper ordering:

1. Register the private region first (sets private node context)
2. Then hotplug the memory (sets N_MEMORY_PRIVATE)
3. On failure, unregister the private region to avoid leaving the
   node in an inconsistent state.

When the last of memory is removed, hotplug also removes the private
node context. If migration is not supported and the node is still
online, fire a warning (likely bug in the driver).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memory_hotplug.h |  11 +++
 include/linux/mmzone.h         |  12 ++++
 mm/memory_hotplug.c            | 122 ++++++++++++++++++++++++++++++---
 3 files changed, 135 insertions(+), 10 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 1f19f08552ea..e5abade9450a 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -293,6 +293,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 extern int remove_memory(u64 start, u64 size);
 extern void __remove_memory(u64 start, u64 size);
 extern int offline_and_remove_memory(u64 start, u64 size);
+extern int offline_and_remove_private_memory(int nid, u64 start, u64 size);
 
 #else
 static inline void try_offline_node(int nid) {}
@@ -309,6 +310,12 @@ static inline int remove_memory(u64 start, u64 size)
 }
 
 static inline void __remove_memory(u64 start, u64 size) {}
+
+static inline int offline_and_remove_private_memory(int nid, u64 start,
+						    u64 size)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
@@ -326,6 +333,10 @@ int __add_memory_driver_managed(int nid, u64 start, u64 size,
 extern int add_memory_driver_managed(int nid, u64 start, u64 size,
 				     const char *resource_name,
 				     mhp_t mhp_flags);
+int add_private_memory_driver_managed(int nid, u64 start, u64 size,
+				      const char *resource_name,
+				      mhp_t mhp_flags, enum mmop online_type,
+				      struct node_private *np);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 				   unsigned long nr_pages,
 				   struct vmem_altmap *altmap, int migratetype,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 992eb1c5a2c6..cc532b67ad3f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1524,6 +1524,18 @@ typedef struct pglist_data {
 #endif
 } pg_data_t;
 
+#ifdef CONFIG_NUMA
+static inline bool pgdat_is_private(pg_data_t *pgdat)
+{
+	return pgdat->private;
+}
+#else
+static inline bool pgdat_is_private(pg_data_t *pgdat)
+{
+	return false;
+}
+#endif
+
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
 #define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d2dc527bd5b0..9d72f44a30dc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -36,6 +36,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/node.h>
+#include <linux/node_private.h>
 
 #include <asm/tlbflush.h>
 
@@ -1173,8 +1174,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_MOVABLE,
 			       true);
 
-	if (!node_state(nid, N_MEMORY)) {
-		/* Adding memory to the node for the first time */
+	if (!node_state(nid, N_MEMORY) && !node_state(nid, N_MEMORY_PRIVATE)) {
 		node_arg.nid = nid;
 		ret = node_notify(NODE_ADDING_FIRST_MEMORY, &node_arg);
 		ret = notifier_to_errno(ret);
@@ -1208,8 +1208,12 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	online_pages_range(pfn, nr_pages);
 	adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
 
-	if (node_arg.nid >= 0)
-		node_set_state(nid, N_MEMORY);
+	if (node_arg.nid >= 0) {
+		if (pgdat_is_private(NODE_DATA(nid)))
+			node_set_state(nid, N_MEMORY_PRIVATE);
+		else
+			node_set_state(nid, N_MEMORY);
+	}
 	if (need_zonelists_rebuild)
 		build_all_zonelists(NULL);
 
@@ -1227,8 +1231,14 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	/* reinitialise watermarks and update pcp limits */
 	init_per_zone_wmark_min();
 
-	kswapd_run(nid);
-	kcompactd_run(nid);
+	/*
+	 * Don't start reclaim/compaction daemons for private nodes.
+	 * Private node services will decide whether to start these services.
+	 */
+	if (!pgdat_is_private(NODE_DATA(nid))) {
+		kswapd_run(nid);
+		kcompactd_run(nid);
+	}
 
 	if (node_arg.nid >= 0)
 		/* First memory added successfully. Notify consumers. */
@@ -1722,6 +1732,54 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
 }
 EXPORT_SYMBOL_GPL(add_memory_driver_managed);
 
+/**
+ * add_private_memory_driver_managed - add driver-managed N_MEMORY_PRIVATE memory
+ * @nid: NUMA node ID (or memory group ID when MHP_NID_IS_MGID is set)
+ * @start: Start physical address
+ * @size: Size in bytes
+ * @resource_name: "System RAM ($DRIVER)" format
+ * @mhp_flags: Memory hotplug flags
+ * @online_type: MMOP_* online type
+ * @np: Driver-owned node_private structure (owner, refcount)
+ *
+ * Registers node_private first, then hotplugs the memory.
+ *
+ * On failure, unregisters the node_private.
+ */
+int add_private_memory_driver_managed(int nid, u64 start, u64 size,
+				      const char *resource_name,
+				      mhp_t mhp_flags, enum mmop online_type,
+				      struct node_private *np)
+{
+	struct memory_group *group;
+	int real_nid = nid;
+	int rc;
+
+	if (!np)
+		return -EINVAL;
+
+	if (mhp_flags & MHP_NID_IS_MGID) {
+		group = memory_group_find_by_id(nid);
+		if (!group)
+			return -EINVAL;
+		real_nid = group->nid;
+	}
+
+	rc = node_private_register(real_nid, np);
+	if (rc)
+		return rc;
+
+	rc = __add_memory_driver_managed(nid, start, size, resource_name,
+					 mhp_flags, online_type);
+	if (rc) {
+		node_private_unregister(real_nid);
+		return rc;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(add_private_memory_driver_managed);
+
 /*
  * Platforms should define arch_get_mappable_range() that provides
  * maximum possible addressable physical memory range for which the
@@ -1872,6 +1930,15 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			goto put_folio;
 		}
 
+		/* Private nodes w/o migration must ensure folios are offline */
+		if (folio_is_private_node(folio) &&
+		    !folio_private_flags(folio, NP_OPS_MIGRATION)) {
+			WARN_ONCE(1, "hot-unplug on non-migratable node %d pfn %lx\n",
+				  folio_nid(folio), pfn);
+			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
+			goto put_folio;
+		}
+
 		if (!isolate_folio_to_list(folio, &source)) {
 			if (__ratelimit(&migrate_rs)) {
 				pr_warn("failed to isolate pfn %lx\n",
@@ -2014,8 +2081,8 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 
 	/*
 	 * Check whether the node will have no present pages after we offline
-	 * 'nr_pages' more. If so, we know that the node will become empty, and
-	 * so we will clear N_MEMORY for it.
+	 * 'nr_pages' more. If so, send pre-notification for last memory removal.
+	 * We will clear N_MEMORY(_PRIVATE) if this is the case.
 	 */
 	if (nr_pages >= pgdat->node_present_pages) {
 		node_arg.nid = node;
@@ -2108,8 +2175,12 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	 * Make sure to mark the node as memory-less before rebuilding the zone
 	 * list. Otherwise this node would still appear in the fallback lists.
 	 */
-	if (node_arg.nid >= 0)
-		node_clear_state(node, N_MEMORY);
+	if (node_arg.nid >= 0) {
+		if (node_state(node, N_MEMORY))
+			node_clear_state(node, N_MEMORY);
+		else if (node_state(node, N_MEMORY_PRIVATE))
+			node_clear_state(node, N_MEMORY_PRIVATE);
+	}
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
 		build_all_zonelists(NULL);
@@ -2461,4 +2532,35 @@ int offline_and_remove_memory(u64 start, u64 size)
 	return rc;
 }
 EXPORT_SYMBOL_GPL(offline_and_remove_memory);
+
+/**
+ * offline_and_remove_private_memory - offline, remove, and unregister private memory
+ * @nid: NUMA node ID of the private memory
+ * @start: Start physical address
+ * @size: Size in bytes
+ *
+ * Counterpart to add_private_memory_driver_managed().  Offlines and removes
+ * the memory range, then attempts to unregister the node_private.
+ *
+ * offline_and_remove_memory() clears N_MEMORY_PRIVATE when the last block
+ * is offlined, which allows node_private_unregister() to clear the
+ * pgdat->node_private pointer.  If other private memory ranges remain on
+ * the node, node_private_unregister() returns -EBUSY (N_MEMORY_PRIVATE
+ * is still set) and the node_private remains registered.
+ *
+ * Return: 0 on full success (memory removed and node_private unregistered),
+ *         -EBUSY if memory was removed but node still has other private memory,
+ *         other negative error code if offline/remove failed.
+ */
+int offline_and_remove_private_memory(int nid, u64 start, u64 size)
+{
+	int rc;
+
+	rc = offline_and_remove_memory(start, size);
+	if (rc)
+		return rc;
+
+	return node_private_unregister(nid);
+}
+EXPORT_SYMBOL_GPL(offline_and_remove_private_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (21 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Add the CRAM (Compressed RAM) subsystem that manages folios demoted
to N_MEMORY_PRIVATE nodes via the standard kernel LRU.

We limit entry into CRAM by demotion in to provide devices a way for
drivers to close access - which allows the system to stabiliz under
memory pressure (the device can run out of real memory when compression
ratios drop too far).

We utilize write-protect to prevent unbounded writes to compressed
memory pages, which may cause run-away compression ratio loss without
a reliable way to prevent the degenerate case (cascading poisons).

CRAM provides the bridge between the mm/ private node infrastructure
and compressed memory hardware.  Folios are aged by kswapd on the
private node and reclaimed to swap when the device signals pressure.

Write faults trigger promotion back to regular DRAM via the
ops->handle_fault callback.

Device pressure is communicated via watermark_boost on the private
node's zone.

CRAM registers node_private_ops with:
  - handle_fault:   promotes folio back to DRAM on write
  - migrate_to:     custom demotion to the CRAM node
  - folio_migrate:  (no-op)
  - free_folio:     zeroes pages on free to scrub stale data
  - reclaim_policy: provides mayswap/writeback/boost overrides
  - flags: NP_OPS_MIGRATION | NP_OPS_DEMOTION |
	   NP_OPS_NUMA_BALANCING | NP_OPS_PROTECT_WRITE
           NP_OPS_RECLAIM

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/cram.h |  66 ++++++
 mm/Kconfig           |  10 +
 mm/Makefile          |   1 +
 mm/cram.c            | 508 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 585 insertions(+)
 create mode 100644 include/linux/cram.h
 create mode 100644 mm/cram.c

diff --git a/include/linux/cram.h b/include/linux/cram.h
new file mode 100644
index 000000000000..a3c10362fd4f
--- /dev/null
+++ b/include/linux/cram.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CRAM_H
+#define _LINUX_CRAM_H
+
+#include <linux/mm_types.h>
+
+struct folio;
+struct list_head;
+struct vm_fault;
+
+#define CRAM_PRESSURE_MAX	1000
+
+/**
+ * cram_flush_cb_t - Driver callback invoked when a folio on a private node
+ *                   is freed (refcount reaches zero).
+ * @folio: the folio being freed
+ * @private: opaque driver data passed at registration
+ *
+ * Return:
+ *   0: Flush resolved -- page should return to buddy allocator (e.g., flush
+ *      record bit was set, meaning this free is from our own flush resolution)
+ *   1: Page deferred -- driver took a reference, page will be flushed later.
+ *      Do NOT return to buddy allocator.
+ *   2: Buffer full -- caller should zero the page and return to buddy.
+ */
+typedef int (*cram_flush_cb_t)(struct folio *folio, void *private);
+
+#ifdef CONFIG_CRAM
+
+int cram_register_private_node(int nid, void *owner,
+			       cram_flush_cb_t flush_cb, void *flush_data);
+int cram_unregister_private_node(int nid);
+int cram_unpurge(int nid);
+void cram_set_pressure(int nid, unsigned int pressure);
+void cram_clear_pressure(int nid);
+
+#else /* !CONFIG_CRAM */
+
+static inline int cram_register_private_node(int nid, void *owner,
+					     cram_flush_cb_t flush_cb,
+					     void *flush_data)
+{
+	return -ENODEV;
+}
+
+static inline int cram_unregister_private_node(int nid)
+{
+	return -ENODEV;
+}
+
+static inline int cram_unpurge(int nid)
+{
+	return -ENODEV;
+}
+
+static inline void cram_set_pressure(int nid, unsigned int pressure)
+{
+}
+
+static inline void cram_clear_pressure(int nid)
+{
+}
+
+#endif /* CONFIG_CRAM */
+
+#endif /* _LINUX_CRAM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index bd0ea5454af8..054462b954d8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -662,6 +662,16 @@ config MIGRATION
 config DEVICE_MIGRATION
 	def_bool MIGRATION && ZONE_DEVICE
 
+config CRAM
+	bool "Compressed RAM - private node memory management"
+	depends on NUMA
+	depends on MIGRATION
+	depends on MEMORY_HOTPLUG
+	help
+	  Enables management of N_MEMORY_PRIVATE nodes for compressed RAM
+	  and similar use cases. Provides demotion, promotion, and lifecycle
+	  management for private memory nodes.
+
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 2d0570a16e5b..0e1421512643 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
+obj-$(CONFIG_CRAM) += cram.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_LIVEUPDATE) += memfd_luo.o
diff --git a/mm/cram.c b/mm/cram.c
new file mode 100644
index 000000000000..6709e61f5b9d
--- /dev/null
+++ b/mm/cram.c
@@ -0,0 +1,508 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * mm/cram.c - Compressed RAM / private node memory management
+ *
+ * Copyright 2026 Meta Technologies Inc.
+ *   Author: Gregory Price <gourry@gourry.net>
+ *
+ * Manages folios demoted to N_MEMORY_PRIVATE nodes via the standard kernel
+ * LRU.  Folios are aged by kswapd on the private node and reclaimed to swap
+ * (demotion is suppressed for private nodes).  Write faults trigger promotion
+ * back to regular DRAM via the ops->handle_fault callback.
+ *
+ * All reclaim/demotion uses the standard vmscan infrastructure. Device pressure
+ * is communicated via watermark_boost on the private node's zone.
+ */
+
+#include <linux/atomic.h>
+#include <linux/cpuset.h>
+#include <linux/cram.h>
+#include <linux/errno.h>
+#include <linux/gfp.h>
+#include <linux/jiffies.h>
+#include <linux/highmem.h>
+#include <linux/memory-tiers.h>
+#include <linux/list.h>
+#include <linux/migrate.h>
+#include <linux/mm.h>
+#include <linux/huge_mm.h>
+#include <linux/mmzone.h>
+#include <linux/mutex.h>
+#include <linux/nodemask.h>
+#include <linux/node_private.h>
+#include <linux/pagemap.h>
+#include <linux/rcupdate.h>
+#include <linux/refcount.h>
+#include <linux/swap.h>
+
+#include "internal.h"
+
+struct cram_node {
+	void		*owner;
+	bool		purged;		/* node is being torn down */
+	unsigned int	pressure;
+	refcount_t	refcount;
+	cram_flush_cb_t	flush_cb;	/* optional driver flush callback */
+	void		*flush_data;	/* opaque data for flush_cb */
+};
+
+static struct cram_node *cram_nodes[MAX_NUMNODES];
+static DEFINE_MUTEX(cram_mutex);
+
+static inline bool cram_valid_nid(int nid)
+{
+	return nid >= 0 && nid < MAX_NUMNODES;
+}
+
+static inline struct cram_node *get_cram_node(int nid)
+{
+	struct cram_node *cn;
+
+	if (!cram_valid_nid(nid))
+		return NULL;
+
+	rcu_read_lock();
+	cn = rcu_dereference(cram_nodes[nid]);
+	if (cn && !refcount_inc_not_zero(&cn->refcount))
+		cn = NULL;
+	rcu_read_unlock();
+
+	return cn;
+}
+
+static inline void put_cram_node(struct cram_node *cn)
+{
+	if (cn)
+		refcount_dec(&cn->refcount);
+}
+
+static void cram_zero_folio(struct folio *folio)
+{
+	unsigned int i, nr = folio_nr_pages(folio);
+
+	if (want_init_on_free())
+		return;
+
+	for (i = 0; i < nr; i++)
+		clear_highpage(folio_page(folio, i));
+}
+
+static bool cram_free_folio_cb(struct folio *folio)
+{
+	int nid = folio_nid(folio);
+	struct cram_node *cn;
+	int ret;
+
+	cn = get_cram_node(nid);
+	if (!cn)
+		goto zero_and_free;
+
+	if (!cn->flush_cb)
+		goto zero_and_free_put;
+
+	ret = cn->flush_cb(folio, cn->flush_data);
+	put_cram_node(cn);
+
+	switch (ret) {
+	case 0:
+		/* Flush resolved: return to buddy (already zeroed by device) */
+		return false;
+	case 1:
+		/* Deferred: driver holds a ref, do not free to buddy */
+		return true;
+	case 2:
+	default:
+		/* Buffer full or unknown: zero locally, return to buddy */
+		goto zero_and_free;
+	}
+
+zero_and_free_put:
+	put_cram_node(cn);
+zero_and_free:
+	cram_zero_folio(folio);
+	return false;
+}
+
+static struct folio *alloc_cram_folio(struct folio *src, unsigned long private)
+{
+	int nid = (int)private;
+	unsigned int order = folio_order(src);
+	gfp_t gfp = GFP_PRIVATE | __GFP_KSWAPD_RECLAIM |
+		     __GFP_HIGHMEM | __GFP_MOVABLE |
+		     __GFP_NOWARN | __GFP_NORETRY;
+
+	/* Stop allocating if backpressure fired mid-batch */
+	if (node_private_migration_blocked(nid))
+		return NULL;
+
+	if (order)
+		gfp |= __GFP_COMP;
+
+	return __folio_alloc_node(gfp, order, nid);
+}
+
+static void cram_put_new_folio(struct folio *folio, unsigned long private)
+{
+	cram_zero_folio(folio);
+	folio_put(folio);
+}
+
+/*
+ * Allocate a DRAM folio for promotion out of a private node.
+ *
+ * Unlike alloc_migration_target(), this does NOT strip __GFP_RECLAIM for
+ * large folios, the generic helper does that because THP allocations are
+ * opportunistic, but promotion from a private node is mandatory: the page
+ * MUST move to DRAM or the process cannot make forward progress.
+ *
+ * __GFP_RETRY_MAYFAIL tells the allocator to try hard (multiple reclaim
+ * rounds, wait for writeback) before giving up.
+ */
+static struct folio *alloc_cram_promote_folio(struct folio *src,
+					      unsigned long private)
+{
+	int nid = (int)private;
+	unsigned int order = folio_order(src);
+	gfp_t gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL;
+
+	if (order)
+		gfp |= __GFP_COMP;
+
+	return __folio_alloc(gfp, order, nid, NULL);
+}
+
+static int cram_migrate_to(struct list_head *demote_folios, int to_nid,
+			   enum migrate_mode mode,
+			   enum migrate_reason reason,
+			   unsigned int *nr_succeeded)
+{
+	struct cram_node *cn;
+	unsigned int nr_success = 0;
+	int ret = 0;
+
+	cn = get_cram_node(to_nid);
+	if (!cn)
+		return -ENODEV;
+
+	if (cn->purged) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	/* Block new demotions at maximum pressure */
+	if (READ_ONCE(cn->pressure) >= CRAM_PRESSURE_MAX) {
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	ret = migrate_pages(demote_folios, alloc_cram_folio, cram_put_new_folio,
+			    (unsigned long)to_nid, mode, reason,
+			    &nr_success);
+
+	/*
+	 * migrate_folio_move() calls folio_add_lru() for each migrated
+	 * folio, but that only adds the folio to a per-CPU batch, 
+	 * PG_lru is not set until the batch is drained.  Drain now so
+	 * that cram_fault() can isolate these folios immediately.
+	 *
+	 * Use lru_add_drain_all() because migrate_pages() may process
+	 * folios across CPUs, and the local drain might miss batches
+	 * filled on other CPUs.
+	 */
+	if (nr_success)
+		lru_add_drain_all();
+out:
+	put_cram_node(cn);
+	if (nr_succeeded)
+		*nr_succeeded = nr_success;
+	return ret;
+}
+
+static void cram_release_ptl(struct vm_fault *vmf, enum pgtable_level level)
+{
+	if (level == PGTABLE_LEVEL_PTE)
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+	else
+		spin_unlock(vmf->ptl);
+}
+
+static vm_fault_t cram_fault(struct folio *folio, struct vm_fault *vmf,
+			     enum pgtable_level level)
+{
+	struct folio *f, *f2;
+	struct cram_node *cn;
+	unsigned int nr_succeeded = 0;
+	int nid;
+	LIST_HEAD(folios);
+
+	nid = folio_nid(folio);
+
+	cn = get_cram_node(nid);
+	if (!cn) {
+		cram_release_ptl(vmf, level);
+		return 0;
+	}
+
+	/*
+	 * Isolate from LRU while holding PTL.  This serializes against
+	 * other CPUs faulting on the same folio: only one CPU can clear
+	 * PG_lru under the PTL, and it proceeds to migration.  Other
+	 * CPUs find the folio already isolated and bail out, preventing
+	 * the refcount pile-up that causes migrate_pages() to fail with
+	 * -EAGAIN.
+	 *
+	 * No explicit folio_get() is needed: the page table entry holds
+	 * a reference (we still hold PTL), and folio_isolate_lru() takes
+	 * its own reference.  This matches do_numa_page()'s pattern.
+	 *
+	 * PG_lru should already be set: cram_migrate_to() drains per-CPU
+	 * LRU batches after migration, and the failure path below
+	 * drains after putback.
+	 */
+	if (!folio_isolate_lru(folio)) {
+		put_cram_node(cn);
+		cram_release_ptl(vmf, level);
+		cond_resched();
+		return 0;
+	}
+
+	/* Folio isolated, release PTL, proceed to migration */
+	cram_release_ptl(vmf, level);
+
+	node_stat_mod_folio(folio,
+			    NR_ISOLATED_ANON + folio_is_file_lru(folio),
+			    folio_nr_pages(folio));
+	list_add(&folio->lru, &folios);
+
+	migrate_pages(&folios, alloc_cram_promote_folio, NULL,
+		      (unsigned long)numa_node_id(),
+		      MIGRATE_SYNC, MR_NUMA_MISPLACED, &nr_succeeded);
+
+	/* Put failed folios back on LRU; retry on next fault */
+	list_for_each_entry_safe(f, f2, &folios, lru) {
+		list_del(&f->lru);
+		node_stat_mod_folio(f,
+				    NR_ISOLATED_ANON + folio_is_file_lru(f),
+				    -folio_nr_pages(f));
+		folio_putback_lru(f);
+	}
+
+	/*
+	 * If migration failed, folio_putback_lru() batched the folio
+	 * into this CPU's per-CPU LRU cache (PG_lru not yet set).
+	 * Drain now so the folio is immediately visible on the LRU,
+	 * the next fault can then isolate it without an IPI storm
+	 * via lru_add_drain_all().
+	 *
+	 * Return VM_FAULT_RETRY after releasing the fault lock so the
+	 * arch handler retries from scratch.  Without this, returning 0
+	 * causes a tight livelock: the process immediately re-faults on
+	 * the same write-protected entry, alloc fails again, and
+	 * VM_FAULT_OOM eventually leaks out through a stale path.
+	 * VM_FAULT_RETRY gives the system breathing room to reclaim.
+	 */
+	if (!nr_succeeded) {
+		lru_add_drain();
+		cond_resched();
+		put_cram_node(cn);
+		release_fault_lock(vmf);
+		return VM_FAULT_RETRY;
+	}
+
+	cond_resched();
+	put_cram_node(cn);
+	return 0;
+}
+
+static void cram_folio_migrate(struct folio *src, struct folio *dst)
+{
+}
+
+static void cram_reclaim_policy(int nid, struct node_reclaim_policy *policy)
+{
+	policy->may_swap = true;
+	policy->may_writepage = true;
+	policy->managed_watermarks = true;
+}
+
+static vm_fault_t cram_handle_fault(struct folio *folio, struct vm_fault *vmf,
+				    enum pgtable_level level)
+{
+	return cram_fault(folio, vmf, level);
+}
+
+static const struct node_private_ops cram_ops = {
+	.handle_fault		= cram_handle_fault,
+	.migrate_to		= cram_migrate_to,
+	.folio_migrate		= cram_folio_migrate,
+	.free_folio		= cram_free_folio_cb,
+	.reclaim_policy		= cram_reclaim_policy,
+	.flags			= NP_OPS_MIGRATION | NP_OPS_DEMOTION |
+				  NP_OPS_NUMA_BALANCING | NP_OPS_PROTECT_WRITE |
+				  NP_OPS_RECLAIM,
+};
+
+int cram_register_private_node(int nid, void *owner,
+			       cram_flush_cb_t flush_cb, void *flush_data)
+{
+	struct cram_node *cn;
+	int ret;
+
+	if (!node_state(nid, N_MEMORY_PRIVATE))
+		return -EINVAL;
+
+	mutex_lock(&cram_mutex);
+
+	cn = cram_nodes[nid];
+	if (cn) {
+		if (cn->owner != owner) {
+			mutex_unlock(&cram_mutex);
+			return -EBUSY;
+		}
+		mutex_unlock(&cram_mutex);
+		return 0;
+	}
+
+	cn = kzalloc(sizeof(*cn), GFP_KERNEL);
+	if (!cn) {
+		mutex_unlock(&cram_mutex);
+		return -ENOMEM;
+	}
+
+	cn->owner = owner;
+	cn->pressure = 0;
+	cn->flush_cb = flush_cb;
+	cn->flush_data = flush_data;
+	refcount_set(&cn->refcount, 1);
+
+	ret = node_private_set_ops(nid, &cram_ops);
+	if (ret) {
+		mutex_unlock(&cram_mutex);
+		kfree(cn);
+		return ret;
+	}
+
+	rcu_assign_pointer(cram_nodes[nid], cn);
+
+	/* Start kswapd on the private node for LRU aging and reclaim */
+	kswapd_run(nid);
+
+	mutex_unlock(&cram_mutex);
+
+	/* Now that ops->migrate_to is set, refresh demotion targets */
+	memory_tier_refresh_demotion();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cram_register_private_node);
+
+int cram_unregister_private_node(int nid)
+{
+	struct cram_node *cn;
+
+	if (!cram_valid_nid(nid))
+		return -EINVAL;
+
+	mutex_lock(&cram_mutex);
+
+	cn = cram_nodes[nid];
+	if (!cn) {
+		mutex_unlock(&cram_mutex);
+		return -ENODEV;
+	}
+
+	kswapd_stop(nid);
+
+	WARN_ON(node_private_clear_ops(nid, &cram_ops));
+	rcu_assign_pointer(cram_nodes[nid], NULL);
+	mutex_unlock(&cram_mutex);
+
+	/* ops->migrate_to cleared, refresh demotion targets */
+	memory_tier_refresh_demotion();
+
+	synchronize_rcu();
+	while (!refcount_dec_if_one(&cn->refcount))
+		cond_resched();
+	kfree(cn);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cram_unregister_private_node);
+
+int cram_unpurge(int nid)
+{
+	struct cram_node *cn;
+
+	if (!cram_valid_nid(nid))
+		return -EINVAL;
+
+	mutex_lock(&cram_mutex);
+
+	cn = cram_nodes[nid];
+	if (!cn) {
+		mutex_unlock(&cram_mutex);
+		return -ENODEV;
+	}
+
+	cn->purged = false;
+
+	mutex_unlock(&cram_mutex);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cram_unpurge);
+
+void cram_set_pressure(int nid, unsigned int pressure)
+{
+	struct cram_node *cn;
+	struct node_private *np;
+	struct zone *zone;
+	unsigned long managed, boost;
+
+	cn = get_cram_node(nid);
+	if (!cn)
+		return;
+
+	if (pressure > CRAM_PRESSURE_MAX)
+		pressure = CRAM_PRESSURE_MAX;
+
+	WRITE_ONCE(cn->pressure, pressure);
+
+	rcu_read_lock();
+	np = rcu_dereference(NODE_DATA(nid)->node_private);
+	/* Block demotions only at maximum pressure */
+	if (np)
+		WRITE_ONCE(np->migration_blocked,
+			   pressure >= CRAM_PRESSURE_MAX);
+	rcu_read_unlock();
+
+	zone = NULL;
+	for (int i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z = &NODE_DATA(nid)->node_zones[i];
+
+		if (zone_managed_pages(z) > 0) {
+			zone = z;
+			break;
+		}
+	}
+	if (!zone) {
+		put_cram_node(cn);
+		return;
+	}
+	managed = zone_managed_pages(zone);
+
+	/* Boost proportional to pressure. 0:no boost, 1000:full managed */
+	boost = (managed * (unsigned long)pressure) / CRAM_PRESSURE_MAX;
+	WRITE_ONCE(zone->watermark_boost, boost);
+
+	if (boost) {
+		set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
+		wakeup_kswapd(zone, GFP_KERNEL, 0, ZONE_MOVABLE);
+	}
+
+	put_cram_node(cn);
+}
+EXPORT_SYMBOL_GPL(cram_set_pressure);
+
+void cram_clear_pressure(int nid)
+{
+	cram_set_pressure(nid, 0);
+}
+EXPORT_SYMBOL_GPL(cram_clear_pressure);
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (22 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Add the CXL sysram region for direct memory hotplug of CXL RAM regions.

This region eliminates the intermediate dax_region/dax device layer by
directly performing memory hotplug operations.

Key features:
- Supports memory tier integration for proper NUMA placement
- Uses the CXL_SYSRAM_ONLINE_* Kconfig options for default online type
- Automatically hotplugs memory on probe if online type is configured
- Will be extended to support private memory nodes in the future

The driver registers a sysram_regionN device as a child of the CXL
region, managing the memory hotplug lifecycle through device add/remove.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/core/Makefile        |   1 +
 drivers/cxl/core/core.h          |   4 +
 drivers/cxl/core/port.c          |   2 +
 drivers/cxl/core/region_sysram.c | 351 +++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h                |  48 +++++
 5 files changed, 406 insertions(+)
 create mode 100644 drivers/cxl/core/region_sysram.c

diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index d3ec8aea64c5..d7ce52c50810 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -18,6 +18,7 @@ cxl_core-$(CONFIG_TRACING) += trace.o
 cxl_core-$(CONFIG_CXL_REGION) += region.o
 cxl_core-$(CONFIG_CXL_REGION) += region_dax.o
 cxl_core-$(CONFIG_CXL_REGION) += region_pmem.o
+cxl_core-$(CONFIG_CXL_REGION) += region_sysram.o
 cxl_core-$(CONFIG_CXL_MCE) += mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 6e1f695fd155..973bbcae43f7 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -35,6 +35,7 @@ extern struct device_attribute dev_attr_delete_region;
 extern struct device_attribute dev_attr_region;
 extern const struct device_type cxl_pmem_region_type;
 extern const struct device_type cxl_dax_region_type;
+extern const struct device_type cxl_sysram_type;
 extern const struct device_type cxl_region_type;
 
 int cxl_decoder_detach(struct cxl_region *cxlr,
@@ -46,6 +47,7 @@ int cxl_decoder_detach(struct cxl_region *cxlr,
 #define SET_CXL_REGION_ATTR(x) (&dev_attr_##x.attr),
 #define CXL_PMEM_REGION_TYPE(x) (&cxl_pmem_region_type)
 #define CXL_DAX_REGION_TYPE(x) (&cxl_dax_region_type)
+#define CXL_SYSRAM_TYPE(x) (&cxl_sysram_type)
 int cxl_region_init(void);
 void cxl_region_exit(void);
 int cxl_get_poison_by_endpoint(struct cxl_port *port);
@@ -54,6 +56,7 @@ u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 		   u64 dpa);
 int devm_cxl_add_dax_region(struct cxl_region *cxlr, enum dax_driver_type);
 int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
+int devm_cxl_add_sysram(struct cxl_region *cxlr, enum mmop online_type);
 
 #else
 static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
@@ -88,6 +91,7 @@ static inline void cxl_region_exit(void)
 #define SET_CXL_REGION_ATTR(x)
 #define CXL_PMEM_REGION_TYPE(x) NULL
 #define CXL_DAX_REGION_TYPE(x) NULL
+#define CXL_SYSRAM_TYPE(x) NULL
 #endif
 
 struct cxl_send_command;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 5c82e6f32572..d6e82b3c2b64 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -66,6 +66,8 @@ static int cxl_device_id(const struct device *dev)
 		return CXL_DEVICE_PMEM_REGION;
 	if (dev->type == CXL_DAX_REGION_TYPE())
 		return CXL_DEVICE_DAX_REGION;
+	if (dev->type == CXL_SYSRAM_TYPE())
+		return CXL_DEVICE_SYSRAM;
 	if (is_cxl_port(dev)) {
 		if (is_cxl_root(to_cxl_port(dev)))
 			return CXL_DEVICE_ROOT;
diff --git a/drivers/cxl/core/region_sysram.c b/drivers/cxl/core/region_sysram.c
new file mode 100644
index 000000000000..47a415deb352
--- /dev/null
+++ b/drivers/cxl/core/region_sysram.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 Meta Platforms, Inc. All rights reserved. */
+/*
+ * CXL Sysram Region - Direct memory hotplug for CXL RAM regions
+ *
+ * This interface directly performs memory hotplug for CXL RAM regions,
+ * eliminating the indirection through DAX.
+ */
+
+#include <linux/memory_hotplug.h>
+#include <linux/memory-tiers.h>
+#include <linux/memory.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <cxlmem.h>
+#include <cxl.h>
+#include "core.h"
+
+static const char *sysram_res_name = "System RAM (CXL)";
+
+/**
+ * cxl_region_find_sysram - Find the sysram device associated with a region
+ * @cxlr: The CXL region
+ *
+ * Finds and returns the sysram child device of a CXL region.
+ * The caller must release the device reference with put_device()
+ * when done with the returned pointer.
+ *
+ * Return: Pointer to cxl_sysram, or NULL if not found
+ */
+struct cxl_sysram *cxl_region_find_sysram(struct cxl_region *cxlr)
+{
+	struct cxl_sysram *sysram;
+	struct device *sdev;
+	char sname[32];
+
+	snprintf(sname, sizeof(sname), "sysram_region%d", cxlr->id);
+	sdev = device_find_child_by_name(&cxlr->dev, sname);
+	if (!sdev)
+		return NULL;
+
+	sysram = to_cxl_sysram(sdev);
+	return sysram;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_region_find_sysram, "CXL");
+
+static int sysram_get_numa_node(struct cxl_region *cxlr)
+{
+	struct cxl_region_params *p = &cxlr->params;
+	int nid;
+
+	nid = phys_to_target_node(p->res->start);
+	if (nid == NUMA_NO_NODE)
+		nid = memory_add_physaddr_to_nid(p->res->start);
+
+	return nid;
+}
+
+static int sysram_hotplug_add(struct cxl_sysram *sysram, enum mmop online_type)
+{
+	struct resource *res;
+	mhp_t mhp_flags;
+	int rc;
+
+	if (sysram->res)
+		return -EBUSY;
+
+	res = request_mem_region(sysram->hpa_range.start,
+				 range_len(&sysram->hpa_range),
+				 sysram->res_name);
+	if (!res)
+		return -EBUSY;
+
+	sysram->res = res;
+
+	/*
+	 * Set flags appropriate for System RAM. Leave ..._BUSY clear
+	 * so that add_memory() can add a child resource.
+	 */
+	res->flags = IORESOURCE_SYSTEM_RAM;
+
+	mhp_flags = MHP_NID_IS_MGID;
+
+	/*
+	 * Ensure that future kexec'd kernels will not treat
+	 * this as RAM automatically.
+	 */
+	rc = __add_memory_driver_managed(sysram->mgid,
+					 sysram->hpa_range.start,
+					 range_len(&sysram->hpa_range),
+					 sysram_res_name, mhp_flags,
+					 online_type);
+	if (rc) {
+		remove_resource(res);
+		kfree(res);
+		sysram->res = NULL;
+		return rc;
+	}
+
+	return 0;
+}
+
+static int sysram_hotplug_remove(struct cxl_sysram *sysram)
+{
+	int rc;
+
+	if (!sysram->res)
+		return 0;
+
+	rc = offline_and_remove_memory(sysram->hpa_range.start,
+				       range_len(&sysram->hpa_range));
+	if (rc)
+		return rc;
+
+	if (sysram->res) {
+		remove_resource(sysram->res);
+		kfree(sysram->res);
+		sysram->res = NULL;
+	}
+
+	return 0;
+}
+
+int cxl_sysram_offline_and_remove(struct cxl_sysram *sysram)
+{
+	return sysram_hotplug_remove(sysram);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_sysram_offline_and_remove, "CXL");
+
+static void cxl_sysram_release(struct device *dev)
+{
+	struct cxl_sysram *sysram = to_cxl_sysram(dev);
+
+	if (sysram->res)
+		sysram_hotplug_remove(sysram);
+
+	kfree(sysram->res_name);
+
+	if (sysram->mgid >= 0)
+		memory_group_unregister(sysram->mgid);
+
+	if (sysram->mtype)
+		clear_node_memory_type(sysram->numa_node, sysram->mtype);
+
+	kfree(sysram);
+}
+
+static ssize_t hotplug_store(struct device *dev,
+			     struct device_attribute *attr,
+			     const char *buf, size_t len)
+{
+	struct cxl_sysram *sysram = to_cxl_sysram(dev);
+	int online_type, rc;
+
+	online_type = mhp_online_type_from_str(buf);
+	if (online_type < 0)
+		return online_type;
+
+	if (online_type == MMOP_OFFLINE)
+		rc = sysram_hotplug_remove(sysram);
+	else
+		rc = sysram_hotplug_add(sysram, online_type);
+
+	if (rc)
+		dev_warn(dev, "hotplug %s failed: %d\n",
+			 online_type == MMOP_OFFLINE ? "offline" : "online", rc);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_WO(hotplug);
+
+static struct attribute *cxl_sysram_attrs[] = {
+	&dev_attr_hotplug.attr,
+	NULL
+};
+
+static const struct attribute_group cxl_sysram_attribute_group = {
+	.attrs = cxl_sysram_attrs,
+};
+
+static const struct attribute_group *cxl_sysram_attribute_groups[] = {
+	&cxl_base_attribute_group,
+	&cxl_sysram_attribute_group,
+	NULL
+};
+
+const struct device_type cxl_sysram_type = {
+	.name = "cxl_sysram",
+	.release = cxl_sysram_release,
+	.groups = cxl_sysram_attribute_groups,
+};
+
+static bool is_cxl_sysram(struct device *dev)
+{
+	return dev->type == &cxl_sysram_type;
+}
+
+struct cxl_sysram *to_cxl_sysram(struct device *dev)
+{
+	if (dev_WARN_ONCE(dev, !is_cxl_sysram(dev),
+			  "not a cxl_sysram device\n"))
+		return NULL;
+	return container_of(dev, struct cxl_sysram, dev);
+}
+EXPORT_SYMBOL_NS_GPL(to_cxl_sysram, "CXL");
+
+struct device *cxl_sysram_dev(struct cxl_sysram *sysram)
+{
+	return &sysram->dev;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_sysram_dev, "CXL");
+
+static struct lock_class_key cxl_sysram_key;
+
+static enum mmop cxl_sysram_get_default_online_type(void)
+{
+	if (IS_ENABLED(CONFIG_CXL_SYSRAM_ONLINE_TYPE_SYSTEM_DEFAULT))
+		return mhp_get_default_online_type();
+	if (IS_ENABLED(CONFIG_CXL_SYSRAM_ONLINE_TYPE_MOVABLE))
+		return MMOP_ONLINE_MOVABLE;
+	if (IS_ENABLED(CONFIG_CXL_SYSRAM_ONLINE_TYPE_NORMAL))
+		return MMOP_ONLINE;
+	return MMOP_OFFLINE;
+}
+
+static struct cxl_sysram *cxl_sysram_alloc(struct cxl_region *cxlr)
+{
+	struct cxl_sysram *sysram __free(kfree) = NULL;
+	struct device *dev;
+
+	sysram = kzalloc(sizeof(*sysram), GFP_KERNEL);
+	if (!sysram)
+		return ERR_PTR(-ENOMEM);
+
+	sysram->online_type = cxl_sysram_get_default_online_type();
+	sysram->last_hotplug_cmd = MMOP_OFFLINE;
+	sysram->numa_node = -1;
+	sysram->mgid = -1;
+
+	dev = &sysram->dev;
+	sysram->cxlr = cxlr;
+	device_initialize(dev);
+	lockdep_set_class(&dev->mutex, &cxl_sysram_key);
+	device_set_pm_not_required(dev);
+	dev->parent = &cxlr->dev;
+	dev->bus = &cxl_bus_type;
+	dev->type = &cxl_sysram_type;
+
+	return_ptr(sysram);
+}
+
+static void sysram_unregister(void *_sysram)
+{
+	struct cxl_sysram *sysram = _sysram;
+
+	device_unregister(&sysram->dev);
+}
+
+int devm_cxl_add_sysram(struct cxl_region *cxlr, enum mmop online_type)
+{
+	struct cxl_sysram *sysram __free(put_cxl_sysram) = NULL;
+	struct memory_dev_type *mtype;
+	struct range hpa_range;
+	struct device *dev;
+	int adist = MEMTIER_DEFAULT_LOWTIER_ADISTANCE;
+	int numa_node;
+	int rc;
+
+	rc = cxl_region_get_hpa_range(cxlr, &hpa_range);
+	if (rc)
+		return rc;
+
+	hpa_range = memory_block_align_range(&hpa_range);
+	if (hpa_range.start >= hpa_range.end) {
+		dev_warn(&cxlr->dev, "region too small after alignment\n");
+		return -ENOSPC;
+	}
+
+	sysram = cxl_sysram_alloc(cxlr);
+	if (IS_ERR(sysram))
+		return PTR_ERR(sysram);
+
+	sysram->hpa_range = hpa_range;
+
+	sysram->res_name = kasprintf(GFP_KERNEL, "cxl_sysram%d", cxlr->id);
+	if (!sysram->res_name)
+		return -ENOMEM;
+
+	/* Override default online type if caller specified one */
+	if (online_type >= 0)
+		sysram->online_type = online_type;
+
+	dev = &sysram->dev;
+
+	rc = dev_set_name(dev, "sysram_region%d", cxlr->id);
+	if (rc)
+		return rc;
+
+	/* Setup memory tier before adding device */
+	numa_node = sysram_get_numa_node(cxlr);
+	if (numa_node < 0) {
+		dev_warn(&cxlr->dev, "rejecting region with invalid node: %d\n",
+			 numa_node);
+		return -EINVAL;
+	}
+	sysram->numa_node = numa_node;
+
+	mt_calc_adistance(numa_node, &adist);
+	mtype = mt_get_memory_type(adist);
+	if (IS_ERR(mtype))
+		return PTR_ERR(mtype);
+	sysram->mtype = mtype;
+
+	init_node_memory_type(numa_node, mtype);
+
+	/* Register memory group for this region */
+	rc = memory_group_register_static(numa_node,
+					  PFN_UP(range_len(&hpa_range)));
+	if (rc < 0)
+		return rc;
+	sysram->mgid = rc;
+
+	rc = device_add(dev);
+	if (rc)
+		return rc;
+
+	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
+		dev_name(dev));
+
+	/*
+	 * Dynamic capacity regions (DCD) will have memory added later.
+	 * For static RAM regions, hotplug the entire range now.
+	 */
+	if (cxlr->mode != CXL_PARTMODE_RAM)
+		goto out;
+
+	/* If default online_type is a valid online mode, immediately hotplug */
+	if (sysram->online_type > MMOP_OFFLINE) {
+		rc = sysram_hotplug_add(sysram, sysram->online_type);
+		if (rc)
+			dev_warn(dev, "hotplug failed: %d\n", rc);
+		else
+			sysram->last_hotplug_cmd = sysram->online_type;
+	}
+
+out:
+	return devm_add_action_or_reset(&cxlr->dev, sysram_unregister,
+					no_free_ptr(sysram));
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_add_sysram, "CXL");
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f899f240f229..8e8342fd4fde 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -607,6 +607,34 @@ struct cxl_dax_region {
 	enum dax_driver_type dax_driver;
 };
 
+/**
+ * struct cxl_sysram - CXL SysRAM region for system memory hotplug
+ * @dev: device for this sysram
+ * @cxlr: parent cxl_region
+ * @online_type: Default memory online type for new hotplug ops (MMOP_* value)
+ * @last_hotplug_cmd: Last hotplug command submitted (MMOP_* value)
+ * @hpa_range: Host physical address range for the region
+ * @res_name: Resource name for the memory region
+ * @res: Memory resource (set when hotplugged)
+ * @mgid: Memory group id
+ * @mtype: Memory tier type
+ * @numa_node: NUMA node for this memory
+ *
+ * Device that directly performs memory hotplug for CXL RAM regions.
+ */
+struct cxl_sysram {
+	struct device dev;
+	struct cxl_region *cxlr;
+	enum mmop online_type;
+	int last_hotplug_cmd;
+	struct range hpa_range;
+	const char *res_name;
+	struct resource *res;
+	int mgid;
+	struct memory_dev_type *mtype;
+	int numa_node;
+};
+
 /**
  * struct cxl_port - logical collection of upstream port devices and
  *		     downstream port devices to construct a CXL memory
@@ -807,6 +835,7 @@ DEFINE_FREE(put_cxl_port, struct cxl_port *, if (!IS_ERR_OR_NULL(_T)) put_device
 DEFINE_FREE(put_cxl_root_decoder, struct cxl_root_decoder *, if (!IS_ERR_OR_NULL(_T)) put_device(&_T->cxlsd.cxld.dev))
 DEFINE_FREE(put_cxl_region, struct cxl_region *, if (!IS_ERR_OR_NULL(_T)) put_device(&_T->dev))
 DEFINE_FREE(put_cxl_dax_region, struct cxl_dax_region *, if (!IS_ERR_OR_NULL(_T)) put_device(&_T->dev))
+DEFINE_FREE(put_cxl_sysram, struct cxl_sysram *, if (!IS_ERR_OR_NULL(_T)) put_device(&_T->dev))
 
 int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd);
 void cxl_bus_rescan(void);
@@ -889,6 +918,7 @@ void cxl_destroy_region(struct cxl_region *cxlr);
 struct device *cxl_region_dev(struct cxl_region *cxlr);
 enum cxl_partition_mode cxl_region_mode(struct cxl_region *cxlr);
 int cxl_get_region_range(struct cxl_region *cxlr, struct range *range);
+struct cxl_sysram *cxl_region_find_sysram(struct cxl_region *cxlr);
 int cxl_get_committed_regions(struct cxl_memdev *cxlmd,
 			      struct cxl_region **regions, int max_regions);
 struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
@@ -936,6 +966,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
 #define CXL_DEVICE_PMEM_REGION		7
 #define CXL_DEVICE_DAX_REGION		8
 #define CXL_DEVICE_PMU			9
+#define CXL_DEVICE_SYSRAM		10
 
 #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
 #define CXL_MODALIAS_FMT "cxl:t%d"
@@ -954,6 +985,10 @@ bool is_cxl_pmem_region(struct device *dev);
 struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
 int cxl_add_to_region(struct cxl_endpoint_decoder *cxled);
 struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
+struct cxl_sysram *to_cxl_sysram(struct device *dev);
+struct device *cxl_sysram_dev(struct cxl_sysram *sysram);
+int devm_cxl_add_sysram(struct cxl_region *cxlr, enum mmop online_type);
+int cxl_sysram_offline_and_remove(struct cxl_sysram *sysram);
 u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
 #else
 static inline bool is_cxl_pmem_region(struct device *dev)
@@ -972,6 +1007,19 @@ static inline struct cxl_dax_region *to_cxl_dax_region(struct device *dev)
 {
 	return NULL;
 }
+static inline struct cxl_sysram *to_cxl_sysram(struct device *dev)
+{
+	return NULL;
+}
+static inline int devm_cxl_add_sysram(struct cxl_region *cxlr,
+				      enum mmop online_type)
+{
+	return -ENXIO;
+}
+static inline int cxl_sysram_offline_and_remove(struct cxl_sysram *sysram)
+{
+	return -ENXIO;
+}
 static inline u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint,
 					       u64 spa)
 {
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (23 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Extend the cxl_sysram region to support N_MEMORY_PRIVATE hotplug
via add_private_memory_driver_managed(). When a caller passes
private=true to devm_cxl_add_sysram(), the memory is registered
as a private node, isolating it from normal allocations and reclaim.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/core/core.h          |  2 +-
 drivers/cxl/core/region_sysram.c | 50 +++++++++++++++++++++++++-------
 drivers/cxl/cxl.h                |  9 ++++--
 3 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 973bbcae43f7..8ca3d6d41fe4 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -56,7 +56,7 @@ u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 		   u64 dpa);
 int devm_cxl_add_dax_region(struct cxl_region *cxlr, enum dax_driver_type);
 int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
-int devm_cxl_add_sysram(struct cxl_region *cxlr, enum mmop online_type);
+int devm_cxl_add_sysram(struct cxl_region *cxlr, bool private, enum mmop online_type);
 
 #else
 static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
diff --git a/drivers/cxl/core/region_sysram.c b/drivers/cxl/core/region_sysram.c
index 47a415deb352..77aaa52e7332 100644
--- a/drivers/cxl/core/region_sysram.c
+++ b/drivers/cxl/core/region_sysram.c
@@ -85,12 +85,23 @@ static int sysram_hotplug_add(struct cxl_sysram *sysram, enum mmop online_type)
 	/*
 	 * Ensure that future kexec'd kernels will not treat
 	 * this as RAM automatically.
+	 *
+	 * For private regions, use add_private_memory_driver_managed()
+	 * to register as N_MEMORY_PRIVATE which isolates the memory from
+	 * normal allocations and reclaim.
 	 */
-	rc = __add_memory_driver_managed(sysram->mgid,
-					 sysram->hpa_range.start,
-					 range_len(&sysram->hpa_range),
-					 sysram_res_name, mhp_flags,
-					 online_type);
+	if (sysram->private)
+		rc = add_private_memory_driver_managed(sysram->mgid,
+						       sysram->hpa_range.start,
+						       range_len(&sysram->hpa_range),
+						       sysram_res_name, mhp_flags,
+						       online_type, &sysram->np);
+	else
+		rc = __add_memory_driver_managed(sysram->mgid,
+						 sysram->hpa_range.start,
+						 range_len(&sysram->hpa_range),
+						 sysram_res_name, mhp_flags,
+						 online_type);
 	if (rc) {
 		remove_resource(res);
 		kfree(res);
@@ -108,10 +119,23 @@ static int sysram_hotplug_remove(struct cxl_sysram *sysram)
 	if (!sysram->res)
 		return 0;
 
-	rc = offline_and_remove_memory(sysram->hpa_range.start,
-				       range_len(&sysram->hpa_range));
-	if (rc)
-		return rc;
+	if (sysram->private) {
+		rc = offline_and_remove_private_memory(sysram->numa_node,
+						       sysram->hpa_range.start,
+						       range_len(&sysram->hpa_range));
+		/*
+		 * -EBUSY means memory was removed but node_private_unregister()
+		 * could not complete because other regions share the node.
+		 * Continue to resource cleanup since the memory is gone.
+		 */
+		if (rc && rc != -EBUSY)
+			return rc;
+	} else {
+		rc = offline_and_remove_memory(sysram->hpa_range.start,
+					       range_len(&sysram->hpa_range));
+		if (rc)
+			return rc;
+	}
 
 	if (sysram->res) {
 		remove_resource(sysram->res);
@@ -257,7 +281,8 @@ static void sysram_unregister(void *_sysram)
 	device_unregister(&sysram->dev);
 }
 
-int devm_cxl_add_sysram(struct cxl_region *cxlr, enum mmop online_type)
+int devm_cxl_add_sysram(struct cxl_region *cxlr, bool private,
+			enum mmop online_type)
 {
 	struct cxl_sysram *sysram __free(put_cxl_sysram) = NULL;
 	struct memory_dev_type *mtype;
@@ -291,6 +316,11 @@ int devm_cxl_add_sysram(struct cxl_region *cxlr, enum mmop online_type)
 	if (online_type >= 0)
 		sysram->online_type = online_type;
 
+	/* Set up private node registration if requested */
+	sysram->private = private;
+	if (private)
+		sysram->np.owner = sysram;
+
 	dev = &sysram->dev;
 
 	rc = dev_set_name(dev, "sysram_region%d", cxlr->id);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 8e8342fd4fde..54e5f9ac59dc 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -10,6 +10,7 @@
 #include <linux/bitops.h>
 #include <linux/log2.h>
 #include <linux/node.h>
+#include <linux/node_private.h>
 #include <linux/io.h>
 #include <linux/range.h>
 #include <linux/dax.h>
@@ -619,6 +620,8 @@ struct cxl_dax_region {
  * @mgid: Memory group id
  * @mtype: Memory tier type
  * @numa_node: NUMA node for this memory
+ * @private: true if this region uses N_MEMORY_PRIVATE hotplug
+ * @np: private node registration state (valid when @private is true)
  *
  * Device that directly performs memory hotplug for CXL RAM regions.
  */
@@ -633,6 +636,8 @@ struct cxl_sysram {
 	int mgid;
 	struct memory_dev_type *mtype;
 	int numa_node;
+	bool private;
+	struct node_private np;
 };
 
 /**
@@ -987,7 +992,7 @@ int cxl_add_to_region(struct cxl_endpoint_decoder *cxled);
 struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
 struct cxl_sysram *to_cxl_sysram(struct device *dev);
 struct device *cxl_sysram_dev(struct cxl_sysram *sysram);
-int devm_cxl_add_sysram(struct cxl_region *cxlr, enum mmop online_type);
+int devm_cxl_add_sysram(struct cxl_region *cxlr, bool private, enum mmop online_type);
 int cxl_sysram_offline_and_remove(struct cxl_sysram *sysram);
 u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
 #else
@@ -1011,7 +1016,7 @@ static inline struct cxl_sysram *to_cxl_sysram(struct device *dev)
 {
 	return NULL;
 }
-static inline int devm_cxl_add_sysram(struct cxl_region *cxlr,
+static inline int devm_cxl_add_sysram(struct cxl_region *cxlr, bool private,
 				      enum mmop online_type)
 {
 	return -ENXIO;
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (24 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  2026-02-22  8:48 ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Add a sample CXL type-3 driver that registers device memory as
private-node NUMA memory reachable only via explicit mempolicy
(set_mempolicy / mbind).

Probe flow:
  1. Call cxl_pci_type3_probe_init() for standard CXL device setup
  2. Look for pre-committed RAM regions; if none exist, create one
     using cxl_get_hpa_freespace() + cxl_request_dpa() +
     cxl_create_region()
  3. Convert the region to sysram via devm_cxl_add_sysram() with
     private=true and MMOP_ONLINE_MOVABLE
  4. Register node_private_ops with NP_OPS_MIGRATION | NP_OPS_MEMPOLICY
     so the node is excluded from default allocations

The migrate_to callback uses alloc_migration_target() with
__GFP_THISNODE | __GFP_PRIVATE to keep pages on the target node.

Move struct migration_target_control from mm/internal.h to
include/linux/migrate.h so the driver can use alloc_migration_target()
without depending on mm-internal headers.

Usage:
   echo $PCI_DEV > /sys/bus/pci/drivers/cxl_pci/unbind
   echo $PCI_DEV > /sys/bus/pci/drivers/cxl_mempolicy/bind

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/Kconfig                           |   2 +
 drivers/cxl/Makefile                          |   2 +
 drivers/cxl/type3_drivers/Kconfig             |   2 +
 drivers/cxl/type3_drivers/Makefile            |   2 +
 .../cxl/type3_drivers/cxl_mempolicy/Kconfig   |  16 +
 .../cxl/type3_drivers/cxl_mempolicy/Makefile  |   4 +
 .../type3_drivers/cxl_mempolicy/mempolicy.c   | 297 ++++++++++++++++++
 include/linux/migrate.h                       |   7 +-
 mm/internal.h                                 |   7 -
 9 files changed, 331 insertions(+), 8 deletions(-)
 create mode 100644 drivers/cxl/type3_drivers/Kconfig
 create mode 100644 drivers/cxl/type3_drivers/Makefile
 create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig
 create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Makefile
 create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index f99aa7274d12..1648cdeaa0c9 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -278,4 +278,6 @@ config CXL_ATL
 	depends on CXL_REGION
 	depends on ACPI_PRMT && AMD_NB
 
+source "drivers/cxl/type3_drivers/Kconfig"
+
 endif
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index 2caa90fa4bf2..94d2b2233bf8 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -19,3 +19,5 @@ cxl_acpi-y := acpi.o
 cxl_pmem-y := pmem.o security.o
 cxl_mem-y := mem.o
 cxl_pci-y := pci.o
+
+obj-y += type3_drivers/
diff --git a/drivers/cxl/type3_drivers/Kconfig b/drivers/cxl/type3_drivers/Kconfig
new file mode 100644
index 000000000000..369b21763856
--- /dev/null
+++ b/drivers/cxl/type3_drivers/Kconfig
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+source "drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig"
diff --git a/drivers/cxl/type3_drivers/Makefile b/drivers/cxl/type3_drivers/Makefile
new file mode 100644
index 000000000000..2b82265ff118
--- /dev/null
+++ b/drivers/cxl/type3_drivers/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_CXL_MEMPOLICY) += cxl_mempolicy/
diff --git a/drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig b/drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig
new file mode 100644
index 000000000000..3c45da237b9f
--- /dev/null
+++ b/drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig
@@ -0,0 +1,16 @@
+config CXL_MEMPOLICY
+	tristate "CXL Private Memory with Mempolicy Support"
+	depends on CXL_PCI
+	depends on CXL_REGION
+	depends on NUMA
+	depends on MIGRATION
+	help
+	  Minimal driver for CXL memory devices that registers memory as
+	  N_MEMORY_PRIVATE with mempolicy support.  The memory is isolated
+	  from default allocations and can only be reached via explicit
+	  mempolicy (set_mempolicy or mbind).
+
+	  No compression, no PTE controls, the memory behaves like normal
+	  DRAM but is excluded from fallback allocations.
+
+	  If unsure say 'n'.
diff --git a/drivers/cxl/type3_drivers/cxl_mempolicy/Makefile b/drivers/cxl/type3_drivers/cxl_mempolicy/Makefile
new file mode 100644
index 000000000000..dfb58fc88ad9
--- /dev/null
+++ b/drivers/cxl/type3_drivers/cxl_mempolicy/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_CXL_MEMPOLICY) += cxl_mempolicy.o
+cxl_mempolicy-y := mempolicy.o
+ccflags-y += -I$(srctree)/drivers/cxl
diff --git a/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c b/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c
new file mode 100644
index 000000000000..1c19818eb268
--- /dev/null
+++ b/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c
@@ -0,0 +1,297 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 Meta Platforms, Inc. All rights reserved. */
+/*
+ * CXL Mempolicy Driver
+ *
+ * Minimal driver for CXL memory devices that registers memory as
+ * N_MEMORY_PRIVATE with mempolicy support but no PTE controls.  The
+ * memory behaves like normal DRAM but is isolated from default allocations,
+ * it can only be reached via explicit mempolicy (set_mempolicy/mbind).
+ *
+ * Usage:
+ *   1. Unbind device from cxl_pci:
+ *        echo $PCI_DEV > /sys/bus/pci/drivers/cxl_pci/unbind
+ *   2. Bind to cxl_mempolicy:
+ *        echo $PCI_DEV > /sys/bus/pci/drivers/cxl_mempolicy/bind
+ */
+
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/xarray.h>
+#include <linux/node_private.h>
+#include <linux/migrate.h>
+#include <cxl/mailbox.h>
+#include "cxlmem.h"
+#include "cxl.h"
+
+struct cxl_mempolicy_ctx {
+	struct cxl_region *cxlr;
+	struct cxl_endpoint_decoder *cxled;
+	int nid;
+};
+
+static DEFINE_XARRAY(ctx_xa);
+
+static struct cxl_mempolicy_ctx *memdev_to_ctx(struct cxl_memdev *cxlmd)
+{
+	struct pci_dev *pdev = to_pci_dev(cxlmd->dev.parent);
+
+	return xa_load(&ctx_xa, (unsigned long)pdev);
+}
+
+static int cxl_mempolicy_migrate_to(struct list_head *folios, int nid,
+				    enum migrate_mode mode,
+				    enum migrate_reason reason,
+				    unsigned int *nr_succeeded)
+{
+	struct migration_target_control mtc = {
+		.nid = nid,
+		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE |
+			    __GFP_PRIVATE,
+		.reason = reason,
+	};
+
+	return migrate_pages(folios, alloc_migration_target, NULL,
+			     (unsigned long)&mtc, mode, reason, nr_succeeded);
+}
+
+static void cxl_mempolicy_folio_migrate(struct folio *src, struct folio *dst)
+{
+}
+
+static const struct node_private_ops cxl_mempolicy_ops = {
+	.migrate_to	= cxl_mempolicy_migrate_to,
+	.folio_migrate	= cxl_mempolicy_folio_migrate,
+	.flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
+};
+
+static struct cxl_region *create_ram_region(struct cxl_memdev *cxlmd)
+{
+	struct cxl_mempolicy_ctx *ctx = memdev_to_ctx(cxlmd);
+	struct cxl_root_decoder *cxlrd;
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_region *cxlr;
+	resource_size_t ram_size, avail;
+
+	ram_size = cxl_ram_size(cxlmd->cxlds);
+	if (ram_size == 0) {
+		dev_info(&cxlmd->dev, "no RAM capacity available\n");
+		return ERR_PTR(-ENODEV);
+	}
+
+	ram_size = ALIGN_DOWN(ram_size, SZ_256M);
+	if (ram_size == 0) {
+		dev_info(&cxlmd->dev,
+			 "RAM capacity too small (< 256M)\n");
+		return ERR_PTR(-ENOSPC);
+	}
+
+	dev_info(&cxlmd->dev, "creating RAM region for %lld MB\n",
+		 ram_size >> 20);
+
+	cxlrd = cxl_get_hpa_freespace(cxlmd, ram_size, &avail);
+	if (IS_ERR(cxlrd)) {
+		dev_err(&cxlmd->dev, "no HPA freespace: %ld\n",
+			PTR_ERR(cxlrd));
+		return ERR_CAST(cxlrd);
+	}
+
+	cxled = cxl_request_dpa(cxlmd, CXL_PARTMODE_RAM, ram_size);
+	if (IS_ERR(cxled)) {
+		dev_err(&cxlmd->dev, "failed to request DPA: %ld\n",
+			PTR_ERR(cxled));
+		cxl_put_root_decoder(cxlrd);
+		return ERR_CAST(cxled);
+	}
+
+	cxlr = cxl_create_region(cxlrd, &cxled, 1);
+	cxl_put_root_decoder(cxlrd);
+	if (IS_ERR(cxlr)) {
+		dev_err(&cxlmd->dev, "failed to create region: %ld\n",
+			PTR_ERR(cxlr));
+		cxl_dpa_free(cxled);
+		return cxlr;
+	}
+
+	ctx->cxled = cxled;
+	dev_info(&cxlmd->dev, "created region %s\n",
+		 dev_name(cxl_region_dev(cxlr)));
+	return cxlr;
+}
+
+static int setup_private_node(struct cxl_memdev *cxlmd,
+			      struct cxl_region *cxlr)
+{
+	struct cxl_mempolicy_ctx *ctx = memdev_to_ctx(cxlmd);
+	struct range hpa_range;
+	int rc;
+
+	device_release_driver(cxl_region_dev(cxlr));
+
+	rc = devm_cxl_add_sysram(cxlr, true, MMOP_ONLINE_MOVABLE);
+	if (rc) {
+		dev_err(cxl_region_dev(cxlr),
+			"failed to add sysram: %d\n", rc);
+		if (device_attach(cxl_region_dev(cxlr)) < 0)
+			dev_warn(cxl_region_dev(cxlr),
+				 "failed to re-attach driver\n");
+		return rc;
+	}
+
+	rc = cxl_get_region_range(cxlr, &hpa_range);
+	if (rc) {
+		dev_err(cxl_region_dev(cxlr),
+			"failed to get region range: %d\n", rc);
+		return rc;
+	}
+
+	ctx->nid = phys_to_target_node(hpa_range.start);
+	if (ctx->nid == NUMA_NO_NODE)
+		ctx->nid = memory_add_physaddr_to_nid(hpa_range.start);
+
+	rc = node_private_set_ops(ctx->nid, &cxl_mempolicy_ops);
+	if (rc) {
+		dev_err(cxl_region_dev(cxlr),
+			"failed to set ops on node %d: %d\n", ctx->nid, rc);
+		ctx->nid = NUMA_NO_NODE;
+		return rc;
+	}
+
+	dev_info(&cxlmd->dev,
+		 "node %d registered as private mempolicy memory\n", ctx->nid);
+	return 0;
+}
+
+static int cxl_mempolicy_attach_probe(struct cxl_memdev *cxlmd)
+{
+	struct cxl_region *regions[8];
+	struct cxl_region *cxlr;
+	int nr, i;
+	int rc;
+
+	dev_info(&cxlmd->dev,
+		 "cxl_mempolicy attach: looking for regions\n");
+
+	/* Phase 1: look for pre-committed RAM regions */
+	nr = cxl_get_committed_regions(cxlmd, regions, ARRAY_SIZE(regions));
+	for (i = 0; i < nr; i++) {
+		if (cxl_region_mode(regions[i]) != CXL_PARTMODE_RAM) {
+			put_device(cxl_region_dev(regions[i]));
+			continue;
+		}
+
+		cxlr = regions[i];
+		rc = setup_private_node(cxlmd, cxlr);
+		put_device(cxl_region_dev(cxlr));
+		if (rc == 0) {
+			/* Release remaining region references */
+			for (i++; i < nr; i++)
+				put_device(cxl_region_dev(regions[i]));
+			return 0;
+		}
+	}
+
+	/* Phase 2: no committed regions, create one */
+	dev_info(&cxlmd->dev,
+		 "no existing regions, creating RAM region\n");
+
+	cxlr = create_ram_region(cxlmd);
+	if (IS_ERR(cxlr)) {
+		rc = PTR_ERR(cxlr);
+		if (rc == -ENODEV) {
+			dev_info(&cxlmd->dev,
+				 "no RAM capacity: %d\n", rc);
+			return 0;
+		}
+		return rc;
+	}
+
+	rc = setup_private_node(cxlmd, cxlr);
+	if (rc) {
+		dev_err(&cxlmd->dev,
+			"failed to setup private node: %d\n", rc);
+		return rc;
+	}
+
+	/* Only take ownership of regions we created (Phase 2) */
+	memdev_to_ctx(cxlmd)->cxlr = cxlr;
+
+	return 0;
+}
+
+static const struct cxl_memdev_attach cxl_mempolicy_attach = {
+	.probe = cxl_mempolicy_attach_probe,
+};
+
+static int cxl_mempolicy_probe(struct pci_dev *pdev,
+			       const struct pci_device_id *id)
+{
+	struct cxl_mempolicy_ctx *ctx;
+	struct cxl_memdev *cxlmd;
+	int rc;
+
+	dev_info(&pdev->dev, "cxl_mempolicy: probing device\n");
+
+	ctx = devm_kzalloc(&pdev->dev, sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	ctx->nid = NUMA_NO_NODE;
+
+	rc = xa_insert(&ctx_xa, (unsigned long)pdev, ctx, GFP_KERNEL);
+	if (rc)
+		return rc;
+
+	cxlmd = cxl_pci_type3_probe_init(pdev, &cxl_mempolicy_attach);
+	if (IS_ERR(cxlmd)) {
+		xa_erase(&ctx_xa, (unsigned long)pdev);
+		return PTR_ERR(cxlmd);
+	}
+
+	dev_info(&pdev->dev, "cxl_mempolicy: probe complete\n");
+	return 0;
+}
+
+static void cxl_mempolicy_remove(struct pci_dev *pdev)
+{
+	struct cxl_mempolicy_ctx *ctx = xa_erase(&ctx_xa, (unsigned long)pdev);
+
+	dev_info(&pdev->dev, "cxl_mempolicy: removing device\n");
+
+	if (!ctx)
+		return;
+
+	if (ctx->nid != NUMA_NO_NODE)
+		WARN_ON(node_private_clear_ops(ctx->nid, &cxl_mempolicy_ops));
+
+	if (ctx->cxlr) {
+		cxl_destroy_region(ctx->cxlr);
+		ctx->cxlr = NULL;
+	}
+
+	if (ctx->cxled) {
+		cxl_dpa_free(ctx->cxled);
+		ctx->cxled = NULL;
+	}
+}
+
+static const struct pci_device_id cxl_mempolicy_pci_tbl[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x0d93) },
+	{ },
+};
+MODULE_DEVICE_TABLE(pci, cxl_mempolicy_pci_tbl);
+
+static struct pci_driver cxl_mempolicy_driver = {
+	.name		= KBUILD_MODNAME,
+	.id_table	= cxl_mempolicy_pci_tbl,
+	.probe		= cxl_mempolicy_probe,
+	.remove		= cxl_mempolicy_remove,
+	.driver	= {
+		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
+	},
+};
+
+module_pci_driver(cxl_mempolicy_driver);
+
+MODULE_DESCRIPTION("CXL: Private Memory with Mempolicy Support");
+MODULE_LICENSE("GPL v2");
+MODULE_IMPORT_NS("CXL");
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 7b2da3875ff2..1f9fb61f3932 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -10,7 +10,12 @@
 typedef struct folio *new_folio_t(struct folio *folio, unsigned long private);
 typedef void free_folio_t(struct folio *folio, unsigned long private);
 
-struct migration_target_control;
+struct migration_target_control {
+	int nid;		/* preferred node id */
+	nodemask_t *nmask;
+	gfp_t gfp_mask;
+	enum migrate_reason reason;
+};
 
 /**
  * struct movable_operations - Driver page migration
diff --git a/mm/internal.h b/mm/internal.h
index 64467ca774f1..85cd11189854 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1352,13 +1352,6 @@ extern const struct trace_print_flags gfpflag_names[];
 
 void setup_zone_pageset(struct zone *zone);
 
-struct migration_target_control {
-	int nid;		/* preferred node id */
-	nodemask_t *nmask;
-	gfp_t gfp_mask;
-	enum migrate_reason reason;
-};
-
 /*
  * mm/filemap.c
  */
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 27/27] cxl: add cxl_compression PCI driver
  2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
                   ` (25 preceding siblings ...)
  2026-02-22  8:48 ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
@ 2026-02-22  8:48 ` Gregory Price
  26 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-02-22  8:48 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-kernel, linux-cxl, cgroups, linux-mm, linux-trace-kernel,
	damon, kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, osalvador, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, axelrasmussen, yuanchu, weixugc, yury.norov,
	linux, mhiramat, mathieu.desnoyers, tj, hannes, mkoutny,
	jackmanb, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng,
	nphamcs, bhe, zhengqi.arch, terry.bowman

Add a generic CXL type-3 driver for compressed memory controllers.

The driver provides an alternative PCI binding that converts CXL
RAM regions to private-node sysram and registers them with the
CRAM subsystem for transparent demotion/promotion.

Probe flow:
  1. cxl_pci_type3_probe_init() for standard CXL device setup
  2. Discover/convert auto-RAM regions or create a RAM region
  3. Convert to private-node sysram via devm_cxl_add_sysram()
  4. Register with CRAM via cram_register_private_node()

Page flush pipeline:
  When a CRAM folio is freed, the CRAM free_folio   callback buffers
  it into a per-CPU RCU-protected flush buffer to offload the operation.

  A periodic kthread swaps the per-CPU buffers under RCU, then sends
  batched Sanitize-Zero commands so the device can zero pages.

  A flush_record bitmap tracks in-flight pages to avoid re-buffering on
  the second free_folio entry after folio_put().

  Overflow from full buffers is handled by a per-CPU workqueue fallback.

Watermark interrupts:
  MSI-X vector 12 - delivers "Low" watermark interrupts
  MSI-X vector 13 - delivers "High" watermark interrupts
  This adjusts CRAM pressure:
	Low  - increases pressure.
  	High - reduces pressure.

  A dynamic watermark mode cycles through four phases with
  progressively tighter thresholds.

  Static watermark mode sets pressure 0 or MAX respectively.

Teardown ordering:
  pre_teardown  - cram_unregister + retry-loop memory offline
  post_teardown - kthread stop, drain all flush buffers via CCI

Usage:
   echo $PCI_DEV > /sys/bus/pci/drivers/cxl_pci/unbind
   echo $PCI_DEV > /sys/bus/pci/drivers/cxl_compression/bind

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/type3_drivers/Kconfig             |    1 +
 drivers/cxl/type3_drivers/Makefile            |    1 +
 .../cxl/type3_drivers/cxl_compression/Kconfig |   20 +
 .../type3_drivers/cxl_compression/Makefile    |    4 +
 .../cxl_compression/compression.c             | 1025 +++++++++++++++++
 5 files changed, 1051 insertions(+)
 create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Kconfig
 create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Makefile
 create mode 100644 drivers/cxl/type3_drivers/cxl_compression/compression.c

diff --git a/drivers/cxl/type3_drivers/Kconfig b/drivers/cxl/type3_drivers/Kconfig
index 369b21763856..98f73e46730e 100644
--- a/drivers/cxl/type3_drivers/Kconfig
+++ b/drivers/cxl/type3_drivers/Kconfig
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 source "drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig"
+source "drivers/cxl/type3_drivers/cxl_compression/Kconfig"
diff --git a/drivers/cxl/type3_drivers/Makefile b/drivers/cxl/type3_drivers/Makefile
index 2b82265ff118..f5b0766d92af 100644
--- a/drivers/cxl/type3_drivers/Makefile
+++ b/drivers/cxl/type3_drivers/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_CXL_MEMPOLICY) += cxl_mempolicy/
+obj-$(CONFIG_CXL_COMPRESSION) += cxl_compression/
diff --git a/drivers/cxl/type3_drivers/cxl_compression/Kconfig b/drivers/cxl/type3_drivers/cxl_compression/Kconfig
new file mode 100644
index 000000000000..8c891a48b000
--- /dev/null
+++ b/drivers/cxl/type3_drivers/cxl_compression/Kconfig
@@ -0,0 +1,20 @@
+config CXL_COMPRESSION
+	tristate "CXL Compression Memory Driver"
+	depends on CXL_PCI
+	depends on CXL_REGION
+	depends on CRAM
+	help
+	  This driver provides an alternative PCI binding for CXL memory
+	  devices with compressed memory support. It converts CXL RAM
+	  regions to sysram for direct memory hotplug and registers with
+	  the CRAM subsystem for transparent compression.
+
+	  Page reclamation uses the standard CXL Media Operations Zero
+	  command (opcode 0x4402). If the device does not support it,
+	  the driver falls back to inline CPU zeroing.
+
+	  Usage: First unbind the device from cxl_pci, then bind to
+	  cxl_compression. The driver will initialize the CXL device and
+	  convert any RAM regions to use direct memory hotplug via sysram.
+
+	  If unsure say 'n'.
diff --git a/drivers/cxl/type3_drivers/cxl_compression/Makefile b/drivers/cxl/type3_drivers/cxl_compression/Makefile
new file mode 100644
index 000000000000..46f34809bf74
--- /dev/null
+++ b/drivers/cxl/type3_drivers/cxl_compression/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_CXL_COMPRESSION) += cxl_compression.o
+cxl_compression-y := compression.o
+ccflags-y += -I$(srctree)/drivers/cxl
diff --git a/drivers/cxl/type3_drivers/cxl_compression/compression.c b/drivers/cxl/type3_drivers/cxl_compression/compression.c
new file mode 100644
index 000000000000..e4c8b62227e2
--- /dev/null
+++ b/drivers/cxl/type3_drivers/cxl_compression/compression.c
@@ -0,0 +1,1025 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 Meta Platforms, Inc. All rights reserved. */
+/*
+ * CXL Compression Driver
+ *
+ * This driver provides an alternative binding for CXL memory devices that
+ * converts all associated RAM regions to sysram_regions for direct memory
+ * hotplug, bypassing the standard dax region path.
+ *
+ * Page reclamation uses the standard CXL Media Operations Zero command
+ * (opcode 0x4402, class 0x01, subclass 0x01).  Watermark interrupts
+ * are delivered via separate MSI-X vectors (12 for lthresh, 13 for
+ * hthresh), injected externally via QMP.
+ *
+ * Usage:
+ *   1. Device initially binds to cxl_pci at boot
+ *   2. Unbind from cxl_pci:
+ *        echo $PCI_DEV > /sys/bus/pci/drivers/cxl_pci/unbind
+ *   3. Bind to cxl_compression:
+ *        echo $PCI_DEV > /sys/bus/pci/drivers/cxl_compression/bind
+ */
+
+#include <linux/unaligned.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/module.h>
+#include <linux/delay.h>
+#include <linux/sizes.h>
+#include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/pci.h>
+#include <linux/io.h>
+#include <linux/interrupt.h>
+#include <linux/bitmap.h>
+#include <linux/highmem.h>
+#include <linux/workqueue.h>
+#include <linux/kthread.h>
+#include <linux/rcupdate.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <linux/cram.h>
+#include <linux/memory_hotplug.h>
+#include <linux/xarray.h>
+#include <cxl/mailbox.h>
+#include "cxlmem.h"
+#include "cxl.h"
+
+/*
+ * Per-device compression context lookup.
+ *
+ * pci_set_drvdata() MUST store cxlds because mbox_to_cxlds() uses
+ * dev_get_drvdata() to recover the cxl_dev_state from the mailbox host
+ * device.  Storing anything else in pci drvdata breaks every CXL mailbox
+ * command.  Use an xarray keyed by pci_dev pointer so that multiple
+ * devices can bind concurrently without colliding.
+ */
+static DEFINE_XARRAY(comp_ctx_xa);
+
+static struct cxl_compression_ctx *pdev_to_comp_ctx(struct pci_dev *pdev)
+{
+	return xa_load(&comp_ctx_xa, (unsigned long)pdev);
+}
+
+#define CXL_MEDIA_OP_OPCODE		0x4402
+#define CXL_MEDIA_OP_CLASS_SANITIZE	0x01
+#define CXL_MEDIA_OP_SUBC_ZERO		0x01
+
+struct cxl_dpa_range {
+	__le64 starting_dpa;
+	__le64 length;
+} __packed;
+
+struct cxl_media_op_input {
+	u8 media_operation_class;
+	u8 media_operation_subclass;
+	__le16 reserved;
+	__le32 dpa_range_count;
+	struct cxl_dpa_range ranges[];
+} __packed;
+
+#define CXL_CT3_MSIX_LTHRESH		12
+#define CXL_CT3_MSIX_HTHRESH		13
+#define CXL_CT3_MSIX_VECTOR_NR		14
+#define CXL_FLUSH_INTERVAL_DEFAULT_MS	1000
+
+static unsigned int flush_buf_size;
+module_param(flush_buf_size, uint, 0444);
+MODULE_PARM_DESC(flush_buf_size,
+		 "Max DPA ranges per media ops CCI command (0 = use hw max)");
+
+static unsigned int flush_interval_ms = CXL_FLUSH_INTERVAL_DEFAULT_MS;
+module_param(flush_interval_ms, uint, 0644);
+MODULE_PARM_DESC(flush_interval_ms,
+		 "Flush worker interval in ms (default 1000)");
+
+struct cxl_flush_buf {
+	unsigned int count;
+	unsigned int max;			/* max ranges per command */
+	struct cxl_media_op_input *cmd;		/* pre-allocated CCI payload */
+	struct folio **folios;			/* parallel folio tracking */
+};
+
+struct cxl_flush_ctx;
+
+struct cxl_pcpu_flush {
+	struct cxl_flush_buf __rcu *active;	/* callback writes here */
+	struct cxl_flush_buf *overflow_spare;	/* spare for overflow work */
+	struct work_struct overflow_work;	/* per-CPU overflow flush */
+	struct cxl_flush_ctx *ctx;		/* backpointer */
+};
+
+/**
+ * struct cxl_flush_ctx - Per-region flush context
+ * @flush_record: two-level bitmap, 1 bit per 4KB page, tracks in-flight ops
+ * @flush_record_pages: number of pages in the flush_record array
+ * @nr_pages: total number of 4KB pages in the region
+ * @base_pfn: starting PFN of the region (for DPA offset calculation)
+ * @buf_max: max DPA ranges per CCI command
+ * @media_ops_supported: true if device supports media operations zero
+ * @pcpu: per-CPU flush state
+ * @kthread_spares: array[nr_cpu_ids] of spare buffers for the kthread
+ * @flush_thread: round-robin kthread
+ * @mbox: pointer to CXL mailbox for sending CCI commands
+ * @dev: device for logging
+ * @nid: NUMA node of the private region
+ */
+struct cxl_flush_ctx {
+	unsigned long	**flush_record;
+	unsigned int	 flush_record_pages;
+	unsigned long	 nr_pages;
+	unsigned long	 base_pfn;
+	unsigned int	 buf_max;
+	bool		 media_ops_supported;
+	struct cxl_pcpu_flush __percpu *pcpu;
+	struct cxl_flush_buf **kthread_spares;
+	struct task_struct *flush_thread;
+	struct cxl_mailbox *mbox;
+	struct device	*dev;
+	int		 nid;
+};
+
+/* Bits per page-sized bitmap chunk */
+#define FLUSH_RECORD_BITS_PER_PAGE	(PAGE_SIZE * BITS_PER_BYTE)
+#define FLUSH_RECORD_SHIFT		(PAGE_SHIFT + 3)
+
+static unsigned long **flush_record_alloc(unsigned long nr_bits,
+					  unsigned int *nr_pages_out)
+{
+	unsigned int nr_pages = DIV_ROUND_UP(nr_bits, FLUSH_RECORD_BITS_PER_PAGE);
+	unsigned long **pages;
+	unsigned int i;
+
+	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		pages[i] = (unsigned long *)get_zeroed_page(GFP_KERNEL);
+		if (!pages[i])
+			goto err;
+	}
+
+	*nr_pages_out = nr_pages;
+	return pages;
+
+err:
+	while (i--)
+		free_page((unsigned long)pages[i]);
+	kfree(pages);
+	return NULL;
+}
+
+static void flush_record_free(unsigned long **pages, unsigned int nr_pages)
+{
+	unsigned int i;
+
+	if (!pages)
+		return;
+
+	for (i = 0; i < nr_pages; i++)
+		free_page((unsigned long)pages[i]);
+	kfree(pages);
+}
+
+static inline bool flush_record_test_and_clear(unsigned long **pages,
+					       unsigned long idx)
+{
+	return test_and_clear_bit(idx & (FLUSH_RECORD_BITS_PER_PAGE - 1),
+				  pages[idx >> FLUSH_RECORD_SHIFT]);
+}
+
+static inline void flush_record_set(unsigned long **pages, unsigned long idx)
+{
+	set_bit(idx & (FLUSH_RECORD_BITS_PER_PAGE - 1),
+		pages[idx >> FLUSH_RECORD_SHIFT]);
+}
+
+static struct cxl_flush_buf *cxl_flush_buf_alloc(unsigned int max, int nid)
+{
+	struct cxl_flush_buf *buf;
+
+	buf = kzalloc_node(sizeof(*buf), GFP_KERNEL, nid);
+	if (!buf)
+		return NULL;
+
+	buf->max = max;
+	buf->cmd = kvzalloc_node(struct_size(buf->cmd, ranges, max),
+				 GFP_KERNEL, nid);
+	if (!buf->cmd)
+		goto err_cmd;
+
+	buf->folios = kcalloc_node(max, sizeof(struct folio *),
+				   GFP_KERNEL, nid);
+	if (!buf->folios)
+		goto err_folios;
+
+	return buf;
+
+err_folios:
+	kvfree(buf->cmd);
+err_cmd:
+	kfree(buf);
+	return NULL;
+}
+
+static void cxl_flush_buf_free(struct cxl_flush_buf *buf)
+{
+	if (!buf)
+		return;
+	kvfree(buf->cmd);
+	kfree(buf->folios);
+	kfree(buf);
+}
+
+static inline void cxl_flush_buf_reset(struct cxl_flush_buf *buf)
+{
+	buf->count = 0;
+}
+
+static void cxl_flush_buf_send(struct cxl_flush_ctx *ctx,
+			       struct cxl_flush_buf *buf)
+{
+	struct cxl_mbox_cmd mbox_cmd;
+	unsigned int count = buf->count;
+	unsigned int i;
+	int rc;
+
+	if (count == 0)
+		return;
+
+	if (!ctx->media_ops_supported) {
+		/* No device support, zero all folios inline */
+		for (i = 0; i < count; i++)
+			folio_zero_range(buf->folios[i], 0,
+					 folio_size(buf->folios[i]));
+		goto release;
+	}
+
+	buf->cmd->media_operation_class = CXL_MEDIA_OP_CLASS_SANITIZE;
+	buf->cmd->media_operation_subclass = CXL_MEDIA_OP_SUBC_ZERO;
+	buf->cmd->reserved = 0;
+	buf->cmd->dpa_range_count = cpu_to_le32(count);
+
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MEDIA_OP_OPCODE,
+		.payload_in = buf->cmd,
+		.size_in = struct_size(buf->cmd, ranges, count),
+		.poll_interval_ms = 1000,
+		.poll_count = 30,
+	};
+
+	rc = cxl_internal_send_cmd(ctx->mbox, &mbox_cmd);
+	if (rc) {
+		dev_warn(ctx->dev,
+			 "media ops zero CCI command failed: %d\n", rc);
+
+		/* Zero all folios inline on failure */
+		for (i = 0; i < count; i++)
+			folio_zero_range(buf->folios[i], 0,
+					 folio_size(buf->folios[i]));
+	}
+
+release:
+	for (i = 0; i < count; i++)
+		folio_put(buf->folios[i]);
+
+	cxl_flush_buf_reset(buf);
+}
+
+static int cxl_compression_flush_cb(struct folio *folio, void *private)
+{
+	struct cxl_flush_ctx *ctx = private;
+	unsigned long pfn = folio_pfn(folio);
+	unsigned long idx = pfn - ctx->base_pfn;
+	unsigned long nr = folio_nr_pages(folio);
+	struct cxl_pcpu_flush *pcpu;
+	struct cxl_flush_buf *buf;
+	unsigned long flags;
+	unsigned int pos;
+
+	/* Case (a): flush record bit set, resolution from our media op */
+	if (flush_record_test_and_clear(ctx->flush_record, idx))
+		return 0;
+
+	dev_dbg_ratelimited(ctx->dev,
+			     "flush_cb: folio pfn=%lx order=%u idx=%lu cpu=%d\n",
+			     pfn, folio_order(folio), idx,
+			     raw_smp_processor_id());
+
+	local_irq_save(flags);
+	rcu_read_lock();
+
+	pcpu = this_cpu_ptr(ctx->pcpu);
+	buf = rcu_dereference(pcpu->active);
+
+	if (unlikely(!buf || buf->count >= buf->max)) {
+		rcu_read_unlock();
+		local_irq_restore(flags);
+		if (buf)
+			schedule_work_on(raw_smp_processor_id(),
+					 &pcpu->overflow_work);
+		return 2;
+	}
+
+	/* Case (b): write DPA range directly into pre-formatted CCI buffer */
+	folio_get(folio);
+	flush_record_set(ctx->flush_record, idx);
+
+	pos = buf->count;
+	buf->folios[pos] = folio;
+	buf->cmd->ranges[pos].starting_dpa = cpu_to_le64((u64)idx * PAGE_SIZE);
+	buf->cmd->ranges[pos].length = cpu_to_le64((u64)nr * PAGE_SIZE);
+	buf->count = pos + 1;
+
+	rcu_read_unlock();
+	local_irq_restore(flags);
+
+	return 1;
+}
+
+static int cxl_flush_kthread_fn(void *data)
+{
+	struct cxl_flush_ctx *ctx = data;
+	struct cxl_flush_buf *dirty;
+	struct cxl_pcpu_flush *pcpu;
+	int cpu;
+	bool any_dirty;
+
+	while (!kthread_should_stop()) {
+		any_dirty = false;
+
+		/* Phase 1: Swap all per-CPU buffers */
+		for_each_possible_cpu(cpu) {
+			struct cxl_flush_buf *spare = ctx->kthread_spares[cpu];
+
+			if (!spare)
+				continue;
+
+			pcpu = per_cpu_ptr(ctx->pcpu, cpu);
+			cxl_flush_buf_reset(spare);
+			dirty = rcu_replace_pointer(pcpu->active, spare, true);
+			ctx->kthread_spares[cpu] = dirty;
+
+			if (dirty && dirty->count > 0) {
+				dev_dbg(ctx->dev,
+					 "flush_kthread: cpu=%d has %u dirty ranges\n",
+					 cpu, dirty->count);
+				any_dirty = true;
+			}
+		}
+
+		if (!any_dirty)
+			goto sleep;
+
+		/* Phase 2: Single synchronize_rcu for all swaps */
+		synchronize_rcu();
+
+		/* Phase 3: Send CCI commands for dirty buffers */
+		for_each_possible_cpu(cpu) {
+			dirty = ctx->kthread_spares[cpu];
+			if (dirty && dirty->count > 0)
+				cxl_flush_buf_send(ctx, dirty);
+			/* dirty is now clean, stays as kthread_spares[cpu] */
+		}
+
+sleep:
+		schedule_timeout_interruptible(
+			msecs_to_jiffies(flush_interval_ms));
+	}
+
+	return 0;
+}
+
+static void cxl_flush_overflow_work(struct work_struct *work)
+{
+	struct cxl_pcpu_flush *pcpu =
+		container_of(work, struct cxl_pcpu_flush, overflow_work);
+	struct cxl_flush_ctx *ctx = pcpu->ctx;
+	struct cxl_flush_buf *dirty, *spare;
+	unsigned long flags;
+
+	dev_dbg(ctx->dev, "flush_overflow: cpu=%d buffer full, flushing\n",
+		 raw_smp_processor_id());
+
+	spare = pcpu->overflow_spare;
+	if (!spare)
+		return;
+
+	cxl_flush_buf_reset(spare);
+
+	local_irq_save(flags);
+	dirty = rcu_replace_pointer(pcpu->active, spare, true);
+	local_irq_restore(flags);
+
+	pcpu->overflow_spare = dirty;
+
+	synchronize_rcu();
+	cxl_flush_buf_send(ctx, dirty);
+}
+
+struct cxl_teardown_ctx {
+	struct cxl_flush_ctx *flush_ctx;
+	struct cxl_sysram *sysram;
+	int nid;
+};
+
+static void cxl_compression_pre_teardown(void *data)
+{
+	struct cxl_teardown_ctx *tctx = data;
+
+	if (!tctx->flush_ctx)
+		return;
+
+	/*
+	 * Unregister the CRAM node before memory goes offline.
+	 * node_private_clear_ops requires the node_private to still
+	 * exist, which is destroyed during memory removal.
+	 */
+	cram_unregister_private_node(tctx->nid);
+
+	/*
+	 * Offline and remove CXL memory with retry.  CXL compressed
+	 * memory may have pages pinned by in-flight flush operations;
+	 * keep retrying until they complete.  Once done, sysram->res
+	 * is NULL so the devm sysram_unregister action that follows
+	 * will skip the hotplug removal.
+	 */
+	if (tctx->sysram) {
+		int rc, retries = 0;
+
+		while (true) {
+			rc = cxl_sysram_offline_and_remove(tctx->sysram);
+			if (!rc)
+				break;
+			if (++retries > 60) {
+				pr_err("cxl_compression: memory offline failed after %d retries, giving up\n",
+				       retries);
+				break;
+			}
+			pr_info("cxl_compression: memory offline failed (%d), retrying...\n",
+				rc);
+			msleep(1000);
+		}
+	}
+}
+
+static void cxl_compression_post_teardown(void *data)
+{
+	struct cxl_teardown_ctx *tctx = data;
+	struct cxl_flush_ctx *ctx = tctx->flush_ctx;
+	struct cxl_pcpu_flush *pcpu;
+	struct cxl_flush_buf *buf;
+	int cpu;
+
+	if (!ctx)
+		return;
+
+	/* cram_unregister_private_node already called in pre_teardown */
+
+	if (ctx->flush_thread) {
+		kthread_stop(ctx->flush_thread);
+		ctx->flush_thread = NULL;
+	}
+
+	for_each_possible_cpu(cpu) {
+		pcpu = per_cpu_ptr(ctx->pcpu, cpu);
+		cancel_work_sync(&pcpu->overflow_work);
+	}
+
+	for_each_possible_cpu(cpu) {
+		pcpu = per_cpu_ptr(ctx->pcpu, cpu);
+
+		buf = rcu_dereference_raw(pcpu->active);
+		if (buf && buf->count > 0)
+			cxl_flush_buf_send(ctx, buf);
+
+		if (pcpu->overflow_spare && pcpu->overflow_spare->count > 0)
+			cxl_flush_buf_send(ctx, pcpu->overflow_spare);
+
+		if (ctx->kthread_spares && ctx->kthread_spares[cpu]) {
+			buf = ctx->kthread_spares[cpu];
+			if (buf->count > 0)
+				cxl_flush_buf_send(ctx, buf);
+		}
+	}
+
+	for_each_possible_cpu(cpu) {
+		pcpu = per_cpu_ptr(ctx->pcpu, cpu);
+
+		buf = rcu_dereference_raw(pcpu->active);
+		cxl_flush_buf_free(buf);
+
+		cxl_flush_buf_free(pcpu->overflow_spare);
+
+		if (ctx->kthread_spares)
+			cxl_flush_buf_free(ctx->kthread_spares[cpu]);
+	}
+
+	kfree(ctx->kthread_spares);
+	free_percpu(ctx->pcpu);
+	flush_record_free(ctx->flush_record, ctx->flush_record_pages);
+}
+
+/**
+ * struct cxl_compression_ctx - Per-device context for compression driver
+ * @mbox: CXL mailbox for issuing CCI commands
+ * @pdev: PCI device
+ * @flush_ctx: Flush context for deferred page reclamation
+ * @tctx: Teardown context for devm actions
+ * @sysram: Sysram device for offline+remove in remove path
+ * @nid: NUMA node ID, NUMA_NO_NODE if unset
+ * @cxlmd: The memdev associated with this context
+ * @cxlr: Region created by this driver (NULL if pre-existing)
+ * @cxled: Endpoint decoder with DPA allocated by this driver
+ * @regions_converted: Number of regions successfully converted
+ * @media_ops_supported: Device supports media operations zero (0x4402)
+ */
+struct cxl_compression_ctx {
+	struct cxl_mailbox *mbox;
+	struct pci_dev *pdev;
+	struct cxl_flush_ctx *flush_ctx;
+	struct cxl_teardown_ctx *tctx;
+	struct cxl_sysram *sysram;
+	int nid;
+	struct cxl_memdev *cxlmd;
+	struct cxl_region *cxlr;
+	struct cxl_endpoint_decoder *cxled;
+	int regions_converted;
+	bool media_ops_supported;
+};
+
+/*
+ * Probe whether the device supports Media Operations Zero (0x4402).
+ * Send a zero-count command, a conforming device returns SUCCESS,
+ * a device that doesn't support it returns UNSUPPORTED (-ENXIO).
+ */
+static bool cxl_probe_media_ops_zero(struct cxl_mailbox *mbox,
+				     struct device *dev)
+{
+	struct cxl_media_op_input probe = {
+		.media_operation_class = CXL_MEDIA_OP_CLASS_SANITIZE,
+		.media_operation_subclass = CXL_MEDIA_OP_SUBC_ZERO,
+		.dpa_range_count = 0,
+	};
+	struct cxl_mbox_cmd cmd = {
+		.opcode = CXL_MEDIA_OP_OPCODE,
+		.payload_in = &probe,
+		.size_in = sizeof(probe),
+	};
+	int rc;
+
+	rc = cxl_internal_send_cmd(mbox, &cmd);
+	if (rc) {
+		dev_info(dev,
+			 "media operations zero not supported (rc=%d), using inline zeroing\n",
+			 rc);
+		return false;
+	}
+
+	dev_info(dev, "media operations zero (0x4402) supported\n");
+	return true;
+}
+
+struct cxl_compression_wm_ctx {
+	struct device *dev;
+	int nid;
+};
+
+static irqreturn_t cxl_compression_lthresh_irq(int irq, void *data)
+{
+	struct cxl_compression_wm_ctx *wm = data;
+
+	dev_info(wm->dev, "lthresh watermark: pressuring node %d\n", wm->nid);
+	cram_set_pressure(wm->nid, CRAM_PRESSURE_MAX);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t cxl_compression_hthresh_irq(int irq, void *data)
+{
+	struct cxl_compression_wm_ctx *wm = data;
+
+	dev_info(wm->dev, "hthresh watermark: resuming node %d\n", wm->nid);
+	cram_set_pressure(wm->nid, 0);
+	return IRQ_HANDLED;
+}
+
+static int convert_region_to_sysram(struct cxl_region *cxlr,
+				    struct pci_dev *pdev)
+{
+	struct cxl_compression_ctx *comp_ctx = pdev_to_comp_ctx(pdev);
+	struct device *dev = cxl_region_dev(cxlr);
+	struct cxl_compression_wm_ctx *wm_ctx;
+	struct cxl_teardown_ctx *tctx;
+	struct cxl_flush_ctx *flush_ctx;
+	struct cxl_pcpu_flush *pcpu;
+	resource_size_t region_start, region_size;
+	struct range hpa_range;
+	int nid;
+	int irq;
+	int cpu;
+	int rc;
+
+	if (cxl_region_mode(cxlr) != CXL_PARTMODE_RAM) {
+		dev_dbg(dev, "skipping non-RAM region (mode=%d)\n",
+			cxl_region_mode(cxlr));
+		return 0;
+	}
+
+	dev_info(dev, "converting region to sysram\n");
+
+	rc = devm_cxl_add_sysram(cxlr, true, MMOP_ONLINE_MOVABLE);
+	if (rc) {
+		dev_err(dev, "failed to add sysram region: %d\n", rc);
+		return rc;
+	}
+
+	tctx = devm_kzalloc(dev, sizeof(*tctx), GFP_KERNEL);
+	if (!tctx)
+		return -ENOMEM;
+
+	rc = devm_add_action_or_reset(dev, cxl_compression_post_teardown, tctx);
+	if (rc)
+		return rc;
+
+	/* Find the sysram child device for pre_teardown */
+	comp_ctx->sysram = cxl_region_find_sysram(cxlr);
+	if (comp_ctx->sysram)
+		tctx->sysram = comp_ctx->sysram;
+
+	rc = cxl_get_region_range(cxlr, &hpa_range);
+	if (rc) {
+		dev_err(dev, "failed to get region range: %d\n", rc);
+		return rc;
+	}
+
+	nid = phys_to_target_node(hpa_range.start);
+	if (nid == NUMA_NO_NODE)
+		nid = memory_add_physaddr_to_nid(hpa_range.start);
+
+	region_start = hpa_range.start;
+	region_size = range_len(&hpa_range);
+
+	flush_ctx = devm_kzalloc(dev, sizeof(*flush_ctx), GFP_KERNEL);
+	if (!flush_ctx)
+		return -ENOMEM;
+
+	flush_ctx->base_pfn = PHYS_PFN(region_start);
+	flush_ctx->nr_pages = region_size >> PAGE_SHIFT;
+	flush_ctx->flush_record = flush_record_alloc(flush_ctx->nr_pages,
+						     &flush_ctx->flush_record_pages);
+	if (!flush_ctx->flush_record)
+		return -ENOMEM;
+
+	flush_ctx->mbox = comp_ctx->mbox;
+	flush_ctx->dev = dev;
+	flush_ctx->nid = nid;
+	flush_ctx->media_ops_supported = comp_ctx->media_ops_supported;
+
+	/*
+	 * Cap buffer at max DPA ranges that fit in one CCI payload.
+	 * Header is 8 bytes (struct cxl_media_op_input), each range
+	 * is 16 bytes (struct cxl_dpa_range).  The module parameter
+	 * flush_buf_size can further limit this (0 = use hw max).
+	 */
+	flush_ctx->buf_max = (flush_ctx->mbox->payload_size -
+			      sizeof(struct cxl_media_op_input)) /
+			     sizeof(struct cxl_dpa_range);
+	if (flush_buf_size && flush_buf_size < flush_ctx->buf_max)
+		flush_ctx->buf_max = flush_buf_size;
+	if (flush_ctx->buf_max == 0)
+		flush_ctx->buf_max = 1;
+
+	dev_info(dev,
+		 "flush buffer: %u DPA ranges per command (payload %zu bytes, media_ops %s)\n",
+		 flush_ctx->buf_max, flush_ctx->mbox->payload_size,
+		 flush_ctx->media_ops_supported ? "yes" : "no");
+
+	flush_ctx->pcpu = alloc_percpu(struct cxl_pcpu_flush);
+	if (!flush_ctx->pcpu)
+		return -ENOMEM;
+
+	flush_ctx->kthread_spares = kcalloc(nr_cpu_ids,
+					    sizeof(struct cxl_flush_buf *),
+					    GFP_KERNEL);
+	if (!flush_ctx->kthread_spares)
+		goto err_pcpu_init;
+
+	for_each_possible_cpu(cpu) {
+		struct cxl_flush_buf *active_buf, *overflow_buf, *spare_buf;
+
+		active_buf = cxl_flush_buf_alloc(flush_ctx->buf_max, nid);
+		if (!active_buf)
+			goto err_pcpu_init;
+
+		overflow_buf = cxl_flush_buf_alloc(flush_ctx->buf_max, nid);
+		if (!overflow_buf) {
+			cxl_flush_buf_free(active_buf);
+			goto err_pcpu_init;
+		}
+
+		spare_buf = cxl_flush_buf_alloc(flush_ctx->buf_max, nid);
+		if (!spare_buf) {
+			cxl_flush_buf_free(active_buf);
+			cxl_flush_buf_free(overflow_buf);
+			goto err_pcpu_init;
+		}
+
+		pcpu = per_cpu_ptr(flush_ctx->pcpu, cpu);
+		pcpu->ctx = flush_ctx;
+		rcu_assign_pointer(pcpu->active, active_buf);
+		pcpu->overflow_spare = overflow_buf;
+		INIT_WORK(&pcpu->overflow_work, cxl_flush_overflow_work);
+
+		flush_ctx->kthread_spares[cpu] = spare_buf;
+	}
+
+	flush_ctx->flush_thread = kthread_create_on_node(
+		cxl_flush_kthread_fn, flush_ctx, nid, "cxl-flush/%d", nid);
+	if (IS_ERR(flush_ctx->flush_thread)) {
+		rc = PTR_ERR(flush_ctx->flush_thread);
+		flush_ctx->flush_thread = NULL;
+		goto err_pcpu_init;
+	}
+	wake_up_process(flush_ctx->flush_thread);
+
+	rc = cram_register_private_node(nid, cxlr,
+					cxl_compression_flush_cb, flush_ctx);
+	if (rc) {
+		dev_err(dev, "failed to register cram node %d: %d\n", nid, rc);
+		goto err_pcpu_init;
+	}
+
+	tctx->flush_ctx = flush_ctx;
+	tctx->nid = nid;
+
+	rc = devm_add_action_or_reset(dev, cxl_compression_pre_teardown, tctx);
+	if (rc)
+		return rc;
+
+	comp_ctx->flush_ctx = flush_ctx;
+	comp_ctx->tctx = tctx;
+	comp_ctx->nid = nid;
+
+	/*
+	 * Register watermark IRQ handlers on &pdev->dev for
+	 * MSI-X vector 12 (lthresh) and vector 13 (hthresh).
+	 */
+	wm_ctx = devm_kzalloc(&pdev->dev, sizeof(*wm_ctx), GFP_KERNEL);
+	if (!wm_ctx)
+		return -ENOMEM;
+
+	wm_ctx->dev = &pdev->dev;
+	wm_ctx->nid = nid;
+
+	irq = pci_irq_vector(pdev, CXL_CT3_MSIX_LTHRESH);
+	if (irq >= 0) {
+		rc = devm_request_threaded_irq(&pdev->dev, irq, NULL,
+					       cxl_compression_lthresh_irq,
+					       IRQF_ONESHOT,
+					       "cxl-lthresh", wm_ctx);
+		if (rc)
+			dev_warn(&pdev->dev,
+				 "failed to register lthresh IRQ: %d\n", rc);
+	}
+
+	irq = pci_irq_vector(pdev, CXL_CT3_MSIX_HTHRESH);
+	if (irq >= 0) {
+		rc = devm_request_threaded_irq(&pdev->dev, irq, NULL,
+					       cxl_compression_hthresh_irq,
+					       IRQF_ONESHOT,
+					       "cxl-hthresh", wm_ctx);
+		if (rc)
+			dev_warn(&pdev->dev,
+				 "failed to register hthresh IRQ: %d\n", rc);
+	}
+
+	return 0;
+
+err_pcpu_init:
+	if (flush_ctx->flush_thread)
+		kthread_stop(flush_ctx->flush_thread);
+	for_each_possible_cpu(cpu) {
+		struct cxl_flush_buf *buf;
+
+		pcpu = per_cpu_ptr(flush_ctx->pcpu, cpu);
+
+		buf = rcu_dereference_raw(pcpu->active);
+		cxl_flush_buf_free(buf);
+
+		cxl_flush_buf_free(pcpu->overflow_spare);
+
+		if (flush_ctx->kthread_spares)
+			cxl_flush_buf_free(flush_ctx->kthread_spares[cpu]);
+	}
+	kfree(flush_ctx->kthread_spares);
+	free_percpu(flush_ctx->pcpu);
+	flush_record_free(flush_ctx->flush_record, flush_ctx->flush_record_pages);
+	return rc ? rc : -ENOMEM;
+}
+
+static struct cxl_region *create_ram_region(struct cxl_memdev *cxlmd)
+{
+	struct cxl_root_decoder *cxlrd;
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_region *cxlr;
+	resource_size_t ram_size, avail;
+
+	ram_size = cxl_ram_size(cxlmd->cxlds);
+	if (ram_size == 0) {
+		dev_info(&cxlmd->dev, "no RAM capacity available\n");
+		return ERR_PTR(-ENODEV);
+	}
+
+	ram_size = ALIGN_DOWN(ram_size, SZ_256M);
+	if (ram_size == 0) {
+		dev_info(&cxlmd->dev, "RAM capacity too small (< 256M)\n");
+		return ERR_PTR(-ENOSPC);
+	}
+
+	dev_info(&cxlmd->dev, "creating RAM region for %lld MB\n",
+		 ram_size >> 20);
+
+	cxlrd = cxl_get_hpa_freespace(cxlmd, ram_size, &avail);
+	if (IS_ERR(cxlrd)) {
+		dev_err(&cxlmd->dev, "no HPA freespace: %ld\n",
+			PTR_ERR(cxlrd));
+		return ERR_CAST(cxlrd);
+	}
+
+	cxled = cxl_request_dpa(cxlmd, CXL_PARTMODE_RAM, ram_size);
+	if (IS_ERR(cxled)) {
+		dev_err(&cxlmd->dev, "failed to request DPA: %ld\n",
+			PTR_ERR(cxled));
+		cxl_put_root_decoder(cxlrd);
+		return ERR_CAST(cxled);
+	}
+
+	cxlr = cxl_create_region(cxlrd, &cxled, 1);
+	cxl_put_root_decoder(cxlrd);
+	if (IS_ERR(cxlr)) {
+		dev_err(&cxlmd->dev, "failed to create region: %ld\n",
+			PTR_ERR(cxlr));
+		cxl_dpa_free(cxled);
+		return cxlr;
+	}
+
+	dev_info(&cxlmd->dev, "created region %s\n",
+		 dev_name(cxl_region_dev(cxlr)));
+	pdev_to_comp_ctx(to_pci_dev(cxlmd->dev.parent))->cxled = cxled;
+	return cxlr;
+}
+
+static int cxl_compression_attach_probe(struct cxl_memdev *cxlmd)
+{
+	struct pci_dev *pdev = to_pci_dev(cxlmd->dev.parent);
+	struct cxl_compression_ctx *comp_ctx = pdev_to_comp_ctx(pdev);
+	struct cxl_region *regions[8];
+	struct cxl_region *cxlr;
+	int nr, i, converted = 0, errors = 0;
+	int rc;
+
+	comp_ctx->cxlmd = cxlmd;
+	comp_ctx->mbox = &cxlmd->cxlds->cxl_mbox;
+
+	/* Probe device for media operations zero support */
+	comp_ctx->media_ops_supported =
+		cxl_probe_media_ops_zero(comp_ctx->mbox,
+					 &cxlmd->dev);
+
+	dev_info(&cxlmd->dev, "compression attach: looking for regions\n");
+
+	nr = cxl_get_committed_regions(cxlmd, regions, ARRAY_SIZE(regions));
+	for (i = 0; i < nr; i++) {
+		if (cxl_region_mode(regions[i]) == CXL_PARTMODE_RAM) {
+			rc = convert_region_to_sysram(regions[i], pdev);
+			if (rc)
+				errors++;
+			else
+				converted++;
+		}
+		put_device(cxl_region_dev(regions[i]));
+	}
+
+	if (converted > 0) {
+		dev_info(&cxlmd->dev,
+			 "converted %d regions to sysram (%d errors)\n",
+			 converted, errors);
+		return errors ? -EIO : 0;
+	}
+
+	dev_info(&cxlmd->dev, "no existing regions, creating RAM region\n");
+
+	cxlr = create_ram_region(cxlmd);
+	if (IS_ERR(cxlr)) {
+		rc = PTR_ERR(cxlr);
+		if (rc == -ENODEV) {
+			dev_info(&cxlmd->dev,
+				 "could not create RAM region: %d\n", rc);
+			return 0;
+		}
+		return rc;
+	}
+
+	rc = convert_region_to_sysram(cxlr, pdev);
+	if (rc) {
+		dev_err(&cxlmd->dev,
+			"failed to convert region to sysram: %d\n", rc);
+		return rc;
+	}
+
+	comp_ctx->cxlr = cxlr;
+
+	dev_info(&cxlmd->dev, "created and converted region %s to sysram\n",
+		 dev_name(cxl_region_dev(cxlr)));
+
+	return 0;
+}
+
+static const struct cxl_memdev_attach cxl_compression_attach = {
+	.probe = cxl_compression_attach_probe,
+};
+
+static int cxl_compression_probe(struct pci_dev *pdev,
+				 const struct pci_device_id *id)
+{
+	struct cxl_compression_ctx *comp_ctx;
+	struct cxl_memdev *cxlmd;
+	int rc;
+
+	dev_info(&pdev->dev, "cxl_compression: probing device\n");
+
+	comp_ctx = devm_kzalloc(&pdev->dev, sizeof(*comp_ctx), GFP_KERNEL);
+	if (!comp_ctx)
+		return -ENOMEM;
+	comp_ctx->nid = NUMA_NO_NODE;
+	comp_ctx->pdev = pdev;
+
+	rc = xa_insert(&comp_ctx_xa, (unsigned long)pdev, comp_ctx, GFP_KERNEL);
+	if (rc)
+		return rc;
+
+	cxlmd = cxl_pci_type3_probe_init(pdev, &cxl_compression_attach);
+	if (IS_ERR(cxlmd)) {
+		xa_erase(&comp_ctx_xa, (unsigned long)pdev);
+		return PTR_ERR(cxlmd);
+	}
+
+	comp_ctx->cxlmd = cxlmd;
+	comp_ctx->mbox = &cxlmd->cxlds->cxl_mbox;
+
+	dev_info(&pdev->dev, "cxl_compression: probe complete\n");
+	return 0;
+}
+
+static void cxl_compression_remove(struct pci_dev *pdev)
+{
+	struct cxl_compression_ctx *comp_ctx = xa_erase(&comp_ctx_xa,
+							(unsigned long)pdev);
+
+	dev_info(&pdev->dev, "cxl_compression: removing device\n");
+
+	if (!comp_ctx || comp_ctx->nid == NUMA_NO_NODE)
+		return;
+
+	/*
+	 * Destroy the region, devm actions on the region device handle teardown
+	 * in registration-reverse order:
+	 *   1. pre_teardown:  cram_unregister + retry-forever memory offline
+	 *   2. sysram_unregister: device_unregister (sysram->res is NULL
+	 *      after pre_teardown, so cxl_sysram_release skips hotplug)
+	 *   3. post_teardown: kthread stop, flush cleanup
+	 *
+	 * PCI MMIO is still live so CCI commands in post_teardown work.
+	 */
+	if (comp_ctx->cxlr) {
+		cxl_destroy_region(comp_ctx->cxlr);
+		comp_ctx->cxlr = NULL;
+	}
+
+	if (comp_ctx->cxled) {
+		cxl_dpa_free(comp_ctx->cxled);
+		comp_ctx->cxled = NULL;
+	}
+}
+
+static const struct pci_device_id cxl_compression_pci_tbl[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x0d93) },
+	{ /* terminate list */ },
+};
+MODULE_DEVICE_TABLE(pci, cxl_compression_pci_tbl);
+
+static struct pci_driver cxl_compression_driver = {
+	.name		= KBUILD_MODNAME,
+	.id_table	= cxl_compression_pci_tbl,
+	.probe		= cxl_compression_probe,
+	.remove		= cxl_compression_remove,
+	.driver	= {
+		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
+	},
+};
+
+module_pci_driver(cxl_compression_driver);
+
+MODULE_DESCRIPTION("CXL: Compression Memory Driver with SysRAM regions");
+MODULE_LICENSE("GPL v2");
+MODULE_IMPORT_NS("CXL");
-- 
2.53.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-02-22  8:50 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-22  8:48 [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22  8:48 ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox