linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Remove device private pages from physical address space
@ 2025-11-28  4:41 Jordan Niethe
  2025-11-28  4:41 ` [RFC PATCH 1/6] mm/hmm: Add flag to track device private PFNs Jordan Niethe
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Jordan Niethe @ 2025-11-28  4:41 UTC (permalink / raw)
  To: linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, david,
	ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

Today, when creating these device private struct pages, the first step
is to use request_free_mem_region() to get a range of physical address
space large enough to represent the devices memory. This allocated
physical address range is then remapped as device private memory using
memremap_pages.

Needing allocation of physical address space has some problems:

  1) There may be insufficient physical address space to represent the
     device memory. KASLR reducing the physical address space and VM
     configurations with limited physical address space increase the
     likelihood of hitting this especially as device memory increases. This
     has been observed to prevent device private from being initialized.  

  2) Attempting to add the device private pages to the linear map at
     addresses beyond the actual physical memory causes issues on
     architectures like aarch64  - meaning the feature does not work there [0].

This RFC changes device private memory so that it does not require
allocation of physical address space and these problems are avoided.
Instead of using the physical address space, we introduce a "device
private address space" and allocate from there.

A consequence of placing the device private pages outside of the
physical address space is that they no longer have a PFN. However, it is
still necessary to be able to look up a corresponding device private
page from a device private PTE entry, which means that we still require
some way to index into this device private address space. This leads to
the idea of a device private PFN. This is like a PFN but instead of
associating memory in the physical address space with a struct page, it
associates device memory in the device private address space with a
device private struct page.

The problem that then needs to be addressed is how to avoid confusing
these device private PFNs with the regular PFNs. It is the inherent
limited usage of the device private pages themselves which make this
possible. A device private page is only used for userspace mappings, we
do not need to be concerned with them being used within the mm more
broadly. This means that the only way that the core kernel looks up
these pages is via the page table, where their PTE already indicates if
they refer to a device private page via their swap type, e.g.
SWP_DEVICE_WRITE. We can use this information to determine if the PTE
contains a normal PFN which should be looked up in the page map, or a
device private PFN which should be looked up elsewhere.

This applies when we are creating PTE entries for device private pages -
because they have their own type there are already must be handled
separately, so it is a small step to convert them to a device private
PFN now too.

The first part of the series updates callers where device private PFNs
might now be encountered to track this extra state.

The last patch contains the bulk of the work where we change how we
convert between device private pages to device private PFNs and then use
a new interface for allocating device private pages without the need for
reserving physical address space.

For the purposes of the RFC changes have been limited to test_hmm.c
updates to the other drivers will be included in the next revision.

This would include updating existing users of memremap_pages() to use
memremap_device_private_pagemap() instead to allocate device private
pages. This also means they would no longer need to call
request_free_mem_region().  An equivalent of devm_memremap_pages() will
also be necessary.

Users of the migrate_vma() interface will also need to be updated to be
aware these device private PFNs.

By removing the device private pages from the physical address space,
this RFC also opens up the possibility to moving away from tracking
device private memory using struct pages in the future. This is
desirable as on systems with large amounts of memory these device
private struct pages use a signifiant amount of memory and take a
significant amount of time to initialize.

Testing:
- selftests/mm/hmm-tests on an amd64 VM

[0] https://lore.kernel.org/lkml/CAMj1kXFZ=4hLL1w6iCV5O5uVoVLHAJbc0rr40j24ObenAjXe9w@mail.gmail.com/

Jordan Niethe (6):
  mm/hmm: Add flag to track device private PFNs
  mm/migrate_device: Add migrate PFN flag to track device private PFNs
  mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn to track
    device private PFNs
  mm: Add a new swap type for migration entries with device private PFNs
  mm/util: Add flag to track device private PFNs in page snapshots
  mm: Remove device private pages from the physical address space

 Documentation/mm/hmm.rst |   9 +-
 fs/proc/page.c           |   6 +-
 include/linux/hmm.h      |   5 ++
 include/linux/memremap.h |  25 +++++-
 include/linux/migrate.h  |   5 ++
 include/linux/mm.h       |   9 +-
 include/linux/rmap.h     |  33 +++++++-
 include/linux/swap.h     |   8 +-
 include/linux/swapops.h  | 102 +++++++++++++++++++++--
 lib/test_hmm.c           |  66 ++++++++-------
 mm/debug.c               |   9 +-
 mm/hmm.c                 |   2 +-
 mm/memory.c              |   9 +-
 mm/memremap.c            | 174 +++++++++++++++++++++++++++++----------
 mm/migrate.c             |   6 +-
 mm/migrate_device.c      |  44 ++++++----
 mm/mm_init.c             |   8 +-
 mm/mprotect.c            |  21 +++--
 mm/page_vma_mapped.c     |  18 +++-
 mm/pagewalk.c            |   2 +-
 mm/rmap.c                |  68 ++++++++++-----
 mm/util.c                |   8 +-
 mm/vmscan.c              |   2 +-
 23 files changed, 485 insertions(+), 154 deletions(-)


base-commit: e1afacb68573c3cd0a3785c6b0508876cd3423bc
-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/6] mm/hmm: Add flag to track device private PFNs
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
@ 2025-11-28  4:41 ` Jordan Niethe
  2025-11-28 18:36   ` Matthew Brost
  2025-11-28  4:41 ` [RFC PATCH 2/6] mm/migrate_device: Add migrate PFN " Jordan Niethe
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Jordan Niethe @ 2025-11-28  4:41 UTC (permalink / raw)
  To: linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, david,
	ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

A future change will remove device private pages from the physical
address space. This will mean that device private pages no longer have
normal PFN and must be handled separately.

Prepare for this by adding a HMM_PFN_DEVICE_PRIVATE flag to indicate
that a hmm_pfn contains a PFN for a device private page.

Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/hmm.h | 2 ++
 mm/hmm.c            | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index db75ffc949a7..df571fa75a44 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,7 @@ struct mmu_interval_notifier;
  * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
  * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
  *                 fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_DEVICE_PRIVATE - the pfn field contains a DEVICE_PRIVATE pfn.
  * HMM_PFN_P2PDMA - P2P page
  * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
  * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
@@ -40,6 +41,7 @@ enum hmm_pfn_flags {
 	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
 	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
 	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
+	HMM_PFN_DEVICE_PRIVATE = 1UL << (BITS_PER_LONG - 7),
 	/*
 	 * Sticky flags, carried from input to output,
 	 * don't forget to update HMM_PFN_INOUT_FLAGS
diff --git a/mm/hmm.c b/mm/hmm.c
index 87562914670a..1cff68ade1d4 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -262,7 +262,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		if (is_device_private_entry(entry) &&
 		    page_pgmap(pfn_swap_entry_to_page(entry))->owner ==
 		    range->dev_private_owner) {
-			cpu_flags = HMM_PFN_VALID;
+			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
 			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
 			new_pfn_flags = swp_offset_pfn(entry) | cpu_flags;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/6] mm/migrate_device: Add migrate PFN flag to track device private PFNs
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
  2025-11-28  4:41 ` [RFC PATCH 1/6] mm/hmm: Add flag to track device private PFNs Jordan Niethe
@ 2025-11-28  4:41 ` Jordan Niethe
  2025-11-28  4:41 ` [RFC PATCH 3/6] mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn " Jordan Niethe
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Jordan Niethe @ 2025-11-28  4:41 UTC (permalink / raw)
  To: linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, david,
	ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

A future change will remove device private pages from the physical
address space. This will mean that device private pages no longer have
normal PFN and must be handled separately.

Prepare for this by adding a MIGRATE_PFN_DEVICE flag to indicate
that a migrate pfn contains a PFN for a device private page.

Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
Signed-off-by: Alistair Popple <apopple@nvidia.com>

---

Note: Existing drivers must also be updated in next revision.
---
 include/linux/migrate.h | 1 +
 lib/test_hmm.c          | 3 ++-
 mm/migrate_device.c     | 5 +++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 1f0ac122c3bf..d8f520dca342 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -125,6 +125,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_DEVICE	(1UL << 4)
 #define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 83e3d8208a54..0035e1b7beec 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -684,7 +684,8 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 			 page_to_pfn(spage), page_to_pfn(dpage));
-		*dst = migrate_pfn(page_to_pfn(dpage));
+		*dst = migrate_pfn(page_to_pfn(dpage)) |
+				   MIGRATE_PFN_DEVICE;
 		if ((*src & MIGRATE_PFN_WRITE) ||
 		    (!spage && args->vma->vm_flags & VM_WRITE))
 			*dst |= MIGRATE_PFN_WRITE;
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index abd9f6850db6..82f09b24d913 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -148,7 +148,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				goto next;
 
 			mpfn = migrate_pfn(page_to_pfn(page)) |
-					MIGRATE_PFN_MIGRATE;
+					MIGRATE_PFN_MIGRATE |
+					MIGRATE_PFN_DEVICE;
 			if (is_writable_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
@@ -918,7 +919,7 @@ static unsigned long migrate_device_pfn_lock(unsigned long pfn)
 		return 0;
 	}
 
-	return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+	return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE | MIGRATE_PFN_DEVICE;
 }
 
 /**
-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/6] mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn to track device private PFNs
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
  2025-11-28  4:41 ` [RFC PATCH 1/6] mm/hmm: Add flag to track device private PFNs Jordan Niethe
  2025-11-28  4:41 ` [RFC PATCH 2/6] mm/migrate_device: Add migrate PFN " Jordan Niethe
@ 2025-11-28  4:41 ` Jordan Niethe
  2025-11-28  4:41 ` [RFC PATCH 4/6] mm: Add a new swap type for migration entries with " Jordan Niethe
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Jordan Niethe @ 2025-11-28  4:41 UTC (permalink / raw)
  To: linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, david,
	ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

A future change will remove device private pages from the physical
address space. This will mean that device private pages no longer have
normal PFN and must be handled separately.

Prepare for this by modifying page_vma_mapped_walk::pfn to contain flags
as well as a PFN. Introduce a PVMW_PFN_DEVICE_PRIVATE flag to indicate
that a page_vma_mapped_walk::pfn contains a PFN for a device private
page.

Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/rmap.h | 26 +++++++++++++++++++++++++-
 mm/page_vma_mapped.c |  6 +++---
 mm/rmap.c            |  4 ++--
 mm/vmscan.c          |  2 +-
 4 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index daa92a58585d..79e5c733d9c8 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -939,9 +939,33 @@ struct page_vma_mapped_walk {
 	unsigned int flags;
 };
 
+/* pfn is a device private offset */
+#define PVMW_PFN_DEVICE_PRIVATE	(1UL << 0)
+#define PVMW_PFN_SHIFT		1
+
+static inline unsigned long page_vma_walk_pfn(unsigned long pfn)
+{
+	return (pfn << PVMW_PFN_SHIFT);
+}
+
+static inline unsigned long folio_page_vma_walk_pfn(const struct folio *folio)
+{
+	return page_vma_walk_pfn(folio_pfn(folio));
+}
+
+static inline struct page *page_vma_walk_pfn_to_page(unsigned long pvmw_pfn)
+{
+	return pfn_to_page(pvmw_pfn >> PVMW_PFN_SHIFT);
+}
+
+static inline struct folio *page_vma_walk_pfn_to_folio(unsigned long pvmw_pfn)
+{
+	return page_folio(page_vma_walk_pfn_to_page(pvmw_pfn));
+}
+
 #define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags)	\
 	struct page_vma_mapped_walk name = {				\
-		.pfn = folio_pfn(_folio),				\
+		.pfn = folio_page_vma_walk_pfn(_folio),			\
 		.nr_pages = folio_nr_pages(_folio),			\
 		.pgoff = folio_pgoff(_folio),				\
 		.vma = _vma,						\
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index c498a91b6706..9146bd084435 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -133,9 +133,9 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
 		pfn = pte_pfn(ptent);
 	}
 
-	if ((pfn + pte_nr - 1) < pvmw->pfn)
+	if ((pfn + pte_nr - 1) < (pvmw->pfn >> PVMW_PFN_SHIFT))
 		return false;
-	if (pfn > (pvmw->pfn + pvmw->nr_pages - 1))
+	if (pfn > ((pvmw->pfn >> PVMW_PFN_SHIFT) + pvmw->nr_pages - 1))
 		return false;
 	return true;
 }
@@ -346,7 +346,7 @@ unsigned long page_mapped_in_vma(const struct page *page,
 {
 	const struct folio *folio = page_folio(page);
 	struct page_vma_mapped_walk pvmw = {
-		.pfn = page_to_pfn(page),
+		.pfn = folio_page_vma_walk_pfn(folio),
 		.nr_pages = 1,
 		.vma = vma,
 		.flags = PVMW_SYNC,
diff --git a/mm/rmap.c b/mm/rmap.c
index ac4f783d6ec2..e94500318f92 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1129,7 +1129,7 @@ static bool mapping_wrprotect_range_one(struct folio *folio,
 {
 	struct wrprotect_file_state *state = (struct wrprotect_file_state *)arg;
 	struct page_vma_mapped_walk pvmw = {
-		.pfn		= state->pfn,
+		.pfn		= page_vma_walk_pfn(state->pfn),
 		.nr_pages	= state->nr_pages,
 		.pgoff		= state->pgoff,
 		.vma		= vma,
@@ -1207,7 +1207,7 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 		      struct vm_area_struct *vma)
 {
 	struct page_vma_mapped_walk pvmw = {
-		.pfn		= pfn,
+		.pfn		= page_vma_walk_pfn(pfn),
 		.nr_pages	= nr_pages,
 		.pgoff		= pgoff,
 		.vma		= vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2fc8b626d3d..e07ad830e30a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4238,7 +4238,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	pte_t *pte = pvmw->pte;
 	unsigned long addr = pvmw->address;
 	struct vm_area_struct *vma = pvmw->vma;
-	struct folio *folio = pfn_folio(pvmw->pfn);
+	struct folio *folio = page_vma_walk_pfn_to_folio(pvmw->pfn);
 	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct pglist_data *pgdat = folio_pgdat(folio);
 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 4/6] mm: Add a new swap type for migration entries with device private PFNs
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
                   ` (2 preceding siblings ...)
  2025-11-28  4:41 ` [RFC PATCH 3/6] mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn " Jordan Niethe
@ 2025-11-28  4:41 ` Jordan Niethe
  2025-12-01  2:43   ` Chih-En Lin
  2025-11-28  4:41 ` [RFC PATCH 5/6] mm/util: Add flag to track device private PFNs in page snapshots Jordan Niethe
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Jordan Niethe @ 2025-11-28  4:41 UTC (permalink / raw)
  To: linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, david,
	ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

A future change will remove device private pages from the physical
address space. This will mean that device private pages no longer have
normal PFN and must be handled separately.

When migrating a device private page a migration entry is created for
that page - this includes the PFN for that page. Once device private
PFNs exist in a different address space to regular PFNs we need to be
able to determine which kind of PFN is in the entry so we can associate
it with the correct page.

Introduce new swap types:

  - SWP_MIGRATION_DEVICE_READ
  - SWP_MIGRATION_DEVICE_WRITE
  - SWP_MIGRATION_DEVICE_READ_EXCLUSIVE

These correspond to

  - SWP_MIGRATION_READ
  - SWP_MIGRATION_WRITE
  - SWP_MIGRATION_READ_EXCLUSIVE

except the swap entry contains a device private PFN.

The existing helpers such as is_writable_migration_entry() will still
return true for a SWP_MIGRATION_DEVICE_WRITE entry.

Introduce new helpers such as
is_writable_device_migration_private_entry() to disambiguate between a
SWP_MIGRATION_WRITE and a SWP_MIGRATION_DEVICE_WRITE entry.

Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/swap.h    |  8 +++-
 include/linux/swapops.h | 87 ++++++++++++++++++++++++++++++++++++++---
 mm/memory.c             |  9 ++++-
 mm/migrate.c            |  2 +-
 mm/migrate_device.c     | 31 ++++++++++-----
 mm/mprotect.c           | 21 +++++++---
 mm/page_vma_mapped.c    |  2 +-
 mm/pagewalk.c           |  3 +-
 mm/rmap.c               | 32 ++++++++++-----
 9 files changed, 161 insertions(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index e818fbade1e2..87f14d673979 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -74,12 +74,18 @@ static inline int current_is_kswapd(void)
  *
  * When a page is mapped by the device for exclusive access we set the CPU page
  * table entries to a special SWP_DEVICE_EXCLUSIVE entry.
+ *
+ * Because device private pages do not use regular PFNs, special migration
+ * entries are also needed.
  */
 #ifdef CONFIG_DEVICE_PRIVATE
-#define SWP_DEVICE_NUM 3
+#define SWP_DEVICE_NUM 6
 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
 #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
 #define SWP_DEVICE_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
+#define SWP_MIGRATION_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
+#define SWP_MIGRATION_DEVICE_READ_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+4)
+#define SWP_MIGRATION_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+5)
 #else
 #define SWP_DEVICE_NUM 0
 #endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 64ea151a7ae3..7aa3f00e304a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -196,6 +196,43 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE;
 }
 
+static inline swp_entry_t make_readable_migration_device_private_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_MIGRATION_DEVICE_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_device_private_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_MIGRATION_DEVICE_WRITE, offset);
+}
+
+static inline bool is_device_private_migration_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_MIGRATION_DEVICE_READ ||
+			swp_type(entry) == SWP_MIGRATION_DEVICE_READ_EXCLUSIVE ||
+			swp_type(entry) == SWP_MIGRATION_DEVICE_WRITE);
+}
+
+static inline bool is_readable_device_migration_private_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_MIGRATION_DEVICE_READ);
+}
+
+static inline bool is_writable_device_migration_private_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_MIGRATION_DEVICE_WRITE);
+}
+
+static inline swp_entry_t make_device_migration_readable_exclusive_migration_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_MIGRATION_DEVICE_READ_EXCLUSIVE, offset);
+}
+
+static inline bool is_device_migration_readable_exclusive_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_MIGRATION_DEVICE_READ_EXCLUSIVE;
+}
+
 #else /* CONFIG_DEVICE_PRIVATE */
 static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
@@ -217,6 +254,11 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
 	return false;
 }
 
+static inline bool is_readable_device_migration_private_entry(swp_entry_t entry)
+{
+	return false;
+}
+
 static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
 {
 	return swp_entry(0, 0);
@@ -227,6 +269,36 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
 	return false;
 }
 
+static inline swp_entry_t make_readable_migration_device_private_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_device_private_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline bool is_device_private_migration_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_writable_device_migration_private_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline swp_entry_t make_device_migration_readable_exclusive_migration_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline bool is_device_migration_readable_exclusive_entry(swp_entry_t entry)
+{
+	return false;
+}
+
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
@@ -234,22 +306,26 @@ static inline int is_migration_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
 			swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE ||
-			swp_type(entry) == SWP_MIGRATION_WRITE);
+			swp_type(entry) == SWP_MIGRATION_WRITE ||
+			is_device_private_migration_entry(entry));
 }
 
 static inline int is_writable_migration_entry(swp_entry_t entry)
 {
-	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
+	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE ||
+			is_writable_device_migration_private_entry(entry));
 }
 
 static inline int is_readable_migration_entry(swp_entry_t entry)
 {
-	return unlikely(swp_type(entry) == SWP_MIGRATION_READ);
+	return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
+			is_readable_device_migration_private_entry(entry));
 }
 
 static inline int is_readable_exclusive_migration_entry(swp_entry_t entry)
 {
-	return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE);
+	return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE ||
+			is_device_migration_readable_exclusive_entry(entry));
 }
 
 static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
@@ -525,7 +601,8 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
 	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
 
 	return is_migration_entry(entry) || is_device_private_entry(entry) ||
-	       is_device_exclusive_entry(entry) || is_hwpoison_entry(entry);
+	       is_device_exclusive_entry(entry) || is_hwpoison_entry(entry) ||
+	       is_device_private_migration_entry(entry);
 }
 
 struct page_vma_mapped_walk;
diff --git a/mm/memory.c b/mm/memory.c
index b59ae7ce42eb..f1ed361434ff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -962,8 +962,13 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			 * to be set to read. A previously exclusive entry is
 			 * now shared.
 			 */
-			entry = make_readable_migration_entry(
-							swp_offset(entry));
+			if (is_device_private_migration_entry(entry))
+				entry = make_readable_migration_device_private_entry(
+								swp_offset(entry));
+			else
+				entry = make_readable_migration_entry(
+								swp_offset(entry));
+
 			pte = swp_entry_to_pte(entry);
 			if (pte_swp_soft_dirty(orig_pte))
 				pte = pte_swp_mksoft_dirty(pte);
diff --git a/mm/migrate.c b/mm/migrate.c
index c0e9f15be2a2..3c561d61afba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -495,7 +495,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 		goto out;
 
 	entry = pte_to_swp_entry(pte);
-	if (!is_migration_entry(entry))
+	if (!(is_migration_entry(entry)))
 		goto out;
 
 	migration_entry_wait_on_locked(entry, ptl);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 82f09b24d913..458b5114bb2b 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -235,15 +235,28 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				folio_mark_dirty(folio);
 
 			/* Setup special migration page table entry */
-			if (mpfn & MIGRATE_PFN_WRITE)
-				entry = make_writable_migration_entry(
-							page_to_pfn(page));
-			else if (anon_exclusive)
-				entry = make_readable_exclusive_migration_entry(
-							page_to_pfn(page));
-			else
-				entry = make_readable_migration_entry(
-							page_to_pfn(page));
+			if (mpfn & MIGRATE_PFN_WRITE) {
+				if (is_device_private_page(page))
+					entry = make_writable_migration_device_private_entry(
+								page_to_pfn(page));
+				else
+					entry = make_writable_migration_entry(
+								page_to_pfn(page));
+			} else if (anon_exclusive) {
+				if (is_device_private_page(page))
+					entry = make_device_migration_readable_exclusive_migration_entry(
+								page_to_pfn(page));
+				else
+					entry = make_readable_exclusive_migration_entry(
+								page_to_pfn(page));
+			} else {
+				if (is_device_private_page(page))
+					entry = make_readable_migration_device_private_entry(
+								page_to_pfn(page));
+				else
+					entry = make_readable_migration_entry(
+								page_to_pfn(page));
+			}
 			if (pte_present(pte)) {
 				if (pte_young(pte))
 					entry = make_migration_entry_young(entry);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 113b48985834..7d79a0f53bf5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -365,11 +365,22 @@ static long change_pte_range(struct mmu_gather *tlb,
 				 * A protection check is difficult so
 				 * just be safe and disable write
 				 */
-				if (folio_test_anon(folio))
-					entry = make_readable_exclusive_migration_entry(
-							     swp_offset(entry));
-				else
-					entry = make_readable_migration_entry(swp_offset(entry));
+				if (!is_writable_device_migration_private_entry(entry)) {
+					if (folio_test_anon(folio))
+						entry = make_readable_exclusive_migration_entry(
+								swp_offset(entry));
+					else
+						entry = make_readable_migration_entry(
+								swp_offset(entry));
+				} else {
+					if (folio_test_anon(folio))
+						entry = make_device_migration_readable_exclusive_migration_entry(
+								swp_offset(entry));
+					else
+						entry = make_readable_migration_device_private_entry(
+								swp_offset(entry));
+				}
+
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(oldpte))
 					newpte = pte_swp_mksoft_dirty(newpte);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 9146bd084435..e9fe747d3df3 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -112,7 +112,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
 			return false;
 		entry = pte_to_swp_entry(ptent);
 
-		if (!is_migration_entry(entry))
+		if (!(is_migration_entry(entry)))
 			return false;
 
 		pfn = swp_offset_pfn(entry);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 9f91cf85a5be..f5c77dda3359 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -1003,7 +1003,8 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
 		if ((flags & FW_MIGRATION) &&
-		    is_migration_entry(entry)) {
+		    (is_migration_entry(entry) ||
+		     is_device_private_migration_entry(entry))) {
 			page = pfn_swap_entry_to_page(entry);
 			expose_page = false;
 			goto found;
diff --git a/mm/rmap.c b/mm/rmap.c
index e94500318f92..9642a79cbdb4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2535,15 +2535,29 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			if (writable)
-				entry = make_writable_migration_entry(
-							page_to_pfn(subpage));
-			else if (anon_exclusive)
-				entry = make_readable_exclusive_migration_entry(
-							page_to_pfn(subpage));
-			else
-				entry = make_readable_migration_entry(
-							page_to_pfn(subpage));
+			if (writable) {
+				if (is_device_private_page(subpage))
+					entry = make_writable_migration_device_private_entry(
+								page_to_pfn(subpage));
+				else
+					entry = make_writable_migration_entry(
+								page_to_pfn(subpage));
+			} else if (anon_exclusive) {
+				if (is_device_private_page(subpage))
+					entry = make_device_migration_readable_exclusive_migration_entry(
+								page_to_pfn(subpage));
+				else
+					entry = make_readable_exclusive_migration_entry(
+								page_to_pfn(subpage));
+			} else {
+				if (is_device_private_page(subpage))
+					entry = make_readable_migration_device_private_entry(
+								page_to_pfn(subpage));
+				else
+					entry = make_readable_migration_entry(
+								page_to_pfn(subpage));
+			}
+
 			if (likely(pte_present(pteval))) {
 				if (pte_young(pteval))
 					entry = make_migration_entry_young(entry);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 5/6] mm/util: Add flag to track device private PFNs in page snapshots
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
                   ` (3 preceding siblings ...)
  2025-11-28  4:41 ` [RFC PATCH 4/6] mm: Add a new swap type for migration entries with " Jordan Niethe
@ 2025-11-28  4:41 ` Jordan Niethe
  2025-11-28  4:41 ` [RFC PATCH 6/6] mm: Remove device private pages from the physical address space Jordan Niethe
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Jordan Niethe @ 2025-11-28  4:41 UTC (permalink / raw)
  To: linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, david,
	ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

A future change will remove device private pages from the physical
address space. This will mean that device private pages no longer have
normal PFN and must be handled separately.

Add a new flag PAGE_SNAPSHOT_DEVICE_PRIVATE to track when the pfn of a
page snapshot is a device private page.

Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 fs/proc/page.c     | 6 ++++--
 include/linux/mm.h | 7 ++++---
 mm/util.c          | 3 +++
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index fc64f23e05e5..c3e88a199c19 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -192,10 +192,12 @@ u64 stable_page_flags(const struct page *page)
 	         folio_test_large_rmappable(folio)) {
 		/* Note: we indicate any THPs here, not just PMD-sized ones */
 		u |= 1 << KPF_THP;
-	} else if (is_huge_zero_pfn(ps.pfn)) {
+	} else if (!(ps.flags & PAGE_SNAPSHOT_DEVICE_PRIVATE) &&
+		   is_huge_zero_pfn(ps.pfn)) {
 		u |= 1 << KPF_ZERO_PAGE;
 		u |= 1 << KPF_THP;
-	} else if (is_zero_pfn(ps.pfn)) {
+	} else if (!(ps.flags & PAGE_SNAPSHOT_DEVICE_PRIVATE)
+		   && is_zero_pfn(ps.pfn)) {
 		u |= 1 << KPF_ZERO_PAGE;
 	}
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7c79b3369b82..6b8c299a6687 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4317,9 +4317,10 @@ static inline bool page_pool_page_is_pp(const struct page *page)
 }
 #endif
 
-#define PAGE_SNAPSHOT_FAITHFUL (1 << 0)
-#define PAGE_SNAPSHOT_PG_BUDDY (1 << 1)
-#define PAGE_SNAPSHOT_PG_IDLE  (1 << 2)
+#define PAGE_SNAPSHOT_FAITHFUL		(1 << 0)
+#define PAGE_SNAPSHOT_PG_BUDDY		(1 << 1)
+#define PAGE_SNAPSHOT_PG_IDLE		(1 << 2)
+#define PAGE_SNAPSHOT_DEVICE_PRIVATE	(1 << 3)
 
 struct page_snapshot {
 	struct folio folio_snapshot;
diff --git a/mm/util.c b/mm/util.c
index 8989d5767528..2472b7381b11 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1215,6 +1215,9 @@ static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio,
 
 	if (folio_test_idle(folio))
 		ps->flags |= PAGE_SNAPSHOT_PG_IDLE;
+
+	if (is_device_private_page(page))
+		ps->flags |= PAGE_SNAPSHOT_DEVICE_PRIVATE;
 }
 
 /**
-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 6/6] mm: Remove device private pages from the physical address space
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
                   ` (4 preceding siblings ...)
  2025-11-28  4:41 ` [RFC PATCH 5/6] mm/util: Add flag to track device private PFNs in page snapshots Jordan Niethe
@ 2025-11-28  4:41 ` Jordan Niethe
  2025-11-28 17:51   ` Jason Gunthorpe
  2025-11-28  7:40 ` [RFC PATCH 0/6] Remove device private pages from " David Hildenbrand (Red Hat)
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Jordan Niethe @ 2025-11-28  4:41 UTC (permalink / raw)
  To: linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, david,
	ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

Currently when creating these device private struct pages, the first
step is to use request_free_mem_region() to get a range of physical
address space large enough to represent the devices memory. This
allocated physical address range is then remapped as device private
memory using memremap_pages().

Needing allocation of physical address space has some problems:

  1) There may be insufficient physical address space to represent the
     device memory. KASLR reducing the physical address space and VM
     configurations with limited physical address space increase the
     likelihood of hitting this especially as device memory increases. This
     has been observed to prevent device private from being initialized.

  2) Attempting to add the device private pages to the linear map at
     addresses beyond the actual physical memory causes issues on
     architectures like aarch64 meaning the feature does not work there.

Instead of using the physical address space, introduce a device private
address space and allocate devices regions from there to represent the
device private pages.

Introduce a new interface memremap_device_private_pagemap() that
allocates a requested amount of device private address space and creates
the necessary device private pages.

To support this new interface, struct dev_pagemap needs some changes:

  - Add a new dev_pagemap::nr_pages field as an input parameter.
  - Add a new dev_pagemap::pages array to store the device
    private pages.

When using memremap_device_private_pagemap(), rather then passing in
dev_pagemap::ranges[dev_pagemap::nr_ranges] of physical address space to
be remapped, dev_pagemap::nr_ranges will always be 1, and the device
private range that is reserved is returned in dev_pagemap::range.

Forbid calling memremap_pages() with dev_pagemap::ranges::type =
MEMORY_DEVICE_PRIVATE.

Represent this device private address space using a new
device_private_pgmap_tree maple tree. This tree maps a given device
private address to a struct dev_pagemap, where a specific device private
page may then be looked up in that dev_pagemap::pages array.

Device private address space can be reclaimed and the assoicated device
private pages freed using the corresponding new
memunmap_device_private_pagemap() interface.

Because the device private pages now live outside the physical address
space, they no longer have a normal PFN. This means that page_to_pfn(),
et al. are no longer meaningful.

Introduce helpers:

  - device_private_page_to_offset()
  - device_private_folio_to_offset()

to take a given device private page / folio and return its offset within
the device private address space (this is essentially a PFN within the
device private address space).

Update the places where we previously converted a device private page to
a PFN to use these new helpers. When we encounter a device private PFN,
instead of looking up its page within the pagemap use
device_private_offset_to_page() instead.

Update lib/test_hmm.c to use the new memremap_device_private_pagemap()
interface.

Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
Signed-off-by: Alistair Popple <apopple@nvidia.com>

---

Note: The existing users of memremap_pages() will be updated in the next
revision.
---
 Documentation/mm/hmm.rst |   9 +-
 include/linux/hmm.h      |   3 +
 include/linux/memremap.h |  25 +++++-
 include/linux/migrate.h  |   4 +
 include/linux/mm.h       |   2 +
 include/linux/rmap.h     |   7 ++
 include/linux/swapops.h  |  15 +++-
 lib/test_hmm.c           |  65 ++++++++-------
 mm/debug.c               |   9 +-
 mm/memremap.c            | 174 +++++++++++++++++++++++++++++----------
 mm/migrate.c             |   4 +-
 mm/migrate_device.c      |  14 ++--
 mm/mm_init.c             |   8 +-
 mm/page_vma_mapped.c     |  10 +++
 mm/pagewalk.c            |   3 +-
 mm/rmap.c                |  38 ++++++---
 mm/util.c                |   5 +-
 17 files changed, 282 insertions(+), 113 deletions(-)

diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst
index 7d61b7a8b65b..49a10d3dfb2d 100644
--- a/Documentation/mm/hmm.rst
+++ b/Documentation/mm/hmm.rst
@@ -276,17 +276,12 @@ These can be allocated and freed with::
     struct resource *res;
     struct dev_pagemap pagemap;
 
-    res = request_free_mem_region(&iomem_resource, /* number of bytes */,
-                                  "name of driver resource");
     pagemap.type = MEMORY_DEVICE_PRIVATE;
-    pagemap.range.start = res->start;
-    pagemap.range.end = res->end;
-    pagemap.nr_range = 1;
+    pagemap.nr_pages = /* number of pages */;
     pagemap.ops = &device_devmem_ops;
-    memremap_pages(&pagemap, numa_node_id());
+    memremap_device_private_pagemap(&pagemap, numa_node_id());
 
     memunmap_pages(&pagemap);
-    release_mem_region(pagemap.range.start, range_len(&pagemap.range));
 
 There are also devm_request_free_mem_region(), devm_memremap_pages(),
 devm_memunmap_pages(), and devm_release_mem_region() when the resources can
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index df571fa75a44..f6e65a6d80ea 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -68,6 +68,9 @@ enum hmm_pfn_flags {
  */
 static inline struct page *hmm_pfn_to_page(unsigned long hmm_pfn)
 {
+	if (hmm_pfn & HMM_PFN_DEVICE_PRIVATE)
+		return device_private_offset_to_page(hmm_pfn & ~HMM_PFN_FLAGS);
+
 	return pfn_to_page(hmm_pfn & ~HMM_PFN_FLAGS);
 }
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index e5951ba12a28..737574209cea 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -38,6 +38,7 @@ struct vmem_altmap {
  * backing the device memory. Doing so simplifies the implementation, but it is
  * important to remember that there are certain points at which the struct page
  * must be treated as an opaque object, rather than a "normal" struct page.
+ * Unlike "normal" struct pages, the page_to_pfn() is invalid.
  *
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/mm/hmm.rst.
@@ -120,9 +121,13 @@ struct dev_pagemap_ops {
  * @owner: an opaque pointer identifying the entity that manages this
  *	instance.  Used by various helpers to make sure that no
  *	foreign ZONE_DEVICE memory is accessed.
- * @nr_range: number of ranges to be mapped
- * @range: range to be mapped when nr_range == 1
+ * @nr_range: number of ranges to be mapped. Always == 1 for
+ *	MEMORY_DEVICE_PRIVATE.
+ * @range: range to be mapped when nr_range == 1. Used as an output param for
+ *	MEMORY_DEVICE_PRIVATE.
  * @ranges: array of ranges to be mapped when nr_range > 1
+ * @nr_pages: number of pages requested to be mapped for MEMORY_DEVICE_PRIVATE.
+ * @pages: array of nr_pages initialized for MEMORY_DEVICE_PRIVATE.
  */
 struct dev_pagemap {
 	struct vmem_altmap altmap;
@@ -138,6 +143,8 @@ struct dev_pagemap {
 		struct range range;
 		DECLARE_FLEX_ARRAY(struct range, ranges);
 	};
+	unsigned long nr_pages;
+	struct page *pages;
 };
 
 static inline bool pgmap_has_memory_failure(struct dev_pagemap *pgmap)
@@ -164,6 +171,15 @@ static inline bool folio_is_device_private(const struct folio *folio)
 		folio->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+struct page *device_private_offset_to_page(unsigned long offset);
+struct page *device_private_entry_to_page(swp_entry_t entry);
+pgoff_t device_private_page_to_offset(const struct page *page);
+
+static inline pgoff_t device_private_folio_to_offset(const struct folio *folio)
+{
+	return device_private_page_to_offset((const struct page *)&folio->page);
+}
+
 static inline bool is_device_private_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
@@ -206,7 +222,12 @@ static inline bool is_fsdax_page(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+void __init_zone_device_page(struct page *page, unsigned long pfn,
+					  unsigned long zone_idx, int nid,
+					  struct dev_pagemap *pgmap);
 void zone_device_page_init(struct page *page);
+unsigned long memremap_device_private_pagemap(struct dev_pagemap *pgmap);
+void memunmap_device_private_pagemap(struct dev_pagemap *pgmap);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index d8f520dca342..d50684dd4ee6 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -132,6 +132,10 @@ static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
 {
 	if (!(mpfn & MIGRATE_PFN_VALID))
 		return NULL;
+
+	if (mpfn & MIGRATE_PFN_DEVICE)
+		return device_private_offset_to_page(mpfn >> MIGRATE_PFN_SHIFT);
+
 	return pfn_to_page(mpfn >> MIGRATE_PFN_SHIFT);
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b8c299a6687..94d83897ea18 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1851,6 +1851,8 @@ static inline unsigned long memdesc_section(memdesc_flags_t mdf)
  */
 static inline unsigned long folio_pfn(const struct folio *folio)
 {
+	VM_BUG_ON(folio_is_device_private(folio));
+
 	return page_to_pfn(&folio->page);
 }
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 79e5c733d9c8..c1561a92864f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -950,11 +950,18 @@ static inline unsigned long page_vma_walk_pfn(unsigned long pfn)
 
 static inline unsigned long folio_page_vma_walk_pfn(const struct folio *folio)
 {
+	if (folio_is_device_private(folio))
+		return page_vma_walk_pfn(device_private_folio_to_offset(folio)) |
+		       PVMW_PFN_DEVICE_PRIVATE;
+
 	return page_vma_walk_pfn(folio_pfn(folio));
 }
 
 static inline struct page *page_vma_walk_pfn_to_page(unsigned long pvmw_pfn)
 {
+	if (pvmw_pfn & PVMW_PFN_DEVICE_PRIVATE)
+		return device_private_offset_to_page(pvmw_pfn >> PVMW_PFN_SHIFT);
+
 	return pfn_to_page(pvmw_pfn >> PVMW_PFN_SHIFT);
 }
 
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 7aa3f00e304a..03271ad98f73 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -565,7 +565,13 @@ static inline int pte_none_mostly(pte_t pte)
 
 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 {
-	struct page *p = pfn_to_page(swp_offset_pfn(entry));
+	struct page *p;
+
+	if (is_device_private_entry(entry) ||
+	    is_device_private_migration_entry(entry))
+		p = device_private_entry_to_page(entry);
+	else
+		p = pfn_to_page(swp_offset_pfn(entry));
 
 	/*
 	 * Any use of migration entries may only occur while the
@@ -578,8 +584,13 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 
 static inline struct folio *pfn_swap_entry_folio(swp_entry_t entry)
 {
-	struct folio *folio = pfn_folio(swp_offset_pfn(entry));
+	struct folio *folio;
 
+	if (is_device_private_entry(entry) ||
+	    is_device_private_migration_entry(entry))
+		folio = page_folio(device_private_entry_to_page(entry));
+	else
+		folio = pfn_folio(swp_offset_pfn(entry));
 	/*
 	 * Any use of migration entries may only occur while the
 	 * corresponding folio is locked
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 0035e1b7beec..59dae2ec628a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -495,7 +495,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 				   struct page **ppage)
 {
 	struct dmirror_chunk *devmem;
-	struct resource *res = NULL;
+	bool device_private = false;
 	unsigned long pfn;
 	unsigned long pfn_first;
 	unsigned long pfn_last;
@@ -508,13 +508,9 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 	switch (mdevice->zone_device_type) {
 	case HMM_DMIRROR_MEMORY_DEVICE_PRIVATE:
-		res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
-					      "hmm_dmirror");
-		if (IS_ERR_OR_NULL(res))
-			goto err_devmem;
-		devmem->pagemap.range.start = res->start;
-		devmem->pagemap.range.end = res->end;
+		device_private = true;
 		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+		devmem->pagemap.nr_pages = DEVMEM_CHUNK_SIZE / PAGE_SIZE;
 		break;
 	case HMM_DMIRROR_MEMORY_DEVICE_COHERENT:
 		devmem->pagemap.range.start = (MINOR(mdevice->cdevice.dev) - 2) ?
@@ -523,13 +519,13 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		devmem->pagemap.range.end = devmem->pagemap.range.start +
 					    DEVMEM_CHUNK_SIZE - 1;
 		devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
+		devmem->pagemap.nr_range = 1;
 		break;
 	default:
 		ret = -EINVAL;
 		goto err_devmem;
 	}
 
-	devmem->pagemap.nr_range = 1;
 	devmem->pagemap.ops = &dmirror_devmem_ops;
 	devmem->pagemap.owner = mdevice;
 
@@ -549,13 +545,20 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		mdevice->devmem_capacity = new_capacity;
 		mdevice->devmem_chunks = new_chunks;
 	}
-	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
-	if (IS_ERR_OR_NULL(ptr)) {
-		if (ptr)
-			ret = PTR_ERR(ptr);
-		else
-			ret = -EFAULT;
-		goto err_release;
+
+	if (device_private) {
+		ret = memremap_device_private_pagemap(&devmem->pagemap);
+		if (ret)
+			goto err_release;
+	} else {
+		ptr = memremap_pages(&devmem->pagemap, numa_node_id());
+		if (IS_ERR_OR_NULL(ptr)) {
+			if (ptr)
+				ret = PTR_ERR(ptr);
+			else
+				ret = -EFAULT;
+			goto err_release;
+		}
 	}
 
 	devmem->mdevice = mdevice;
@@ -565,15 +568,21 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 	mutex_unlock(&mdevice->devmem_lock);
 
-	pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
+	pr_info("added new %u MB chunk (total %u chunks, %u MB) %sPFNs [0x%lx 0x%lx)\n",
 		DEVMEM_CHUNK_SIZE / (1024 * 1024),
 		mdevice->devmem_count,
 		mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
+		device_private ? "device " : "",
 		pfn_first, pfn_last);
 
 	spin_lock(&mdevice->lock);
 	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
-		struct page *page = pfn_to_page(pfn);
+		struct page *page;
+
+		if (device_private)
+			page = device_private_offset_to_page(pfn);
+		else
+			page = pfn_to_page(pfn);
 
 		page->zone_device_data = mdevice->free_pages;
 		mdevice->free_pages = page;
@@ -589,9 +598,6 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
-	if (res && devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
-		release_mem_region(devmem->pagemap.range.start,
-				   range_len(&devmem->pagemap.range));
 err_devmem:
 	kfree(devmem);
 
@@ -660,8 +666,8 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 */
 		spage = migrate_pfn_to_page(*src);
 		if (WARN(spage && is_zone_device_page(spage),
-		     "page already in device spage pfn: 0x%lx\n",
-		     page_to_pfn(spage)))
+		     "page already in device spage dev pfn: 0x%lx\n",
+		     device_private_page_to_offset(spage)))
 			continue;
 
 		dpage = dmirror_devmem_alloc_page(mdevice);
@@ -683,8 +689,9 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		rpage->zone_device_data = dmirror;
 
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
-			 page_to_pfn(spage), page_to_pfn(dpage));
-		*dst = migrate_pfn(page_to_pfn(dpage)) |
+			 page_to_pfn(spage),
+			 device_private_page_to_offset(dpage));
+		*dst = migrate_pfn(device_private_page_to_offset(dpage)) |
 				   MIGRATE_PFN_DEVICE;
 		if ((*src & MIGRATE_PFN_WRITE) ||
 		    (!spage && args->vma->vm_flags & VM_WRITE))
@@ -846,8 +853,8 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
 		if (!dpage)
 			continue;
-		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
-			 page_to_pfn(spage), page_to_pfn(dpage));
+		pr_debug("migrating from dev to sys dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+			 device_private_page_to_offset(spage), page_to_pfn(dpage));
 
 		lock_page(dpage);
 		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
@@ -1257,10 +1264,10 @@ static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
 			spin_unlock(&mdevice->lock);
 
 			dmirror_device_evict_chunk(devmem);
-			memunmap_pages(&devmem->pagemap);
 			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
-				release_mem_region(devmem->pagemap.range.start,
-						   range_len(&devmem->pagemap.range));
+				memunmap_device_private_pagemap(&devmem->pagemap);
+			else
+				memunmap_pages(&devmem->pagemap);
 			kfree(devmem);
 		}
 		mdevice->devmem_count = 0;
diff --git a/mm/debug.c b/mm/debug.c
index 64ddb0c4b4be..81326d96a678 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -77,9 +77,11 @@ static void __dump_folio(struct folio *folio, struct page *page,
 	if (page_mapcount_is_type(mapcount))
 		mapcount = 0;
 
-	pr_warn("page: refcount:%d mapcount:%d mapping:%p index:%#lx pfn:%#lx\n",
+	pr_warn("page: refcount:%d mapcount:%d mapping:%p index:%#lx %spfn:%#lx\n",
 			folio_ref_count(folio), mapcount, mapping,
-			folio->index + idx, pfn);
+			folio->index + idx,
+			folio_is_device_private(folio) ? "device " : "",
+			pfn);
 	if (folio_test_large(folio)) {
 		int pincount = 0;
 
@@ -113,7 +115,8 @@ static void __dump_folio(struct folio *folio, struct page *page,
 	 * inaccuracy here due to racing.
 	 */
 	pr_warn("%sflags: %pGp%s\n", type, &folio->flags,
-		is_migrate_cma_folio(folio, pfn) ? " CMA" : "");
+		(!folio_is_device_private(folio) &&
+		 is_migrate_cma_folio(folio, pfn)) ? " CMA" : "");
 	if (page_has_type(&folio->page))
 		pr_warn("page_type: %x(%s)\n", folio->page.page_type >> 24,
 				page_type_name(folio->page.page_type));
diff --git a/mm/memremap.c b/mm/memremap.c
index 46cb1b0b6f72..eb8dec1e550e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,9 +12,12 @@
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/xarray.h>
+#include <linux/maple_tree.h>
 #include "internal.h"
 
 static DEFINE_XARRAY(pgmap_array);
+static struct maple_tree device_private_pgmap_tree =
+	MTREE_INIT(device_private_pgmap_tree, MT_FLAGS_ALLOC_RANGE);
 
 /*
  * The memremap() and memremap_pages() interfaces are alternately used
@@ -113,9 +116,10 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 {
 	int i;
 
+	WARN_ONCE(pgmap->type == MEMORY_DEVICE_PRIVATE, "Type should not be MEMORY_DEVICE_PRIVATE\n");
+
 	percpu_ref_kill(&pgmap->ref);
-	if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
-	    pgmap->type != MEMORY_DEVICE_COHERENT)
+	if (pgmap->type != MEMORY_DEVICE_COHERENT)
 		for (i = 0; i < pgmap->nr_range; i++)
 			percpu_ref_put_many(&pgmap->ref, pfn_len(pgmap, i));
 
@@ -144,7 +148,6 @@ static void dev_pagemap_percpu_release(struct percpu_ref *ref)
 static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 		int range_id, int nid)
 {
-	const bool is_private = pgmap->type == MEMORY_DEVICE_PRIVATE;
 	struct range *range = &pgmap->ranges[range_id];
 	struct dev_pagemap *conflict_pgmap;
 	int error, is_ram;
@@ -190,7 +193,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	if (error)
 		goto err_pfn_remap;
 
-	if (!mhp_range_allowed(range->start, range_len(range), !is_private)) {
+	if (!mhp_range_allowed(range->start, range_len(range), true)) {
 		error = -EINVAL;
 		goto err_kasan;
 	}
@@ -198,30 +201,19 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	mem_hotplug_begin();
 
 	/*
-	 * For device private memory we call add_pages() as we only need to
-	 * allocate and initialize struct page for the device memory. More-
-	 * over the device memory is un-accessible thus we do not want to
-	 * create a linear mapping for the memory like arch_add_memory()
-	 * would do.
-	 *
-	 * For all other device memory types, which are accessible by
-	 * the CPU, we do want the linear mapping and thus use
+	 * All device memory types except device private memory are accessible
+	 * by the CPU, so we want the linear mapping and thus use
 	 * arch_add_memory().
 	 */
-	if (is_private) {
-		error = add_pages(nid, PHYS_PFN(range->start),
-				PHYS_PFN(range_len(range)), params);
-	} else {
-		error = kasan_add_zero_shadow(__va(range->start), range_len(range));
-		if (error) {
-			mem_hotplug_done();
-			goto err_kasan;
-		}
-
-		error = arch_add_memory(nid, range->start, range_len(range),
-					params);
+	error = kasan_add_zero_shadow(__va(range->start), range_len(range));
+	if (error) {
+		mem_hotplug_done();
+		goto err_kasan;
 	}
 
+	error = arch_add_memory(nid, range->start, range_len(range),
+				params);
+
 	if (!error) {
 		struct zone *zone;
 
@@ -248,8 +240,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	return 0;
 
 err_add_memory:
-	if (!is_private)
-		kasan_remove_zero_shadow(__va(range->start), range_len(range));
+	kasan_remove_zero_shadow(__va(range->start), range_len(range));
 err_kasan:
 	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
 err_pfn_remap:
@@ -281,22 +272,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
-		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
-			WARN(1, "Device private memory not supported\n");
-			return ERR_PTR(-EINVAL);
-		}
-		if (!pgmap->ops || !pgmap->ops->migrate_to_ram) {
-			WARN(1, "Missing migrate_to_ram method\n");
-			return ERR_PTR(-EINVAL);
-		}
-		if (!pgmap->ops->page_free) {
-			WARN(1, "Missing page_free method\n");
-			return ERR_PTR(-EINVAL);
-		}
-		if (!pgmap->owner) {
-			WARN(1, "Missing owner\n");
-			return ERR_PTR(-EINVAL);
-		}
+		WARN(1, "Use memremap_device_private_pagemap()\n");
+		return ERR_PTR(-EINVAL);
 		break;
 	case MEMORY_DEVICE_COHERENT:
 		if (!pgmap->ops->page_free) {
@@ -491,3 +468,116 @@ void zone_device_page_init(struct page *page)
 	lock_page(page);
 }
 EXPORT_SYMBOL_GPL(zone_device_page_init);
+
+unsigned long memremap_device_private_pagemap(struct dev_pagemap *pgmap)
+{
+	unsigned long dpfn, dpfn_first, dpfn_last = 0;
+	unsigned long start;
+	int rc;
+
+	if (pgmap->type != MEMORY_DEVICE_PRIVATE) {
+		WARN(1, "Not device private memory\n");
+		return -EINVAL;
+	}
+	if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
+		WARN(1, "Device private memory not supported\n");
+		return -EINVAL;
+	}
+	if (!pgmap->ops || !pgmap->ops->migrate_to_ram) {
+		WARN(1, "Missing migrate_to_ram method\n");
+		return -EINVAL;
+	}
+	if (!pgmap->ops->page_free) {
+		WARN(1, "Missing page_free method\n");
+		return -EINVAL;
+	}
+	if (!pgmap->owner) {
+		WARN(1, "Missing owner\n");
+		return -EINVAL;
+	}
+
+	pgmap->pages = kzalloc(sizeof(struct page) * pgmap->nr_pages,
+			       GFP_KERNEL);
+	if (!pgmap->pages)
+		return -ENOMEM;
+
+	rc = mtree_alloc_range(&device_private_pgmap_tree, &start, pgmap,
+			       pgmap->nr_pages * PAGE_SIZE, 0,
+			       1ull << MAX_PHYSMEM_BITS, GFP_KERNEL);
+	if (rc < 0)
+		goto err_mtree_alloc;
+
+	pgmap->range.start = start;
+	pgmap->range.end = pgmap->range.start + (pgmap->nr_pages * PAGE_SIZE) - 1;
+	pgmap->nr_range = 1;
+
+	init_completion(&pgmap->done);
+	rc = percpu_ref_init(&pgmap->ref, dev_pagemap_percpu_release, 0,
+		GFP_KERNEL);
+	if (rc < 0)
+		goto err_ref_init;
+
+	dpfn_first = pgmap->range.start >> PAGE_SHIFT;
+	dpfn_last = dpfn_first + (range_len(&pgmap->range) >> PAGE_SHIFT);
+	for (dpfn = dpfn_first; dpfn < dpfn_last; dpfn++) {
+		struct page *page = device_private_offset_to_page(dpfn);
+
+		__init_zone_device_page(page, dpfn, ZONE_DEVICE, numa_node_id(), pgmap);
+		page_folio(page)->pgmap = (void *) pgmap;
+	}
+
+	return 0;
+
+err_ref_init:
+	mtree_erase(&device_private_pgmap_tree, pgmap->range.start);
+err_mtree_alloc:
+	kfree(pgmap->pages);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(memremap_device_private_pagemap);
+
+void memunmap_device_private_pagemap(struct dev_pagemap *pgmap)
+{
+	percpu_ref_kill(&pgmap->ref);
+	wait_for_completion(&pgmap->done);
+	percpu_ref_exit(&pgmap->ref);
+	kfree(pgmap->pages);
+	mtree_erase(&device_private_pgmap_tree, pgmap->range.start);
+}
+EXPORT_SYMBOL_GPL(memunmap_device_private_pagemap);
+
+struct page *device_private_offset_to_page(unsigned long offset)
+{
+	struct dev_pagemap *pgmap;
+
+	pgmap = mtree_load(&device_private_pgmap_tree, offset << PAGE_SHIFT);
+	if (WARN_ON_ONCE(!pgmap))
+		return NULL;
+
+	return &pgmap->pages[offset - (pgmap->range.start >> PAGE_SHIFT)];
+}
+EXPORT_SYMBOL_GPL(device_private_offset_to_page);
+
+struct page *device_private_entry_to_page(swp_entry_t entry)
+{
+	unsigned long offset;
+
+	if (!((is_device_private_entry(entry) ||
+	       (is_device_private_migration_entry(entry))))) {
+		return NULL;
+	}
+
+	offset = swp_offset_pfn(entry);
+
+	return device_private_offset_to_page(offset);
+}
+
+pgoff_t device_private_page_to_offset(const struct page *page)
+{
+	struct dev_pagemap *pgmap = (struct dev_pagemap *) page_pgmap(page);
+
+	VM_BUG_ON_PAGE(!is_device_private_page(page), page);
+
+	return (pgmap->range.start >> PAGE_SHIFT) + ((page - pgmap->pages));
+}
+EXPORT_SYMBOL_GPL(device_private_page_to_offset);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3c561d61afba..76e08fedbf2b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -399,10 +399,10 @@ static bool remove_migration_pte(struct folio *folio,
 		if (unlikely(is_device_private_page(new))) {
 			if (pte_write(pte))
 				entry = make_writable_device_private_entry(
-							page_to_pfn(new));
+							device_private_page_to_offset(new));
 			else
 				entry = make_readable_device_private_entry(
-							page_to_pfn(new));
+							device_private_page_to_offset(new));
 			pte = swp_entry_to_pte(entry);
 			if (pte_swp_soft_dirty(old_pte))
 				pte = pte_swp_mksoft_dirty(pte);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 458b5114bb2b..4579f8e9b759 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -147,7 +147,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			    pgmap->owner != migrate->pgmap_owner)
 				goto next;
 
-			mpfn = migrate_pfn(page_to_pfn(page)) |
+			mpfn = migrate_pfn(device_private_page_to_offset(page)) |
 					MIGRATE_PFN_MIGRATE |
 					MIGRATE_PFN_DEVICE;
 			if (is_writable_device_private_entry(entry))
@@ -238,21 +238,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (mpfn & MIGRATE_PFN_WRITE) {
 				if (is_device_private_page(page))
 					entry = make_writable_migration_device_private_entry(
-								page_to_pfn(page));
+								device_private_page_to_offset(page));
 				else
 					entry = make_writable_migration_entry(
 								page_to_pfn(page));
 			} else if (anon_exclusive) {
 				if (is_device_private_page(page))
 					entry = make_device_migration_readable_exclusive_migration_entry(
-								page_to_pfn(page));
+								device_private_page_to_offset(page));
 				else
 					entry = make_readable_exclusive_migration_entry(
 								page_to_pfn(page));
 			} else {
 				if (is_device_private_page(page))
 					entry = make_readable_migration_device_private_entry(
-								page_to_pfn(page));
+								device_private_page_to_offset(page));
 				else
 					entry = make_readable_migration_entry(
 								page_to_pfn(page));
@@ -650,10 +650,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 
 		if (vma->vm_flags & VM_WRITE)
 			swp_entry = make_writable_device_private_entry(
-						page_to_pfn(page));
+						device_private_page_to_offset(page));
 		else
 			swp_entry = make_readable_device_private_entry(
-						page_to_pfn(page));
+						device_private_page_to_offset(page));
 		entry = swp_entry_to_pte(swp_entry);
 	} else {
 		if (folio_is_zone_device(folio) &&
@@ -923,7 +923,7 @@ static unsigned long migrate_device_pfn_lock(unsigned long pfn)
 {
 	struct folio *folio;
 
-	folio = folio_get_nontail_page(pfn_to_page(pfn));
+	folio = folio_get_nontail_page(device_private_offset_to_page(pfn));
 	if (!folio)
 		return 0;
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7712d887b696..772025d833f4 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1004,9 +1004,9 @@ static void __init memmap_init(void)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
-					  unsigned long zone_idx, int nid,
-					  struct dev_pagemap *pgmap)
+void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+				   unsigned long zone_idx, int nid,
+				   struct dev_pagemap *pgmap)
 {
 
 	__init_single_page(page, pfn, zone_idx, nid);
@@ -1038,7 +1038,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
 	 * because this is done early in section_activate()
 	 */
-	if (pageblock_aligned(pfn)) {
+	if (pgmap->type != MEMORY_DEVICE_PRIVATE && pageblock_aligned(pfn)) {
 		init_pageblock_migratetype(page, MIGRATE_MOVABLE, false);
 		cond_resched();
 	}
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e9fe747d3df3..9911bbe15699 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -104,6 +104,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
 static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
 {
 	unsigned long pfn;
+	bool device_private = false;
 	pte_t ptent = ptep_get(pvmw->pte);
 
 	if (pvmw->flags & PVMW_MIGRATION) {
@@ -115,6 +116,9 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
 		if (!(is_migration_entry(entry)))
 			return false;
 
+		if (is_device_private_migration_entry(entry))
+			device_private = true;
+
 		pfn = swp_offset_pfn(entry);
 	} else if (is_swap_pte(ptent)) {
 		swp_entry_t entry;
@@ -125,6 +129,9 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
 		    !is_device_exclusive_entry(entry))
 			return false;
 
+		if (is_device_private_entry(entry))
+			device_private = true;
+
 		pfn = swp_offset_pfn(entry);
 	} else {
 		if (!pte_present(ptent))
@@ -133,6 +140,9 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
 		pfn = pte_pfn(ptent);
 	}
 
+	if ((device_private) ^ !!(pvmw->pfn & PVMW_PFN_DEVICE_PRIVATE))
+		return false;
+
 	if ((pfn + pte_nr - 1) < (pvmw->pfn >> PVMW_PFN_SHIFT))
 		return false;
 	if (pfn > ((pvmw->pfn >> PVMW_PFN_SHIFT) + pvmw->nr_pages - 1))
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index f5c77dda3359..5970f62bc4b2 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -1003,8 +1003,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
 		if ((flags & FW_MIGRATION) &&
-		    (is_migration_entry(entry) ||
-		     is_device_private_migration_entry(entry))) {
+		    (is_migration_entry(entry))) {
 			page = pfn_swap_entry_to_page(entry);
 			expose_page = false;
 			goto found;
diff --git a/mm/rmap.c b/mm/rmap.c
index 9642a79cbdb4..5aef8223914b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long nr_pages = 1, end_addr;
-	unsigned long pfn;
+	unsigned long nr;
 	unsigned long hsz = 0;
 	int ptes = 0;
 
@@ -1980,13 +1980,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 */
 		pteval = ptep_get(pvmw.pte);
 		if (likely(pte_present(pteval))) {
-			pfn = pte_pfn(pteval);
+			nr = pte_pfn(pteval) - folio_pfn(folio);
 		} else {
-			pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
+			swp_entry_t entry = pte_to_swp_entry(pteval);
+
+			if (is_device_private_entry(entry) ||
+			    is_device_private_migration_entry(entry))
+				nr = swp_offset_pfn(entry) - device_private_folio_to_offset(folio);
+			else
+				nr = swp_offset_pfn(entry) - folio_pfn(folio);
+
 			VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 		}
 
-		subpage = folio_page(folio, pfn - folio_pfn(folio));
+		subpage = folio_page(folio, nr);
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
 				 PageAnonExclusive(subpage);
@@ -2300,7 +2307,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	struct page *subpage;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
-	unsigned long pfn;
+	unsigned long nr;
 	unsigned long hsz = 0;
 
 	/*
@@ -2370,13 +2377,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		 */
 		pteval = ptep_get(pvmw.pte);
 		if (likely(pte_present(pteval))) {
-			pfn = pte_pfn(pteval);
+			nr = pte_pfn(pteval) - folio_pfn(folio);
 		} else {
-			pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
+			swp_entry_t entry = pte_to_swp_entry(pteval);
+
+			if (is_device_private_entry(entry) ||
+			    is_device_private_migration_entry(entry))
+				nr = swp_offset_pfn(entry) - device_private_folio_to_offset(folio);
+			else
+				nr = swp_offset_pfn(entry) - folio_pfn(folio);
+
 			VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 		}
 
-		subpage = folio_page(folio, pfn - folio_pfn(folio));
+		subpage = folio_page(folio, nr);
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
 				 PageAnonExclusive(subpage);
@@ -2436,7 +2450,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				folio_mark_dirty(folio);
 			writable = pte_write(pteval);
 		} else if (likely(pte_present(pteval))) {
-			flush_cache_page(vma, address, pfn);
+			flush_cache_page(vma, address, pte_pfn(pteval));
 			/* Nuke the page table entry. */
 			if (should_defer_flush(mm, flags)) {
 				/*
@@ -2538,21 +2552,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			if (writable) {
 				if (is_device_private_page(subpage))
 					entry = make_writable_migration_device_private_entry(
-								page_to_pfn(subpage));
+								device_private_page_to_offset(subpage));
 				else
 					entry = make_writable_migration_entry(
 								page_to_pfn(subpage));
 			} else if (anon_exclusive) {
 				if (is_device_private_page(subpage))
 					entry = make_device_migration_readable_exclusive_migration_entry(
-								page_to_pfn(subpage));
+								device_private_page_to_offset(subpage));
 				else
 					entry = make_readable_exclusive_migration_entry(
 								page_to_pfn(subpage));
 			} else {
 				if (is_device_private_page(subpage))
 					entry = make_readable_migration_device_private_entry(
-								page_to_pfn(subpage));
+								device_private_page_to_offset(subpage));
 				else
 					entry = make_readable_migration_entry(
 								page_to_pfn(subpage));
diff --git a/mm/util.c b/mm/util.c
index 2472b7381b11..5f2aef804035 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1241,7 +1241,10 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
 	struct folio *foliop;
 	int loops = 5;
 
-	ps->pfn = page_to_pfn(page);
+	if (is_device_private_page(page))
+		ps->pfn = device_private_page_to_offset(page);
+	else
+		ps->pfn = page_to_pfn(page);
 	ps->flags = PAGE_SNAPSHOT_FAITHFUL;
 
 again:
-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
                   ` (5 preceding siblings ...)
  2025-11-28  4:41 ` [RFC PATCH 6/6] mm: Remove device private pages from the physical address space Jordan Niethe
@ 2025-11-28  7:40 ` David Hildenbrand (Red Hat)
  2025-11-30 23:33   ` Alistair Popple
  2025-11-28 15:09 ` Matthew Wilcox
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-28  7:40 UTC (permalink / raw)
  To: Jordan Niethe, linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, ziy,
	apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

On 11/28/25 05:41, Jordan Niethe wrote:
> Today, when creating these device private struct pages, the first step
> is to use request_free_mem_region() to get a range of physical address
> space large enough to represent the devices memory. This allocated
> physical address range is then remapped as device private memory using
> memremap_pages.

Just a note that as we are finishing the old release and are about to 
start the merge window (+ there is Thanksgiving), expect few replies to 
non-urgent stuff in the next weeks.

Having that said, the proposal is interesting. I recall that Alistair 
and Jason recently discussed removing the need of dealing with PFNs
completely for device-private.

Is that the result of these discussions?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
                   ` (6 preceding siblings ...)
  2025-11-28  7:40 ` [RFC PATCH 0/6] Remove device private pages from " David Hildenbrand (Red Hat)
@ 2025-11-28 15:09 ` Matthew Wilcox
  2025-12-02  1:31   ` Jordan Niethe
  2025-11-28 16:07 ` Mika Penttilä
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2025-11-28 15:09 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: linux-mm, balbirs, matthew.brost, akpm, linux-kernel, dri-devel,
	david, ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied,
	simona, rcampbell, mpenttil, jgg

On Fri, Nov 28, 2025 at 03:41:40PM +1100, Jordan Niethe wrote:
> A consequence of placing the device private pages outside of the
> physical address space is that they no longer have a PFN. However, it is
> still necessary to be able to look up a corresponding device private
> page from a device private PTE entry, which means that we still require
> some way to index into this device private address space. This leads to
> the idea of a device private PFN. This is like a PFN but instead of

Don't call it a "device private PFN".  That's going to lead to
confusion.  Device private index?  Device memory index?

> By removing the device private pages from the physical address space,
> this RFC also opens up the possibility to moving away from tracking
> device private memory using struct pages in the future. This is
> desirable as on systems with large amounts of memory these device
> private struct pages use a signifiant amount of memory and take a
> significant amount of time to initialize.

I did tell Jerome he was making a huge mistake with his design, but
he forced it in anyway.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
                   ` (7 preceding siblings ...)
  2025-11-28 15:09 ` Matthew Wilcox
@ 2025-11-28 16:07 ` Mika Penttilä
  2025-12-02  1:32   ` Jordan Niethe
  2025-11-28 19:22 ` Matthew Brost
  2025-12-02 22:20 ` Balbir Singh
  10 siblings, 1 reply; 26+ messages in thread
From: Mika Penttilä @ 2025-11-28 16:07 UTC (permalink / raw)
  To: Jordan Niethe, linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, david,
	ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, jgg, willy

Hi Jordan!

On 11/28/25 06:41, Jordan Niethe wrote:

> Today, when creating these device private struct pages, the first step
> is to use request_free_mem_region() to get a range of physical address
> space large enough to represent the devices memory. This allocated
> physical address range is then remapped as device private memory using
> memremap_pages.
>
I just did a quick read thru, and liked how it turned out to be, nice work!

--Mika




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 6/6] mm: Remove device private pages from the physical address space
  2025-11-28  4:41 ` [RFC PATCH 6/6] mm: Remove device private pages from the physical address space Jordan Niethe
@ 2025-11-28 17:51   ` Jason Gunthorpe
  2025-12-02  2:28     ` Jordan Niethe
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2025-11-28 17:51 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: linux-mm, balbirs, matthew.brost, akpm, linux-kernel, dri-devel,
	david, ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied,
	simona, rcampbell, mpenttil, willy

On Fri, Nov 28, 2025 at 03:41:46PM +1100, Jordan Niethe wrote:
> Introduce helpers:
> 
>   - device_private_page_to_offset()
>   - device_private_folio_to_offset()
> 
> to take a given device private page / folio and return its offset within
> the device private address space (this is essentially a PFN within the
> device private address space).

It would be nice if we rarely/never needed to see number space outside
the pte itself or the internal helpers..

Like, I don't think there should be stuff like this:

>  					entry = make_writable_migration_device_private_entry(
> -								page_to_pfn(page));
> +								device_private_page_to_offset(page));

make_writable_migration_device_private_entry() should accept the
struct page as the handle?

If it really is needed I think it should have its own dedicated type
and not be intermixed with normal pfns..

Jason


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/6] mm/hmm: Add flag to track device private PFNs
  2025-11-28  4:41 ` [RFC PATCH 1/6] mm/hmm: Add flag to track device private PFNs Jordan Niethe
@ 2025-11-28 18:36   ` Matthew Brost
  2025-12-02  1:20     ` Jordan Niethe
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Brost @ 2025-11-28 18:36 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: linux-mm, balbirs, akpm, linux-kernel, dri-devel, david, ziy,
	apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

On Fri, Nov 28, 2025 at 03:41:41PM +1100, Jordan Niethe wrote:
> A future change will remove device private pages from the physical
> address space. This will mean that device private pages no longer have
> normal PFN and must be handled separately.
> 
> Prepare for this by adding a HMM_PFN_DEVICE_PRIVATE flag to indicate
> that a hmm_pfn contains a PFN for a device private page.
> 
> Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  include/linux/hmm.h | 2 ++
>  mm/hmm.c            | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index db75ffc949a7..df571fa75a44 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -23,6 +23,7 @@ struct mmu_interval_notifier;
>   * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
>   * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
>   *                 fail. ie poisoned memory, special pages, no vma, etc
> + * HMM_PFN_DEVICE_PRIVATE - the pfn field contains a DEVICE_PRIVATE pfn.
>   * HMM_PFN_P2PDMA - P2P page
>   * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
>   * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
> @@ -40,6 +41,7 @@ enum hmm_pfn_flags {
>  	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
>  	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
>  	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
> +	HMM_PFN_DEVICE_PRIVATE = 1UL << (BITS_PER_LONG - 7),
>  	/*
>  	 * Sticky flags, carried from input to output,
>  	 * don't forget to update HMM_PFN_INOUT_FLAGS
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 87562914670a..1cff68ade1d4 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -262,7 +262,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  		if (is_device_private_entry(entry) &&
>  		    page_pgmap(pfn_swap_entry_to_page(entry))->owner ==
>  		    range->dev_private_owner) {
> -			cpu_flags = HMM_PFN_VALID;
> +			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;

I think you’ll need to set this flag in hmm_vma_handle_absent_pmd as
well. That function handles 2M device pages. Support for 2M device
pages, I believe, will be included in the 6.19 PR, but
hmm_vma_handle_absent_pmd is already upstream.

Matt

>  			if (is_writable_device_private_entry(entry))
>  				cpu_flags |= HMM_PFN_WRITE;
>  			new_pfn_flags = swp_offset_pfn(entry) | cpu_flags;
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
                   ` (8 preceding siblings ...)
  2025-11-28 16:07 ` Mika Penttilä
@ 2025-11-28 19:22 ` Matthew Brost
  2025-11-30 23:23   ` Alistair Popple
  2025-12-02 22:20 ` Balbir Singh
  10 siblings, 1 reply; 26+ messages in thread
From: Matthew Brost @ 2025-11-28 19:22 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: linux-mm, balbirs, akpm, linux-kernel, dri-devel, david, ziy,
	apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

On Fri, Nov 28, 2025 at 03:41:40PM +1100, Jordan Niethe wrote:
> Today, when creating these device private struct pages, the first step
> is to use request_free_mem_region() to get a range of physical address
> space large enough to represent the devices memory. This allocated
> physical address range is then remapped as device private memory using
> memremap_pages.
> 
> Needing allocation of physical address space has some problems:
> 
>   1) There may be insufficient physical address space to represent the
>      device memory. KASLR reducing the physical address space and VM
>      configurations with limited physical address space increase the
>      likelihood of hitting this especially as device memory increases. This
>      has been observed to prevent device private from being initialized.  
> 
>   2) Attempting to add the device private pages to the linear map at
>      addresses beyond the actual physical memory causes issues on
>      architectures like aarch64  - meaning the feature does not work there [0].
> 
> This RFC changes device private memory so that it does not require
> allocation of physical address space and these problems are avoided.
> Instead of using the physical address space, we introduce a "device
> private address space" and allocate from there.
> 
> A consequence of placing the device private pages outside of the
> physical address space is that they no longer have a PFN. However, it is
> still necessary to be able to look up a corresponding device private
> page from a device private PTE entry, which means that we still require
> some way to index into this device private address space. This leads to
> the idea of a device private PFN. This is like a PFN but instead of
> associating memory in the physical address space with a struct page, it
> associates device memory in the device private address space with a
> device private struct page.
> 
> The problem that then needs to be addressed is how to avoid confusing
> these device private PFNs with the regular PFNs. It is the inherent
> limited usage of the device private pages themselves which make this
> possible. A device private page is only used for userspace mappings, we
> do not need to be concerned with them being used within the mm more
> broadly. This means that the only way that the core kernel looks up
> these pages is via the page table, where their PTE already indicates if
> they refer to a device private page via their swap type, e.g.
> SWP_DEVICE_WRITE. We can use this information to determine if the PTE
> contains a normal PFN which should be looked up in the page map, or a
> device private PFN which should be looked up elsewhere.
> 
> This applies when we are creating PTE entries for device private pages -
> because they have their own type there are already must be handled
> separately, so it is a small step to convert them to a device private
> PFN now too.
> 
> The first part of the series updates callers where device private PFNs
> might now be encountered to track this extra state.
> 
> The last patch contains the bulk of the work where we change how we
> convert between device private pages to device private PFNs and then use
> a new interface for allocating device private pages without the need for
> reserving physical address space.
> 
> For the purposes of the RFC changes have been limited to test_hmm.c
> updates to the other drivers will be included in the next revision.
> 
> This would include updating existing users of memremap_pages() to use
> memremap_device_private_pagemap() instead to allocate device private
> pages. This also means they would no longer need to call
> request_free_mem_region().  An equivalent of devm_memremap_pages() will
> also be necessary.
> 
> Users of the migrate_vma() interface will also need to be updated to be
> aware these device private PFNs.
> 
> By removing the device private pages from the physical address space,
> this RFC also opens up the possibility to moving away from tracking
> device private memory using struct pages in the future. This is
> desirable as on systems with large amounts of memory these device
> private struct pages use a signifiant amount of memory and take a
> significant amount of time to initialize.

A couple things.

- I’m fairly certain that, briefly looking at this, it will break all
  upstream DRM drivers (AMDKFD, Nouveau, Xe / GPUSVM) that use device
  private pages. I looked into what I think conflicts with Xe / GPUSVM,
  and I believe the impact is fairly minor. I’m happy to help by pulling
  this code and fixing up our side.

- I’m fully on board with eventually moving to something that uses less
  memory than struct page, and I’m happy to coordinate on future changes.

- Before we start coordinating on this patch set, should we hold off until
  the 6.19 cycle, which includes 2M device pages from Balbir [1] (i.e.,
  rebase this series on top of 6.19 once it includes 2M pages)? I suspect
  that, given the scope of this series and Balbir’s, there will be some
  conflicts.

Matt

[1] https://patchwork.freedesktop.org/series/152798/

> 
> Testing:
> - selftests/mm/hmm-tests on an amd64 VM
> 
> [0] https://lore.kernel.org/lkml/CAMj1kXFZ=4hLL1w6iCV5O5uVoVLHAJbc0rr40j24ObenAjXe9w@mail.gmail.com/
> 
> Jordan Niethe (6):
>   mm/hmm: Add flag to track device private PFNs
>   mm/migrate_device: Add migrate PFN flag to track device private PFNs
>   mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn to track
>     device private PFNs
>   mm: Add a new swap type for migration entries with device private PFNs
>   mm/util: Add flag to track device private PFNs in page snapshots
>   mm: Remove device private pages from the physical address space
> 
>  Documentation/mm/hmm.rst |   9 +-
>  fs/proc/page.c           |   6 +-
>  include/linux/hmm.h      |   5 ++
>  include/linux/memremap.h |  25 +++++-
>  include/linux/migrate.h  |   5 ++
>  include/linux/mm.h       |   9 +-
>  include/linux/rmap.h     |  33 +++++++-
>  include/linux/swap.h     |   8 +-
>  include/linux/swapops.h  | 102 +++++++++++++++++++++--
>  lib/test_hmm.c           |  66 ++++++++-------
>  mm/debug.c               |   9 +-
>  mm/hmm.c                 |   2 +-
>  mm/memory.c              |   9 +-
>  mm/memremap.c            | 174 +++++++++++++++++++++++++++++----------
>  mm/migrate.c             |   6 +-
>  mm/migrate_device.c      |  44 ++++++----
>  mm/mm_init.c             |   8 +-
>  mm/mprotect.c            |  21 +++--
>  mm/page_vma_mapped.c     |  18 +++-
>  mm/pagewalk.c            |   2 +-
>  mm/rmap.c                |  68 ++++++++++-----
>  mm/util.c                |   8 +-
>  mm/vmscan.c              |   2 +-
>  23 files changed, 485 insertions(+), 154 deletions(-)
> 
> 
> base-commit: e1afacb68573c3cd0a3785c6b0508876cd3423bc
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-28 19:22 ` Matthew Brost
@ 2025-11-30 23:23   ` Alistair Popple
  2025-12-01  1:51     ` Matthew Brost
  0 siblings, 1 reply; 26+ messages in thread
From: Alistair Popple @ 2025-11-30 23:23 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Jordan Niethe, linux-mm, balbirs, akpm, linux-kernel, dri-devel,
	david, ziy, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

On 2025-11-29 at 06:22 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Fri, Nov 28, 2025 at 03:41:40PM +1100, Jordan Niethe wrote:
> > Today, when creating these device private struct pages, the first step
> > is to use request_free_mem_region() to get a range of physical address
> > space large enough to represent the devices memory. This allocated
> > physical address range is then remapped as device private memory using
> > memremap_pages.
> > 
> > Needing allocation of physical address space has some problems:
> > 
> >   1) There may be insufficient physical address space to represent the
> >      device memory. KASLR reducing the physical address space and VM
> >      configurations with limited physical address space increase the
> >      likelihood of hitting this especially as device memory increases. This
> >      has been observed to prevent device private from being initialized.  
> > 
> >   2) Attempting to add the device private pages to the linear map at
> >      addresses beyond the actual physical memory causes issues on
> >      architectures like aarch64  - meaning the feature does not work there [0].
> > 
> > This RFC changes device private memory so that it does not require
> > allocation of physical address space and these problems are avoided.
> > Instead of using the physical address space, we introduce a "device
> > private address space" and allocate from there.
> > 
> > A consequence of placing the device private pages outside of the
> > physical address space is that they no longer have a PFN. However, it is
> > still necessary to be able to look up a corresponding device private
> > page from a device private PTE entry, which means that we still require
> > some way to index into this device private address space. This leads to
> > the idea of a device private PFN. This is like a PFN but instead of
> > associating memory in the physical address space with a struct page, it
> > associates device memory in the device private address space with a
> > device private struct page.
> > 
> > The problem that then needs to be addressed is how to avoid confusing
> > these device private PFNs with the regular PFNs. It is the inherent
> > limited usage of the device private pages themselves which make this
> > possible. A device private page is only used for userspace mappings, we
> > do not need to be concerned with them being used within the mm more
> > broadly. This means that the only way that the core kernel looks up
> > these pages is via the page table, where their PTE already indicates if
> > they refer to a device private page via their swap type, e.g.
> > SWP_DEVICE_WRITE. We can use this information to determine if the PTE
> > contains a normal PFN which should be looked up in the page map, or a
> > device private PFN which should be looked up elsewhere.
> > 
> > This applies when we are creating PTE entries for device private pages -
> > because they have their own type there are already must be handled
> > separately, so it is a small step to convert them to a device private
> > PFN now too.
> > 
> > The first part of the series updates callers where device private PFNs
> > might now be encountered to track this extra state.
> > 
> > The last patch contains the bulk of the work where we change how we
> > convert between device private pages to device private PFNs and then use
> > a new interface for allocating device private pages without the need for
> > reserving physical address space.
> > 
> > For the purposes of the RFC changes have been limited to test_hmm.c
> > updates to the other drivers will be included in the next revision.
> > 
> > This would include updating existing users of memremap_pages() to use
> > memremap_device_private_pagemap() instead to allocate device private
> > pages. This also means they would no longer need to call
> > request_free_mem_region().  An equivalent of devm_memremap_pages() will
> > also be necessary.
> > 
> > Users of the migrate_vma() interface will also need to be updated to be
> > aware these device private PFNs.
> > 
> > By removing the device private pages from the physical address space,
> > this RFC also opens up the possibility to moving away from tracking
> > device private memory using struct pages in the future. This is
> > desirable as on systems with large amounts of memory these device
> > private struct pages use a signifiant amount of memory and take a
> > significant amount of time to initialize.
> 
> A couple things.
> 
> - I’m fairly certain that, briefly looking at this, it will break all
>   upstream DRM drivers (AMDKFD, Nouveau, Xe / GPUSVM) that use device
>   private pages. I looked into what I think conflicts with Xe / GPUSVM,
>   and I believe the impact is fairly minor. I’m happy to help by pulling
>   this code and fixing up our side.

It most certainly will :-) I think Jordan called that out above but we wanted
to get the design right before spending too much time updating drivers. That
said I don't think the driver changes should be extensive, but let us know if
you disagree.

> - I’m fully on board with eventually moving to something that uses less
>   memory than struct page, and I’m happy to coordinate on future changes.

Thanks!

> - Before we start coordinating on this patch set, should we hold off until
>   the 6.19 cycle, which includes 2M device pages from Balbir [1] (i.e.,
>   rebase this series on top of 6.19 once it includes 2M pages)? I suspect
>   that, given the scope of this series and Balbir’s, there will be some
>   conflicts.

Our aim here is to get some review of the design and the patches/implementation
for the 6.19 cycle but I agree that this will need to get rebased on top of
Balbir's series.

 - Alistair

> Matt
> 
> [1] https://patchwork.freedesktop.org/series/152798/
> 
> > 
> > Testing:
> > - selftests/mm/hmm-tests on an amd64 VM
> > 
> > [0] https://lore.kernel.org/lkml/CAMj1kXFZ=4hLL1w6iCV5O5uVoVLHAJbc0rr40j24ObenAjXe9w@mail.gmail.com/
> > 
> > Jordan Niethe (6):
> >   mm/hmm: Add flag to track device private PFNs
> >   mm/migrate_device: Add migrate PFN flag to track device private PFNs
> >   mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn to track
> >     device private PFNs
> >   mm: Add a new swap type for migration entries with device private PFNs
> >   mm/util: Add flag to track device private PFNs in page snapshots
> >   mm: Remove device private pages from the physical address space
> > 
> >  Documentation/mm/hmm.rst |   9 +-
> >  fs/proc/page.c           |   6 +-
> >  include/linux/hmm.h      |   5 ++
> >  include/linux/memremap.h |  25 +++++-
> >  include/linux/migrate.h  |   5 ++
> >  include/linux/mm.h       |   9 +-
> >  include/linux/rmap.h     |  33 +++++++-
> >  include/linux/swap.h     |   8 +-
> >  include/linux/swapops.h  | 102 +++++++++++++++++++++--
> >  lib/test_hmm.c           |  66 ++++++++-------
> >  mm/debug.c               |   9 +-
> >  mm/hmm.c                 |   2 +-
> >  mm/memory.c              |   9 +-
> >  mm/memremap.c            | 174 +++++++++++++++++++++++++++++----------
> >  mm/migrate.c             |   6 +-
> >  mm/migrate_device.c      |  44 ++++++----
> >  mm/mm_init.c             |   8 +-
> >  mm/mprotect.c            |  21 +++--
> >  mm/page_vma_mapped.c     |  18 +++-
> >  mm/pagewalk.c            |   2 +-
> >  mm/rmap.c                |  68 ++++++++++-----
> >  mm/util.c                |   8 +-
> >  mm/vmscan.c              |   2 +-
> >  23 files changed, 485 insertions(+), 154 deletions(-)
> > 
> > 
> > base-commit: e1afacb68573c3cd0a3785c6b0508876cd3423bc
> > -- 
> > 2.34.1
> > 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-28  7:40 ` [RFC PATCH 0/6] Remove device private pages from " David Hildenbrand (Red Hat)
@ 2025-11-30 23:33   ` Alistair Popple
  0 siblings, 0 replies; 26+ messages in thread
From: Alistair Popple @ 2025-11-30 23:33 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Jordan Niethe, linux-mm, balbirs, matthew.brost, akpm,
	linux-kernel, dri-devel, ziy, lorenzo.stoakes, lyude, dakr,
	airlied, simona, rcampbell, mpenttil, jgg, willy

On 2025-11-28 at 18:40 +1100, "David Hildenbrand (Red Hat)" <david@kernel.org> wrote...
> On 11/28/25 05:41, Jordan Niethe wrote:
> > Today, when creating these device private struct pages, the first step
> > is to use request_free_mem_region() to get a range of physical address
> > space large enough to represent the devices memory. This allocated
> > physical address range is then remapped as device private memory using
> > memremap_pages.
> 
> Just a note that as we are finishing the old release and are about to start
> the merge window (+ there is Thanksgiving), expect few replies to non-urgent
> stuff in the next weeks.

Thanks David! Mostly we just wanted to at least get the RFC out prior to LPC so
I can talk about it there if needed.

> Having that said, the proposal is interesting. I recall that Alistair and
> Jason recently discussed removing the need of dealing with PFNs
> completely for device-private.
> 
> Is that the result of these discussions?

That is certainly something we would like to explore, but this idea mostly came
from a more immediate need to deal with the lack of support on AARCH64 where we
can't just steal random bits of the physical address space (which is reasonable
- the kernel doesn't really "own" the physical memory map after all), and also
the KASLR and VM issues which cause initialisation to fail.

Removing struct pages entirely for at least device private memory is also
something I'd like to explore with this.

> -- 
> Cheers
> 
> David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-30 23:23   ` Alistair Popple
@ 2025-12-01  1:51     ` Matthew Brost
  2025-12-02  1:40       ` Jordan Niethe
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Brost @ 2025-12-01  1:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jordan Niethe, linux-mm, balbirs, akpm, linux-kernel, dri-devel,
	david, ziy, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

On Mon, Dec 01, 2025 at 10:23:32AM +1100, Alistair Popple wrote:
> On 2025-11-29 at 06:22 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Fri, Nov 28, 2025 at 03:41:40PM +1100, Jordan Niethe wrote:
> > > Today, when creating these device private struct pages, the first step
> > > is to use request_free_mem_region() to get a range of physical address
> > > space large enough to represent the devices memory. This allocated
> > > physical address range is then remapped as device private memory using
> > > memremap_pages.
> > > 
> > > Needing allocation of physical address space has some problems:
> > > 
> > >   1) There may be insufficient physical address space to represent the
> > >      device memory. KASLR reducing the physical address space and VM
> > >      configurations with limited physical address space increase the
> > >      likelihood of hitting this especially as device memory increases. This
> > >      has been observed to prevent device private from being initialized.  
> > > 
> > >   2) Attempting to add the device private pages to the linear map at
> > >      addresses beyond the actual physical memory causes issues on
> > >      architectures like aarch64  - meaning the feature does not work there [0].
> > > 
> > > This RFC changes device private memory so that it does not require
> > > allocation of physical address space and these problems are avoided.
> > > Instead of using the physical address space, we introduce a "device
> > > private address space" and allocate from there.
> > > 
> > > A consequence of placing the device private pages outside of the
> > > physical address space is that they no longer have a PFN. However, it is
> > > still necessary to be able to look up a corresponding device private
> > > page from a device private PTE entry, which means that we still require
> > > some way to index into this device private address space. This leads to
> > > the idea of a device private PFN. This is like a PFN but instead of
> > > associating memory in the physical address space with a struct page, it
> > > associates device memory in the device private address space with a
> > > device private struct page.
> > > 
> > > The problem that then needs to be addressed is how to avoid confusing
> > > these device private PFNs with the regular PFNs. It is the inherent
> > > limited usage of the device private pages themselves which make this
> > > possible. A device private page is only used for userspace mappings, we
> > > do not need to be concerned with them being used within the mm more
> > > broadly. This means that the only way that the core kernel looks up
> > > these pages is via the page table, where their PTE already indicates if
> > > they refer to a device private page via their swap type, e.g.
> > > SWP_DEVICE_WRITE. We can use this information to determine if the PTE
> > > contains a normal PFN which should be looked up in the page map, or a
> > > device private PFN which should be looked up elsewhere.
> > > 
> > > This applies when we are creating PTE entries for device private pages -
> > > because they have their own type there are already must be handled
> > > separately, so it is a small step to convert them to a device private
> > > PFN now too.
> > > 
> > > The first part of the series updates callers where device private PFNs
> > > might now be encountered to track this extra state.
> > > 
> > > The last patch contains the bulk of the work where we change how we
> > > convert between device private pages to device private PFNs and then use
> > > a new interface for allocating device private pages without the need for
> > > reserving physical address space.
> > > 
> > > For the purposes of the RFC changes have been limited to test_hmm.c
> > > updates to the other drivers will be included in the next revision.
> > > 
> > > This would include updating existing users of memremap_pages() to use
> > > memremap_device_private_pagemap() instead to allocate device private
> > > pages. This also means they would no longer need to call
> > > request_free_mem_region().  An equivalent of devm_memremap_pages() will
> > > also be necessary.
> > > 
> > > Users of the migrate_vma() interface will also need to be updated to be
> > > aware these device private PFNs.
> > > 
> > > By removing the device private pages from the physical address space,
> > > this RFC also opens up the possibility to moving away from tracking
> > > device private memory using struct pages in the future. This is
> > > desirable as on systems with large amounts of memory these device
> > > private struct pages use a signifiant amount of memory and take a
> > > significant amount of time to initialize.
> > 
> > A couple things.
> > 
> > - I’m fairly certain that, briefly looking at this, it will break all
> >   upstream DRM drivers (AMDKFD, Nouveau, Xe / GPUSVM) that use device
> >   private pages. I looked into what I think conflicts with Xe / GPUSVM,
> >   and I believe the impact is fairly minor. I’m happy to help by pulling
> >   this code and fixing up our side.
> 
> It most certainly will :-) I think Jordan called that out above but we wanted

I don't always read.

> to get the design right before spending too much time updating drivers. That
> said I don't think the driver changes should be extensive, but let us know if
> you disagree.

I did a quick look, and I believe it pretty minor (e.g., pfn_to_page is used a
few places for device pages which would need a refactor, etc...). Maybe
a bit more, we will find out but not too concerned.

> 
> > - I’m fully on board with eventually moving to something that uses less
> >   memory than struct page, and I’m happy to coordinate on future changes.
> 
> Thanks!
> 
> > - Before we start coordinating on this patch set, should we hold off until
> >   the 6.19 cycle, which includes 2M device pages from Balbir [1] (i.e.,
> >   rebase this series on top of 6.19 once it includes 2M pages)? I suspect
> >   that, given the scope of this series and Balbir’s, there will be some
> >   conflicts.
> 
> Our aim here is to get some review of the design and the patches/implementation
> for the 6.19 cycle but I agree that this will need to get rebased on top of
> Balbir's series.

+1. Will be on the lookout for the next post and pull into 6.19 DRM tree
and at least test out the Intel stuffi + send fixes if needed.

I can enable both of you for Intel CI too, just include intel-xe list on
next post and it will get kicked off and you can find the results on
patchworks.

Matt

> 
>  - Alistair
> 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/152798/
> > 
> > > 
> > > Testing:
> > > - selftests/mm/hmm-tests on an amd64 VM
> > > 
> > > [0] https://lore.kernel.org/lkml/CAMj1kXFZ=4hLL1w6iCV5O5uVoVLHAJbc0rr40j24ObenAjXe9w@mail.gmail.com/
> > > 
> > > Jordan Niethe (6):
> > >   mm/hmm: Add flag to track device private PFNs
> > >   mm/migrate_device: Add migrate PFN flag to track device private PFNs
> > >   mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn to track
> > >     device private PFNs
> > >   mm: Add a new swap type for migration entries with device private PFNs
> > >   mm/util: Add flag to track device private PFNs in page snapshots
> > >   mm: Remove device private pages from the physical address space
> > > 
> > >  Documentation/mm/hmm.rst |   9 +-
> > >  fs/proc/page.c           |   6 +-
> > >  include/linux/hmm.h      |   5 ++
> > >  include/linux/memremap.h |  25 +++++-
> > >  include/linux/migrate.h  |   5 ++
> > >  include/linux/mm.h       |   9 +-
> > >  include/linux/rmap.h     |  33 +++++++-
> > >  include/linux/swap.h     |   8 +-
> > >  include/linux/swapops.h  | 102 +++++++++++++++++++++--
> > >  lib/test_hmm.c           |  66 ++++++++-------
> > >  mm/debug.c               |   9 +-
> > >  mm/hmm.c                 |   2 +-
> > >  mm/memory.c              |   9 +-
> > >  mm/memremap.c            | 174 +++++++++++++++++++++++++++++----------
> > >  mm/migrate.c             |   6 +-
> > >  mm/migrate_device.c      |  44 ++++++----
> > >  mm/mm_init.c             |   8 +-
> > >  mm/mprotect.c            |  21 +++--
> > >  mm/page_vma_mapped.c     |  18 +++-
> > >  mm/pagewalk.c            |   2 +-
> > >  mm/rmap.c                |  68 ++++++++++-----
> > >  mm/util.c                |   8 +-
> > >  mm/vmscan.c              |   2 +-
> > >  23 files changed, 485 insertions(+), 154 deletions(-)
> > > 
> > > 
> > > base-commit: e1afacb68573c3cd0a3785c6b0508876cd3423bc
> > > -- 
> > > 2.34.1
> > > 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 4/6] mm: Add a new swap type for migration entries with device private PFNs
  2025-11-28  4:41 ` [RFC PATCH 4/6] mm: Add a new swap type for migration entries with " Jordan Niethe
@ 2025-12-01  2:43   ` Chih-En Lin
  2025-12-02  1:42     ` Jordan Niethe
  0 siblings, 1 reply; 26+ messages in thread
From: Chih-En Lin @ 2025-12-01  2:43 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: linux-mm, balbirs, matthew.brost, akpm, linux-kernel, dri-devel,
	david, ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied,
	simona, rcampbell, mpenttil, jgg, willy

On Fri, Nov 28, 2025 at 03:41:44PM +1100, Jordan Niethe wrote:
> A future change will remove device private pages from the physical
> address space. This will mean that device private pages no longer have
> normal PFN and must be handled separately.
> 
> When migrating a device private page a migration entry is created for
> that page - this includes the PFN for that page. Once device private
> PFNs exist in a different address space to regular PFNs we need to be
> able to determine which kind of PFN is in the entry so we can associate
> it with the correct page.
> 
> Introduce new swap types:
> 
>   - SWP_MIGRATION_DEVICE_READ
>   - SWP_MIGRATION_DEVICE_WRITE
>   - SWP_MIGRATION_DEVICE_READ_EXCLUSIVE
> 
> These correspond to
> 
>   - SWP_MIGRATION_READ
>   - SWP_MIGRATION_WRITE
>   - SWP_MIGRATION_READ_EXCLUSIVE
> 
> except the swap entry contains a device private PFN.
> 
> The existing helpers such as is_writable_migration_entry() will still
> return true for a SWP_MIGRATION_DEVICE_WRITE entry.
> 
> Introduce new helpers such as
> is_writable_device_migration_private_entry() to disambiguate between a
> SWP_MIGRATION_WRITE and a SWP_MIGRATION_DEVICE_WRITE entry.
> 
> Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  include/linux/swap.h    |  8 +++-
>  include/linux/swapops.h | 87 ++++++++++++++++++++++++++++++++++++++---
>  mm/memory.c             |  9 ++++-
>  mm/migrate.c            |  2 +-
>  mm/migrate_device.c     | 31 ++++++++++-----
>  mm/mprotect.c           | 21 +++++++---
>  mm/page_vma_mapped.c    |  2 +-
>  mm/pagewalk.c           |  3 +-
>  mm/rmap.c               | 32 ++++++++++-----
>  9 files changed, 161 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e818fbade1e2..87f14d673979 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -74,12 +74,18 @@ static inline int current_is_kswapd(void)
>   *
>   * When a page is mapped by the device for exclusive access we set the CPU page
>   * table entries to a special SWP_DEVICE_EXCLUSIVE entry.
> + *
> + * Because device private pages do not use regular PFNs, special migration
> + * entries are also needed.
>   */
>  #ifdef CONFIG_DEVICE_PRIVATE
> -#define SWP_DEVICE_NUM 3
> +#define SWP_DEVICE_NUM 6
>  #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
>  #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
>  #define SWP_DEVICE_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
> +#define SWP_MIGRATION_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
> +#define SWP_MIGRATION_DEVICE_READ_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+4)
> +#define SWP_MIGRATION_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+5)
>  #else
>  #define SWP_DEVICE_NUM 0
>  #endif
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 64ea151a7ae3..7aa3f00e304a 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -196,6 +196,43 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
>  	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE;
>  }
>  
> +static inline swp_entry_t make_readable_migration_device_private_entry(pgoff_t offset)
> +{
> +	return swp_entry(SWP_MIGRATION_DEVICE_READ, offset);
> +}
> +
> +static inline swp_entry_t make_writable_migration_device_private_entry(pgoff_t offset)
> +{
> +	return swp_entry(SWP_MIGRATION_DEVICE_WRITE, offset);
> +}
> +
> +static inline bool is_device_private_migration_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_MIGRATION_DEVICE_READ ||
> +			swp_type(entry) == SWP_MIGRATION_DEVICE_READ_EXCLUSIVE ||
> +			swp_type(entry) == SWP_MIGRATION_DEVICE_WRITE);
> +}
> +
> +static inline bool is_readable_device_migration_private_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_MIGRATION_DEVICE_READ);
> +}
> +
> +static inline bool is_writable_device_migration_private_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_MIGRATION_DEVICE_WRITE);
> +}
> +
> +static inline swp_entry_t make_device_migration_readable_exclusive_migration_entry(pgoff_t offset)
> +{
> +	return swp_entry(SWP_MIGRATION_DEVICE_READ_EXCLUSIVE, offset);
> +}
> +
> +static inline bool is_device_migration_readable_exclusive_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_MIGRATION_DEVICE_READ_EXCLUSIVE;
> +}

The names are inconsistent.

Maybe make_device_migration_readable_exclusive_migration_entry to
make_readable_exclusive_migration_device_private_entry, and
is_device_migration_readable_exclusive_entry to
is_readable_exclusive_device_private_migration_entry?


>  #else /* CONFIG_DEVICE_PRIVATE */
>  static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
>  {
> @@ -217,6 +254,11 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
>  	return false;
>  }
>  
> +static inline bool is_readable_device_migration_private_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
>  static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
>  {
>  	return swp_entry(0, 0);
> @@ -227,6 +269,36 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
>  	return false;
>  }
>  
> +static inline swp_entry_t make_readable_migration_device_private_entry(pgoff_t offset)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline swp_entry_t make_writable_migration_device_private_entry(pgoff_t offset)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline bool is_device_private_migration_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline bool is_writable_device_migration_private_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline swp_entry_t make_device_migration_readable_exclusive_migration_entry(pgoff_t offset)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline bool is_device_migration_readable_exclusive_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
>  #endif /* CONFIG_DEVICE_PRIVATE */
>  
>  #ifdef CONFIG_MIGRATION
> @@ -234,22 +306,26 @@ static inline int is_migration_entry(swp_entry_t entry)
>  {
>  	return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
>  			swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE ||
> -			swp_type(entry) == SWP_MIGRATION_WRITE);
> +			swp_type(entry) == SWP_MIGRATION_WRITE ||
> +			is_device_private_migration_entry(entry));
>  }
>  
>  static inline int is_writable_migration_entry(swp_entry_t entry)
>  {
> -	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
> +	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE ||
> +			is_writable_device_migration_private_entry(entry));
>  }
>  
>  static inline int is_readable_migration_entry(swp_entry_t entry)
>  {
> -	return unlikely(swp_type(entry) == SWP_MIGRATION_READ);
> +	return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
> +			is_readable_device_migration_private_entry(entry));
>  }
>  
>  static inline int is_readable_exclusive_migration_entry(swp_entry_t entry)
>  {
> -	return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE);
> +	return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE ||
> +			is_device_migration_readable_exclusive_entry(entry));
>  }
>  
>  static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
> @@ -525,7 +601,8 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
>  	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
>  
>  	return is_migration_entry(entry) || is_device_private_entry(entry) ||
> -	       is_device_exclusive_entry(entry) || is_hwpoison_entry(entry);
> +	       is_device_exclusive_entry(entry) || is_hwpoison_entry(entry) ||
> +	       is_device_private_migration_entry(entry);
>  }
>  
>  struct page_vma_mapped_walk;
> diff --git a/mm/memory.c b/mm/memory.c
> index b59ae7ce42eb..f1ed361434ff 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -962,8 +962,13 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  			 * to be set to read. A previously exclusive entry is
>  			 * now shared.
>  			 */
> -			entry = make_readable_migration_entry(
> -							swp_offset(entry));
> +			if (is_device_private_migration_entry(entry))
> +				entry = make_readable_migration_device_private_entry(
> +								swp_offset(entry));
> +			else
> +				entry = make_readable_migration_entry(
> +								swp_offset(entry));
> +
>  			pte = swp_entry_to_pte(entry);
>  			if (pte_swp_soft_dirty(orig_pte))
>  				pte = pte_swp_mksoft_dirty(pte);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c0e9f15be2a2..3c561d61afba 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -495,7 +495,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  		goto out;
>  
>  	entry = pte_to_swp_entry(pte);
> -	if (!is_migration_entry(entry))
> +	if (!(is_migration_entry(entry)))
>  		goto out;
>  
>  	migration_entry_wait_on_locked(entry, ptl);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 82f09b24d913..458b5114bb2b 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -235,15 +235,28 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  				folio_mark_dirty(folio);
>  
>  			/* Setup special migration page table entry */
> -			if (mpfn & MIGRATE_PFN_WRITE)
> -				entry = make_writable_migration_entry(
> -							page_to_pfn(page));
> -			else if (anon_exclusive)
> -				entry = make_readable_exclusive_migration_entry(
> -							page_to_pfn(page));
> -			else
> -				entry = make_readable_migration_entry(
> -							page_to_pfn(page));
> +			if (mpfn & MIGRATE_PFN_WRITE) {
> +				if (is_device_private_page(page))
> +					entry = make_writable_migration_device_private_entry(
> +								page_to_pfn(page));
> +				else
> +					entry = make_writable_migration_entry(
> +								page_to_pfn(page));
> +			} else if (anon_exclusive) {
> +				if (is_device_private_page(page))
> +					entry = make_device_migration_readable_exclusive_migration_entry(
> +								page_to_pfn(page));
> +				else
> +					entry = make_readable_exclusive_migration_entry(
> +								page_to_pfn(page));
> +			} else {
> +				if (is_device_private_page(page))
> +					entry = make_readable_migration_device_private_entry(
> +								page_to_pfn(page));
> +				else
> +					entry = make_readable_migration_entry(
> +								page_to_pfn(page));
> +			}
>  			if (pte_present(pte)) {
>  				if (pte_young(pte))
>  					entry = make_migration_entry_young(entry);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 113b48985834..7d79a0f53bf5 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -365,11 +365,22 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				 * A protection check is difficult so
>  				 * just be safe and disable write
>  				 */
> -				if (folio_test_anon(folio))
> -					entry = make_readable_exclusive_migration_entry(
> -							     swp_offset(entry));
> -				else
> -					entry = make_readable_migration_entry(swp_offset(entry));
> +				if (!is_writable_device_migration_private_entry(entry)) {
> +					if (folio_test_anon(folio))
> +						entry = make_readable_exclusive_migration_entry(
> +								swp_offset(entry));
> +					else
> +						entry = make_readable_migration_entry(
> +								swp_offset(entry));
> +				} else {
> +					if (folio_test_anon(folio))
> +						entry = make_device_migration_readable_exclusive_migration_entry(
> +								swp_offset(entry));
> +					else
> +						entry = make_readable_migration_device_private_entry(
> +								swp_offset(entry));
> +				}
> +
>  				newpte = swp_entry_to_pte(entry);
>  				if (pte_swp_soft_dirty(oldpte))
>  					newpte = pte_swp_mksoft_dirty(newpte);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 9146bd084435..e9fe747d3df3 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -112,7 +112,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>  			return false;
>  		entry = pte_to_swp_entry(ptent);
>  
> -		if (!is_migration_entry(entry))
> +		if (!(is_migration_entry(entry)))
>  			return false;
>  
>  		pfn = swp_offset_pfn(entry);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 9f91cf85a5be..f5c77dda3359 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -1003,7 +1003,8 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>  		swp_entry_t entry = pte_to_swp_entry(pte);
>  
>  		if ((flags & FW_MIGRATION) &&
> -		    is_migration_entry(entry)) {
> +		    (is_migration_entry(entry) ||
> +		     is_device_private_migration_entry(entry))) {
>  			page = pfn_swap_entry_to_page(entry);
>  			expose_page = false;
>  			goto found;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e94500318f92..9642a79cbdb4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2535,15 +2535,29 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  			 * pte. do_swap_page() will wait until the migration
>  			 * pte is removed and then restart fault handling.
>  			 */
> -			if (writable)
> -				entry = make_writable_migration_entry(
> -							page_to_pfn(subpage));
> -			else if (anon_exclusive)
> -				entry = make_readable_exclusive_migration_entry(
> -							page_to_pfn(subpage));
> -			else
> -				entry = make_readable_migration_entry(
> -							page_to_pfn(subpage));
> +			if (writable) {
> +				if (is_device_private_page(subpage))
> +					entry = make_writable_migration_device_private_entry(
> +								page_to_pfn(subpage));
> +				else
> +					entry = make_writable_migration_entry(
> +								page_to_pfn(subpage));
> +			} else if (anon_exclusive) {
> +				if (is_device_private_page(subpage))
> +					entry = make_device_migration_readable_exclusive_migration_entry(
> +								page_to_pfn(subpage));
> +				else
> +					entry = make_readable_exclusive_migration_entry(
> +								page_to_pfn(subpage));
> +			} else {
> +				if (is_device_private_page(subpage))
> +					entry = make_readable_migration_device_private_entry(
> +								page_to_pfn(subpage));
> +				else
> +					entry = make_readable_migration_entry(
> +								page_to_pfn(subpage));
> +			}
> +
>  			if (likely(pte_present(pteval))) {
>  				if (pte_young(pteval))
>  					entry = make_migration_entry_young(entry);
> -- 
> 2.34.1
> 

Thanks,
Chih-En Lin


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/6] mm/hmm: Add flag to track device private PFNs
  2025-11-28 18:36   ` Matthew Brost
@ 2025-12-02  1:20     ` Jordan Niethe
  2025-12-03  4:25       ` Balbir Singh
  0 siblings, 1 reply; 26+ messages in thread
From: Jordan Niethe @ 2025-12-02  1:20 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-mm, balbirs, akpm, linux-kernel, dri-devel, david, ziy,
	apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

Hi,

On 29/11/25 05:36, Matthew Brost wrote:
> On Fri, Nov 28, 2025 at 03:41:41PM +1100, Jordan Niethe wrote:
>> A future change will remove device private pages from the physical
>> address space. This will mean that device private pages no longer have
>> normal PFN and must be handled separately.
>>
>> Prepare for this by adding a HMM_PFN_DEVICE_PRIVATE flag to indicate
>> that a hmm_pfn contains a PFN for a device private page.
>>
>> Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>   include/linux/hmm.h | 2 ++
>>   mm/hmm.c            | 2 +-
>>   2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>> index db75ffc949a7..df571fa75a44 100644
>> --- a/include/linux/hmm.h
>> +++ b/include/linux/hmm.h
>> @@ -23,6 +23,7 @@ struct mmu_interval_notifier;
>>    * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
>>    * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
>>    *                 fail. ie poisoned memory, special pages, no vma, etc
>> + * HMM_PFN_DEVICE_PRIVATE - the pfn field contains a DEVICE_PRIVATE pfn.
>>    * HMM_PFN_P2PDMA - P2P page
>>    * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
>>    * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
>> @@ -40,6 +41,7 @@ enum hmm_pfn_flags {
>>   	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
>>   	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
>>   	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
>> +	HMM_PFN_DEVICE_PRIVATE = 1UL << (BITS_PER_LONG - 7),
>>   	/*
>>   	 * Sticky flags, carried from input to output,
>>   	 * don't forget to update HMM_PFN_INOUT_FLAGS
>> diff --git a/mm/hmm.c b/mm/hmm.c
>> index 87562914670a..1cff68ade1d4 100644
>> --- a/mm/hmm.c
>> +++ b/mm/hmm.c
>> @@ -262,7 +262,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>>   		if (is_device_private_entry(entry) &&
>>   		    page_pgmap(pfn_swap_entry_to_page(entry))->owner ==
>>   		    range->dev_private_owner) {
>> -			cpu_flags = HMM_PFN_VALID;
>> +			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
> 
> I think you’ll need to set this flag in hmm_vma_handle_absent_pmd as
> well. That function handles 2M device pages. Support for 2M device
> pages, I believe, will be included in the 6.19 PR, but
> hmm_vma_handle_absent_pmd is already upstream.

Thanks Matt, I agree. There will be a few more updates to this
series for 2MB device pages - I'll send the next revision on top of that
support.

Jordan.

> 
> Matt
> 
>>   			if (is_writable_device_private_entry(entry))
>>   				cpu_flags |= HMM_PFN_WRITE;
>>   			new_pfn_flags = swp_offset_pfn(entry) | cpu_flags;
>> -- 
>> 2.34.1
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-28 15:09 ` Matthew Wilcox
@ 2025-12-02  1:31   ` Jordan Niethe
  0 siblings, 0 replies; 26+ messages in thread
From: Jordan Niethe @ 2025-12-02  1:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, balbirs, matthew.brost, akpm, linux-kernel, dri-devel,
	david, ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied,
	simona, rcampbell, mpenttil, jgg

Hi,

On 29/11/25 02:09, Matthew Wilcox wrote:
> On Fri, Nov 28, 2025 at 03:41:40PM +1100, Jordan Niethe wrote:
>> A consequence of placing the device private pages outside of the
>> physical address space is that they no longer have a PFN. However, it is
>> still necessary to be able to look up a corresponding device private
>> page from a device private PTE entry, which means that we still require
>> some way to index into this device private address space. This leads to
>> the idea of a device private PFN. This is like a PFN but instead of
> 
> Don't call it a "device private PFN".  That's going to lead to
> confusion.  Device private index?  Device memory index?

Sure, I think 'device memory index' is fine. What I was trying to
express with 'device private PFN' here is that each index into device
memory still represents a PAGE_SIZE region, but I agree it leads to
further confusion.

Thanks,
Jordan.

> 
>> By removing the device private pages from the physical address space,
>> this RFC also opens up the possibility to moving away from tracking
>> device private memory using struct pages in the future. This is
>> desirable as on systems with large amounts of memory these device
>> private struct pages use a signifiant amount of memory and take a
>> significant amount of time to initialize.
> 
> I did tell Jerome he was making a huge mistake with his design, but
> he forced it in anyway.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-28 16:07 ` Mika Penttilä
@ 2025-12-02  1:32   ` Jordan Niethe
  0 siblings, 0 replies; 26+ messages in thread
From: Jordan Niethe @ 2025-12-02  1:32 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: balbirs, matthew.brost, akpm, linux-kernel, dri-devel, david,
	ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, jgg, willy

Hi,

On 29/11/25 03:07, Mika Penttilä wrote:
> Hi Jordan!
> 
> On 11/28/25 06:41, Jordan Niethe wrote:
> 
>> Today, when creating these device private struct pages, the first step
>> is to use request_free_mem_region() to get a range of physical address
>> space large enough to represent the devices memory. This allocated
>> physical address range is then remapped as device private memory using
>> memremap_pages.
>>
> I just did a quick read thru, and liked how it turned out to be, nice work!
> 
> --Mika

Thanks Mika.

Jordan.

> 
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-12-01  1:51     ` Matthew Brost
@ 2025-12-02  1:40       ` Jordan Niethe
  0 siblings, 0 replies; 26+ messages in thread
From: Jordan Niethe @ 2025-12-02  1:40 UTC (permalink / raw)
  To: Matthew Brost, Alistair Popple
  Cc: linux-mm, balbirs, akpm, linux-kernel, dri-devel, david, ziy,
	lorenzo.stoakes, lyude, dakr, airlied, simona, rcampbell,
	mpenttil, jgg, willy

Hi,

On 1/12/25 12:51, Matthew Brost wrote:
> On Mon, Dec 01, 2025 at 10:23:32AM +1100, Alistair Popple wrote:
>> On 2025-11-29 at 06:22 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
>>> On Fri, Nov 28, 2025 at 03:41:40PM +1100, Jordan Niethe wrote:
>>>> Today, when creating these device private struct pages, the first step
>>>> is to use request_free_mem_region() to get a range of physical address
>>>> space large enough to represent the devices memory. This allocated
>>>> physical address range is then remapped as device private memory using
>>>> memremap_pages.
>>>>
>>>> Needing allocation of physical address space has some problems:
>>>>
>>>>    1) There may be insufficient physical address space to represent the
>>>>       device memory. KASLR reducing the physical address space and VM
>>>>       configurations with limited physical address space increase the
>>>>       likelihood of hitting this especially as device memory increases. This
>>>>       has been observed to prevent device private from being initialized.
>>>>
>>>>    2) Attempting to add the device private pages to the linear map at
>>>>       addresses beyond the actual physical memory causes issues on
>>>>       architectures like aarch64  - meaning the feature does not work there [0].
>>>>
>>>> This RFC changes device private memory so that it does not require
>>>> allocation of physical address space and these problems are avoided.
>>>> Instead of using the physical address space, we introduce a "device
>>>> private address space" and allocate from there.
>>>>
>>>> A consequence of placing the device private pages outside of the
>>>> physical address space is that they no longer have a PFN. However, it is
>>>> still necessary to be able to look up a corresponding device private
>>>> page from a device private PTE entry, which means that we still require
>>>> some way to index into this device private address space. This leads to
>>>> the idea of a device private PFN. This is like a PFN but instead of
>>>> associating memory in the physical address space with a struct page, it
>>>> associates device memory in the device private address space with a
>>>> device private struct page.
>>>>
>>>> The problem that then needs to be addressed is how to avoid confusing
>>>> these device private PFNs with the regular PFNs. It is the inherent
>>>> limited usage of the device private pages themselves which make this
>>>> possible. A device private page is only used for userspace mappings, we
>>>> do not need to be concerned with them being used within the mm more
>>>> broadly. This means that the only way that the core kernel looks up
>>>> these pages is via the page table, where their PTE already indicates if
>>>> they refer to a device private page via their swap type, e.g.
>>>> SWP_DEVICE_WRITE. We can use this information to determine if the PTE
>>>> contains a normal PFN which should be looked up in the page map, or a
>>>> device private PFN which should be looked up elsewhere.
>>>>
>>>> This applies when we are creating PTE entries for device private pages -
>>>> because they have their own type there are already must be handled
>>>> separately, so it is a small step to convert them to a device private
>>>> PFN now too.
>>>>
>>>> The first part of the series updates callers where device private PFNs
>>>> might now be encountered to track this extra state.
>>>>
>>>> The last patch contains the bulk of the work where we change how we
>>>> convert between device private pages to device private PFNs and then use
>>>> a new interface for allocating device private pages without the need for
>>>> reserving physical address space.
>>>>
>>>> For the purposes of the RFC changes have been limited to test_hmm.c
>>>> updates to the other drivers will be included in the next revision.
>>>>
>>>> This would include updating existing users of memremap_pages() to use
>>>> memremap_device_private_pagemap() instead to allocate device private
>>>> pages. This also means they would no longer need to call
>>>> request_free_mem_region().  An equivalent of devm_memremap_pages() will
>>>> also be necessary.
>>>>
>>>> Users of the migrate_vma() interface will also need to be updated to be
>>>> aware these device private PFNs.
>>>>
>>>> By removing the device private pages from the physical address space,
>>>> this RFC also opens up the possibility to moving away from tracking
>>>> device private memory using struct pages in the future. This is
>>>> desirable as on systems with large amounts of memory these device
>>>> private struct pages use a signifiant amount of memory and take a
>>>> significant amount of time to initialize.
>>>
>>> A couple things.
>>>
>>> - I’m fairly certain that, briefly looking at this, it will break all
>>>    upstream DRM drivers (AMDKFD, Nouveau, Xe / GPUSVM) that use device
>>>    private pages. I looked into what I think conflicts with Xe / GPUSVM,
>>>    and I believe the impact is fairly minor. I’m happy to help by pulling
>>>    this code and fixing up our side.
>>
>> It most certainly will :-) I think Jordan called that out above but we wanted
> 
> I don't always read.
> 
>> to get the design right before spending too much time updating drivers. That
>> said I don't think the driver changes should be extensive, but let us know if
>> you disagree.
> 
> I did a quick look, and I believe it pretty minor (e.g., pfn_to_page is used a
> few places for device pages which would need a refactor, etc...). Maybe
> a bit more, we will find out but not too concerned.

Yes, the existing drivers will need to be updated to use the new
interface. It should be a mechanical enough change that I can include
the driver updates myself in the next revision, but will need some help
testing. Just wanted to get some feedback on the general approach first.

> 
>>
>>> - I’m fully on board with eventually moving to something that uses less
>>>    memory than struct page, and I’m happy to coordinate on future changes.
>>
>> Thanks!
>>
>>> - Before we start coordinating on this patch set, should we hold off until
>>>    the 6.19 cycle, which includes 2M device pages from Balbir [1] (i.e.,
>>>    rebase this series on top of 6.19 once it includes 2M pages)? I suspect
>>>    that, given the scope of this series and Balbir’s, there will be some
>>>    conflicts.
>>
>> Our aim here is to get some review of the design and the patches/implementation
>> for the 6.19 cycle but I agree that this will need to get rebased on top of
>> Balbir's series.
> 
> +1. Will be on the lookout for the next post and pull into 6.19 DRM tree
> and at least test out the Intel stuffi + send fixes if needed.

The next revision I will rebase on Balbir's series.

> 
> I can enable both of you for Intel CI too, just include intel-xe list on
> next post and it will get kicked off and you can find the results on
> patchworks.

This will be very helpful, will do.

Thanks,
Jordan.

> 
> Matt
> 
>>
>>   - Alistair
>>
>>> Matt
>>>
>>> [1] https://patchwork.freedesktop.org/series/152798/
>>>
>>>>
>>>> Testing:
>>>> - selftests/mm/hmm-tests on an amd64 VM
>>>>
>>>> [0] https://lore.kernel.org/lkml/CAMj1kXFZ=4hLL1w6iCV5O5uVoVLHAJbc0rr40j24ObenAjXe9w@mail.gmail.com/
>>>>
>>>> Jordan Niethe (6):
>>>>    mm/hmm: Add flag to track device private PFNs
>>>>    mm/migrate_device: Add migrate PFN flag to track device private PFNs
>>>>    mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn to track
>>>>      device private PFNs
>>>>    mm: Add a new swap type for migration entries with device private PFNs
>>>>    mm/util: Add flag to track device private PFNs in page snapshots
>>>>    mm: Remove device private pages from the physical address space
>>>>
>>>>   Documentation/mm/hmm.rst |   9 +-
>>>>   fs/proc/page.c           |   6 +-
>>>>   include/linux/hmm.h      |   5 ++
>>>>   include/linux/memremap.h |  25 +++++-
>>>>   include/linux/migrate.h  |   5 ++
>>>>   include/linux/mm.h       |   9 +-
>>>>   include/linux/rmap.h     |  33 +++++++-
>>>>   include/linux/swap.h     |   8 +-
>>>>   include/linux/swapops.h  | 102 +++++++++++++++++++++--
>>>>   lib/test_hmm.c           |  66 ++++++++-------
>>>>   mm/debug.c               |   9 +-
>>>>   mm/hmm.c                 |   2 +-
>>>>   mm/memory.c              |   9 +-
>>>>   mm/memremap.c            | 174 +++++++++++++++++++++++++++++----------
>>>>   mm/migrate.c             |   6 +-
>>>>   mm/migrate_device.c      |  44 ++++++----
>>>>   mm/mm_init.c             |   8 +-
>>>>   mm/mprotect.c            |  21 +++--
>>>>   mm/page_vma_mapped.c     |  18 +++-
>>>>   mm/pagewalk.c            |   2 +-
>>>>   mm/rmap.c                |  68 ++++++++++-----
>>>>   mm/util.c                |   8 +-
>>>>   mm/vmscan.c              |   2 +-
>>>>   23 files changed, 485 insertions(+), 154 deletions(-)
>>>>
>>>>
>>>> base-commit: e1afacb68573c3cd0a3785c6b0508876cd3423bc
>>>> -- 
>>>> 2.34.1
>>>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 4/6] mm: Add a new swap type for migration entries with device private PFNs
  2025-12-01  2:43   ` Chih-En Lin
@ 2025-12-02  1:42     ` Jordan Niethe
  0 siblings, 0 replies; 26+ messages in thread
From: Jordan Niethe @ 2025-12-02  1:42 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: linux-mm, balbirs, matthew.brost, akpm, linux-kernel, dri-devel,
	david, ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied,
	simona, rcampbell, mpenttil, jgg, willy

Hi,

On 1/12/25 13:43, Chih-En Lin wrote:
> On Fri, Nov 28, 2025 at 03:41:44PM +1100, Jordan Niethe wrote:
>> A future change will remove device private pages from the physical
>> address space. This will mean that device private pages no longer have
>> normal PFN and must be handled separately.
>>
>> When migrating a device private page a migration entry is created for
>> that page - this includes the PFN for that page. Once device private
>> PFNs exist in a different address space to regular PFNs we need to be
>> able to determine which kind of PFN is in the entry so we can associate
>> it with the correct page.
>>
>> Introduce new swap types:
>>
>>    - SWP_MIGRATION_DEVICE_READ
>>    - SWP_MIGRATION_DEVICE_WRITE
>>    - SWP_MIGRATION_DEVICE_READ_EXCLUSIVE
>>
>> These correspond to
>>
>>    - SWP_MIGRATION_READ
>>    - SWP_MIGRATION_WRITE
>>    - SWP_MIGRATION_READ_EXCLUSIVE
>>
>> except the swap entry contains a device private PFN.
>>
>> The existing helpers such as is_writable_migration_entry() will still
>> return true for a SWP_MIGRATION_DEVICE_WRITE entry.
>>
>> Introduce new helpers such as
>> is_writable_device_migration_private_entry() to disambiguate between a
>> SWP_MIGRATION_WRITE and a SWP_MIGRATION_DEVICE_WRITE entry.
>>
>> Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>   include/linux/swap.h    |  8 +++-
>>   include/linux/swapops.h | 87 ++++++++++++++++++++++++++++++++++++++---
>>   mm/memory.c             |  9 ++++-
>>   mm/migrate.c            |  2 +-
>>   mm/migrate_device.c     | 31 ++++++++++-----
>>   mm/mprotect.c           | 21 +++++++---
>>   mm/page_vma_mapped.c    |  2 +-
>>   mm/pagewalk.c           |  3 +-
>>   mm/rmap.c               | 32 ++++++++++-----
>>   9 files changed, 161 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index e818fbade1e2..87f14d673979 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -74,12 +74,18 @@ static inline int current_is_kswapd(void)
>>    *
>>    * When a page is mapped by the device for exclusive access we set the CPU page
>>    * table entries to a special SWP_DEVICE_EXCLUSIVE entry.
>> + *
>> + * Because device private pages do not use regular PFNs, special migration
>> + * entries are also needed.
>>    */
>>   #ifdef CONFIG_DEVICE_PRIVATE
>> -#define SWP_DEVICE_NUM 3
>> +#define SWP_DEVICE_NUM 6
>>   #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
>>   #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
>>   #define SWP_DEVICE_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
>> +#define SWP_MIGRATION_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
>> +#define SWP_MIGRATION_DEVICE_READ_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+4)
>> +#define SWP_MIGRATION_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+5)
>>   #else
>>   #define SWP_DEVICE_NUM 0
>>   #endif
>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>> index 64ea151a7ae3..7aa3f00e304a 100644
>> --- a/include/linux/swapops.h
>> +++ b/include/linux/swapops.h
>> @@ -196,6 +196,43 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
>>   	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE;
>>   }
>>   
>> +static inline swp_entry_t make_readable_migration_device_private_entry(pgoff_t offset)
>> +{
>> +	return swp_entry(SWP_MIGRATION_DEVICE_READ, offset);
>> +}
>> +
>> +static inline swp_entry_t make_writable_migration_device_private_entry(pgoff_t offset)
>> +{
>> +	return swp_entry(SWP_MIGRATION_DEVICE_WRITE, offset);
>> +}
>> +
>> +static inline bool is_device_private_migration_entry(swp_entry_t entry)
>> +{
>> +	return unlikely(swp_type(entry) == SWP_MIGRATION_DEVICE_READ ||
>> +			swp_type(entry) == SWP_MIGRATION_DEVICE_READ_EXCLUSIVE ||
>> +			swp_type(entry) == SWP_MIGRATION_DEVICE_WRITE);
>> +}
>> +
>> +static inline bool is_readable_device_migration_private_entry(swp_entry_t entry)
>> +{
>> +	return unlikely(swp_type(entry) == SWP_MIGRATION_DEVICE_READ);
>> +}
>> +
>> +static inline bool is_writable_device_migration_private_entry(swp_entry_t entry)
>> +{
>> +	return unlikely(swp_type(entry) == SWP_MIGRATION_DEVICE_WRITE);
>> +}
>> +
>> +static inline swp_entry_t make_device_migration_readable_exclusive_migration_entry(pgoff_t offset)
>> +{
>> +	return swp_entry(SWP_MIGRATION_DEVICE_READ_EXCLUSIVE, offset);
>> +}
>> +
>> +static inline bool is_device_migration_readable_exclusive_entry(swp_entry_t entry)
>> +{
>> +	return swp_type(entry) == SWP_MIGRATION_DEVICE_READ_EXCLUSIVE;
>> +}
> 
> The names are inconsistent.
> 
> Maybe make_device_migration_readable_exclusive_migration_entry to
> make_readable_exclusive_migration_device_private_entry, and
> is_device_migration_readable_exclusive_entry to
> is_readable_exclusive_device_private_migration_entry?

I agree - I'll change it.

Thanks,
Jordan.

> 
> 
>>   #else /* CONFIG_DEVICE_PRIVATE */
>>   static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
>>   {
>> @@ -217,6 +254,11 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
>>   	return false;
>>   }
>>   
>> +static inline bool is_readable_device_migration_private_entry(swp_entry_t entry)
>> +{
>> +	return false;
>> +}
>> +
>>   static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
>>   {
>>   	return swp_entry(0, 0);
>> @@ -227,6 +269,36 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
>>   	return false;
>>   }
>>   
>> +static inline swp_entry_t make_readable_migration_device_private_entry(pgoff_t offset)
>> +{
>> +	return swp_entry(0, 0);
>> +}
>> +
>> +static inline swp_entry_t make_writable_migration_device_private_entry(pgoff_t offset)
>> +{
>> +	return swp_entry(0, 0);
>> +}
>> +
>> +static inline bool is_device_private_migration_entry(swp_entry_t entry)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline bool is_writable_device_migration_private_entry(swp_entry_t entry)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline swp_entry_t make_device_migration_readable_exclusive_migration_entry(pgoff_t offset)
>> +{
>> +	return swp_entry(0, 0);
>> +}
>> +
>> +static inline bool is_device_migration_readable_exclusive_entry(swp_entry_t entry)
>> +{
>> +	return false;
>> +}
>> +
>>   #endif /* CONFIG_DEVICE_PRIVATE */
>>   
>>   #ifdef CONFIG_MIGRATION
>> @@ -234,22 +306,26 @@ static inline int is_migration_entry(swp_entry_t entry)
>>   {
>>   	return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
>>   			swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE ||
>> -			swp_type(entry) == SWP_MIGRATION_WRITE);
>> +			swp_type(entry) == SWP_MIGRATION_WRITE ||
>> +			is_device_private_migration_entry(entry));
>>   }
>>   
>>   static inline int is_writable_migration_entry(swp_entry_t entry)
>>   {
>> -	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
>> +	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE ||
>> +			is_writable_device_migration_private_entry(entry));
>>   }
>>   
>>   static inline int is_readable_migration_entry(swp_entry_t entry)
>>   {
>> -	return unlikely(swp_type(entry) == SWP_MIGRATION_READ);
>> +	return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
>> +			is_readable_device_migration_private_entry(entry));
>>   }
>>   
>>   static inline int is_readable_exclusive_migration_entry(swp_entry_t entry)
>>   {
>> -	return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE);
>> +	return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE ||
>> +			is_device_migration_readable_exclusive_entry(entry));
>>   }
>>   
>>   static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
>> @@ -525,7 +601,8 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
>>   	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
>>   
>>   	return is_migration_entry(entry) || is_device_private_entry(entry) ||
>> -	       is_device_exclusive_entry(entry) || is_hwpoison_entry(entry);
>> +	       is_device_exclusive_entry(entry) || is_hwpoison_entry(entry) ||
>> +	       is_device_private_migration_entry(entry);
>>   }
>>   
>>   struct page_vma_mapped_walk;
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b59ae7ce42eb..f1ed361434ff 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -962,8 +962,13 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>   			 * to be set to read. A previously exclusive entry is
>>   			 * now shared.
>>   			 */
>> -			entry = make_readable_migration_entry(
>> -							swp_offset(entry));
>> +			if (is_device_private_migration_entry(entry))
>> +				entry = make_readable_migration_device_private_entry(
>> +								swp_offset(entry));
>> +			else
>> +				entry = make_readable_migration_entry(
>> +								swp_offset(entry));
>> +
>>   			pte = swp_entry_to_pte(entry);
>>   			if (pte_swp_soft_dirty(orig_pte))
>>   				pte = pte_swp_mksoft_dirty(pte);
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index c0e9f15be2a2..3c561d61afba 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -495,7 +495,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>>   		goto out;
>>   
>>   	entry = pte_to_swp_entry(pte);
>> -	if (!is_migration_entry(entry))
>> +	if (!(is_migration_entry(entry)))
>>   		goto out;
>>   
>>   	migration_entry_wait_on_locked(entry, ptl);
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 82f09b24d913..458b5114bb2b 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -235,15 +235,28 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>   				folio_mark_dirty(folio);
>>   
>>   			/* Setup special migration page table entry */
>> -			if (mpfn & MIGRATE_PFN_WRITE)
>> -				entry = make_writable_migration_entry(
>> -							page_to_pfn(page));
>> -			else if (anon_exclusive)
>> -				entry = make_readable_exclusive_migration_entry(
>> -							page_to_pfn(page));
>> -			else
>> -				entry = make_readable_migration_entry(
>> -							page_to_pfn(page));
>> +			if (mpfn & MIGRATE_PFN_WRITE) {
>> +				if (is_device_private_page(page))
>> +					entry = make_writable_migration_device_private_entry(
>> +								page_to_pfn(page));
>> +				else
>> +					entry = make_writable_migration_entry(
>> +								page_to_pfn(page));
>> +			} else if (anon_exclusive) {
>> +				if (is_device_private_page(page))
>> +					entry = make_device_migration_readable_exclusive_migration_entry(
>> +								page_to_pfn(page));
>> +				else
>> +					entry = make_readable_exclusive_migration_entry(
>> +								page_to_pfn(page));
>> +			} else {
>> +				if (is_device_private_page(page))
>> +					entry = make_readable_migration_device_private_entry(
>> +								page_to_pfn(page));
>> +				else
>> +					entry = make_readable_migration_entry(
>> +								page_to_pfn(page));
>> +			}
>>   			if (pte_present(pte)) {
>>   				if (pte_young(pte))
>>   					entry = make_migration_entry_young(entry);
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 113b48985834..7d79a0f53bf5 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -365,11 +365,22 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				 * A protection check is difficult so
>>   				 * just be safe and disable write
>>   				 */
>> -				if (folio_test_anon(folio))
>> -					entry = make_readable_exclusive_migration_entry(
>> -							     swp_offset(entry));
>> -				else
>> -					entry = make_readable_migration_entry(swp_offset(entry));
>> +				if (!is_writable_device_migration_private_entry(entry)) {
>> +					if (folio_test_anon(folio))
>> +						entry = make_readable_exclusive_migration_entry(
>> +								swp_offset(entry));
>> +					else
>> +						entry = make_readable_migration_entry(
>> +								swp_offset(entry));
>> +				} else {
>> +					if (folio_test_anon(folio))
>> +						entry = make_device_migration_readable_exclusive_migration_entry(
>> +								swp_offset(entry));
>> +					else
>> +						entry = make_readable_migration_device_private_entry(
>> +								swp_offset(entry));
>> +				}
>> +
>>   				newpte = swp_entry_to_pte(entry);
>>   				if (pte_swp_soft_dirty(oldpte))
>>   					newpte = pte_swp_mksoft_dirty(newpte);
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index 9146bd084435..e9fe747d3df3 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -112,7 +112,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>>   			return false;
>>   		entry = pte_to_swp_entry(ptent);
>>   
>> -		if (!is_migration_entry(entry))
>> +		if (!(is_migration_entry(entry)))
>>   			return false;
>>   
>>   		pfn = swp_offset_pfn(entry);
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index 9f91cf85a5be..f5c77dda3359 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -1003,7 +1003,8 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>>   		swp_entry_t entry = pte_to_swp_entry(pte);
>>   
>>   		if ((flags & FW_MIGRATION) &&
>> -		    is_migration_entry(entry)) {
>> +		    (is_migration_entry(entry) ||
>> +		     is_device_private_migration_entry(entry))) {
>>   			page = pfn_swap_entry_to_page(entry);
>>   			expose_page = false;
>>   			goto found;
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index e94500318f92..9642a79cbdb4 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2535,15 +2535,29 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>   			 * pte. do_swap_page() will wait until the migration
>>   			 * pte is removed and then restart fault handling.
>>   			 */
>> -			if (writable)
>> -				entry = make_writable_migration_entry(
>> -							page_to_pfn(subpage));
>> -			else if (anon_exclusive)
>> -				entry = make_readable_exclusive_migration_entry(
>> -							page_to_pfn(subpage));
>> -			else
>> -				entry = make_readable_migration_entry(
>> -							page_to_pfn(subpage));
>> +			if (writable) {
>> +				if (is_device_private_page(subpage))
>> +					entry = make_writable_migration_device_private_entry(
>> +								page_to_pfn(subpage));
>> +				else
>> +					entry = make_writable_migration_entry(
>> +								page_to_pfn(subpage));
>> +			} else if (anon_exclusive) {
>> +				if (is_device_private_page(subpage))
>> +					entry = make_device_migration_readable_exclusive_migration_entry(
>> +								page_to_pfn(subpage));
>> +				else
>> +					entry = make_readable_exclusive_migration_entry(
>> +								page_to_pfn(subpage));
>> +			} else {
>> +				if (is_device_private_page(subpage))
>> +					entry = make_readable_migration_device_private_entry(
>> +								page_to_pfn(subpage));
>> +				else
>> +					entry = make_readable_migration_entry(
>> +								page_to_pfn(subpage));
>> +			}
>> +
>>   			if (likely(pte_present(pteval))) {
>>   				if (pte_young(pteval))
>>   					entry = make_migration_entry_young(entry);
>> -- 
>> 2.34.1
>>
> 
> Thanks,
> Chih-En Lin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 6/6] mm: Remove device private pages from the physical address space
  2025-11-28 17:51   ` Jason Gunthorpe
@ 2025-12-02  2:28     ` Jordan Niethe
  2025-12-02  4:10       ` Alistair Popple
  0 siblings, 1 reply; 26+ messages in thread
From: Jordan Niethe @ 2025-12-02  2:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, balbirs, matthew.brost, akpm, linux-kernel, dri-devel,
	david, ziy, apopple, lorenzo.stoakes, lyude, dakr, airlied,
	simona, rcampbell, mpenttil, willy

Hi,

On 29/11/25 04:51, Jason Gunthorpe wrote:
> On Fri, Nov 28, 2025 at 03:41:46PM +1100, Jordan Niethe wrote:
>> Introduce helpers:
>>
>>    - device_private_page_to_offset()
>>    - device_private_folio_to_offset()
>>
>> to take a given device private page / folio and return its offset within
>> the device private address space (this is essentially a PFN within the
>> device private address space).
> 
> It would be nice if we rarely/never needed to see number space outside
> the pte itself or the internal helpers..

Outside of the PTE itself, one of the use cases for the PFNs themselves
is range checking. Like we see in mm/page_vma_mapped.c:check_pte().

> 
> Like, I don't think there should be stuff like this:
> 
>>   					entry = make_writable_migration_device_private_entry(
>> -								page_to_pfn(page));
>> +								device_private_page_to_offset(page));
> 
> make_writable_migration_device_private_entry() should accept the
> struct page as the handle?

That would be more clean - I'll give it a try.

> 
> If it really is needed I think it should have its own dedicated type
> and not be intermixed with normal pfns..

One consideration here is for things like range checking the PFNs, the
logic remains the same for device PFNs and the normal PFNs.
If we represent the device PFNs as a unique type, ideally we'd like to
still avoid introducing too much special handling.

Potentially I could see something like a tagged union for memory indices 
like ...

enum memory_index_type {
         MEMORY_INDEX_TYPE_PFN,
         MEMORY_INDEX_TYPE_DEVICE_MEMORY_INDEX,
};

union memory_index {
         unsigned long pfn;
         unsigned long device_memory_index;
         enum memory_index_type type;
};

... if we wanted to introduce a dedicated type.

Another possibility could be to avoid exposing the PFN for cases like
this.

For example if we went back to struct page_vma_mapped_walk containing a
folio / struct page instead of a passing in a pfn then we could 
introduce some helper
like ...

         bool swap_entry_contains_folio(struct folio *folio, swp_entry_t 
entry);

... that handles both device memory and normal memory and use that in
check_pte().

Thanks,
Jordan.


> 
> Jason



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 6/6] mm: Remove device private pages from the physical address space
  2025-12-02  2:28     ` Jordan Niethe
@ 2025-12-02  4:10       ` Alistair Popple
  0 siblings, 0 replies; 26+ messages in thread
From: Alistair Popple @ 2025-12-02  4:10 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: Jason Gunthorpe, linux-mm, balbirs, matthew.brost, akpm,
	linux-kernel, dri-devel, david, ziy, lorenzo.stoakes, lyude,
	dakr, airlied, simona, rcampbell, mpenttil, willy

On 2025-12-02 at 13:28 +1100, Jordan Niethe <jniethe@nvidia.com> wrote...
> Hi,
> 
> On 29/11/25 04:51, Jason Gunthorpe wrote:
> > On Fri, Nov 28, 2025 at 03:41:46PM +1100, Jordan Niethe wrote:
> > > Introduce helpers:
> > > 
> > >    - device_private_page_to_offset()
> > >    - device_private_folio_to_offset()
> > > 
> > > to take a given device private page / folio and return its offset within
> > > the device private address space (this is essentially a PFN within the
> > > device private address space).
> > 
> > It would be nice if we rarely/never needed to see number space outside
> > the pte itself or the internal helpers..
> 
> Outside of the PTE itself, one of the use cases for the PFNs themselves
> is range checking. Like we see in mm/page_vma_mapped.c:check_pte().
> 
> > 
> > Like, I don't think there should be stuff like this:
> > 
> > >   					entry = make_writable_migration_device_private_entry(
> > > -								page_to_pfn(page));
> > > +								device_private_page_to_offset(page));
> > 
> > make_writable_migration_device_private_entry() should accept the
> > struct page as the handle?
> 
> That would be more clean - I'll give it a try.
> 
> > 
> > If it really is needed I think it should have its own dedicated type
> > and not be intermixed with normal pfns..
> 
> One consideration here is for things like range checking the PFNs, the
> logic remains the same for device PFNs and the normal PFNs.
> If we represent the device PFNs as a unique type, ideally we'd like to
> still avoid introducing too much special handling.

Right, Jordan and I went back and forth on this a little bit prior to posting
but in the end I thought it wasn't worth the overhead of a new type for such a
limited number of use cases for which the actual logic ends up being the same
anyway.

Getting rid of passing the pfn to make_writable_migration_device_private_entry()
makes sense though and should address most of these cases.

> Potentially I could see something like a tagged union for memory indices
> like ...
> 
> enum memory_index_type {
>         MEMORY_INDEX_TYPE_PFN,
>         MEMORY_INDEX_TYPE_DEVICE_MEMORY_INDEX,
> };
> 
> union memory_index {
>         unsigned long pfn;
>         unsigned long device_memory_index;
>         enum memory_index_type type;
> };
> 
> ... if we wanted to introduce a dedicated type.
> 
> Another possibility could be to avoid exposing the PFN for cases like
> this.
> 
> For example if we went back to struct page_vma_mapped_walk containing a
> folio / struct page instead of a passing in a pfn then we could introduce
> some helper
> like ...
> 
>         bool swap_entry_contains_folio(struct folio *folio, swp_entry_t
> entry);
> 
> ... that handles both device memory and normal memory and use that in
> check_pte().
> 
> Thanks,
> Jordan.
> 
> 
> > 
> > Jason
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/6] Remove device private pages from physical address space
  2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
                   ` (9 preceding siblings ...)
  2025-11-28 19:22 ` Matthew Brost
@ 2025-12-02 22:20 ` Balbir Singh
  10 siblings, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2025-12-02 22:20 UTC (permalink / raw)
  To: Jordan Niethe, linux-mm
  Cc: matthew.brost, akpm, linux-kernel, dri-devel, david, ziy,
	apopple, lorenzo.stoakes, lyude, dakr, airlied, simona,
	rcampbell, mpenttil, jgg, willy

On 11/28/25 15:41, Jordan Niethe wrote:
> Today, when creating these device private struct pages, the first step
> is to use request_free_mem_region() to get a range of physical address
> space large enough to represent the devices memory. This allocated
> physical address range is then remapped as device private memory using
> memremap_pages.
> 
> Needing allocation of physical address space has some problems:
> 
>   1) There may be insufficient physical address space to represent the
>      device memory. KASLR reducing the physical address space and VM
>      configurations with limited physical address space increase the
>      likelihood of hitting this especially as device memory increases. This
>      has been observed to prevent device private from being initialized.  
> 
>   2) Attempting to add the device private pages to the linear map at
>      addresses beyond the actual physical memory causes issues on
>      architectures like aarch64  - meaning the feature does not work there [0].
> 
> This RFC changes device private memory so that it does not require
> allocation of physical address space and these problems are avoided.
> Instead of using the physical address space, we introduce a "device
> private address space" and allocate from there.
> 
> A consequence of placing the device private pages outside of the
> physical address space is that they no longer have a PFN. However, it is
> still necessary to be able to look up a corresponding device private
> page from a device private PTE entry, which means that we still require
> some way to index into this device private address space. This leads to
> the idea of a device private PFN. This is like a PFN but instead of
> associating memory in the physical address space with a struct page, it
> associates device memory in the device private address space with a
> device private struct page.
> 
> The problem that then needs to be addressed is how to avoid confusing
> these device private PFNs with the regular PFNs. It is the inherent
> limited usage of the device private pages themselves which make this
> possible. A device private page is only used for userspace mappings, we
> do not need to be concerned with them being used within the mm more
> broadly. This means that the only way that the core kernel looks up
> these pages is via the page table, where their PTE already indicates if
> they refer to a device private page via their swap type, e.g.
> SWP_DEVICE_WRITE. We can use this information to determine if the PTE
> contains a normal PFN which should be looked up in the page map, or a
> device private PFN which should be looked up elsewhere.
> 
> This applies when we are creating PTE entries for device private pages -
> because they have their own type there are already must be handled
> separately, so it is a small step to convert them to a device private
> PFN now too.
> 

It'll be important to distinguish between the two PFN's and ensure
that they are not treated as being interchangable

> The first part of the series updates callers where device private PFNs
> might now be encountered to track this extra state.
> 
> The last patch contains the bulk of the work where we change how we
> convert between device private pages to device private PFNs and then use
> a new interface for allocating device private pages without the need for
> reserving physical address space.
> 
> For the purposes of the RFC changes have been limited to test_hmm.c
> updates to the other drivers will be included in the next revision.
> 
> This would include updating existing users of memremap_pages() to use
> memremap_device_private_pagemap() instead to allocate device private
> pages. This also means they would no longer need to call
> request_free_mem_region().  An equivalent of devm_memremap_pages() will
> also be necessary.
> 
> Users of the migrate_vma() interface will also need to be updated to be
> aware these device private PFNs.
> 
> By removing the device private pages from the physical address space,
> this RFC also opens up the possibility to moving away from tracking
> device private memory using struct pages in the future. This is
> desirable as on systems with large amounts of memory these device
> private struct pages use a signifiant amount of memory and take a
> significant amount of time to initialize.
> 
> Testing:
> - selftests/mm/hmm-tests on an amd64 VM
> 
> [0] https://lore.kernel.org/lkml/CAMj1kXFZ=4hLL1w6iCV5O5uVoVLHAJbc0rr40j24ObenAjXe9w@mail.gmail.com/
> 
> Jordan Niethe (6):
>   mm/hmm: Add flag to track device private PFNs
>   mm/migrate_device: Add migrate PFN flag to track device private PFNs
>   mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn to track
>     device private PFNs
>   mm: Add a new swap type for migration entries with device private PFNs
>   mm/util: Add flag to track device private PFNs in page snapshots
>   mm: Remove device private pages from the physical address space
> 
>  Documentation/mm/hmm.rst |   9 +-
>  fs/proc/page.c           |   6 +-
>  include/linux/hmm.h      |   5 ++
>  include/linux/memremap.h |  25 +++++-
>  include/linux/migrate.h  |   5 ++
>  include/linux/mm.h       |   9 +-
>  include/linux/rmap.h     |  33 +++++++-
>  include/linux/swap.h     |   8 +-
>  include/linux/swapops.h  | 102 +++++++++++++++++++++--
>  lib/test_hmm.c           |  66 ++++++++-------
>  mm/debug.c               |   9 +-
>  mm/hmm.c                 |   2 +-
>  mm/memory.c              |   9 +-
>  mm/memremap.c            | 174 +++++++++++++++++++++++++++++----------
>  mm/migrate.c             |   6 +-
>  mm/migrate_device.c      |  44 ++++++----
>  mm/mm_init.c             |   8 +-
>  mm/mprotect.c            |  21 +++--
>  mm/page_vma_mapped.c     |  18 +++-
>  mm/pagewalk.c            |   2 +-
>  mm/rmap.c                |  68 ++++++++++-----
>  mm/util.c                |   8 +-
>  mm/vmscan.c              |   2 +-
>  23 files changed, 485 insertions(+), 154 deletions(-)
> 
> 

Balbir


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/6] mm/hmm: Add flag to track device private PFNs
  2025-12-02  1:20     ` Jordan Niethe
@ 2025-12-03  4:25       ` Balbir Singh
  0 siblings, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2025-12-03  4:25 UTC (permalink / raw)
  To: Jordan Niethe, Matthew Brost
  Cc: linux-mm, akpm, linux-kernel, dri-devel, david, ziy, apopple,
	lorenzo.stoakes, lyude, dakr, airlied, simona, rcampbell,
	mpenttil, jgg, willy

On 12/2/25 12:20, Jordan Niethe wrote:
> Hi,
> 
> On 29/11/25 05:36, Matthew Brost wrote:
>> On Fri, Nov 28, 2025 at 03:41:41PM +1100, Jordan Niethe wrote:
>>> A future change will remove device private pages from the physical
>>> address space. This will mean that device private pages no longer have
>>> normal PFN and must be handled separately.
>>>
>>> Prepare for this by adding a HMM_PFN_DEVICE_PRIVATE flag to indicate
>>> that a hmm_pfn contains a PFN for a device private page.
>>>
>>> Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> ---
>>>   include/linux/hmm.h | 2 ++
>>>   mm/hmm.c            | 2 +-
>>>   2 files changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>> index db75ffc949a7..df571fa75a44 100644
>>> --- a/include/linux/hmm.h
>>> +++ b/include/linux/hmm.h
>>> @@ -23,6 +23,7 @@ struct mmu_interval_notifier;
>>>    * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
>>>    * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
>>>    *                 fail. ie poisoned memory, special pages, no vma, etc
>>> + * HMM_PFN_DEVICE_PRIVATE - the pfn field contains a DEVICE_PRIVATE pfn.
>>>    * HMM_PFN_P2PDMA - P2P page
>>>    * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
>>>    * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
>>> @@ -40,6 +41,7 @@ enum hmm_pfn_flags {
>>>       HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
>>>       HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
>>>       HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
>>> +    HMM_PFN_DEVICE_PRIVATE = 1UL << (BITS_PER_LONG - 7),

Doesn't this break HMM_PFN_ORDER_SHIFT? The assumption is that we have 5 bits for
order

>>>       /*
>>>        * Sticky flags, carried from input to output,
>>>        * don't forget to update HMM_PFN_INOUT_FLAGS
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index 87562914670a..1cff68ade1d4 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -262,7 +262,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>>>           if (is_device_private_entry(entry) &&
>>>               page_pgmap(pfn_swap_entry_to_page(entry))->owner ==
>>>               range->dev_private_owner) {
>>> -            cpu_flags = HMM_PFN_VALID;
>>> +            cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
>>
>> I think you’ll need to set this flag in hmm_vma_handle_absent_pmd as
>> well. That function handles 2M device pages. Support for 2M device
>> pages, I believe, will be included in the 6.19 PR, but
>> hmm_vma_handle_absent_pmd is already upstream.
> 
> Thanks Matt, I agree. There will be a few more updates to this
> series for 2MB device pages - I'll send the next revision on top of that
> support.
> 

I think it makes sense to build on top of v6.19 with THP support

Balbir


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-12-03  4:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-28  4:41 [RFC PATCH 0/6] Remove device private pages from physical address space Jordan Niethe
2025-11-28  4:41 ` [RFC PATCH 1/6] mm/hmm: Add flag to track device private PFNs Jordan Niethe
2025-11-28 18:36   ` Matthew Brost
2025-12-02  1:20     ` Jordan Niethe
2025-12-03  4:25       ` Balbir Singh
2025-11-28  4:41 ` [RFC PATCH 2/6] mm/migrate_device: Add migrate PFN " Jordan Niethe
2025-11-28  4:41 ` [RFC PATCH 3/6] mm/page_vma_mapped: Add flags to page_vma_mapped_walk::pfn " Jordan Niethe
2025-11-28  4:41 ` [RFC PATCH 4/6] mm: Add a new swap type for migration entries with " Jordan Niethe
2025-12-01  2:43   ` Chih-En Lin
2025-12-02  1:42     ` Jordan Niethe
2025-11-28  4:41 ` [RFC PATCH 5/6] mm/util: Add flag to track device private PFNs in page snapshots Jordan Niethe
2025-11-28  4:41 ` [RFC PATCH 6/6] mm: Remove device private pages from the physical address space Jordan Niethe
2025-11-28 17:51   ` Jason Gunthorpe
2025-12-02  2:28     ` Jordan Niethe
2025-12-02  4:10       ` Alistair Popple
2025-11-28  7:40 ` [RFC PATCH 0/6] Remove device private pages from " David Hildenbrand (Red Hat)
2025-11-30 23:33   ` Alistair Popple
2025-11-28 15:09 ` Matthew Wilcox
2025-12-02  1:31   ` Jordan Niethe
2025-11-28 16:07 ` Mika Penttilä
2025-12-02  1:32   ` Jordan Niethe
2025-11-28 19:22 ` Matthew Brost
2025-11-30 23:23   ` Alistair Popple
2025-12-01  1:51     ` Matthew Brost
2025-12-02  1:40       ` Jordan Niethe
2025-12-02 22:20 ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox