linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/5] Enable THP support in drm_pagemap
@ 2026-01-16 11:10 Francois Dugast
  2026-01-16 11:10 ` [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios Francois Dugast
                   ` (4 more replies)
  0 siblings, 5 replies; 44+ messages in thread
From: Francois Dugast @ 2026-01-16 11:10 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Francois Dugast, Zi Yan, Madhavan Srinivasan,
	Alistair Popple, Lorenzo Stoakes, Liam R . Howlett,
	Suren Baghdasaryan, Michal Hocko, Mike Rapoport, Vlastimil Babka,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Balbir Singh,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Mika Penttilä,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci,
	linux-mm, linux-cxl, nvdimm, linux-fsdevel

Use Balbir Singh's series for device-private THP support [1] and previous
preparation work in drm_pagemap [2] to add 2MB/THP support in xe. This leads
to significant performance improvements when using SVM with 2MB pages.

[1] https://lore.kernel.org/linux-mm/20251001065707.920170-1-balbirs@nvidia.com/
[2] https://patchwork.freedesktop.org/series/151754/

v2:
- rebase on top of multi-device SVM
- add drm_pagemap_cpages() with temporary patch
- address other feedback from Matt Brost on v1

v3:
The major change is to remove the dependency to the mm/huge_memory
helper migrate_device_split_page() that was called explicitely when
a 2M buddy allocation backed by a large folio would be later reused
for a smaller allocation (4K or 64K). Instead, the first 3 patches
provided by Matthew Brost ensure large folios are split at the time
of freeing.

v4:
- add order argument to folio_free callback
- send complete series to linux-mm and MM folks as requested (Zi Yan
  and Andrew Morton) and cover letter to anyone receiving at least
  one of the patches (Liam R. Howlett)

v5:
- update zone_device_page_init() in patch #1 to reinitialize large
  zone device private folios

v6:
- fix drm_pagemap change in patch #1 to allow applying to 6.19 and
  add some comments

Cc: Zi Yan <ziy@nvidia.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Cc: nvdimm@lists.linux.dev
Cc: linux-fsdevel@vger.kernel.org

Francois Dugast (3):
  drm/pagemap: Unlock and put folios when possible
  drm/pagemap: Add helper to access zone_device_data
  drm/pagemap: Enable THP support for GPU memory migration

Matthew Brost (2):
  mm/zone_device: Reinitialize large zone device private folios
  drm/pagemap: Correct cpages calculation for migrate_vma_setup

 arch/powerpc/kvm/book3s_hv_uvmem.c       |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   2 +-
 drivers/gpu/drm/drm_gpusvm.c             |   7 +-
 drivers/gpu/drm/drm_pagemap.c            | 158 ++++++++++++++++++-----
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |   2 +-
 include/drm/drm_pagemap.h                |  15 +++
 include/linux/memremap.h                 |   9 +-
 lib/test_hmm.c                           |   4 +-
 mm/memremap.c                            |  35 ++++-
 9 files changed, 195 insertions(+), 39 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 11:10 [PATCH v6 0/5] Enable THP support in drm_pagemap Francois Dugast
@ 2026-01-16 11:10 ` Francois Dugast
  2026-01-16 13:10   ` Balbir Singh
                     ` (3 more replies)
  2026-01-16 11:10 ` [PATCH v6 2/5] drm/pagemap: Unlock and put folios when possible Francois Dugast
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 44+ messages in thread
From: Francois Dugast @ 2026-01-16 11:10 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, Alistair Popple,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Balbir Singh, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl,
	Francois Dugast

From: Matthew Brost <matthew.brost@intel.com>

Reinitialize metadata for large zone device private folios in
zone_device_page_init prior to creating a higher-order zone device
private folio. This step is necessary when the folio’s order changes
dynamically between zone_device_page_init calls to avoid building a
corrupt folio. As part of the metadata reinitialization, the dev_pagemap
must be passed in from the caller because the pgmap stored in the folio
page may have been overwritten with a compound head.

Without this fix, individual pages could have invalid pgmap fields and
flags (with PG_locked being notably problematic) due to prior different
order allocations, which can, and will, result in kernel crashes.

Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: adhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>

---

The latest revision updates the commit message to explain what is broken
prior to this patch and also restructures the patch so it applies, and
works, on both the 6.19 branches and drm-tip, the latter in which includes
patches for the next kernel release PR. Intel CI passes on both the 6.19
branches and drm-tip at point of the first patch in this series and the
last (drm-tip only given subsequent patches in the series require in
patches drm-tip but not present 6.19).
---
 arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
 drivers/gpu/drm/drm_pagemap.c            |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
 include/linux/memremap.h                 |  9 ++++--
 lib/test_hmm.c                           |  4 ++-
 mm/memremap.c                            | 35 +++++++++++++++++++++++-
 7 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e5000bef90f2..7cf9310de0ec 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -723,7 +723,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
 
 	dpage = pfn_to_page(uvmem_pfn);
 	dpage->zone_device_data = pvt;
-	zone_device_page_init(dpage, 0);
+	zone_device_page_init(dpage, &kvmppc_uvmem_pgmap, 0);
 	return dpage;
 out_clear:
 	spin_lock(&kvmppc_uvmem_bitmap_lock);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index af53e796ea1b..6ada7b4af7c6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -217,7 +217,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
 	page = pfn_to_page(pfn);
 	svm_range_bo_ref(prange->svm_bo);
 	page->zone_device_data = prange->svm_bo;
-	zone_device_page_init(page, 0);
+	zone_device_page_init(page, page_pgmap(page), 0);
 }
 
 static void
diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 03ee39a761a4..38eca94f01a1 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -201,7 +201,7 @@ static void drm_pagemap_get_devmem_page(struct page *page,
 					struct drm_pagemap_zdd *zdd)
 {
 	page->zone_device_data = drm_pagemap_zdd_get(zdd);
-	zone_device_page_init(page, 0);
+	zone_device_page_init(page, page_pgmap(page), 0);
 }
 
 /**
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 58071652679d..3d8031296eed 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -425,7 +425,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
 			order = ilog2(DMEM_CHUNK_NPAGES);
 	}
 
-	zone_device_folio_init(folio, order);
+	zone_device_folio_init(folio, page_pgmap(folio_page(folio, 0)), order);
 	return page;
 }
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 713ec0435b48..e3c2ccf872a8 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -224,7 +224,8 @@ static inline bool is_fsdax_page(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page, unsigned int order);
+void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
+			   unsigned int order);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -234,9 +235,11 @@ bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long memremap_compat_align(void);
 
-static inline void zone_device_folio_init(struct folio *folio, unsigned int order)
+static inline void zone_device_folio_init(struct folio *folio,
+					  struct dev_pagemap *pgmap,
+					  unsigned int order)
 {
-	zone_device_page_init(&folio->page, order);
+	zone_device_page_init(&folio->page, pgmap, order);
 	if (order)
 		folio_set_large_rmappable(folio);
 }
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 8af169d3873a..455a6862ae50 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -662,7 +662,9 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
 			goto error;
 	}
 
-	zone_device_folio_init(page_folio(dpage), order);
+	zone_device_folio_init(page_folio(dpage),
+			       page_pgmap(folio_page(page_folio(dpage), 0)),
+			       order);
 	dpage->zone_device_data = rpage;
 	return dpage;
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 63c6ab4fdf08..ac7be07e3361 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page, unsigned int order)
+void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
+			   unsigned int order)
 {
+	struct page *new_page = page;
+	unsigned int i;
+
 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
 
+	for (i = 0; i < (1UL << order); ++i, ++new_page) {
+		struct folio *new_folio = (struct folio *)new_page;
+
+		/*
+		 * new_page could have been part of previous higher order folio
+		 * which encodes the order, in page + 1, in the flags bits. We
+		 * blindly clear bits which could have set my order field here,
+		 * including page head.
+		 */
+		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
+
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+		/*
+		 * This pointer math looks odd, but new_page could have been
+		 * part of a previous higher order folio, which sets _nr_pages
+		 * in page + 1 (new_page). Therefore, we use pointer casting to
+		 * correctly locate the _nr_pages bits within new_page which
+		 * could have modified by previous higher order folio.
+		 */
+		((struct folio *)(new_page - 1))->_nr_pages = 0;
+#endif
+
+		new_folio->mapping = NULL;
+		new_folio->pgmap = pgmap;	/* Also clear compound head */
+		new_folio->share = 0;   /* fsdax only, unused for device private */
+		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
+		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
+	}
+
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
-- 
2.43.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v6 2/5] drm/pagemap: Unlock and put folios when possible
  2026-01-16 11:10 [PATCH v6 0/5] Enable THP support in drm_pagemap Francois Dugast
  2026-01-16 11:10 ` [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios Francois Dugast
@ 2026-01-16 11:10 ` Francois Dugast
  2026-01-16 11:10 ` [PATCH v6 3/5] drm/pagemap: Add helper to access zone_device_data Francois Dugast
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 44+ messages in thread
From: Francois Dugast @ 2026-01-16 11:10 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Francois Dugast, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Alistair Popple, Balbir Singh, linux-mm, Matthew Brost

If the page is part of a folio, unlock and put the whole folio at once
instead of individual pages one after the other. This will reduce the
amount of operations once device THP are in use.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linux-mm@kvack.org
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Balbir Singh <balbirs@nvidia.com>
---
 drivers/gpu/drm/drm_pagemap.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 38eca94f01a1..8e24a2c24729 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -154,15 +154,15 @@ static void drm_pagemap_zdd_put(struct drm_pagemap_zdd *zdd)
 }
 
 /**
- * drm_pagemap_migration_unlock_put_page() - Put a migration page
- * @page: Pointer to the page to put
+ * drm_pagemap_migration_unlock_put_folio() - Put a migration folio
+ * @folio: Pointer to the folio to put
  *
- * This function unlocks and puts a page.
+ * This function unlocks and puts a folio.
  */
-static void drm_pagemap_migration_unlock_put_page(struct page *page)
+static void drm_pagemap_migration_unlock_put_folio(struct folio *folio)
 {
-	unlock_page(page);
-	put_page(page);
+	folio_unlock(folio);
+	folio_put(folio);
 }
 
 /**
@@ -177,15 +177,23 @@ static void drm_pagemap_migration_unlock_put_pages(unsigned long npages,
 {
 	unsigned long i;
 
-	for (i = 0; i < npages; ++i) {
+	for (i = 0; i < npages;) {
 		struct page *page;
+		struct folio *folio;
+		unsigned int order = 0;
 
 		if (!migrate_pfn[i])
-			continue;
+			goto next;
 
 		page = migrate_pfn_to_page(migrate_pfn[i]);
-		drm_pagemap_migration_unlock_put_page(page);
+		folio = page_folio(page);
+		order = folio_order(folio);
+
+		drm_pagemap_migration_unlock_put_folio(folio);
 		migrate_pfn[i] = 0;
+
+next:
+		i += NR_PAGES(order);
 	}
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v6 3/5] drm/pagemap: Add helper to access zone_device_data
  2026-01-16 11:10 [PATCH v6 0/5] Enable THP support in drm_pagemap Francois Dugast
  2026-01-16 11:10 ` [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios Francois Dugast
  2026-01-16 11:10 ` [PATCH v6 2/5] drm/pagemap: Unlock and put folios when possible Francois Dugast
@ 2026-01-16 11:10 ` Francois Dugast
  2026-01-16 11:10 ` [PATCH v6 4/5] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
  2026-01-16 11:10 ` [PATCH v6 5/5] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast
  4 siblings, 0 replies; 44+ messages in thread
From: Francois Dugast @ 2026-01-16 11:10 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Francois Dugast, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Alistair Popple, Balbir Singh, linux-mm, Matthew Brost

This new helper helps ensure all accesses to zone_device_data use the
correct API whether the page is part of a folio or not.

v2:
- Move to drm_pagemap.h, stick to folio_zone_device_data (Matthew Brost)
- Return struct drm_pagemap_zdd * (Matthew Brost)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linux-mm@kvack.org
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 drivers/gpu/drm/drm_gpusvm.c  |  7 +++++--
 drivers/gpu/drm/drm_pagemap.c | 21 ++++++++++++---------
 include/drm/drm_pagemap.h     | 15 +++++++++++++++
 3 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/drm_gpusvm.c b/drivers/gpu/drm/drm_gpusvm.c
index aa9a0b60e727..585d913d3d19 100644
--- a/drivers/gpu/drm/drm_gpusvm.c
+++ b/drivers/gpu/drm/drm_gpusvm.c
@@ -1488,12 +1488,15 @@ int drm_gpusvm_get_pages(struct drm_gpusvm *gpusvm,
 		order = drm_gpusvm_hmm_pfn_to_order(pfns[i], i, npages);
 		if (is_device_private_page(page) ||
 		    is_device_coherent_page(page)) {
+			struct drm_pagemap_zdd *__zdd =
+				drm_pagemap_page_zone_device_data(page);
+
 			if (!ctx->allow_mixed &&
-			    zdd != page->zone_device_data && i > 0) {
+			    zdd != __zdd && i > 0) {
 				err = -EOPNOTSUPP;
 				goto err_unmap;
 			}
-			zdd = page->zone_device_data;
+			zdd = __zdd;
 			if (pagemap != page_pgmap(page)) {
 				if (i > 0) {
 					err = -EOPNOTSUPP;
diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 8e24a2c24729..61c6ca59df81 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -252,7 +252,7 @@ static int drm_pagemap_migrate_map_pages(struct device *dev,
 		order = folio_order(folio);
 
 		if (is_device_private_page(page)) {
-			struct drm_pagemap_zdd *zdd = page->zone_device_data;
+			struct drm_pagemap_zdd *zdd = drm_pagemap_page_zone_device_data(page);
 			struct drm_pagemap *dpagemap = zdd->dpagemap;
 			struct drm_pagemap_addr addr;
 
@@ -323,7 +323,7 @@ static void drm_pagemap_migrate_unmap_pages(struct device *dev,
 			goto next;
 
 		if (is_zone_device_page(page)) {
-			struct drm_pagemap_zdd *zdd = page->zone_device_data;
+			struct drm_pagemap_zdd *zdd = drm_pagemap_page_zone_device_data(page);
 			struct drm_pagemap *dpagemap = zdd->dpagemap;
 
 			dpagemap->ops->device_unmap(dpagemap, dev, pagemap_addr[i]);
@@ -611,7 +611,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 
 		pages[i] = NULL;
 		if (src_page && is_device_private_page(src_page)) {
-			struct drm_pagemap_zdd *src_zdd = src_page->zone_device_data;
+			struct drm_pagemap_zdd *src_zdd =
+				drm_pagemap_page_zone_device_data(src_page);
 
 			if (page_pgmap(src_page) == pagemap &&
 			    !mdetails->can_migrate_same_pagemap) {
@@ -733,8 +734,8 @@ static int drm_pagemap_migrate_populate_ram_pfn(struct vm_area_struct *vas,
 			goto next;
 
 		if (fault_page) {
-			if (src_page->zone_device_data !=
-			    fault_page->zone_device_data)
+			if (drm_pagemap_page_zone_device_data(src_page) !=
+			    drm_pagemap_page_zone_device_data(fault_page))
 				goto next;
 		}
 
@@ -1075,7 +1076,7 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 	void *buf;
 	int i, err = 0;
 
-	zdd = page->zone_device_data;
+	zdd = drm_pagemap_page_zone_device_data(page);
 	if (time_before64(get_jiffies_64(), zdd->devmem_allocation->timeslice_expiration))
 		return 0;
 
@@ -1158,7 +1159,9 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
  */
 static void drm_pagemap_folio_free(struct folio *folio)
 {
-	drm_pagemap_zdd_put(folio->page.zone_device_data);
+	struct page *page = folio_page(folio, 0);
+
+	drm_pagemap_zdd_put(drm_pagemap_page_zone_device_data(page));
 }
 
 /**
@@ -1174,7 +1177,7 @@ static void drm_pagemap_folio_free(struct folio *folio)
  */
 static vm_fault_t drm_pagemap_migrate_to_ram(struct vm_fault *vmf)
 {
-	struct drm_pagemap_zdd *zdd = vmf->page->zone_device_data;
+	struct drm_pagemap_zdd *zdd = drm_pagemap_page_zone_device_data(vmf->page);
 	int err;
 
 	err = __drm_pagemap_migrate_to_ram(vmf->vma,
@@ -1240,7 +1243,7 @@ EXPORT_SYMBOL_GPL(drm_pagemap_devmem_init);
  */
 struct drm_pagemap *drm_pagemap_page_to_dpagemap(struct page *page)
 {
-	struct drm_pagemap_zdd *zdd = page->zone_device_data;
+	struct drm_pagemap_zdd *zdd = drm_pagemap_page_zone_device_data(page);
 
 	return zdd->devmem_allocation->dpagemap;
 }
diff --git a/include/drm/drm_pagemap.h b/include/drm/drm_pagemap.h
index 46e9c58f09e0..736fb6cb7b33 100644
--- a/include/drm/drm_pagemap.h
+++ b/include/drm/drm_pagemap.h
@@ -4,6 +4,7 @@
 
 #include <linux/dma-direction.h>
 #include <linux/hmm.h>
+#include <linux/memremap.h>
 #include <linux/types.h>
 
 #define NR_PAGES(order) (1U << (order))
@@ -359,4 +360,18 @@ int drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
 void drm_pagemap_destroy(struct drm_pagemap *dpagemap, bool is_atomic_or_reclaim);
 
 int drm_pagemap_reinit(struct drm_pagemap *dpagemap);
+
+/**
+ * drm_pagemap_page_zone_device_data() - Page to zone_device_data
+ * @page: Pointer to the page
+ *
+ * Return: Page's zone_device_data
+ */
+static inline struct drm_pagemap_zdd *drm_pagemap_page_zone_device_data(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	return folio_zone_device_data(folio);
+}
+
 #endif
-- 
2.43.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v6 4/5] drm/pagemap: Correct cpages calculation for migrate_vma_setup
  2026-01-16 11:10 [PATCH v6 0/5] Enable THP support in drm_pagemap Francois Dugast
                   ` (2 preceding siblings ...)
  2026-01-16 11:10 ` [PATCH v6 3/5] drm/pagemap: Add helper to access zone_device_data Francois Dugast
@ 2026-01-16 11:10 ` Francois Dugast
  2026-01-16 11:37   ` Balbir Singh
  2026-01-16 11:10 ` [PATCH v6 5/5] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast
  4 siblings, 1 reply; 44+ messages in thread
From: Francois Dugast @ 2026-01-16 11:10 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Matthew Brost, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Alistair Popple, Balbir Singh, linux-mm, Francois Dugast

From: Matthew Brost <matthew.brost@intel.com>

cpages returned from migrate_vma_setup represents the total number of
individual pages found, not the number of 4K pages. The math in
drm_pagemap_migrate_to_devmem for npages is based on the number of 4K
pages, so cpages != npages can fail even if the entire memory range is
found in migrate_vma_setup (e.g., when a single 2M page is found).
Add drm_pagemap_cpages, which converts cpages to the number of 4K pages
found.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linux-mm@kvack.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 drivers/gpu/drm/drm_pagemap.c | 38 ++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 61c6ca59df81..801da343f0a6 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -452,6 +452,41 @@ static int drm_pagemap_migrate_range(struct drm_pagemap_devmem *devmem,
 	return ret;
 }
 
+/**
+ * drm_pagemap_cpages() - Count collected pages
+ * @migrate_pfn: Array of migrate_pfn entries to account
+ * @npages: Number of entries in @migrate_pfn
+ *
+ * Compute the total number of minimum-sized pages represented by the
+ * collected entries in @migrate_pfn. The total is derived from the
+ * order encoded in each entry.
+ *
+ * Return: Total number of minimum-sized pages.
+ */
+static int drm_pagemap_cpages(unsigned long *migrate_pfn, unsigned long npages)
+{
+	unsigned long i, cpages = 0;
+
+	for (i = 0; i < npages;) {
+		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
+		struct folio *folio;
+		unsigned int order = 0;
+
+		if (page) {
+			folio = page_folio(page);
+			order = folio_order(folio);
+			cpages += NR_PAGES(order);
+		} else if (migrate_pfn[i] & MIGRATE_PFN_COMPOUND) {
+			order = HPAGE_PMD_ORDER;
+			cpages += NR_PAGES(order);
+		}
+
+		i += NR_PAGES(order);
+	}
+
+	return cpages;
+}
+
 /**
  * drm_pagemap_migrate_to_devmem() - Migrate a struct mm_struct range to device memory
  * @devmem_allocation: The device memory allocation to migrate to.
@@ -564,7 +599,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 		goto err_free;
 	}
 
-	if (migrate.cpages != npages) {
+	if (migrate.cpages != npages &&
+	    drm_pagemap_cpages(migrate.src, npages) != npages) {
 		/*
 		 * Some pages to migrate. But we want to migrate all or
 		 * nothing. Raced or unknown device pages.
-- 
2.43.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v6 5/5] drm/pagemap: Enable THP support for GPU memory migration
  2026-01-16 11:10 [PATCH v6 0/5] Enable THP support in drm_pagemap Francois Dugast
                   ` (3 preceding siblings ...)
  2026-01-16 11:10 ` [PATCH v6 4/5] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
@ 2026-01-16 11:10 ` Francois Dugast
  4 siblings, 0 replies; 44+ messages in thread
From: Francois Dugast @ 2026-01-16 11:10 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Francois Dugast, Matthew Brost, Thomas Hellström,
	Michal Mrozek, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Alistair Popple,
	Balbir Singh, linux-mm

This enables support for Transparent Huge Pages (THP) for device pages by
using MIGRATE_VMA_SELECT_COMPOUND during migration. It removes the need to
split folios and loop multiple times over all pages to perform required
operations at page level. Instead, we rely on newly introduced support for
higher orders in drm_pagemap and folio-level API.

In Xe, this drastically improves performance when using SVM. The GT stats
below collected after a 2MB page fault show overall servicing is more than
7 times faster, and thanks to reduced CPU overhead the time spent on the
actual copy goes from 23% without THP to 80% with THP:

Without THP:

    svm_2M_pagefault_us: 966
    svm_2M_migrate_us: 942
    svm_2M_device_copy_us: 223
    svm_2M_get_pages_us: 9
    svm_2M_bind_us: 10

With THP:

    svm_2M_pagefault_us: 132
    svm_2M_migrate_us: 128
    svm_2M_device_copy_us: 106
    svm_2M_get_pages_us: 1
    svm_2M_bind_us: 2

v2:
- Fix one occurrence of drm_pagemap_get_devmem_page() (Matthew Brost)

v3:
- Remove migrate_device_split_page() and folio_split_lock, instead rely on
  free_zone_device_folio() to split folios before freeing (Matthew Brost)
- Assert folio order is HPAGE_PMD_ORDER (Matthew Brost)
- Always use folio_set_zone_device_data() in split (Matthew Brost)

v4:
- Warn on compound device page, s/continue/goto next/ (Matthew Brost)

v5:
- Revert warn on compound device page
- s/zone_device_page_init()/zone_device_folio_init() (Matthew Brost)

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linux-mm@kvack.org
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 drivers/gpu/drm/drm_pagemap.c | 73 ++++++++++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 801da343f0a6..e2aecd519f14 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -200,16 +200,19 @@ static void drm_pagemap_migration_unlock_put_pages(unsigned long npages,
 /**
  * drm_pagemap_get_devmem_page() - Get a reference to a device memory page
  * @page: Pointer to the page
+ * @order: Order
  * @zdd: Pointer to the GPU SVM zone device data
  *
  * This function associates the given page with the specified GPU SVM zone
  * device data and initializes it for zone device usage.
  */
 static void drm_pagemap_get_devmem_page(struct page *page,
+					unsigned int order,
 					struct drm_pagemap_zdd *zdd)
 {
-	page->zone_device_data = drm_pagemap_zdd_get(zdd);
-	zone_device_page_init(page, page_pgmap(page), 0);
+	zone_device_folio_init((struct folio *)page, zdd->dpagemap->pagemap,
+			       order);
+	folio_set_zone_device_data(page_folio(page), drm_pagemap_zdd_get(zdd));
 }
 
 /**
@@ -534,7 +537,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 		 * rare and only occur when the madvise attributes of memory are
 		 * changed or atomics are being used.
 		 */
-		.flags		= MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
+				  MIGRATE_VMA_SELECT_COMPOUND,
 	};
 	unsigned long i, npages = npages_in_range(start, end);
 	unsigned long own_pages = 0, migrated_pages = 0;
@@ -640,11 +644,13 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 
 	own_pages = 0;
 
-	for (i = 0; i < npages; ++i) {
+	for (i = 0; i < npages;) {
+		unsigned long j;
 		struct page *page = pfn_to_page(migrate.dst[i]);
 		struct page *src_page = migrate_pfn_to_page(migrate.src[i]);
-		cur.start = i;
+		unsigned int order = 0;
 
+		cur.start = i;
 		pages[i] = NULL;
 		if (src_page && is_device_private_page(src_page)) {
 			struct drm_pagemap_zdd *src_zdd =
@@ -654,7 +660,7 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 			    !mdetails->can_migrate_same_pagemap) {
 				migrate.dst[i] = 0;
 				own_pages++;
-				continue;
+				goto next;
 			}
 			if (mdetails->source_peer_migrates) {
 				cur.dpagemap = src_zdd->dpagemap;
@@ -670,7 +676,20 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 			pages[i] = page;
 		}
 		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
-		drm_pagemap_get_devmem_page(page, zdd);
+
+		if (migrate.src[i] & MIGRATE_PFN_COMPOUND) {
+			drm_WARN_ONCE(dpagemap->drm, src_page &&
+				      folio_order(page_folio(src_page)) != HPAGE_PMD_ORDER,
+				      "Unexpected folio order\n");
+
+			order = HPAGE_PMD_ORDER;
+			migrate.dst[i] |= MIGRATE_PFN_COMPOUND;
+
+			for (j = 1; j < NR_PAGES(order) && i + j < npages; j++)
+				migrate.dst[i + j] = 0;
+		}
+
+		drm_pagemap_get_devmem_page(page, order, zdd);
 
 		/* If we switched the migrating drm_pagemap, migrate previous pages now */
 		err = drm_pagemap_migrate_range(devmem_allocation, migrate.src, migrate.dst,
@@ -680,7 +699,11 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 			npages = i + 1;
 			goto err_finalize;
 		}
+
+next:
+		i += NR_PAGES(order);
 	}
+
 	cur.start = npages;
 	cur.ops = NULL; /* Force migration */
 	err = drm_pagemap_migrate_range(devmem_allocation, migrate.src, migrate.dst,
@@ -789,6 +812,8 @@ static int drm_pagemap_migrate_populate_ram_pfn(struct vm_area_struct *vas,
 		page = folio_page(folio, 0);
 		mpfn[i] = migrate_pfn(page_to_pfn(page));
 
+		if (order)
+			mpfn[i] |= MIGRATE_PFN_COMPOUND;
 next:
 		if (page)
 			addr += page_size(page);
@@ -1044,8 +1069,15 @@ int drm_pagemap_evict_to_ram(struct drm_pagemap_devmem *devmem_allocation)
 	if (err)
 		goto err_finalize;
 
-	for (i = 0; i < npages; ++i)
+	for (i = 0; i < npages;) {
+		unsigned int order = 0;
+
 		pages[i] = migrate_pfn_to_page(src[i]);
+		if (pages[i])
+			order = folio_order(page_folio(pages[i]));
+
+		i += NR_PAGES(order);
+	}
 
 	err = ops->copy_to_ram(pages, pagemap_addr, npages, NULL);
 	if (err)
@@ -1098,7 +1130,8 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 		.vma		= vas,
 		.pgmap_owner	= page_pgmap(page)->owner,
 		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
-		MIGRATE_VMA_SELECT_DEVICE_COHERENT,
+				  MIGRATE_VMA_SELECT_DEVICE_COHERENT |
+				  MIGRATE_VMA_SELECT_COMPOUND,
 		.fault_page	= page,
 	};
 	struct drm_pagemap_migrate_details mdetails = {};
@@ -1164,8 +1197,15 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 	if (err)
 		goto err_finalize;
 
-	for (i = 0; i < npages; ++i)
+	for (i = 0; i < npages;) {
+		unsigned int order = 0;
+
 		pages[i] = migrate_pfn_to_page(migrate.src[i]);
+		if (pages[i])
+			order = folio_order(page_folio(pages[i]));
+
+		i += NR_PAGES(order);
+	}
 
 	err = ops->copy_to_ram(pages, pagemap_addr, npages, NULL);
 	if (err)
@@ -1223,9 +1263,22 @@ static vm_fault_t drm_pagemap_migrate_to_ram(struct vm_fault *vmf)
 	return err ? VM_FAULT_SIGBUS : 0;
 }
 
+static void drm_pagemap_folio_split(struct folio *orig_folio, struct folio *new_folio)
+{
+	struct drm_pagemap_zdd *zdd;
+
+	if (!new_folio)
+		return;
+
+	new_folio->pgmap = orig_folio->pgmap;
+	zdd = folio_zone_device_data(orig_folio);
+	folio_set_zone_device_data(new_folio, drm_pagemap_zdd_get(zdd));
+}
+
 static const struct dev_pagemap_ops drm_pagemap_pagemap_ops = {
 	.folio_free = drm_pagemap_folio_free,
 	.migrate_to_ram = drm_pagemap_migrate_to_ram,
+	.folio_split = drm_pagemap_folio_split,
 };
 
 /**
-- 
2.43.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 4/5] drm/pagemap: Correct cpages calculation for migrate_vma_setup
  2026-01-16 11:10 ` [PATCH v6 4/5] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
@ 2026-01-16 11:37   ` Balbir Singh
  2026-01-16 12:02     ` Francois Dugast
  0 siblings, 1 reply; 44+ messages in thread
From: Balbir Singh @ 2026-01-16 11:37 UTC (permalink / raw)
  To: Francois Dugast, intel-xe
  Cc: dri-devel, Matthew Brost, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Alistair Popple, linux-mm

On 1/16/26 22:10, Francois Dugast wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> cpages returned from migrate_vma_setup represents the total number of
> individual pages found, not the number of 4K pages. The math in
> drm_pagemap_migrate_to_devmem for npages is based on the number of 4K
> pages, so cpages != npages can fail even if the entire memory range is
> found in migrate_vma_setup (e.g., when a single 2M page is found).
> Add drm_pagemap_cpages, which converts cpages to the number of 4K pages
> found.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Francois Dugast <francois.dugast@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  drivers/gpu/drm/drm_pagemap.c | 38 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
> index 61c6ca59df81..801da343f0a6 100644
> --- a/drivers/gpu/drm/drm_pagemap.c
> +++ b/drivers/gpu/drm/drm_pagemap.c
> @@ -452,6 +452,41 @@ static int drm_pagemap_migrate_range(struct drm_pagemap_devmem *devmem,
>  	return ret;
>  }
>  
> +/**
> + * drm_pagemap_cpages() - Count collected pages
> + * @migrate_pfn: Array of migrate_pfn entries to account
> + * @npages: Number of entries in @migrate_pfn
> + *
> + * Compute the total number of minimum-sized pages represented by the
> + * collected entries in @migrate_pfn. The total is derived from the
> + * order encoded in each entry.
> + *
> + * Return: Total number of minimum-sized pages.
> + */
> +static int drm_pagemap_cpages(unsigned long *migrate_pfn, unsigned long npages)
> +{
> +	unsigned long i, cpages = 0;
> +
> +	for (i = 0; i < npages;) {
> +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> +		struct folio *folio;
> +		unsigned int order = 0;
> +
> +		if (page) {
> +			folio = page_folio(page);
> +			order = folio_order(folio);
> +			cpages += NR_PAGES(order);
> +		} else if (migrate_pfn[i] & MIGRATE_PFN_COMPOUND) {
> +			order = HPAGE_PMD_ORDER;
> +			cpages += NR_PAGES(order);
> +		}
> +
> +		i += NR_PAGES(order);
> +	}
> +
> +	return cpages;
> +}
> +
>  /**
>   * drm_pagemap_migrate_to_devmem() - Migrate a struct mm_struct range to device memory
>   * @devmem_allocation: The device memory allocation to migrate to.
> @@ -564,7 +599,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>  		goto err_free;
>  	}
>  
> -	if (migrate.cpages != npages) {
> +	if (migrate.cpages != npages &&
> +	    drm_pagemap_cpages(migrate.src, npages) != npages) {
>  		/*
>  		 * Some pages to migrate. But we want to migrate all or
>  		 * nothing. Raced or unknown device pages.

I thought I did for the previous revision, but

Reviewed-by: Balbir Singh <balbirs@nvidia.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 4/5] drm/pagemap: Correct cpages calculation for migrate_vma_setup
  2026-01-16 11:37   ` Balbir Singh
@ 2026-01-16 12:02     ` Francois Dugast
  0 siblings, 0 replies; 44+ messages in thread
From: Francois Dugast @ 2026-01-16 12:02 UTC (permalink / raw)
  To: Balbir Singh
  Cc: intel-xe, dri-devel, Matthew Brost, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Alistair Popple, linux-mm

On Fri, Jan 16, 2026 at 10:37:15PM +1100, Balbir Singh wrote:
> On 1/16/26 22:10, Francois Dugast wrote:
> > From: Matthew Brost <matthew.brost@intel.com>
> > 
> > cpages returned from migrate_vma_setup represents the total number of
> > individual pages found, not the number of 4K pages. The math in
> > drm_pagemap_migrate_to_devmem for npages is based on the number of 4K
> > pages, so cpages != npages can fail even if the entire memory range is
> > found in migrate_vma_setup (e.g., when a single 2M page is found).
> > Add drm_pagemap_cpages, which converts cpages to the number of 4K pages
> > found.
> > 
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Cc: linux-mm@kvack.org
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reviewed-by: Francois Dugast <francois.dugast@intel.com>
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > ---
> >  drivers/gpu/drm/drm_pagemap.c | 38 ++++++++++++++++++++++++++++++++++-
> >  1 file changed, 37 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
> > index 61c6ca59df81..801da343f0a6 100644
> > --- a/drivers/gpu/drm/drm_pagemap.c
> > +++ b/drivers/gpu/drm/drm_pagemap.c
> > @@ -452,6 +452,41 @@ static int drm_pagemap_migrate_range(struct drm_pagemap_devmem *devmem,
> >  	return ret;
> >  }
> >  
> > +/**
> > + * drm_pagemap_cpages() - Count collected pages
> > + * @migrate_pfn: Array of migrate_pfn entries to account
> > + * @npages: Number of entries in @migrate_pfn
> > + *
> > + * Compute the total number of minimum-sized pages represented by the
> > + * collected entries in @migrate_pfn. The total is derived from the
> > + * order encoded in each entry.
> > + *
> > + * Return: Total number of minimum-sized pages.
> > + */
> > +static int drm_pagemap_cpages(unsigned long *migrate_pfn, unsigned long npages)
> > +{
> > +	unsigned long i, cpages = 0;
> > +
> > +	for (i = 0; i < npages;) {
> > +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > +		struct folio *folio;
> > +		unsigned int order = 0;
> > +
> > +		if (page) {
> > +			folio = page_folio(page);
> > +			order = folio_order(folio);
> > +			cpages += NR_PAGES(order);
> > +		} else if (migrate_pfn[i] & MIGRATE_PFN_COMPOUND) {
> > +			order = HPAGE_PMD_ORDER;
> > +			cpages += NR_PAGES(order);
> > +		}
> > +
> > +		i += NR_PAGES(order);
> > +	}
> > +
> > +	return cpages;
> > +}
> > +
> >  /**
> >   * drm_pagemap_migrate_to_devmem() - Migrate a struct mm_struct range to device memory
> >   * @devmem_allocation: The device memory allocation to migrate to.
> > @@ -564,7 +599,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
> >  		goto err_free;
> >  	}
> >  
> > -	if (migrate.cpages != npages) {
> > +	if (migrate.cpages != npages &&
> > +	    drm_pagemap_cpages(migrate.src, npages) != npages) {
> >  		/*
> >  		 * Some pages to migrate. But we want to migrate all or
> >  		 * nothing. Raced or unknown device pages.
> 
> I thought I did for the previous revision, but

You did for patch #2, it was kept in this revision.

> 
> Reviewed-by: Balbir Singh <balbirs@nvidia.com>

Thanks Balbir!

Francois


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 11:10 ` [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios Francois Dugast
@ 2026-01-16 13:10   ` Balbir Singh
  2026-01-16 16:07   ` Vlastimil Babka
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Balbir Singh @ 2026-01-16 13:10 UTC (permalink / raw)
  To: Francois Dugast, intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, Alistair Popple,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linuxppc-dev, kvm,
	linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 1/16/26 22:10, Francois Dugast wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> Reinitialize metadata for large zone device private folios in
> zone_device_page_init prior to creating a higher-order zone device
> private folio. This step is necessary when the folio’s order changes
> dynamically between zone_device_page_init calls to avoid building a
> corrupt folio. As part of the metadata reinitialization, the dev_pagemap
> must be passed in from the caller because the pgmap stored in the folio
> page may have been overwritten with a compound head.
> 
> Without this fix, individual pages could have invalid pgmap fields and
> flags (with PG_locked being notably problematic) due to prior different
> order allocations, which can, and will, result in kernel crashes.
> 
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: adhavan Srinivasan <maddy@linux.ibm.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> Cc: linux-mm@kvack.org
> Cc: linux-cxl@vger.kernel.org
> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> 
> ---
> 
> The latest revision updates the commit message to explain what is broken
> prior to this patch and also restructures the patch so it applies, and
> works, on both the 6.19 branches and drm-tip, the latter in which includes
> patches for the next kernel release PR. Intel CI passes on both the 6.19
> branches and drm-tip at point of the first patch in this series and the
> last (drm-tip only given subsequent patches in the series require in
> patches drm-tip but not present 6.19).
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
>  drivers/gpu/drm/drm_pagemap.c            |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
>  include/linux/memremap.h                 |  9 ++++--
>  lib/test_hmm.c                           |  4 ++-
>  mm/memremap.c                            | 35 +++++++++++++++++++++++-
>  7 files changed, 47 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index e5000bef90f2..7cf9310de0ec 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -723,7 +723,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
>  
>  	dpage = pfn_to_page(uvmem_pfn);
>  	dpage->zone_device_data = pvt;
> -	zone_device_page_init(dpage, 0);
> +	zone_device_page_init(dpage, &kvmppc_uvmem_pgmap, 0);
>  	return dpage;
>  out_clear:
>  	spin_lock(&kvmppc_uvmem_bitmap_lock);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index af53e796ea1b..6ada7b4af7c6 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -217,7 +217,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
>  	page = pfn_to_page(pfn);
>  	svm_range_bo_ref(prange->svm_bo);
>  	page->zone_device_data = prange->svm_bo;
> -	zone_device_page_init(page, 0);
> +	zone_device_page_init(page, page_pgmap(page), 0);
>  }
>  
>  static void
> diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
> index 03ee39a761a4..38eca94f01a1 100644
> --- a/drivers/gpu/drm/drm_pagemap.c
> +++ b/drivers/gpu/drm/drm_pagemap.c
> @@ -201,7 +201,7 @@ static void drm_pagemap_get_devmem_page(struct page *page,
>  					struct drm_pagemap_zdd *zdd)
>  {
>  	page->zone_device_data = drm_pagemap_zdd_get(zdd);
> -	zone_device_page_init(page, 0);
> +	zone_device_page_init(page, page_pgmap(page), 0);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 58071652679d..3d8031296eed 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -425,7 +425,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
>  			order = ilog2(DMEM_CHUNK_NPAGES);
>  	}
>  
> -	zone_device_folio_init(folio, order);
> +	zone_device_folio_init(folio, page_pgmap(folio_page(folio, 0)), order);
>  	return page;
>  }
>  
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 713ec0435b48..e3c2ccf872a8 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -224,7 +224,8 @@ static inline bool is_fsdax_page(const struct page *page)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page, unsigned int order);
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -234,9 +235,11 @@ bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>  
>  unsigned long memremap_compat_align(void);
>  
> -static inline void zone_device_folio_init(struct folio *folio, unsigned int order)
> +static inline void zone_device_folio_init(struct folio *folio,
> +					  struct dev_pagemap *pgmap,
> +					  unsigned int order)
>  {
> -	zone_device_page_init(&folio->page, order);
> +	zone_device_page_init(&folio->page, pgmap, order);
>  	if (order)
>  		folio_set_large_rmappable(folio);
>  }
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 8af169d3873a..455a6862ae50 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -662,7 +662,9 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
>  			goto error;
>  	}
>  
> -	zone_device_folio_init(page_folio(dpage), order);
> +	zone_device_folio_init(page_folio(dpage),
> +			       page_pgmap(folio_page(page_folio(dpage), 0)),
> +			       order);
>  	dpage->zone_device_data = rpage;
>  	return dpage;
>  
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 63c6ab4fdf08..ac7be07e3361 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
>  	}
>  }
>  
> -void zone_device_page_init(struct page *page, unsigned int order)
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order)
>  {
> +	struct page *new_page = page;
> +	unsigned int i;
> +
>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>  
> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> +		struct folio *new_folio = (struct folio *)new_page;
> +
> +		/*
> +		 * new_page could have been part of previous higher order folio
> +		 * which encodes the order, in page + 1, in the flags bits. We
> +		 * blindly clear bits which could have set my order field here,
> +		 * including page head.
> +		 */
> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> +
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> +		/*
> +		 * This pointer math looks odd, but new_page could have been
> +		 * part of a previous higher order folio, which sets _nr_pages
> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> +		 * correctly locate the _nr_pages bits within new_page which
> +		 * could have modified by previous higher order folio.
> +		 */
> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> +#endif
> +
> +		new_folio->mapping = NULL;
> +		new_folio->pgmap = pgmap;	/* Also clear compound head */
> +		new_folio->share = 0;   /* fsdax only, unused for device private */

It could use a

BUILD_BUG_ON(offsetof(struct folio, pgmap) > sizeof(struct page));
BUILD_BUG_ON(offsetof(struct folio, share) > sizeof(struct page));

> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> +	}
> +
>  	/*
>  	 * Drivers shouldn't be allocating pages after calling
>  	 * memunmap_pages().


Very subtle, I wonder if from a new page perspective if memset of new_page to 0
is cleaner, but I guess it does touch more bytes.

Reviewed-by: Balbir Singh <balbirs@nvidia.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 11:10 ` [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios Francois Dugast
  2026-01-16 13:10   ` Balbir Singh
@ 2026-01-16 16:07   ` Vlastimil Babka
  2026-01-16 17:20     ` Jason Gunthorpe
  2026-01-22  8:02     ` Vlastimil Babka
  2026-01-16 17:49   ` Jason Gunthorpe
  2026-01-16 22:34   ` Andrew Morton
  3 siblings, 2 replies; 44+ messages in thread
From: Vlastimil Babka @ 2026-01-16 16:07 UTC (permalink / raw)
  To: Francois Dugast, intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, Alistair Popple,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Balbir Singh, linuxppc-dev, kvm, linux-kernel,
	amd-gfx, nouveau, linux-mm, linux-cxl

On 1/16/26 12:10, Francois Dugast wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 63c6ab4fdf08..ac7be07e3361 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
>  	}
>  }
>  
> -void zone_device_page_init(struct page *page, unsigned int order)
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order)
>  {
> +	struct page *new_page = page;
> +	unsigned int i;
> +
>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>  
> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> +		struct folio *new_folio = (struct folio *)new_page;
> +
> +		/*
> +		 * new_page could have been part of previous higher order folio
> +		 * which encodes the order, in page + 1, in the flags bits. We
> +		 * blindly clear bits which could have set my order field here,
> +		 * including page head.
> +		 */
> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> +
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> +		/*
> +		 * This pointer math looks odd, but new_page could have been
> +		 * part of a previous higher order folio, which sets _nr_pages
> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> +		 * correctly locate the _nr_pages bits within new_page which
> +		 * could have modified by previous higher order folio.
> +		 */
> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> +#endif
> +
> +		new_folio->mapping = NULL;
> +		new_folio->pgmap = pgmap;	/* Also clear compound head */
> +		new_folio->share = 0;   /* fsdax only, unused for device private */
> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> +	}
> +
>  	/*
>  	 * Drivers shouldn't be allocating pages after calling
>  	 * memunmap_pages().

Can't say I'm a fan of this. It probably works now (so I'm not nacking) but
seems rather fragile. It seems likely to me somebody will try to change some
implementation detail in the page allocator and not notice it breaks this,
for example. I hope we can eventually get to something more robust.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 16:07   ` Vlastimil Babka
@ 2026-01-16 17:20     ` Jason Gunthorpe
  2026-01-16 17:27       ` Vlastimil Babka
  2026-01-22  8:02     ` Vlastimil Babka
  1 sibling, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-16 17:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Francois Dugast, intel-xe, dri-devel, Matthew Brost, Zi Yan,
	Alistair Popple, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Balbir Singh,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-mm,
	linux-cxl

On Fri, Jan 16, 2026 at 05:07:09PM +0100, Vlastimil Babka wrote:
> On 1/16/26 12:10, Francois Dugast wrote:
> > From: Matthew Brost <matthew.brost@intel.com>
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 63c6ab4fdf08..ac7be07e3361 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
> >  	}
> >  }
> >  
> > -void zone_device_page_init(struct page *page, unsigned int order)
> > +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> > +			   unsigned int order)
> >  {
> > +	struct page *new_page = page;
> > +	unsigned int i;
> > +
> >  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> >  
> > +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> > +		struct folio *new_folio = (struct folio *)new_page;
> > +
> > +		/*
> > +		 * new_page could have been part of previous higher order folio
> > +		 * which encodes the order, in page + 1, in the flags bits. We
> > +		 * blindly clear bits which could have set my order field here,
> > +		 * including page head.
> > +		 */
> > +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> > +
> > +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > +		/*
> > +		 * This pointer math looks odd, but new_page could have been
> > +		 * part of a previous higher order folio, which sets _nr_pages
> > +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> > +		 * correctly locate the _nr_pages bits within new_page which
> > +		 * could have modified by previous higher order folio.
> > +		 */
> > +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > +#endif
> > +
> > +		new_folio->mapping = NULL;
> > +		new_folio->pgmap = pgmap;	/* Also clear compound head */
> > +		new_folio->share = 0;   /* fsdax only, unused for device private */
> > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > +	}
> > +
> >  	/*
> >  	 * Drivers shouldn't be allocating pages after calling
> >  	 * memunmap_pages().
> 
> Can't say I'm a fan of this. It probably works now (so I'm not nacking) but
> seems rather fragile. It seems likely to me somebody will try to change some
> implementation detail in the page allocator and not notice it breaks this,
> for example. I hope we can eventually get to something more robust.

These pages shouldn't be in the buddy allocator at all? The driver
using the ZONE_DEVICE pages is responsible to provide its own
allocator.

Did you mean something else?

Jason
 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 17:20     ` Jason Gunthorpe
@ 2026-01-16 17:27       ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2026-01-16 17:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Francois Dugast, intel-xe, dri-devel, Matthew Brost, Zi Yan,
	Alistair Popple, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Balbir Singh,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-mm,
	linux-cxl

On 1/16/26 18:20, Jason Gunthorpe wrote:
> On Fri, Jan 16, 2026 at 05:07:09PM +0100, Vlastimil Babka wrote:
>> On 1/16/26 12:10, Francois Dugast wrote:
>> > From: Matthew Brost <matthew.brost@intel.com>
>> > diff --git a/mm/memremap.c b/mm/memremap.c
>> > index 63c6ab4fdf08..ac7be07e3361 100644
>> > --- a/mm/memremap.c
>> > +++ b/mm/memremap.c
>> > @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
>> >  	}
>> >  }
>> >  
>> > -void zone_device_page_init(struct page *page, unsigned int order)
>> > +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
>> > +			   unsigned int order)
>> >  {
>> > +	struct page *new_page = page;
>> > +	unsigned int i;
>> > +
>> >  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>> >  
>> > +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
>> > +		struct folio *new_folio = (struct folio *)new_page;
>> > +
>> > +		/*
>> > +		 * new_page could have been part of previous higher order folio
>> > +		 * which encodes the order, in page + 1, in the flags bits. We
>> > +		 * blindly clear bits which could have set my order field here,
>> > +		 * including page head.
>> > +		 */
>> > +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>> > +
>> > +#ifdef NR_PAGES_IN_LARGE_FOLIO
>> > +		/*
>> > +		 * This pointer math looks odd, but new_page could have been
>> > +		 * part of a previous higher order folio, which sets _nr_pages
>> > +		 * in page + 1 (new_page). Therefore, we use pointer casting to
>> > +		 * correctly locate the _nr_pages bits within new_page which
>> > +		 * could have modified by previous higher order folio.
>> > +		 */
>> > +		((struct folio *)(new_page - 1))->_nr_pages = 0;
>> > +#endif
>> > +
>> > +		new_folio->mapping = NULL;
>> > +		new_folio->pgmap = pgmap;	/* Also clear compound head */
>> > +		new_folio->share = 0;   /* fsdax only, unused for device private */
>> > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
>> > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
>> > +	}
>> > +
>> >  	/*
>> >  	 * Drivers shouldn't be allocating pages after calling
>> >  	 * memunmap_pages().
>> 
>> Can't say I'm a fan of this. It probably works now (so I'm not nacking) but
>> seems rather fragile. It seems likely to me somebody will try to change some
>> implementation detail in the page allocator and not notice it breaks this,
>> for example. I hope we can eventually get to something more robust.
> 
> These pages shouldn't be in the buddy allocator at all? The driver
> using the ZONE_DEVICE pages is responsible to provide its own
> allocator.
> 
> Did you mean something else?

Yeah sorry that was imprecise. I meant the struct page/folio layout
implementation details (which may or may not be related to the page allocator).

> Jason
>  



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 11:10 ` [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios Francois Dugast
  2026-01-16 13:10   ` Balbir Singh
  2026-01-16 16:07   ` Vlastimil Babka
@ 2026-01-16 17:49   ` Jason Gunthorpe
  2026-01-16 19:17     ` Vlastimil Babka
  2026-01-16 22:34   ` Andrew Morton
  3 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-16 17:49 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, dri-devel, Matthew Brost, Zi Yan, Alistair Popple,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Balbir Singh, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau,
	linux-mm, linux-cxl

On Fri, Jan 16, 2026 at 12:10:16PM +0100, Francois Dugast wrote:
> -void zone_device_page_init(struct page *page, unsigned int order)
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order)
>  {
> +	struct page *new_page = page;
> +	unsigned int i;
> +
>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>  
> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> +		struct folio *new_folio = (struct folio *)new_page;
> +
> +		/*
> +		 * new_page could have been part of previous higher order folio
> +		 * which encodes the order, in page + 1, in the flags bits. We
> +		 * blindly clear bits which could have set my order field here,
> +		 * including page head.
> +		 */
> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> +
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> +		/*
> +		 * This pointer math looks odd, but new_page could have been
> +		 * part of a previous higher order folio, which sets _nr_pages
> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> +		 * correctly locate the _nr_pages bits within new_page which
> +		 * could have modified by previous higher order folio.
> +		 */
> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> +#endif

This seems too weird, why is it in the loop?  There is only one
_nr_pages per folio.

This is mostly zeroing some memory in the tail pages? Why?

Why can't this use the normal helpers, like memmap_init_compound()?

 struct folio *new_folio = page

 /* First 4 tail pages are part of struct folio */
 for (i = 4; i < (1UL << order); i++) {
     prep_compound_tail(..)
 }

 prep_comound_head(page, order)
 new_folio->_nr_pages = 0

??

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 17:49   ` Jason Gunthorpe
@ 2026-01-16 19:17     ` Vlastimil Babka
  2026-01-16 20:31       ` Matthew Brost
  2026-01-17  0:19       ` Jason Gunthorpe
  0 siblings, 2 replies; 44+ messages in thread
From: Vlastimil Babka @ 2026-01-16 19:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Francois Dugast
  Cc: intel-xe, dri-devel, Matthew Brost, Zi Yan, Alistair Popple,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Balbir Singh,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-mm,
	linux-cxl

On 1/16/26 18:49, Jason Gunthorpe wrote:
> On Fri, Jan 16, 2026 at 12:10:16PM +0100, Francois Dugast wrote:
>> -void zone_device_page_init(struct page *page, unsigned int order)
>> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
>> +			   unsigned int order)
>>  {
>> +	struct page *new_page = page;
>> +	unsigned int i;
>> +
>>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>  
>> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
>> +		struct folio *new_folio = (struct folio *)new_page;
>> +
>> +		/*
>> +		 * new_page could have been part of previous higher order folio
>> +		 * which encodes the order, in page + 1, in the flags bits. We
>> +		 * blindly clear bits which could have set my order field here,
>> +		 * including page head.
>> +		 */
>> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>> +
>> +#ifdef NR_PAGES_IN_LARGE_FOLIO
>> +		/*
>> +		 * This pointer math looks odd, but new_page could have been
>> +		 * part of a previous higher order folio, which sets _nr_pages
>> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
>> +		 * correctly locate the _nr_pages bits within new_page which
>> +		 * could have modified by previous higher order folio.
>> +		 */
>> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
>> +#endif
> 
> This seems too weird, why is it in the loop?  There is only one
> _nr_pages per folio.

I suppose we could be getting say an order-9 folio that was previously used
as two order-8 folios? And each of them had their _nr_pages in their head
and we can't know that at this point so we have to reset everything?

AFAIU this would not be a problem if the clearing of the previous state was
done upon freeing, as e.g. v4 did, but I think you also argued it meant
processing the pages when freeing and then again at reallocation, so it's
now like this instead?

Or maybe you mean that stray _nr_pages in some tail page from previous
lifetimes can't affect the current lifetime in a wrong way for something
looking at said page? I don't know immediately.

> This is mostly zeroing some memory in the tail pages? Why?
> 
> Why can't this use the normal helpers, like memmap_init_compound()?
> 
>  struct folio *new_folio = page
> 
>  /* First 4 tail pages are part of struct folio */
>  for (i = 4; i < (1UL << order); i++) {
>      prep_compound_tail(..)
>  }
> 
>  prep_comound_head(page, order)
>  new_folio->_nr_pages = 0
> 
> ??
> 
> Jason



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 19:17     ` Vlastimil Babka
@ 2026-01-16 20:31       ` Matthew Brost
  2026-01-17  0:51         ` Jason Gunthorpe
  2026-01-17  0:19       ` Jason Gunthorpe
  1 sibling, 1 reply; 44+ messages in thread
From: Matthew Brost @ 2026-01-16 20:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jason Gunthorpe, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	Alistair Popple, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Balbir Singh,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-mm,
	linux-cxl

On Fri, Jan 16, 2026 at 08:17:22PM +0100, Vlastimil Babka wrote:
> On 1/16/26 18:49, Jason Gunthorpe wrote:
> > On Fri, Jan 16, 2026 at 12:10:16PM +0100, Francois Dugast wrote:
> >> -void zone_device_page_init(struct page *page, unsigned int order)
> >> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> >> +			   unsigned int order)
> >>  {
> >> +	struct page *new_page = page;
> >> +	unsigned int i;
> >> +
> >>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> >>  
> >> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> >> +		struct folio *new_folio = (struct folio *)new_page;
> >> +
> >> +		/*
> >> +		 * new_page could have been part of previous higher order folio
> >> +		 * which encodes the order, in page + 1, in the flags bits. We
> >> +		 * blindly clear bits which could have set my order field here,
> >> +		 * including page head.
> >> +		 */
> >> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> >> +
> >> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> >> +		/*
> >> +		 * This pointer math looks odd, but new_page could have been
> >> +		 * part of a previous higher order folio, which sets _nr_pages
> >> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> >> +		 * correctly locate the _nr_pages bits within new_page which
> >> +		 * could have modified by previous higher order folio.
> >> +		 */
> >> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> >> +#endif
> > 
> > This seems too weird, why is it in the loop?  There is only one
> > _nr_pages per folio.
> 
> I suppose we could be getting say an order-9 folio that was previously used
> as two order-8 folios? And each of them had their _nr_pages in their head

Yes, this is a good example. At this point we have idea what previous
allocation(s) order(s) were - we could have multiple places in the loop
where _nr_pages is populated, thus we have to clear this everywhere. 

> and we can't know that at this point so we have to reset everything?
> 

Yes, see above, correct. We have no visablity to previous state of the
pages so the only option is to reset everything.

> AFAIU this would not be a problem if the clearing of the previous state was
> done upon freeing, as e.g. v4 did, but I think you also argued it meant
> processing the pages when freeing and then again at reallocation, so it's
> now like this instead?

Yes, if we cleanup the previous folio state upon freeing, then this
problem goes away but the we back passing in the order as argument to
->folio_free(). 

> 
> Or maybe you mean that stray _nr_pages in some tail page from previous
> lifetimes can't affect the current lifetime in a wrong way for something
> looking at said page? I don't know immediately.
> 
> > This is mostly zeroing some memory in the tail pages? Why?
> > 
> > Why can't this use the normal helpers, like memmap_init_compound()?
> > 
> >  struct folio *new_folio = page
> > 
> >  /* First 4 tail pages are part of struct folio */
> >  for (i = 4; i < (1UL << order); i++) {
> >      prep_compound_tail(..)
> >  }
> > 
> >  prep_comound_head(page, order)
> >  new_folio->_nr_pages = 0
> > 
> > ??

I've beat this to death with Alistair, normal helpers do not work here.

An order zero allocation could have _nr_pages set in its page,
new_folio->_nr_pages is page + 1 memory.

Matt

> > 
> > Jason
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 11:10 ` [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios Francois Dugast
                     ` (2 preceding siblings ...)
  2026-01-16 17:49   ` Jason Gunthorpe
@ 2026-01-16 22:34   ` Andrew Morton
  2026-01-16 22:36     ` Matthew Brost
  3 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2026-01-16 22:34 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, dri-devel, Matthew Brost, Zi Yan, Alistair Popple,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Jason Gunthorpe,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Balbir Singh, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau,
	linux-mm, linux-cxl

On Fri, 16 Jan 2026 12:10:16 +0100 Francois Dugast <francois.dugast@intel.com> wrote:

> Reinitialize metadata for large zone device private folios in
> zone_device_page_init prior to creating a higher-order zone device
> private folio. This step is necessary when the folio’s order changes
> dynamically between zone_device_page_init calls to avoid building a
> corrupt folio. As part of the metadata reinitialization, the dev_pagemap
> must be passed in from the caller because the pgmap stored in the folio
> page may have been overwritten with a compound head.
> 
> Without this fix, individual pages could have invalid pgmap fields and
> flags (with PG_locked being notably problematic) due to prior different
> order allocations, which can, and will, result in kernel crashes.

Is it OK to leave 6.18.x without this fixed?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 22:34   ` Andrew Morton
@ 2026-01-16 22:36     ` Matthew Brost
  0 siblings, 0 replies; 44+ messages in thread
From: Matthew Brost @ 2026-01-16 22:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, Alistair Popple,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Jason Gunthorpe,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Balbir Singh, linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau,
	linux-mm, linux-cxl

On Fri, Jan 16, 2026 at 02:34:32PM -0800, Andrew Morton wrote:
> On Fri, 16 Jan 2026 12:10:16 +0100 Francois Dugast <francois.dugast@intel.com> wrote:
> 
> > Reinitialize metadata for large zone device private folios in
> > zone_device_page_init prior to creating a higher-order zone device
> > private folio. This step is necessary when the folio’s order changes
> > dynamically between zone_device_page_init calls to avoid building a
> > corrupt folio. As part of the metadata reinitialization, the dev_pagemap
> > must be passed in from the caller because the pgmap stored in the folio
> > page may have been overwritten with a compound head.
> > 
> > Without this fix, individual pages could have invalid pgmap fields and
> > flags (with PG_locked being notably problematic) due to prior different
> > order allocations, which can, and will, result in kernel crashes.
> 
> Is it OK to leave 6.18.x without this fixed?

I think 6.18.x is fine, the offending patch + large device pages is
going into 6.19, right?

Matt


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 19:17     ` Vlastimil Babka
  2026-01-16 20:31       ` Matthew Brost
@ 2026-01-17  0:19       ` Jason Gunthorpe
  2026-01-19  5:41         ` Alistair Popple
  1 sibling, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-17  0:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Francois Dugast, intel-xe, dri-devel, Matthew Brost, Zi Yan,
	Alistair Popple, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Balbir Singh,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-mm,
	linux-cxl

On Fri, Jan 16, 2026 at 08:17:22PM +0100, Vlastimil Babka wrote:
> >> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> >> +		/*
> >> +		 * This pointer math looks odd, but new_page could have been
> >> +		 * part of a previous higher order folio, which sets _nr_pages
> >> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> >> +		 * correctly locate the _nr_pages bits within new_page which
> >> +		 * could have modified by previous higher order folio.
> >> +		 */
> >> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> >> +#endif
> > 
> > This seems too weird, why is it in the loop?  There is only one
> > _nr_pages per folio.
> 
> I suppose we could be getting say an order-9 folio that was previously used
> as two order-8 folios? And each of them had their _nr_pages in their head
> and we can't know that at this point so we have to reset everything?

Er, did I miss something - who reads _nr_pages from a random tail
page? Doesn't everything working with random tail pages read order,
compute the head page, cast to folio and then access _nr_pages?

> Or maybe you mean that stray _nr_pages in some tail page from previous
> lifetimes can't affect the current lifetime in a wrong way for something
> looking at said page? I don't know immediately.

Yes, exactly.

Basically, what bytes exactly need to be set to what in tail pages for
the system to work? Those should be set.

And if we want to have things set on free that's fine too, but there
should be reasons for doing stuff, and this weird thing above makes
zero sense.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 20:31       ` Matthew Brost
@ 2026-01-17  0:51         ` Jason Gunthorpe
  2026-01-17  3:55           ` Matthew Brost
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-17  0:51 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Vlastimil Babka, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	Alistair Popple, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Balbir Singh,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-mm,
	linux-cxl

On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > I suppose we could be getting say an order-9 folio that was previously used
> > as two order-8 folios? And each of them had their _nr_pages in their head
> 
> Yes, this is a good example. At this point we have idea what previous
> allocation(s) order(s) were - we could have multiple places in the loop
> where _nr_pages is populated, thus we have to clear this everywhere. 

Why? The fact you have to use such a crazy expression to even access
_nr_pages strongly says nothing will read it as _nr_pages.

Explain each thing:

		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */

OK, the tail page flags need to be set right, and prep_compound_page()
called later depends on them being zero.

		((struct folio *)(new_page - 1))->_nr_pages = 0;

Can't see a reason, nothing reads _nr_pages from a random tail
page. _nr_pages is the last 8 bytes of struct page so it overlaps
memcg_data, which is also not supposed to be read from a tail page?

		new_folio->mapping = NULL;

Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;

		new_folio->pgmap = pgmap;	/* Also clear compound head */

Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);

		new_folio->share = 0;   /* fsdax only, unused for device private */

Not sure, certainly share isn't read from a tail page..

> > > Why can't this use the normal helpers, like memmap_init_compound()?
> > > 
> > >  struct folio *new_folio = page
> > > 
> > >  /* First 4 tail pages are part of struct folio */
> > >  for (i = 4; i < (1UL << order); i++) {
> > >      prep_compound_tail(..)
> > >  }
> > > 
> > >  prep_comound_head(page, order)
> > >  new_folio->_nr_pages = 0
> > > 
> > > ??
> 
> I've beat this to death with Alistair, normal helpers do not work here.

What do you mean? It already calls prep_compound_page()! The issue
seems to be that prep_compound_page() makes assumptions about what
values are in flags already?

So how about move that page flags mask logic into
prep_compound_tail()? I think that would help Vlastimil's
concern. That function is already touching most of the cache line so
an extra word shouldn't make a performance difference.

> An order zero allocation could have _nr_pages set in its page,
> new_folio->_nr_pages is page + 1 memory.

An order zero allocation does not have _nr_pages because it is in page
+1 memory that doesn't exist.

An order zero allocation might have memcg_data in the same slot, does
it need zeroing? If so why not add that to prep_compound_head() ?

Also, prep_compound_head() handles order 0 too:

	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
		atomic_set(&folio->_pincount, 0);
		atomic_set(&folio->_entire_mapcount, -1);
	}
	if (order > 1)
		INIT_LIST_HEAD(&folio->_deferred_list);

So some of the problem here looks to be not calling it:

	if (order)
		prep_compound_page(page, order);

So, remove that if ? Also shouldn't it be moved above the
set_page_count/lock_page ?

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-17  0:51         ` Jason Gunthorpe
@ 2026-01-17  3:55           ` Matthew Brost
  2026-01-17  4:42             ` Balbir Singh
  0 siblings, 1 reply; 44+ messages in thread
From: Matthew Brost @ 2026-01-17  3:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Vlastimil Babka, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	Alistair Popple, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Balbir Singh,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-mm,
	linux-cxl

On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > > I suppose we could be getting say an order-9 folio that was previously used
> > > as two order-8 folios? And each of them had their _nr_pages in their head
> > 
> > Yes, this is a good example. At this point we have idea what previous
> > allocation(s) order(s) were - we could have multiple places in the loop
> > where _nr_pages is populated, thus we have to clear this everywhere. 
> 
> Why? The fact you have to use such a crazy expression to even access
> _nr_pages strongly says nothing will read it as _nr_pages.
> 
> Explain each thing:
> 
> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> 
> OK, the tail page flags need to be set right, and prep_compound_page()
> called later depends on them being zero.
> 
> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
> 
> Can't see a reason, nothing reads _nr_pages from a random tail
> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> memcg_data, which is also not supposed to be read from a tail page?
> 
> 		new_folio->mapping = NULL;
> 
> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
> 
> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
> 
> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
> 
> 		new_folio->share = 0;   /* fsdax only, unused for device private */
> 
> Not sure, certainly share isn't read from a tail page..
> 
> > > > Why can't this use the normal helpers, like memmap_init_compound()?
> > > > 
> > > >  struct folio *new_folio = page
> > > > 
> > > >  /* First 4 tail pages are part of struct folio */
> > > >  for (i = 4; i < (1UL << order); i++) {
> > > >      prep_compound_tail(..)
> > > >  }
> > > > 
> > > >  prep_comound_head(page, order)
> > > >  new_folio->_nr_pages = 0
> > > > 
> > > > ??
> > 
> > I've beat this to death with Alistair, normal helpers do not work here.
> 
> What do you mean? It already calls prep_compound_page()! The issue
> seems to be that prep_compound_page() makes assumptions about what
> values are in flags already?
> 
> So how about move that page flags mask logic into
> prep_compound_tail()? I think that would help Vlastimil's
> concern. That function is already touching most of the cache line so
> an extra word shouldn't make a performance difference.
> 
> > An order zero allocation could have _nr_pages set in its page,
> > new_folio->_nr_pages is page + 1 memory.
> 
> An order zero allocation does not have _nr_pages because it is in page
> +1 memory that doesn't exist.
> 
> An order zero allocation might have memcg_data in the same slot, does
> it need zeroing? If so why not add that to prep_compound_head() ?
> 
> Also, prep_compound_head() handles order 0 too:
> 
> 	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
> 		atomic_set(&folio->_pincount, 0);
> 		atomic_set(&folio->_entire_mapcount, -1);
> 	}
> 	if (order > 1)
> 		INIT_LIST_HEAD(&folio->_deferred_list);
> 
> So some of the problem here looks to be not calling it:
> 
> 	if (order)
> 		prep_compound_page(page, order);
> 
> So, remove that if ? Also shouldn't it be moved above the
> set_page_count/lock_page ?
> 

I'm not addressing each comment, some might be valid, others are not.

Ok, can I rework this in a follow-up - I will commit to that? Anything
we touch here is extremely sensitive to failures - Intel is the primary
test vector for any modification to device pages for what I can tell.

The fact is that large device pages do not really work without this
patch, or prior revs. I’ve spent a lot of time getting large device
pages stable — both here and in the initial series, commiting to help in
follow on series touch SVM related things.

I’m going to miss my merge window with this (RB’d) patch blocked for
large device pages. Expect my commitment to helping other vendors to
drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
try again.

Or perhaps we just revert large device pages in 6.19 if we can't get a
consensus here as we shouldn't ship a non-functional kernel.

Matt

> Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-17  3:55           ` Matthew Brost
@ 2026-01-17  4:42             ` Balbir Singh
  2026-01-17  5:27               ` Matthew Brost
  0 siblings, 1 reply; 44+ messages in thread
From: Balbir Singh @ 2026-01-17  4:42 UTC (permalink / raw)
  To: Matthew Brost, Jason Gunthorpe
  Cc: Vlastimil Babka, Francois Dugast, intel-xe, dri-devel, Zi Yan,
	Alistair Popple, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 1/17/26 14:55, Matthew Brost wrote:
> On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
>> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
>>>> I suppose we could be getting say an order-9 folio that was previously used
>>>> as two order-8 folios? And each of them had their _nr_pages in their head
>>>
>>> Yes, this is a good example. At this point we have idea what previous
>>> allocation(s) order(s) were - we could have multiple places in the loop
>>> where _nr_pages is populated, thus we have to clear this everywhere. 
>>
>> Why? The fact you have to use such a crazy expression to even access
>> _nr_pages strongly says nothing will read it as _nr_pages.
>>
>> Explain each thing:
>>
>> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>
>> OK, the tail page flags need to be set right, and prep_compound_page()
>> called later depends on them being zero.
>>
>> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
>>
>> Can't see a reason, nothing reads _nr_pages from a random tail
>> page. _nr_pages is the last 8 bytes of struct page so it overlaps
>> memcg_data, which is also not supposed to be read from a tail page?
>>
>> 		new_folio->mapping = NULL;
>>
>> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
>>
>> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
>>
>> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
>>
>> 		new_folio->share = 0;   /* fsdax only, unused for device private */
>>
>> Not sure, certainly share isn't read from a tail page..
>>
>>>>> Why can't this use the normal helpers, like memmap_init_compound()?
>>>>>
>>>>>  struct folio *new_folio = page
>>>>>
>>>>>  /* First 4 tail pages are part of struct folio */
>>>>>  for (i = 4; i < (1UL << order); i++) {
>>>>>      prep_compound_tail(..)
>>>>>  }
>>>>>
>>>>>  prep_comound_head(page, order)
>>>>>  new_folio->_nr_pages = 0
>>>>>
>>>>> ??
>>>
>>> I've beat this to death with Alistair, normal helpers do not work here.
>>
>> What do you mean? It already calls prep_compound_page()! The issue
>> seems to be that prep_compound_page() makes assumptions about what
>> values are in flags already?
>>
>> So how about move that page flags mask logic into
>> prep_compound_tail()? I think that would help Vlastimil's
>> concern. That function is already touching most of the cache line so
>> an extra word shouldn't make a performance difference.
>>
>>> An order zero allocation could have _nr_pages set in its page,
>>> new_folio->_nr_pages is page + 1 memory.
>>
>> An order zero allocation does not have _nr_pages because it is in page
>> +1 memory that doesn't exist.
>>
>> An order zero allocation might have memcg_data in the same slot, does
>> it need zeroing? If so why not add that to prep_compound_head() ?
>>
>> Also, prep_compound_head() handles order 0 too:
>>
>> 	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
>> 		atomic_set(&folio->_pincount, 0);
>> 		atomic_set(&folio->_entire_mapcount, -1);
>> 	}
>> 	if (order > 1)
>> 		INIT_LIST_HEAD(&folio->_deferred_list);
>>
>> So some of the problem here looks to be not calling it:
>>
>> 	if (order)
>> 		prep_compound_page(page, order);
>>
>> So, remove that if ? Also shouldn't it be moved above the
>> set_page_count/lock_page ?
>>
> 
> I'm not addressing each comment, some might be valid, others are not.
> 
> Ok, can I rework this in a follow-up - I will commit to that? Anything
> we touch here is extremely sensitive to failures - Intel is the primary
> test vector for any modification to device pages for what I can tell.
> 
> The fact is that large device pages do not really work without this
> patch, or prior revs. I’ve spent a lot of time getting large device
> pages stable — both here and in the initial series, commiting to help in
> follow on series touch SVM related things.
> 

Matthew, I feel your frustration and appreciate your help.
For the current state of 6.19, your changes work for me, I added a
Reviewed-by to the patch. It affects a small number of drivers and makes
them work for zone device folios. I am happy to maintain the changes
sent out as a part of zone_device_page_init()

We can rework the details in a follow up series, there are many ideas
and ways of doing this (Jason, Alistair, Zi have good ideas as well).

> I’m going to miss my merge window with this (RB’d) patch blocked for
> large device pages. Expect my commitment to helping other vendors to
> drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
> try again.
> 
> Or perhaps we just revert large device pages in 6.19 if we can't get a
> consensus here as we shouldn't ship a non-functional kernel.
> 
> Matt
> 
>> Jason



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-17  4:42             ` Balbir Singh
@ 2026-01-17  5:27               ` Matthew Brost
  2026-01-19  5:59                 ` Alistair Popple
  0 siblings, 1 reply; 44+ messages in thread
From: Matthew Brost @ 2026-01-17  5:27 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Jason Gunthorpe, Vlastimil Babka, Francois Dugast, intel-xe,
	dri-devel, Zi Yan, Alistair Popple, adhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
> On 1/17/26 14:55, Matthew Brost wrote:
> > On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> >> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> >>>> I suppose we could be getting say an order-9 folio that was previously used
> >>>> as two order-8 folios? And each of them had their _nr_pages in their head
> >>>
> >>> Yes, this is a good example. At this point we have idea what previous
> >>> allocation(s) order(s) were - we could have multiple places in the loop
> >>> where _nr_pages is populated, thus we have to clear this everywhere. 
> >>
> >> Why? The fact you have to use such a crazy expression to even access
> >> _nr_pages strongly says nothing will read it as _nr_pages.
> >>
> >> Explain each thing:
> >>
> >> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> >>
> >> OK, the tail page flags need to be set right, and prep_compound_page()
> >> called later depends on them being zero.
> >>
> >> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
> >>
> >> Can't see a reason, nothing reads _nr_pages from a random tail
> >> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> >> memcg_data, which is also not supposed to be read from a tail page?
> >>
> >> 		new_folio->mapping = NULL;
> >>
> >> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
> >>
> >> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
> >>
> >> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
> >>
> >> 		new_folio->share = 0;   /* fsdax only, unused for device private */
> >>
> >> Not sure, certainly share isn't read from a tail page..
> >>
> >>>>> Why can't this use the normal helpers, like memmap_init_compound()?
> >>>>>
> >>>>>  struct folio *new_folio = page
> >>>>>
> >>>>>  /* First 4 tail pages are part of struct folio */
> >>>>>  for (i = 4; i < (1UL << order); i++) {
> >>>>>      prep_compound_tail(..)
> >>>>>  }
> >>>>>
> >>>>>  prep_comound_head(page, order)
> >>>>>  new_folio->_nr_pages = 0
> >>>>>
> >>>>> ??
> >>>
> >>> I've beat this to death with Alistair, normal helpers do not work here.
> >>
> >> What do you mean? It already calls prep_compound_page()! The issue
> >> seems to be that prep_compound_page() makes assumptions about what
> >> values are in flags already?
> >>
> >> So how about move that page flags mask logic into
> >> prep_compound_tail()? I think that would help Vlastimil's
> >> concern. That function is already touching most of the cache line so
> >> an extra word shouldn't make a performance difference.
> >>
> >>> An order zero allocation could have _nr_pages set in its page,
> >>> new_folio->_nr_pages is page + 1 memory.
> >>
> >> An order zero allocation does not have _nr_pages because it is in page
> >> +1 memory that doesn't exist.
> >>
> >> An order zero allocation might have memcg_data in the same slot, does
> >> it need zeroing? If so why not add that to prep_compound_head() ?
> >>
> >> Also, prep_compound_head() handles order 0 too:
> >>
> >> 	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
> >> 		atomic_set(&folio->_pincount, 0);
> >> 		atomic_set(&folio->_entire_mapcount, -1);
> >> 	}
> >> 	if (order > 1)
> >> 		INIT_LIST_HEAD(&folio->_deferred_list);
> >>
> >> So some of the problem here looks to be not calling it:
> >>
> >> 	if (order)
> >> 		prep_compound_page(page, order);
> >>
> >> So, remove that if ? Also shouldn't it be moved above the
> >> set_page_count/lock_page ?
> >>
> > 
> > I'm not addressing each comment, some might be valid, others are not.
> > 
> > Ok, can I rework this in a follow-up - I will commit to that? Anything
> > we touch here is extremely sensitive to failures - Intel is the primary
> > test vector for any modification to device pages for what I can tell.
> > 
> > The fact is that large device pages do not really work without this
> > patch, or prior revs. I’ve spent a lot of time getting large device
> > pages stable — both here and in the initial series, commiting to help in
> > follow on series touch SVM related things.
> > 
> 
> Matthew, I feel your frustration and appreciate your help.
> For the current state of 6.19, your changes work for me, I added a
> Reviewed-by to the patch. It affects a small number of drivers and makes
> them work for zone device folios. I am happy to maintain the changes
> sent out as a part of zone_device_page_init()
> 

+1

> We can rework the details in a follow up series, there are many ideas
> and ways of doing this (Jason, Alistair, Zi have good ideas as well).
> 

I agree we can rework this in a follow-up — the core MM is hard, and for
valid reasons, but we can all work together on cleaning it up.

Matt

> > I’m going to miss my merge window with this (RB’d) patch blocked for
> > large device pages. Expect my commitment to helping other vendors to
> > drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
> > try again.
> > 
> > Or perhaps we just revert large device pages in 6.19 if we can't get a
> > consensus here as we shouldn't ship a non-functional kernel.
> > 
> > Matt
> > 
> >> Jason
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-17  0:19       ` Jason Gunthorpe
@ 2026-01-19  5:41         ` Alistair Popple
  2026-01-19 14:24           ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Alistair Popple @ 2026-01-19  5:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	Matthew Brost, Zi Yan, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Balbir Singh,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-mm,
	linux-cxl

On 2026-01-17 at 11:19 +1100, Jason Gunthorpe <jgg@nvidia.com> wrote...
> On Fri, Jan 16, 2026 at 08:17:22PM +0100, Vlastimil Babka wrote:
> > >> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > >> +		/*
> > >> +		 * This pointer math looks odd, but new_page could have been
> > >> +		 * part of a previous higher order folio, which sets _nr_pages
> > >> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> > >> +		 * correctly locate the _nr_pages bits within new_page which
> > >> +		 * could have modified by previous higher order folio.
> > >> +		 */
> > >> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > >> +#endif
> > > 
> > > This seems too weird, why is it in the loop?  There is only one
> > > _nr_pages per folio.

Yeah, I don't really know what the motivation is for going via the folio
field which needs the odd pointer math versus just setting page->memcg_data
= 0 directly which would work equally well and would have avoided a lot of
confusion.

> > I suppose we could be getting say an order-9 folio that was previously used
> > as two order-8 folios? And each of them had their _nr_pages in their head
> > and we can't know that at this point so we have to reset everything?
> 
> Er, did I miss something - who reads _nr_pages from a random tail
> page? Doesn't everything working with random tail pages read order,
> compute the head page, cast to folio and then access _nr_pages?
> 
> > Or maybe you mean that stray _nr_pages in some tail page from previous
> > lifetimes can't affect the current lifetime in a wrong way for something
> > looking at said page? I don't know immediately.
> 
> Yes, exactly.
> 
> Basically, what bytes exactly need to be set to what in tail pages for
> the system to work? Those should be set.
> 
> And if we want to have things set on free that's fine too, but there
> should be reasons for doing stuff, and this weird thing above makes
> zero sense.

You can't think of these as tail pages or head pages. They are just random
struct pages, possibly order-0 or PageHead or PageTail, with fields in a
"random" state based on what they were last used for.

All this function should be trying to do is initialising this random state to
something sane as defined by the core-mm for it to consume. Yes, some might
later end up being tail (or head) pages if order > 0 and prep_compound_page()
is called. But the point of this function and the loop is to initialise the
struct page as an order-0 page with "sane" fields to pass to core-mm or call
prep_compound_page() on.

This could for example just use memset(new_page, 0, sizeof(struct page)) and
then refill all the fields correctly (although Vlastimil pointed out some page
flags need preservation). But a big part of the problem is there is no single
definition (AFAIK) of what state a struct page should be in before handing it to
the core-mm via either vm_insert_page()/pages()/etc. or migrate_vma_*() nor what
state the kernel leaves it in once freed.

I would like to see this addressed because it leads to all sorts of weirdness -
for example vm_insert_page() and migrate_vma_*() both require the page refcount
to be 1 for no good reason (drivers usually have to drop it immediately after
the call and they implicitly own the ZONE_DEVICE page lifetimes anyway so why make them
hold a reference just to map the page). Yet only migrate_vma_*() requires the
page to be locked (so other ZONE_DEVICE users just have to immediately unlock).

And I presume page->memcg_data must be set to zero, or Matthew wouldn't have
run into problems prompting him to reinit it. But I don't really know what other
requirements there are for setting page fields, they all sort of come implicitly
from the vm_insert_page/migrate_vma APIs.

 - Alistair

> Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-17  5:27               ` Matthew Brost
@ 2026-01-19  5:59                 ` Alistair Popple
  2026-01-19 14:20                   ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Alistair Popple @ 2026-01-19  5:59 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, Jason Gunthorpe, Vlastimil Babka, Francois Dugast,
	intel-xe, dri-devel, Zi Yan, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 2026-01-17 at 16:27 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
> > On 1/17/26 14:55, Matthew Brost wrote:
> > > On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> > >> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > >>>> I suppose we could be getting say an order-9 folio that was previously used
> > >>>> as two order-8 folios? And each of them had their _nr_pages in their head
> > >>>
> > >>> Yes, this is a good example. At this point we have idea what previous
> > >>> allocation(s) order(s) were - we could have multiple places in the loop
> > >>> where _nr_pages is populated, thus we have to clear this everywhere. 
> > >>
> > >> Why? The fact you have to use such a crazy expression to even access
> > >> _nr_pages strongly says nothing will read it as _nr_pages.
> > >>
> > >> Explain each thing:
> > >>
> > >> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> > >>
> > >> OK, the tail page flags need to be set right, and prep_compound_page()
> > >> called later depends on them being zero.
> > >>
> > >> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > >>
> > >> Can't see a reason, nothing reads _nr_pages from a random tail
> > >> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> > >> memcg_data, which is also not supposed to be read from a tail page?

This is (or was) either a order-0 page, a head page or a tail page, who
knows. So it doesn't really matter whether or not _nr_pages or memcg_data are
supposed to be read from a tail page or not. What really matters is does any of
vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a
particular value when called on this page?

AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when called
on an order-0 page, which means it has to be cleared.

Although I think it would be far less confusing if it was just written like that
rather than the folio math but it achieves the same thing and is technically
correct.

> > >> 		new_folio->mapping = NULL;
> > >>
> > >> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;

Not pointless - vm_insert_page() for example expects folio_test_anon() which
which won't be the case if p->mapping was previously set to TAIL_MAPPING so it
needs to be cleared. migrate_vma_setup() has a similar issue.

> > >>
> > >> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
> > >>
> > >> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);

No it isn't - we're not clearing tail pages here, we're initialising ZONE_DEVICE
struct pages ready for use by the core-mm which means the pgmap needs to be
correct.

> > >> 		new_folio->share = 0;   /* fsdax only, unused for device private */
> > >>
> > >> Not sure, certainly share isn't read from a tail page..

Yeah, not useful for now because FS DAX isn't using this function. Arguably it
should though.

> > >>>>> Why can't this use the normal helpers, like memmap_init_compound()?

Because that's not what this function is trying to do - eg. we might not be
trying to create a compound page. Although something like
memmap_init_zone_device() looks like it would be a good starting point, with the
page order being a parameter instead of read from the pgmap.

> > >>>>>
> > >>>>>  struct folio *new_folio = page
> > >>>>>
> > >>>>>  /* First 4 tail pages are part of struct folio */
> > >>>>>  for (i = 4; i < (1UL << order); i++) {
> > >>>>>      prep_compound_tail(..)
> > >>>>>  }
> > >>>>>
> > >>>>>  prep_comound_head(page, order)
> > >>>>>  new_folio->_nr_pages = 0
> > >>>>>
> > >>>>> ??
> > >>>
> > >>> I've beat this to death with Alistair, normal helpers do not work here.
> > >> What do you mean? It already calls prep_compound_page()! The issue
> > >> seems to be that prep_compound_page() makes assumptions about what
> > >> values are in flags already?
> > >>
> > >> So how about move that page flags mask logic into
> > >> prep_compound_tail()? I think that would help Vlastimil's
> > >> concern. That function is already touching most of the cache line so
> > >> an extra word shouldn't make a performance difference.
> > >>
> > >>> An order zero allocation could have _nr_pages set in its page,
> > >>> new_folio->_nr_pages is page + 1 memory.
> > >>
> > >> An order zero allocation does not have _nr_pages because it is in page
> > >> +1 memory that doesn't exist.
> > >>
> > >> An order zero allocation might have memcg_data in the same slot, does
> > >> it need zeroing? If so why not add that to prep_compound_head() ?
> > >>
> > >> Also, prep_compound_head() handles order 0 too:
> > >>
> > >> 	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
> > >> 		atomic_set(&folio->_pincount, 0);
> > >> 		atomic_set(&folio->_entire_mapcount, -1);
> > >> 	}
> > >> 	if (order > 1)
> > >> 		INIT_LIST_HEAD(&folio->_deferred_list);
> > >>
> > >> So some of the problem here looks to be not calling it:
> > >>
> > >> 	if (order)
> > >> 		prep_compound_page(page, order);
> > >>
> > >> So, remove that if ? Also shouldn't it be moved above the
> > >> set_page_count/lock_page ?
> > >>
> > > 
> > > I'm not addressing each comment, some might be valid, others are not.

Hopefully some of my explainations above help.

> > > 
> > > Ok, can I rework this in a follow-up - I will commit to that? Anything
> > > we touch here is extremely sensitive to failures - Intel is the primary
> > > test vector for any modification to device pages for what I can tell.
> > > 
> > > The fact is that large device pages do not really work without this
> > > patch, or prior revs. I’ve spent a lot of time getting large device
> > > pages stable — both here and in the initial series, commiting to help in
> > > follow on series touch SVM related things.
> > > 
> > 
> > Matthew, I feel your frustration and appreciate your help.
> > For the current state of 6.19, your changes work for me, I added a
> > Reviewed-by to the patch. It affects a small number of drivers and makes
> > them work for zone device folios. I am happy to maintain the changes
> > sent out as a part of zone_device_page_init()

No problem with the above, and FWIW it seems correct. Although I suspect just
setting page->memcg_data = 0 would have been far less controversial ;)

> +1
> 
> > We can rework the details in a follow up series, there are many ideas
> > and ways of doing this (Jason, Alistair, Zi have good ideas as well).
> > 
> 
> I agree we can rework this in a follow-up — the core MM is hard, and for
> valid reasons, but we can all work together on cleaning it up.
> 
> Matt
> 
> > > I’m going to miss my merge window with this (RB’d) patch blocked for
> > > large device pages. Expect my commitment to helping other vendors to
> > > drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
> > > try again.
> > > 
> > > Or perhaps we just revert large device pages in 6.19 if we can't get a
> > > consensus here as we shouldn't ship a non-functional kernel.
> > > 
> > > Matt
> > > 
> > >> Jason
> > 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-19  5:59                 ` Alistair Popple
@ 2026-01-19 14:20                   ` Jason Gunthorpe
  2026-01-19 20:09                     ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-19 14:20 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Matthew Brost, Balbir Singh, Vlastimil Babka, Francois Dugast,
	intel-xe, dri-devel, Zi Yan, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On Mon, Jan 19, 2026 at 04:59:56PM +1100, Alistair Popple wrote:
> On 2026-01-17 at 16:27 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
> > > On 1/17/26 14:55, Matthew Brost wrote:
> > > > On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> > > >> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > > >>>> I suppose we could be getting say an order-9 folio that was previously used
> > > >>>> as two order-8 folios? And each of them had their _nr_pages in their head
> > > >>>
> > > >>> Yes, this is a good example. At this point we have idea what previous
> > > >>> allocation(s) order(s) were - we could have multiple places in the loop
> > > >>> where _nr_pages is populated, thus we have to clear this everywhere. 
> > > >>
> > > >> Why? The fact you have to use such a crazy expression to even access
> > > >> _nr_pages strongly says nothing will read it as _nr_pages.
> > > >>
> > > >> Explain each thing:
> > > >>
> > > >> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> > > >>
> > > >> OK, the tail page flags need to be set right, and prep_compound_page()
> > > >> called later depends on them being zero.
> > > >>
> > > >> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > > >>
> > > >> Can't see a reason, nothing reads _nr_pages from a random tail
> > > >> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> > > >> memcg_data, which is also not supposed to be read from a tail page?
> 
> This is (or was) either a order-0 page, a head page or a tail page, who
> knows. So it doesn't really matter whether or not _nr_pages or memcg_data are
> supposed to be read from a tail page or not. What really matters is does any of
> vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a
> particular value when called on this page?

This weird expression is doing three things,
1) it is zeroing memcg on the head page
2) it is zeroing _nr_pages on the head folio
3) it is zeroing memcg on all the tail pages.

Are you aruging for 1, 2 or 3?

#1 is missing today
#2 is handled directly by the prep_compound_page() -> prep_compound_head() -> folio_set_order()
#3 I argue isn't necessary.

> AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when called
> on an order-0 page, which means it has to be cleared.

Great, so lets write that in prep_compound_head()!

> Although I think it would be far less confusing if it was just written like that
> rather than the folio math but it achieves the same thing and is technically
> correct.

I have yet to hear a reason to do #3.

> > > >> 		new_folio->mapping = NULL;
> > > >>
> > > >> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
>
> Not pointless - vm_insert_page() for example expects folio_test_anon() which
> which won't be the case if p->mapping was previously set to TAIL_MAPPING so it
> needs to be cleared. migrate_vma_setup() has a similar issue.

It is pointless to put it in the loop! Sure set the head page.

> > > >> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
> > > >>
> > > >> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
> 
> No it isn't - we're not clearing tail pages here, we're initialising ZONE_DEVICE
> struct pages ready for use by the core-mm which means the pgmap needs to be
> correct.

See above, same issue. The tail pages have pgmap set to NULL because
prep_compound_tail() does it. So why do we set it to pgmap here and
then clear it a few lines below?

Set it once in the head folio outside this loop.

> No problem with the above, and FWIW it seems correct. Although I suspect just
> setting page->memcg_data = 0 would have been far less controversial ;)

It is "correct" but horrible.

What is wrong with this? Isn't it so much better and more efficient??

diff --git a/mm/internal.h b/mm/internal.h
index e430da900430a1..a7d3f5e4b85e49 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 		atomic_set(&folio->_pincount, 0);
 		atomic_set(&folio->_entire_mapcount, -1);
 	}
-	if (order > 1)
+	if (order > 1) {
 		INIT_LIST_HEAD(&folio->_deferred_list);
+	} else {
+		folio->mapping = NULL;
+#ifdef CONFIG_MEMCG
+		folio->memcg_data = 0;
+#endif
+	}
 }
 
 static inline void prep_compound_tail(struct page *head, int tail_idx)
 {
 	struct page *p = head + tail_idx;
 
+	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
 	p->mapping = TAIL_MAPPING;
 	set_compound_head(p, head);
 	set_page_private(p, 0);
diff --git a/mm/memremap.c b/mm/memremap.c
index 4c2e0d68eb2798..7ec034c11068e1 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -479,19 +479,23 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page, unsigned int order)
+void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
+			   unsigned int order)
 {
 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
+	struct folio *folio;
 
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
 	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
-	set_page_count(page, 1);
-	lock_page(page);
 
-	if (order)
-		prep_compound_page(page, order);
+	prep_compound_page(page, order);
+
+	folio = page_folio(page);
+	folio->pgmap = pgmap;
+	folio_lock(folio);
+	folio_set_count(folio, 1);
 }
 EXPORT_SYMBOL_GPL(zone_device_page_init);

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-19  5:41         ` Alistair Popple
@ 2026-01-19 14:24           ` Jason Gunthorpe
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-19 14:24 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	Matthew Brost, Zi Yan, adhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Balbir Singh,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-mm,
	linux-cxl

On Mon, Jan 19, 2026 at 04:41:42PM +1100, Alistair Popple wrote:
> > And if we want to have things set on free that's fine too, but there
> > should be reasons for doing stuff, and this weird thing above makes
> > zero sense.
> 
> You can't think of these as tail pages or head pages. They are just random
> struct pages, possibly order-0 or PageHead or PageTail, with fields in a
> "random" state based on what they were last used for.

Agree on random state.
 
> All this function should be trying to do is initialising this random state to
> something sane as defined by the core-mm for it to consume. Yes, some might
> later end up being tail (or head) pages if order > 0 and prep_compound_page()
> is called. 

Not "later" during this function. The end result of this entire
function is to setup folio starting at page and at order. Meaning we
are deliberately *creating* head and tail pages out of random junk
left over in the struct page.

> But the point of this function and the loop is to initialise the
> struct page as an order-0 page with "sane" fields to pass to core-mm or call
> prep_compound_page() on.

Which is what seems nonsensical to me. prep_compound_page() does
another loop over all these pages and *re sets* many of the same
fields. You are aruging we should clean things and then call
perp_compund_page(), I'm arguing prep_compound_page() just just accept
the junk and fix it.

> This could for example just use memset(new_page, 0, sizeof(struct page)) and
> then refill all the fields correctly (although Vlastimil pointed out some page
> flags need preservation). But a big part of the problem is there is no single
> definition (AFAIK) of what state a struct page should be in before handing it to
> the core-mm via either vm_insert_page()/pages()/etc. or migrate_vma_*() nor what
> state the kernel leaves it in once freed.

I agree with this, but I argue that prep_compound_page() should codify
whatever that requirement is so we can trend toward such an agreed
definition. See Matthew's first missive on this about doing things
properly in the core code instead of hacking in drivers.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-19 14:20                   ` Jason Gunthorpe
@ 2026-01-19 20:09                     ` Zi Yan
  2026-01-19 20:35                       ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2026-01-19 20:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Wilcox
  Cc: Alistair Popple, Matthew Brost, Balbir Singh, Vlastimil Babka,
	Francois Dugast, intel-xe, dri-devel, adhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 19 Jan 2026, at 9:20, Jason Gunthorpe wrote:

> On Mon, Jan 19, 2026 at 04:59:56PM +1100, Alistair Popple wrote:
>> On 2026-01-17 at 16:27 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
>>> On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
>>>> On 1/17/26 14:55, Matthew Brost wrote:
>>>>> On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
>>>>>> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
>>>>>>>> I suppose we could be getting say an order-9 folio that was previously used
>>>>>>>> as two order-8 folios? And each of them had their _nr_pages in their head
>>>>>>>
>>>>>>> Yes, this is a good example. At this point we have idea what previous
>>>>>>> allocation(s) order(s) were - we could have multiple places in the loop
>>>>>>> where _nr_pages is populated, thus we have to clear this everywhere.
>>>>>>
>>>>>> Why? The fact you have to use such a crazy expression to even access
>>>>>> _nr_pages strongly says nothing will read it as _nr_pages.
>>>>>>
>>>>>> Explain each thing:
>>>>>>
>>>>>> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>>>>>
>>>>>> OK, the tail page flags need to be set right, and prep_compound_page()
>>>>>> called later depends on them being zero.
>>>>>>
>>>>>> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
>>>>>>
>>>>>> Can't see a reason, nothing reads _nr_pages from a random tail
>>>>>> page. _nr_pages is the last 8 bytes of struct page so it overlaps
>>>>>> memcg_data, which is also not supposed to be read from a tail page?
>>
>> This is (or was) either a order-0 page, a head page or a tail page, who
>> knows. So it doesn't really matter whether or not _nr_pages or memcg_data are
>> supposed to be read from a tail page or not. What really matters is does any of
>> vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a
>> particular value when called on this page?
>
> This weird expression is doing three things,
> 1) it is zeroing memcg on the head page
> 2) it is zeroing _nr_pages on the head folio
> 3) it is zeroing memcg on all the tail pages.
>
> Are you aruging for 1, 2 or 3?
>
> #1 is missing today
> #2 is handled directly by the prep_compound_page() -> prep_compound_head() -> folio_set_order()
> #3 I argue isn't necessary.
>
>> AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when called
>> on an order-0 page, which means it has to be cleared.
>
> Great, so lets write that in prep_compound_head()!
>
>> Although I think it would be far less confusing if it was just written like that
>> rather than the folio math but it achieves the same thing and is technically
>> correct.
>
> I have yet to hear a reason to do #3.
>
>>>>>> 		new_folio->mapping = NULL;
>>>>>>
>>>>>> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
>>
>> Not pointless - vm_insert_page() for example expects folio_test_anon() which
>> which won't be the case if p->mapping was previously set to TAIL_MAPPING so it
>> needs to be cleared. migrate_vma_setup() has a similar issue.
>
> It is pointless to put it in the loop! Sure set the head page.
>
>>>>>> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
>>>>>>
>>>>>> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
>>
>> No it isn't - we're not clearing tail pages here, we're initialising ZONE_DEVICE
>> struct pages ready for use by the core-mm which means the pgmap needs to be
>> correct.
>
> See above, same issue. The tail pages have pgmap set to NULL because
> prep_compound_tail() does it. So why do we set it to pgmap here and
> then clear it a few lines below?
>
> Set it once in the head folio outside this loop.
>
>> No problem with the above, and FWIW it seems correct. Although I suspect just
>> setting page->memcg_data = 0 would have been far less controversial ;)
>
> It is "correct" but horrible.
>
> What is wrong with this? Isn't it so much better and more efficient??
>
> diff --git a/mm/internal.h b/mm/internal.h
> index e430da900430a1..a7d3f5e4b85e49 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>  		atomic_set(&folio->_pincount, 0);
>  		atomic_set(&folio->_entire_mapcount, -1);
>  	}
> -	if (order > 1)
> +	if (order > 1) {
>  		INIT_LIST_HEAD(&folio->_deferred_list);
> +	} else {
> +		folio->mapping = NULL;
> +#ifdef CONFIG_MEMCG
> +		folio->memcg_data = 0;
> +#endif
> +	}

prep_compound_head() is only called on >0 order pages. The above
code means when order == 1, folio->mapping and folio->memcg_data are
assigned NULL.

>  }
>
>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>  {
>  	struct page *p = head + tail_idx;
>
> +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */

No one cares about tail page flags if it is not checked in check_new_page()
from mm/page_alloc.c.

>  	p->mapping = TAIL_MAPPING;
>  	set_compound_head(p, head);
>  	set_page_private(p, 0);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 4c2e0d68eb2798..7ec034c11068e1 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -479,19 +479,23 @@ void free_zone_device_folio(struct folio *folio)
>  	}
>  }
>
> -void zone_device_page_init(struct page *page, unsigned int order)
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order)
>  {
>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> +	struct folio *folio;
>
>  	/*
>  	 * Drivers shouldn't be allocating pages after calling
>  	 * memunmap_pages().
>  	 */
>  	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> -	set_page_count(page, 1);
> -	lock_page(page);
>
> -	if (order)
> -		prep_compound_page(page, order);
> +	prep_compound_page(page, order);

prep_compound_page() should only be called for >0 order pages. This creates
another weirdness in device pages by assuming all pages are compound.

> +
> +	folio = page_folio(page);
> +	folio->pgmap = pgmap;
> +	folio_lock(folio);
> +	folio_set_count(folio, 1);

/* clear possible previous page->mapping */
folio->mapping = NULL;

/* clear possible previous page->_nr_pages */
#ifdef CONFIG_MEMCG
	folio->memcg_data = 0;
#endif

With two above and still call prep_compound_page() only when order > 0,
the code should work. There is no need to change prep_compoun_*()
functions.

>  }
>  EXPORT_SYMBOL_GPL(zone_device_page_init);


This patch mixed the concept of page and folio together, thus
causing confusion. Core MM sees page and folio two separate things:
1. page is the smallest internal physical memory management unit,
2. folio is an abstraction on top of pages, and other abstractions can be
   slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).

Compound page is a high-order page that all subpages are managed as a whole,
but it is converted to folio after page_rmappable_folio() (see
__folio_alloc_noprof()). And a slab page can be a compound page too (see
page_slab() does compound_head() like operation). So a compound page is
not the same as a folio.

I can see folio is used in prep_compound_head()
and think it is confusing, since these pages should not be regarded as
a folio yet. I probably blame willy (cc'd), since he started it from commit
94688e8eb453 ("mm: remove folio_pincount_ptr() and head_compound_pincount()")
and before that prep_compound_head() was all about pages. folio_set_order()
was set_compound_order() before commit 1e3be4856f49d ("mm/folio: replace
set_compound_order with folio_set_order").

If device pages have to initialize on top of pages with obsolete states,
at least it should be first initialized as pages, then as folios to avoid
confusion.


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-19 20:09                     ` Zi Yan
@ 2026-01-19 20:35                       ` Jason Gunthorpe
  2026-01-19 22:15                         ` Balbir Singh
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-19 20:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Alistair Popple, Matthew Brost, Balbir Singh,
	Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On Mon, Jan 19, 2026 at 03:09:00PM -0500, Zi Yan wrote:
> > diff --git a/mm/internal.h b/mm/internal.h
> > index e430da900430a1..a7d3f5e4b85e49 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
> >  		atomic_set(&folio->_pincount, 0);
> >  		atomic_set(&folio->_entire_mapcount, -1);
> >  	}
> > -	if (order > 1)
> > +	if (order > 1) {
> >  		INIT_LIST_HEAD(&folio->_deferred_list);
> > +	} else {
> > +		folio->mapping = NULL;
> > +#ifdef CONFIG_MEMCG
> > +		folio->memcg_data = 0;
> > +#endif
> > +	}
> 
> prep_compound_head() is only called on >0 order pages. The above
> code means when order == 1, folio->mapping and folio->memcg_data are
> assigned NULL.

OK, fair enough, the conditionals would have to change and maybe it
shouldn't be called "compound_head" if it also cleans up normal pages.

> >  static inline void prep_compound_tail(struct page *head, int tail_idx)
> >  {
> >  	struct page *p = head + tail_idx;
> >
> > +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> 
> No one cares about tail page flags if it is not checked in check_new_page()
> from mm/page_alloc.c.

At least page_fixed_fake_head() does check PG_head in some
configurations. It does seem safer to clear it. Possibly order is
never used, but it is free to clear it.

> > -	if (order)
> > -		prep_compound_page(page, order);
> > +	prep_compound_page(page, order);
> 
> prep_compound_page() should only be called for >0 order pages. This creates
> another weirdness in device pages by assuming all pages are
> compound.

OK

> > +	folio = page_folio(page);
> > +	folio->pgmap = pgmap;
> > +	folio_lock(folio);
> > +	folio_set_count(folio, 1);
> 
> /* clear possible previous page->mapping */
> folio->mapping = NULL;
> 
> /* clear possible previous page->_nr_pages */
> #ifdef CONFIG_MEMCG
> 	folio->memcg_data = 0;
> #endif

This is reasonable too, but prep_compound_head() was doing more than
that, it is also clearing the order, and this needs to clear the head
bit.  That's why it was apppealing to reuse those functions, but you
are right they are not ideal.

I suppose we want some prep_single_page(page) and some reorg to share
code with the other prep function.

> This patch mixed the concept of page and folio together, thus
> causing confusion. Core MM sees page and folio two separate things:
> 1. page is the smallest internal physical memory management unit,
> 2. folio is an abstraction on top of pages, and other abstractions can be
>    slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).

I think the users of zone_device_page_init() are principally trying to
create something that can be installed in a non-special PTE. Meaning
the output is always a folio because it is going to be read as a folio
in the page walkers.

Thus, the job of this function is to take the memory range starting at
page for 2^order and turn it into a single valid folio with refcount
of 1.

> If device pages have to initialize on top of pages with obsolete states,
> at least it should be first initialized as pages, then as folios to avoid
> confusion.

I don't think so. It should do the above job efficiently and iterate
over the page list exactly once.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-19 20:35                       ` Jason Gunthorpe
@ 2026-01-19 22:15                         ` Balbir Singh
  2026-01-20  2:50                           ` Zi Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Balbir Singh @ 2026-01-19 22:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Zi Yan
  Cc: Matthew Wilcox, Alistair Popple, Matthew Brost, Vlastimil Babka,
	Francois Dugast, intel-xe, dri-devel, adhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 1/20/26 07:35, Jason Gunthorpe wrote:
> On Mon, Jan 19, 2026 at 03:09:00PM -0500, Zi Yan wrote:
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index e430da900430a1..a7d3f5e4b85e49 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>>>  		atomic_set(&folio->_pincount, 0);
>>>  		atomic_set(&folio->_entire_mapcount, -1);
>>>  	}
>>> -	if (order > 1)
>>> +	if (order > 1) {
>>>  		INIT_LIST_HEAD(&folio->_deferred_list);
>>> +	} else {
>>> +		folio->mapping = NULL;
>>> +#ifdef CONFIG_MEMCG
>>> +		folio->memcg_data = 0;
>>> +#endif
>>> +	}
>>
>> prep_compound_head() is only called on >0 order pages. The above
>> code means when order == 1, folio->mapping and folio->memcg_data are
>> assigned NULL.
> 
> OK, fair enough, the conditionals would have to change and maybe it
> shouldn't be called "compound_head" if it also cleans up normal pages.
> 
>>>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>>>  {
>>>  	struct page *p = head + tail_idx;
>>>
>>> +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>
>> No one cares about tail page flags if it is not checked in check_new_page()
>> from mm/page_alloc.c.
> 
> At least page_fixed_fake_head() does check PG_head in some
> configurations. It does seem safer to clear it. Possibly order is
> never used, but it is free to clear it.
> 
>>> -	if (order)
>>> -		prep_compound_page(page, order);
>>> +	prep_compound_page(page, order);
>>
>> prep_compound_page() should only be called for >0 order pages. This creates
>> another weirdness in device pages by assuming all pages are
>> compound.
> 
> OK
> 
>>> +	folio = page_folio(page);
>>> +	folio->pgmap = pgmap;
>>> +	folio_lock(folio);
>>> +	folio_set_count(folio, 1);
>>
>> /* clear possible previous page->mapping */
>> folio->mapping = NULL;
>>
>> /* clear possible previous page->_nr_pages */
>> #ifdef CONFIG_MEMCG
>> 	folio->memcg_data = 0;
>> #endif
> 
> This is reasonable too, but prep_compound_head() was doing more than
> that, it is also clearing the order, and this needs to clear the head
> bit.  That's why it was apppealing to reuse those functions, but you
> are right they are not ideal.
> 
> I suppose we want some prep_single_page(page) and some reorg to share
> code with the other prep function.
> 

There is __init_zone_device_page() and __init_single_page(), 
it does zero out the page and sets the zone, pfn, nid among other things.
I propose we use the current version with zone_device_free_folio() as is.

We can figure out if __init_zone_device_page() can be reused or refactored
for the purposes to doing this with core MM API's


>> This patch mixed the concept of page and folio together, thus
>> causing confusion. Core MM sees page and folio two separate things:
>> 1. page is the smallest internal physical memory management unit,
>> 2. folio is an abstraction on top of pages, and other abstractions can be
>>    slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).
> 
> I think the users of zone_device_page_init() are principally trying to
> create something that can be installed in a non-special PTE. Meaning
> the output is always a folio because it is going to be read as a folio
> in the page walkers.
> 
> Thus, the job of this function is to take the memory range starting at
> page for 2^order and turn it into a single valid folio with refcount
> of 1.
> 
>> If device pages have to initialize on top of pages with obsolete states,
>> at least it should be first initialized as pages, then as folios to avoid
>> confusion.
> 
> I don't think so. It should do the above job efficiently and iterate
> over the page list exactly once.
> 
> Jason

Agreed

Balbir


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-19 22:15                         ` Balbir Singh
@ 2026-01-20  2:50                           ` Zi Yan
  2026-01-20 13:53                             ` Jason Gunthorpe
  2026-01-21  3:51                             ` Balbir Singh
  0 siblings, 2 replies; 44+ messages in thread
From: Zi Yan @ 2026-01-20  2:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Balbir Singh
  Cc: Matthew Wilcox, Alistair Popple, Matthew Brost, Vlastimil Babka,
	Francois Dugast, intel-xe, dri-devel, adhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 19 Jan 2026, at 17:15, Balbir Singh wrote:

> On 1/20/26 07:35, Jason Gunthorpe wrote:
>> On Mon, Jan 19, 2026 at 03:09:00PM -0500, Zi Yan wrote:
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index e430da900430a1..a7d3f5e4b85e49 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>>>>  		atomic_set(&folio->_pincount, 0);
>>>>  		atomic_set(&folio->_entire_mapcount, -1);
>>>>  	}
>>>> -	if (order > 1)
>>>> +	if (order > 1) {
>>>>  		INIT_LIST_HEAD(&folio->_deferred_list);
>>>> +	} else {
>>>> +		folio->mapping = NULL;
>>>> +#ifdef CONFIG_MEMCG
>>>> +		folio->memcg_data = 0;
>>>> +#endif
>>>> +	}
>>>
>>> prep_compound_head() is only called on >0 order pages. The above
>>> code means when order == 1, folio->mapping and folio->memcg_data are
>>> assigned NULL.
>>
>> OK, fair enough, the conditionals would have to change and maybe it
>> shouldn't be called "compound_head" if it also cleans up normal pages.
>>
>>>>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>>>>  {
>>>>  	struct page *p = head + tail_idx;
>>>>
>>>> +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>>
>>> No one cares about tail page flags if it is not checked in check_new_page()
>>> from mm/page_alloc.c.
>>
>> At least page_fixed_fake_head() does check PG_head in some
>> configurations. It does seem safer to clear it. Possibly order is
>> never used, but it is free to clear it.
>>
>>>> -	if (order)
>>>> -		prep_compound_page(page, order);
>>>> +	prep_compound_page(page, order);
>>>
>>> prep_compound_page() should only be called for >0 order pages. This creates
>>> another weirdness in device pages by assuming all pages are
>>> compound.
>>
>> OK
>>
>>>> +	folio = page_folio(page);
>>>> +	folio->pgmap = pgmap;
>>>> +	folio_lock(folio);
>>>> +	folio_set_count(folio, 1);
>>>
>>> /* clear possible previous page->mapping */
>>> folio->mapping = NULL;
>>>
>>> /* clear possible previous page->_nr_pages */
>>> #ifdef CONFIG_MEMCG
>>> 	folio->memcg_data = 0;
>>> #endif
>>
>> This is reasonable too, but prep_compound_head() was doing more than
>> that, it is also clearing the order, and this needs to clear the head
>> bit.  That's why it was apppealing to reuse those functions, but you
>> are right they are not ideal.

PG_head is and must be bit 6, that means the stored order needs to be
at least 2^6=64 to get it set. Who allocates a folio with that large order?
This p->flags.f &= ~0xffUL thing is unnecessary. What really needs
to be done is folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP to make
sure the new folio flags are the same as newly allocated folios
from core MM page allocator.

>>
>> I suppose we want some prep_single_page(page) and some reorg to share
>> code with the other prep function.

This is just an unnecessary need due to lack of knowledge of/do not want
to investigate core MM page and folio initialization code.

>>
>
> There is __init_zone_device_page() and __init_single_page(),
> it does zero out the page and sets the zone, pfn, nid among other things.
> I propose we use the current version with zone_device_free_folio() as is.
>
> We can figure out if __init_zone_device_page() can be reused or refactored
> for the purposes to doing this with core MM API's
>
>
>>> This patch mixed the concept of page and folio together, thus
>>> causing confusion. Core MM sees page and folio two separate things:
>>> 1. page is the smallest internal physical memory management unit,
>>> 2. folio is an abstraction on top of pages, and other abstractions can be
>>>    slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).
>>
>> I think the users of zone_device_page_init() are principally trying to
>> create something that can be installed in a non-special PTE. Meaning
>> the output is always a folio because it is going to be read as a folio
>> in the page walkers.
>>
>> Thus, the job of this function is to take the memory range starting at
>> page for 2^order and turn it into a single valid folio with refcount
>> of 1.
>>
>>> If device pages have to initialize on top of pages with obsolete states,
>>> at least it should be first initialized as pages, then as folios to avoid
>>> confusion.
>>
>> I don't think so. It should do the above job efficiently and iterate
>> over the page list exactly once.

folio initialization should not iterate over any page list, since folio is
supposed to be treated as a whole instead of individual pages.

Based on my understanding,

folio->mapping = NULL;
folio->memcg_data = 0;
folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;

should be enough.

if (order)
	folio_set_large_rmappable(folio);

is done at zone_device_folio_init().

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-20  2:50                           ` Zi Yan
@ 2026-01-20 13:53                             ` Jason Gunthorpe
  2026-01-21  3:01                               ` Zi Yan
  2026-01-21  3:51                             ` Balbir Singh
  1 sibling, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-20 13:53 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, Matthew Wilcox, Alistair Popple, Matthew Brost,
	Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
> >> I suppose we want some prep_single_page(page) and some reorg to share
> >> code with the other prep function.
> 
> This is just an unnecessary need due to lack of knowledge of/do not want
> to investigate core MM page and folio initialization code.

It will be better to keep this related code together, not spread all
around.

> >> I don't think so. It should do the above job efficiently and iterate
> >> over the page list exactly once.
> 
> folio initialization should not iterate over any page list, since folio is
> supposed to be treated as a whole instead of individual pages.

The tail pages need to have the right data in them or compound_head
won't work.

> folio->mapping = NULL;
> folio->memcg_data = 0;
> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> 
> should be enough.

This seems believable to me for setting up an order 0 page.

> if (order)
> 	folio_set_large_rmappable(folio);

That one is in zone_device_folio_init()

And maybe the naming has got really confused if we have both functions
now :\

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-20 13:53                             ` Jason Gunthorpe
@ 2026-01-21  3:01                               ` Zi Yan
  2026-01-22  7:19                                 ` Matthew Brost
  2026-01-22 15:46                                 ` Jason Gunthorpe
  0 siblings, 2 replies; 44+ messages in thread
From: Zi Yan @ 2026-01-21  3:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Balbir Singh, Matthew Wilcox, Alistair Popple, Matthew Brost,
	Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:

> On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
>>>> I suppose we want some prep_single_page(page) and some reorg to share
>>>> code with the other prep function.
>>
>> This is just an unnecessary need due to lack of knowledge of/do not want
>> to investigate core MM page and folio initialization code.
>
> It will be better to keep this related code together, not spread all
> around.

Or clarify what code is for preparing pages, which would go away at memdesc
time, and what code is for preparing folios, which would stay.

>
>>>> I don't think so. It should do the above job efficiently and iterate
>>>> over the page list exactly once.
>>
>> folio initialization should not iterate over any page list, since folio is
>> supposed to be treated as a whole instead of individual pages.
>
> The tail pages need to have the right data in them or compound_head
> won't work.

That is done by set_compound_head() in prep_compound_tail().
prep_compound_page() take cares of it. As long as it is called, even if
the pages in that compound page have random states before, the compound
page should function correctly afterwards.

>
>> folio->mapping = NULL;
>> folio->memcg_data = 0;
>> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>
>> should be enough.
>
> This seems believable to me for setting up an order 0 page.

It works for any folio, regardless of its order. fields used in second
or third subpages are all taken care of by prep_compound_page().

>
>> if (order)
>> 	folio_set_large_rmappable(folio);
>
> That one is in zone_device_folio_init()

Yes. And the code location looks right to me.

>
> And maybe the naming has got really confused if we have both functions
> now :\

Yes. One of the issues is that device private code used to only handles
order-0 pages and was converted to use high order folio directly without
using high order page (namely compound page) as an intermediate step.
This two-step-in-one caused confusion. But the key thing to avoid the
confusion is that to form a high order folio, a list of contiguous pages
would become a compound page by calling prep_compound_page(), then
the compound page becomes a folio by calling folio_set_large_rmappable().

BTW, the code in prep_compound_head() after folio_set_order(folio, order)
should belong to folio_set_large_rmappable() and they are causing confusion,
since they are only applicable to rmappable large folios. I am going to
send a patch to fix it.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-20  2:50                           ` Zi Yan
  2026-01-20 13:53                             ` Jason Gunthorpe
@ 2026-01-21  3:51                             ` Balbir Singh
  1 sibling, 0 replies; 44+ messages in thread
From: Balbir Singh @ 2026-01-21  3:51 UTC (permalink / raw)
  To: Zi Yan, Jason Gunthorpe
  Cc: Matthew Wilcox, Alistair Popple, Matthew Brost, Vlastimil Babka,
	Francois Dugast, intel-xe, dri-devel, adhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 1/20/26 13:50, Zi Yan wrote:
> On 19 Jan 2026, at 17:15, Balbir Singh wrote:
> 
>> On 1/20/26 07:35, Jason Gunthorpe wrote:
>>> On Mon, Jan 19, 2026 at 03:09:00PM -0500, Zi Yan wrote:
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index e430da900430a1..a7d3f5e4b85e49 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>>>>>  		atomic_set(&folio->_pincount, 0);
>>>>>  		atomic_set(&folio->_entire_mapcount, -1);
>>>>>  	}
>>>>> -	if (order > 1)
>>>>> +	if (order > 1) {
>>>>>  		INIT_LIST_HEAD(&folio->_deferred_list);
>>>>> +	} else {
>>>>> +		folio->mapping = NULL;
>>>>> +#ifdef CONFIG_MEMCG
>>>>> +		folio->memcg_data = 0;
>>>>> +#endif
>>>>> +	}
>>>>
>>>> prep_compound_head() is only called on >0 order pages. The above
>>>> code means when order == 1, folio->mapping and folio->memcg_data are
>>>> assigned NULL.
>>>
>>> OK, fair enough, the conditionals would have to change and maybe it
>>> shouldn't be called "compound_head" if it also cleans up normal pages.
>>>
>>>>>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>>>>>  {
>>>>>  	struct page *p = head + tail_idx;
>>>>>
>>>>> +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>>>
>>>> No one cares about tail page flags if it is not checked in check_new_page()
>>>> from mm/page_alloc.c.
>>>
>>> At least page_fixed_fake_head() does check PG_head in some
>>> configurations. It does seem safer to clear it. Possibly order is
>>> never used, but it is free to clear it.
>>>
>>>>> -	if (order)
>>>>> -		prep_compound_page(page, order);
>>>>> +	prep_compound_page(page, order);
>>>>
>>>> prep_compound_page() should only be called for >0 order pages. This creates
>>>> another weirdness in device pages by assuming all pages are
>>>> compound.
>>>
>>> OK
>>>
>>>>> +	folio = page_folio(page);
>>>>> +	folio->pgmap = pgmap;
>>>>> +	folio_lock(folio);
>>>>> +	folio_set_count(folio, 1);
>>>>
>>>> /* clear possible previous page->mapping */
>>>> folio->mapping = NULL;
>>>>
>>>> /* clear possible previous page->_nr_pages */
>>>> #ifdef CONFIG_MEMCG
>>>> 	folio->memcg_data = 0;
>>>> #endif
>>>
>>> This is reasonable too, but prep_compound_head() was doing more than
>>> that, it is also clearing the order, and this needs to clear the head
>>> bit.  That's why it was apppealing to reuse those functions, but you
>>> are right they are not ideal.
> 
> PG_head is and must be bit 6, that means the stored order needs to be
> at least 2^6=64 to get it set. Who allocates a folio with that large order?
> This p->flags.f &= ~0xffUL thing is unnecessary. What really needs
> to be done is folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP to make
> sure the new folio flags are the same as newly allocated folios
> from core MM page allocator.
> 
>>>
>>> I suppose we want some prep_single_page(page) and some reorg to share
>>> code with the other prep function.
> 
> This is just an unnecessary need due to lack of knowledge of/do not want
> to investigate core MM page and folio initialization code.
> 
>>>
>>
>> There is __init_zone_device_page() and __init_single_page(),
>> it does zero out the page and sets the zone, pfn, nid among other things.
>> I propose we use the current version with zone_device_free_folio() as is.
>>
>> We can figure out if __init_zone_device_page() can be reused or refactored
>> for the purposes to doing this with core MM API's
>>
>>
>>>> This patch mixed the concept of page and folio together, thus
>>>> causing confusion. Core MM sees page and folio two separate things:
>>>> 1. page is the smallest internal physical memory management unit,
>>>> 2. folio is an abstraction on top of pages, and other abstractions can be
>>>>    slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).
>>>
>>> I think the users of zone_device_page_init() are principally trying to
>>> create something that can be installed in a non-special PTE. Meaning
>>> the output is always a folio because it is going to be read as a folio
>>> in the page walkers.
>>>
>>> Thus, the job of this function is to take the memory range starting at
>>> page for 2^order and turn it into a single valid folio with refcount
>>> of 1.
>>>
>>>> If device pages have to initialize on top of pages with obsolete states,
>>>> at least it should be first initialized as pages, then as folios to avoid
>>>> confusion.
>>>
>>> I don't think so. It should do the above job efficiently and iterate
>>> over the page list exactly once.
> 
> folio initialization should not iterate over any page list, since folio is
> supposed to be treated as a whole instead of individual pages.
> 
> Based on my understanding,
> 
> folio->mapping = NULL;
> folio->memcg_data = 0;
> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> 
> should be enough.
> 

I think it should be enough as well, worst case memcg_data is aliased with
slab_obj_exts, but we don't expect zone device folios to have slab_obj_exts
to be set.

folio->memcg_data needs to be under an #ifdef CONFIG_MEMCG and folio->mapping
was set to NULL during previous free (one could assume it's unchanged)


> if (order)
> 	folio_set_large_rmappable(folio);
> 
> is done at zone_device_folio_init().
> 



Balbir


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-21  3:01                               ` Zi Yan
@ 2026-01-22  7:19                                 ` Matthew Brost
  2026-01-22  8:00                                   ` Vlastimil Babka
  2026-01-22 14:29                                   ` Jason Gunthorpe
  2026-01-22 15:46                                 ` Jason Gunthorpe
  1 sibling, 2 replies; 44+ messages in thread
From: Matthew Brost @ 2026-01-22  7:19 UTC (permalink / raw)
  To: Zi Yan
  Cc: Jason Gunthorpe, Balbir Singh, Matthew Wilcox, Alistair Popple,
	Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
> 

This whole thread makes my head hurt, as does core MM.

IMO the TL;DR is:

- Why is Intel the only one proving this stuff works? We can debate all
  day about what should or should not work — but someone else needs to
  actually prove it.i, rather than type hypotheticals.

- Intel has demonstrated that this works and is still getting blocked.

- This entire thread is about a fixes patch for large device pages.
  Changing prep_compound_page is completely out of scope for a fixes
  patch, and honestly so is most of the rest of what’s being proposed.

- At a minimum, you must clear every page’s flags in the loop. So why not
  conservatively clear anything else a folio might have set before calling
  an existing core-MM function, ensuring the pages are in a known state?
  This is a fixes patch.

- Given the current state of the discussion, I don’t think large device
  pages should be in 6.19. And if so, why didn’t the entire device pages
  series receive this level of scrutiny earlier? It’s my mistake for not
  saying “no” until the reallocation at different sizes issue was resolved.

@Andrew. - I'd revert large device pages in 6.19 as it doesn't work and
we seemly cannot close on this.

Matt

> > On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
> >>>> I suppose we want some prep_single_page(page) and some reorg to share
> >>>> code with the other prep function.
> >>
> >> This is just an unnecessary need due to lack of knowledge of/do not want
> >> to investigate core MM page and folio initialization code.
> >
> > It will be better to keep this related code together, not spread all
> > around.
> 
> Or clarify what code is for preparing pages, which would go away at memdesc
> time, and what code is for preparing folios, which would stay.
> 
> >
> >>>> I don't think so. It should do the above job efficiently and iterate
> >>>> over the page list exactly once.
> >>
> >> folio initialization should not iterate over any page list, since folio is
> >> supposed to be treated as a whole instead of individual pages.
> >
> > The tail pages need to have the right data in them or compound_head
> > won't work.
> 
> That is done by set_compound_head() in prep_compound_tail().
> prep_compound_page() take cares of it. As long as it is called, even if
> the pages in that compound page have random states before, the compound
> page should function correctly afterwards.
> 
> >
> >> folio->mapping = NULL;
> >> folio->memcg_data = 0;
> >> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>
> >> should be enough.
> >
> > This seems believable to me for setting up an order 0 page.
> 
> It works for any folio, regardless of its order. fields used in second
> or third subpages are all taken care of by prep_compound_page().
> 
> >
> >> if (order)
> >> 	folio_set_large_rmappable(folio);
> >
> > That one is in zone_device_folio_init()
> 
> Yes. And the code location looks right to me.
> 
> >
> > And maybe the naming has got really confused if we have both functions
> > now :\
> 
> Yes. One of the issues is that device private code used to only handles
> order-0 pages and was converted to use high order folio directly without
> using high order page (namely compound page) as an intermediate step.
> This two-step-in-one caused confusion. But the key thing to avoid the
> confusion is that to form a high order folio, a list of contiguous pages
> would become a compound page by calling prep_compound_page(), then
> the compound page becomes a folio by calling folio_set_large_rmappable().
> 
> BTW, the code in prep_compound_head() after folio_set_order(folio, order)
> should belong to folio_set_large_rmappable() and they are causing confusion,
> since they are only applicable to rmappable large folios. I am going to
> send a patch to fix it.
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-22  7:19                                 ` Matthew Brost
@ 2026-01-22  8:00                                   ` Vlastimil Babka
  2026-01-22  9:10                                     ` Balbir Singh
  2026-01-22 14:29                                   ` Jason Gunthorpe
  1 sibling, 1 reply; 44+ messages in thread
From: Vlastimil Babka @ 2026-01-22  8:00 UTC (permalink / raw)
  To: Matthew Brost, Zi Yan
  Cc: Jason Gunthorpe, Balbir Singh, Matthew Wilcox, Alistair Popple,
	Francois Dugast, intel-xe, dri-devel, adhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 1/22/26 08:19, Matthew Brost wrote:
> On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
>> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
>> 
> 
> This whole thread makes my head hurt, as does core MM.
> 
> IMO the TL;DR is:
> 
> - Why is Intel the only one proving this stuff works? We can debate all
>   day about what should or should not work — but someone else needs to
>   actually prove it.i, rather than type hypotheticals.
> 
> - Intel has demonstrated that this works and is still getting blocked.
> 
> - This entire thread is about a fixes patch for large device pages.
>   Changing prep_compound_page is completely out of scope for a fixes
>   patch, and honestly so is most of the rest of what’s being proposed.

FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
about how refactor things more properly afterwards, going forward.

> - At a minimum, you must clear every page’s flags in the loop. So why not
>   conservatively clear anything else a folio might have set before calling
>   an existing core-MM function, ensuring the pages are in a known state?
>   This is a fixes patch.
> 
> - Given the current state of the discussion, I don’t think large device
>   pages should be in 6.19. And if so, why didn’t the entire device pages
>   series receive this level of scrutiny earlier? It’s my mistake for not
>   saying “no” until the reallocation at different sizes issue was resolved.
> 
> @Andrew. - I'd revert large device pages in 6.19 as it doesn't work and
> we seemly cannot close on this.
> 
> Matt
> 
>> > On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
>> >>>> I suppose we want some prep_single_page(page) and some reorg to share
>> >>>> code with the other prep function.
>> >>
>> >> This is just an unnecessary need due to lack of knowledge of/do not want
>> >> to investigate core MM page and folio initialization code.
>> >
>> > It will be better to keep this related code together, not spread all
>> > around.
>> 
>> Or clarify what code is for preparing pages, which would go away at memdesc
>> time, and what code is for preparing folios, which would stay.
>> 
>> >
>> >>>> I don't think so. It should do the above job efficiently and iterate
>> >>>> over the page list exactly once.
>> >>
>> >> folio initialization should not iterate over any page list, since folio is
>> >> supposed to be treated as a whole instead of individual pages.
>> >
>> > The tail pages need to have the right data in them or compound_head
>> > won't work.
>> 
>> That is done by set_compound_head() in prep_compound_tail().
>> prep_compound_page() take cares of it. As long as it is called, even if
>> the pages in that compound page have random states before, the compound
>> page should function correctly afterwards.
>> 
>> >
>> >> folio->mapping = NULL;
>> >> folio->memcg_data = 0;
>> >> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>> >>
>> >> should be enough.
>> >
>> > This seems believable to me for setting up an order 0 page.
>> 
>> It works for any folio, regardless of its order. fields used in second
>> or third subpages are all taken care of by prep_compound_page().
>> 
>> >
>> >> if (order)
>> >> 	folio_set_large_rmappable(folio);
>> >
>> > That one is in zone_device_folio_init()
>> 
>> Yes. And the code location looks right to me.
>> 
>> >
>> > And maybe the naming has got really confused if we have both functions
>> > now :\
>> 
>> Yes. One of the issues is that device private code used to only handles
>> order-0 pages and was converted to use high order folio directly without
>> using high order page (namely compound page) as an intermediate step.
>> This two-step-in-one caused confusion. But the key thing to avoid the
>> confusion is that to form a high order folio, a list of contiguous pages
>> would become a compound page by calling prep_compound_page(), then
>> the compound page becomes a folio by calling folio_set_large_rmappable().
>> 
>> BTW, the code in prep_compound_head() after folio_set_order(folio, order)
>> should belong to folio_set_large_rmappable() and they are causing confusion,
>> since they are only applicable to rmappable large folios. I am going to
>> send a patch to fix it.
>> 
>> 
>> Best Regards,
>> Yan, Zi



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-16 16:07   ` Vlastimil Babka
  2026-01-16 17:20     ` Jason Gunthorpe
@ 2026-01-22  8:02     ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2026-01-22  8:02 UTC (permalink / raw)
  To: Francois Dugast, intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, Alistair Popple,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Balbir Singh, linuxppc-dev, kvm, linux-kernel,
	amd-gfx, nouveau, linux-mm, linux-cxl

On 1/16/26 17:07, Vlastimil Babka wrote:
> On 1/16/26 12:10, Francois Dugast wrote:
>> From: Matthew Brost <matthew.brost@intel.com>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 63c6ab4fdf08..ac7be07e3361 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
>>  	}
>>  }
>>  
>> -void zone_device_page_init(struct page *page, unsigned int order)
>> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
>> +			   unsigned int order)
>>  {
>> +	struct page *new_page = page;
>> +	unsigned int i;
>> +
>>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>  
>> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
>> +		struct folio *new_folio = (struct folio *)new_page;
>> +
>> +		/*
>> +		 * new_page could have been part of previous higher order folio
>> +		 * which encodes the order, in page + 1, in the flags bits. We
>> +		 * blindly clear bits which could have set my order field here,
>> +		 * including page head.
>> +		 */
>> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>> +
>> +#ifdef NR_PAGES_IN_LARGE_FOLIO
>> +		/*
>> +		 * This pointer math looks odd, but new_page could have been
>> +		 * part of a previous higher order folio, which sets _nr_pages
>> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
>> +		 * correctly locate the _nr_pages bits within new_page which
>> +		 * could have modified by previous higher order folio.
>> +		 */
>> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
>> +#endif
>> +
>> +		new_folio->mapping = NULL;
>> +		new_folio->pgmap = pgmap;	/* Also clear compound head */
>> +		new_folio->share = 0;   /* fsdax only, unused for device private */
>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
>> +	}
>> +
>>  	/*
>>  	 * Drivers shouldn't be allocating pages after calling
>>  	 * memunmap_pages().
> 
> Can't say I'm a fan of this. It probably works now (so I'm not nacking) but
> seems rather fragile. It seems likely to me somebody will try to change some
> implementation detail in the page allocator and not notice it breaks this,
> for example. I hope we can eventually get to something more robust.

For doing this as a hotfix for 6.19, assuming we'll refactor later:

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-22  8:00                                   ` Vlastimil Babka
@ 2026-01-22  9:10                                     ` Balbir Singh
  2026-01-22 21:41                                       ` Andrew Morton
  0 siblings, 1 reply; 44+ messages in thread
From: Balbir Singh @ 2026-01-22  9:10 UTC (permalink / raw)
  To: Vlastimil Babka, Matthew Brost, Zi Yan
  Cc: Jason Gunthorpe, Matthew Wilcox, Alistair Popple,
	Francois Dugast, intel-xe, dri-devel, adhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 1/22/26 19:00, Vlastimil Babka wrote:
> On 1/22/26 08:19, Matthew Brost wrote:
>> On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
>>> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
>>>
>>
>> This whole thread makes my head hurt, as does core MM.
>>
>> IMO the TL;DR is:
>>
>> - Why is Intel the only one proving this stuff works? We can debate all
>>   day about what should or should not work — but someone else needs to
>>   actually prove it.i, rather than type hypotheticals.
>>
>> - Intel has demonstrated that this works and is still getting blocked.
>>
>> - This entire thread is about a fixes patch for large device pages.
>>   Changing prep_compound_page is completely out of scope for a fixes
>>   patch, and honestly so is most of the rest of what’s being proposed.
> 
> FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
> about how refactor things more properly afterwards, going forward.
> 

I've said the same thing and I concur, we can use the patch as-is and
change this to set the relevant identified fields after 6.19

Balbir

>> - At a minimum, you must clear every page’s flags in the loop. So why not
>>   conservatively clear anything else a folio might have set before calling
>>   an existing core-MM function, ensuring the pages are in a known state?
>>   This is a fixes patch.
>>
>> - Given the current state of the discussion, I don’t think large device
>>   pages should be in 6.19. And if so, why didn’t the entire device pages
>>   series receive this level of scrutiny earlier? It’s my mistake for not
>>   saying “no” until the reallocation at different sizes issue was resolved.
>>
>> @Andrew. - I'd revert large device pages in 6.19 as it doesn't work and
>> we seemly cannot close on this.
>>
>> Matt


<snip>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-22  7:19                                 ` Matthew Brost
  2026-01-22  8:00                                   ` Vlastimil Babka
@ 2026-01-22 14:29                                   ` Jason Gunthorpe
  1 sibling, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-22 14:29 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Zi Yan, Balbir Singh, Matthew Wilcox, Alistair Popple,
	Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On Wed, Jan 21, 2026 at 11:19:45PM -0800, Matthew Brost wrote:
> - Why is Intel the only one proving this stuff works? We can debate all
>   day about what should or should not work — but someone else needs to
>   actually prove it.i, rather than type hypotheticals.

Oh come on, NVIDIA has done an *enormous* amount of work to get these
things to this point where they are actually getting close to
functional and usable.

Don't "Oh poor intel" me :P

> - Intel has demonstrated that this works and is still getting blocked.

We generally don't merge patches because they "works for me". The
issue is the thing you presented is very ugly and inefficient, and
when we start talking about the right way to do it you get all
defensive and disappear.

> - Given the current state of the discussion, I don’t think large device
>   pages should be in 6.19. And if so, why didn’t the entire device pages
>   series receive this level of scrutiny earlier? It’s my mistake for not
>   saying “no” until the reallocation at different sizes issue was resolved.

It did, nobody noticed this bug or post something so obviously ugly :P

> @Andrew. - I'd revert large device pages in 6.19 as it doesn't work and
> we seemly cannot close on this.

What's the issue here? You said you were going to go ahead with the
ugly thing, go do it and come back with something better. That's what
you wanted, right?

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-21  3:01                               ` Zi Yan
  2026-01-22  7:19                                 ` Matthew Brost
@ 2026-01-22 15:46                                 ` Jason Gunthorpe
  2026-01-23  2:41                                   ` Zi Yan
  1 sibling, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-22 15:46 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, Matthew Wilcox, Alistair Popple, Matthew Brost,
	Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
> 
> > On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
> >>>> I suppose we want some prep_single_page(page) and some reorg to share
> >>>> code with the other prep function.
> >>
> >> This is just an unnecessary need due to lack of knowledge of/do not want
> >> to investigate core MM page and folio initialization code.
> >
> > It will be better to keep this related code together, not spread all
> > around.
> 
> Or clarify what code is for preparing pages, which would go away at memdesc
> time, and what code is for preparing folios, which would stay.

That comes back to the question of 'what are the rules for frozen
pages'

Now that we have frozen pages where the frozen owner can use some of
the struct page memory however it likes that memory needs to be reset
before the page is thawed and converted back to a folio.

memdesc time is only useful for memory that is not writable by frozen
owners - basically must be constant forever.

> >
> >>>> I don't think so. It should do the above job efficiently and iterate
> >>>> over the page list exactly once.
> >>
> >> folio initialization should not iterate over any page list, since folio is
> >> supposed to be treated as a whole instead of individual pages.
> >
> > The tail pages need to have the right data in them or compound_head
> > won't work.
> 
> That is done by set_compound_head() in prep_compound_tail().

Inside a page loop :)

	__SetPageHead(page);
	for (i = 1; i < nr_pages; i++)
		prep_compound_tail(page, i);

> Yes. One of the issues is that device private code used to only handles
> order-0 pages and was converted to use high order folio directly without
> using high order page (namely compound page) as an intermediate step.
> This two-step-in-one caused confusion. But the key thing to avoid the
> confusion is that to form a high order folio, a list of contiguous pages
> would become a compound page by calling prep_compound_page(), then
> the compound page becomes a folio by calling folio_set_large_rmappable().

That seems logical to me.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-22  9:10                                     ` Balbir Singh
@ 2026-01-22 21:41                                       ` Andrew Morton
  2026-01-22 22:53                                         ` Alistair Popple
  2026-01-23  6:45                                         ` Vlastimil Babka
  0 siblings, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2026-01-22 21:41 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Vlastimil Babka, Matthew Brost, Zi Yan, Jason Gunthorpe,
	Matthew Wilcox, Alistair Popple, Francois Dugast, intel-xe,
	dri-devel, adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linuxppc-dev, kvm,
	linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On Thu, 22 Jan 2026 20:10:44 +1100 Balbir Singh <balbirs@nvidia.com> wrote:

> >> - Intel has demonstrated that this works and is still getting blocked.
> >>
> >> - This entire thread is about a fixes patch for large device pages.
> >>   Changing prep_compound_page is completely out of scope for a fixes
> >>   patch, and honestly so is most of the rest of what’s being proposed.
> > 
> > FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
> > about how refactor things more properly afterwards, going forward.
> > 
> 
> I've said the same thing and I concur, we can use the patch as-is and
> change this to set the relevant identified fields after 6.19

So the plan is to add this patch to 6.19-rc and take another look at
patches [2-5] during next -rc cycle?

I think the plan is to take Matthew's work via the DRM tree?  But if people
want me to patchbunny this fix then please lmk.

I presently have

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Balbir Singh <balbirs@nvidia.com>

If people wish to add to this then please do so.

I'll restore this patch into mm.git's hotfix branch (and hence
linux-next) because testing.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-22 21:41                                       ` Andrew Morton
@ 2026-01-22 22:53                                         ` Alistair Popple
  2026-01-23  6:45                                         ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Alistair Popple @ 2026-01-22 22:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Vlastimil Babka, Matthew Brost, Zi Yan,
	Jason Gunthorpe, Matthew Wilcox, Francois Dugast, intel-xe,
	dri-devel, adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linuxppc-dev, kvm,
	linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 2026-01-23 at 08:41 +1100, Andrew Morton <akpm@linux-foundation.org> wrote...
> On Thu, 22 Jan 2026 20:10:44 +1100 Balbir Singh <balbirs@nvidia.com> wrote:
> 
> > >> - Intel has demonstrated that this works and is still getting blocked.
> > >>
> > >> - This entire thread is about a fixes patch for large device pages.
> > >>   Changing prep_compound_page is completely out of scope for a fixes
> > >>   patch, and honestly so is most of the rest of what’s being proposed.
> > > 
> > > FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
> > > about how refactor things more properly afterwards, going forward.
> > > 
> > 
> > I've said the same thing and I concur, we can use the patch as-is and
> > change this to set the relevant identified fields after 6.19
> 
> So the plan is to add this patch to 6.19-rc and take another look at
> patches [2-5] during next -rc cycle?

I'm ok with this as a a quick fix, and happy to take a look at cleaning this up
in the next cycle or two as it's been on my TODO list for a while anyway.

> I think the plan is to take Matthew's work via the DRM tree?  But if people
> want me to patchbunny this fix then please lmk.
> 
> I presently have
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> Acked-by: Felix Kuehling <felix.kuehling@amd.com>
> Reviewed-by: Balbir Singh <balbirs@nvidia.com>
> 
> If people wish to add to this then please do so.
> 
> I'll restore this patch into mm.git's hotfix branch (and hence
> linux-next) because testing.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-22 15:46                                 ` Jason Gunthorpe
@ 2026-01-23  2:41                                   ` Zi Yan
  2026-01-23 14:19                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Zi Yan @ 2026-01-23  2:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Balbir Singh, Matthew Wilcox, Alistair Popple, Matthew Brost,
	Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 22 Jan 2026, at 10:46, Jason Gunthorpe wrote:

> On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
>> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
>>
>>> On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
>>>>>> I suppose we want some prep_single_page(page) and some reorg to share
>>>>>> code with the other prep function.
>>>>
>>>> This is just an unnecessary need due to lack of knowledge of/do not want
>>>> to investigate core MM page and folio initialization code.
>>>
>>> It will be better to keep this related code together, not spread all
>>> around.
>>
>> Or clarify what code is for preparing pages, which would go away at memdesc
>> time, and what code is for preparing folios, which would stay.
>
> That comes back to the question of 'what are the rules for frozen
> pages'
>
> Now that we have frozen pages where the frozen owner can use some of
> the struct page memory however it likes that memory needs to be reset
> before the page is thawed and converted back to a folio.

Based on my understanding, a frozen folio cannot be changed however the
owner wants, since the modification needs to prevent parallel scanner
from misusing the folio. For example, PFN scanners like memory compaction
needs to know this is a frozen folio with a certain order, so that it
will skip it as a whole. But if you change the frozen folio in a way
that a parallel scanner cannot recognize the right order (e.g., the frozen
folio order becomes lower) and finds some of the subpages have non-zero
refcount, it can cause issues.

But I assume device private pages do not have such a parallel scanner
looking at each struct page one by one and examining their state.

>
> memdesc time is only useful for memory that is not writable by frozen
> owners - basically must be constant forever.

Bits 0-3 of memdesc are a type field, so the owner should be able to
set it, so that others will stay away.

BTW, it seems that you treat frozen folio and free folio interchangeable
in this device private folio discussion. To me, they are different,
since frozen folio is transient to prevent others from touching the folio,
e.g., a free page is taken from buddy and allocator is setting up its
state, or a folio is split. You do not want memory compaction code
to touch these transient folios/pages. In terms of free folio, they
are stable before next allocation and others can recognize it and perform
reasonable operations. For example, memory compaction code can take
a free page out of buddy and use it as a migration destination.
That is why I want to remove all device private folio states when it
is freed. But memory compaction code never scans device private folios
and there is no other similar scanners, so that requirement might not
be needed.

>
>>>
>>>>>> I don't think so. It should do the above job efficiently and iterate
>>>>>> over the page list exactly once.
>>>>
>>>> folio initialization should not iterate over any page list, since folio is
>>>> supposed to be treated as a whole instead of individual pages.
>>>
>>> The tail pages need to have the right data in them or compound_head
>>> won't work.
>>
>> That is done by set_compound_head() in prep_compound_tail().
>
> Inside a page loop :)
>
> 	__SetPageHead(page);
> 	for (i = 1; i < nr_pages; i++)
> 		prep_compound_tail(page, i);

Yes, but to a folio, the fields of tail page 1 and 2 are used because
we do not want to inflate struct folio for high order folios. In this
loop, all tail pages are processed in the same way. To follow your method,
there will be some ifs for tail page 1 to clear _nr_pages and tail page 2
to clear other fields. It feels to me that we are clearly mixing
struct page and struct folio.

>
>> Yes. One of the issues is that device private code used to only handles
>> order-0 pages and was converted to use high order folio directly without
>> using high order page (namely compound page) as an intermediate step.
>> This two-step-in-one caused confusion. But the key thing to avoid the
>> confusion is that to form a high order folio, a list of contiguous pages
>> would become a compound page by calling prep_compound_page(), then
>> the compound page becomes a folio by calling folio_set_large_rmappable().
>
> That seems logical to me.
>
> Jason


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-22 21:41                                       ` Andrew Morton
  2026-01-22 22:53                                         ` Alistair Popple
@ 2026-01-23  6:45                                         ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2026-01-23  6:45 UTC (permalink / raw)
  To: Andrew Morton, Balbir Singh
  Cc: Matthew Brost, Zi Yan, Jason Gunthorpe, Matthew Wilcox,
	Alistair Popple, Francois Dugast, intel-xe, dri-devel,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Leon Romanovsky,
	Lorenzo Stoakes, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linuxppc-dev, kvm,
	linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On 1/22/26 22:41, Andrew Morton wrote:
> On Thu, 22 Jan 2026 20:10:44 +1100 Balbir Singh <balbirs@nvidia.com> wrote:
> 
>> >> - Intel has demonstrated that this works and is still getting blocked.
>> >>
>> >> - This entire thread is about a fixes patch for large device pages.
>> >>   Changing prep_compound_page is completely out of scope for a fixes
>> >>   patch, and honestly so is most of the rest of what’s being proposed.
>> > 
>> > FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
>> > about how refactor things more properly afterwards, going forward.
>> > 
>> 
>> I've said the same thing and I concur, we can use the patch as-is and
>> change this to set the relevant identified fields after 6.19
> 
> So the plan is to add this patch to 6.19-rc and take another look at
> patches [2-5] during next -rc cycle?
> 
> I think the plan is to take Matthew's work via the DRM tree?  But if people
> want me to patchbunny this fix then please lmk.
> 
> I presently have
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> Acked-by: Felix Kuehling <felix.kuehling@amd.com>
> Reviewed-by: Balbir Singh <balbirs@nvidia.com>
> 
> If people wish to add to this then please do so.

I did too.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> I'll restore this patch into mm.git's hotfix branch (and hence
> linux-next) because testing.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
  2026-01-23  2:41                                   ` Zi Yan
@ 2026-01-23 14:19                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2026-01-23 14:19 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, Matthew Wilcox, Alistair Popple, Matthew Brost,
	Vlastimil Babka, Francois Dugast, intel-xe, dri-devel,
	adhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich,
	David Hildenbrand, Oscar Salvador, Andrew Morton,
	Leon Romanovsky, Lorenzo Stoakes, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-mm, linux-cxl

On Thu, Jan 22, 2026 at 09:41:03PM -0500, Zi Yan wrote:
> > Now that we have frozen pages where the frozen owner can use some of
> > the struct page memory however it likes that memory needs to be reset
> > before the page is thawed and converted back to a folio.
> 
> Based on my understanding, a frozen folio cannot be changed however the
> owner wants, since the modification needs to prevent parallel scanner
> from misusing the folio. For example, PFN scanners like memory compaction
> needs to know this is a frozen folio with a certain order, so that it
> will skip it as a whole. But if you change the frozen folio in a way
> that a parallel scanner cannot recognize the right order (e.g., the frozen
> folio order becomes lower) and finds some of the subpages have non-zero
> refcount, it can cause issues.

Yes, and this is part of the rules of what bits in the struct page
memory you can use for frozen pages. I've never seen it clearly
written unfortunately.

> But I assume device private pages do not have such a parallel scanner
> looking at each struct page one by one and examining their state.

I hope not!
 
> BTW, it seems that you treat frozen folio and free folio interchangeable
> in this device private folio discussion. To me, they are different,
> since frozen folio is transient to prevent others from touching the folio,

Yes, but really it is the same issue. Once the folio is frozen either
for free or any other use case it must follow a certain set of rules
to be compatible with the parallel scanners and things that may still
inspect the page without taking any refcounts.

The scanner can't tell if the refcount is 0 because it is frozen or
because it is free.

> >>>>>> I don't think so. It should do the above job efficiently and iterate
> >>>>>> over the page list exactly once.
> >>>>
> >>>> folio initialization should not iterate over any page list, since folio is
> >>>> supposed to be treated as a whole instead of individual pages.
> >>>
> >>> The tail pages need to have the right data in them or compound_head
> >>> won't work.
> >>
> >> That is done by set_compound_head() in prep_compound_tail().
> >
> > Inside a page loop :)
> >
> > 	__SetPageHead(page);
> > 	for (i = 1; i < nr_pages; i++)
> > 		prep_compound_tail(page, i);
> 
> Yes, but to a folio, the fields of tail page 1 and 2 are used because
> we do not want to inflate struct folio for high order folios. In this
> loop, all tail pages are processed in the same way. To follow your method,
> there will be some ifs for tail page 1 to clear _nr_pages and tail page 2
> to clear other fields. It feels to me that we are clearly mixing
> struct page and struct folio.

I'm not saying mixing, I'm just pointing out we have to clean up the
tail pages by writing to every one. This can also adjust the flags and
order if required on tail pages. Initing the folio is a seperate step,
and you already showed the right way to code that.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2026-01-23 14:20 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-16 11:10 [PATCH v6 0/5] Enable THP support in drm_pagemap Francois Dugast
2026-01-16 11:10 ` [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios Francois Dugast
2026-01-16 13:10   ` Balbir Singh
2026-01-16 16:07   ` Vlastimil Babka
2026-01-16 17:20     ` Jason Gunthorpe
2026-01-16 17:27       ` Vlastimil Babka
2026-01-22  8:02     ` Vlastimil Babka
2026-01-16 17:49   ` Jason Gunthorpe
2026-01-16 19:17     ` Vlastimil Babka
2026-01-16 20:31       ` Matthew Brost
2026-01-17  0:51         ` Jason Gunthorpe
2026-01-17  3:55           ` Matthew Brost
2026-01-17  4:42             ` Balbir Singh
2026-01-17  5:27               ` Matthew Brost
2026-01-19  5:59                 ` Alistair Popple
2026-01-19 14:20                   ` Jason Gunthorpe
2026-01-19 20:09                     ` Zi Yan
2026-01-19 20:35                       ` Jason Gunthorpe
2026-01-19 22:15                         ` Balbir Singh
2026-01-20  2:50                           ` Zi Yan
2026-01-20 13:53                             ` Jason Gunthorpe
2026-01-21  3:01                               ` Zi Yan
2026-01-22  7:19                                 ` Matthew Brost
2026-01-22  8:00                                   ` Vlastimil Babka
2026-01-22  9:10                                     ` Balbir Singh
2026-01-22 21:41                                       ` Andrew Morton
2026-01-22 22:53                                         ` Alistair Popple
2026-01-23  6:45                                         ` Vlastimil Babka
2026-01-22 14:29                                   ` Jason Gunthorpe
2026-01-22 15:46                                 ` Jason Gunthorpe
2026-01-23  2:41                                   ` Zi Yan
2026-01-23 14:19                                     ` Jason Gunthorpe
2026-01-21  3:51                             ` Balbir Singh
2026-01-17  0:19       ` Jason Gunthorpe
2026-01-19  5:41         ` Alistair Popple
2026-01-19 14:24           ` Jason Gunthorpe
2026-01-16 22:34   ` Andrew Morton
2026-01-16 22:36     ` Matthew Brost
2026-01-16 11:10 ` [PATCH v6 2/5] drm/pagemap: Unlock and put folios when possible Francois Dugast
2026-01-16 11:10 ` [PATCH v6 3/5] drm/pagemap: Add helper to access zone_device_data Francois Dugast
2026-01-16 11:10 ` [PATCH v6 4/5] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
2026-01-16 11:37   ` Balbir Singh
2026-01-16 12:02     ` Francois Dugast
2026-01-16 11:10 ` [PATCH v6 5/5] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox