[PATCH v4 0/7] Enable THP support in drm

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/7] Enable THP support in drm_pagemap
@ 2026-01-11 20:55 Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Francois Dugast, Zi Yan, Madhavan Srinivasan,
	Alistair Popple, Lorenzo Stoakes, Liam R . Howlett,
	Suren Baghdasaryan, Michal Hocko, Mike Rapoport, Vlastimil Babka,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Balbir Singh,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, linuxppc-dev, kvm, linux-kernel, amd-gfx,
	nouveau, linux-pci, linux-mm, linux-cxl, nvdimm, linux-fsdevel

Use Balbir Singh's series for device-private THP support [1] and
previous preparation work in drm_pagemap [2] to add 2MB/THP support
in xe. This leads to significant performance improvements when using
SVM with 2MB pages.

[1] https://lore.kernel.org/linux-mm/20251001065707.920170-1-balbirs@nvidia.com/
[2] https://patchwork.freedesktop.org/series/151754/

v2:
- rebase on top of multi-device SVM
- add drm_pagemap_cpages() with temporary patch
- address other feedback from Matt Brost on v1

v3:
The major change is to remove the dependency to the mm/huge_memory
helper migrate_device_split_page() that was called explicitely when
a 2M buddy allocation backed by a large folio would be later reused
for a smaller allocation (4K or 64K). Instead, the first 3 patches
provided by Matthew Brost ensure large folios are split at the time
of freeing.

v4:
- add order argument to folio_free callback
- send complete series to linux-mm and MM folks as requested (Zi Yan
  and Andrew Morton) and cover letter to anyone receiving at least
  one of the patches (Liam R. Howlett)

Cc: Zi Yan <ziy@nvidia.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Cc: nvdimm@lists.linux.dev
Cc: linux-fsdevel@vger.kernel.org

Francois Dugast (3):
  drm/pagemap: Unlock and put folios when possible
  drm/pagemap: Add helper to access zone_device_data
  drm/pagemap: Enable THP support for GPU memory migration

Matthew Brost (4):
  mm/zone_device: Add order argument to folio_free callback
  mm/zone_device: Add free_zone_device_folio_prepare() helper
  fs/dax: Use free_zone_device_folio_prepare() helper
  drm/pagemap: Correct cpages calculation for migrate_vma_setup

 arch/powerpc/kvm/book3s_hv_uvmem.c       |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   2 +-
 drivers/gpu/drm/drm_gpusvm.c             |   7 +-
 drivers/gpu/drm/drm_pagemap.c            | 165 ++++++++++++++++++-----
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |   4 +-
 drivers/pci/p2pdma.c                     |   2 +-
 fs/dax.c                                 |  24 +---
 include/drm/drm_pagemap.h                |  15 +++
 include/linux/memremap.h                 |   8 +-
 lib/test_hmm.c                           |   4 +-
 mm/memremap.c                            |  60 ++++++++-
 11 files changed, 227 insertions(+), 66 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
@ 2026-01-11 20:55 ` Francois Dugast
  2026-01-11 22:35   ` Matthew Wilcox
  2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, Madhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Balbir Singh,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci,
	linux-mm, linux-cxl, Francois Dugast

From: Matthew Brost <matthew.brost@intel.com>

The core MM splits the folio before calling folio_free, restoring the
zone pages associated with the folio to an initialized state (e.g.,
non-compound, pgmap valid, etc...). The order argument represents the
folio’s order prior to the split which can be used driver side to know
how many pages are being freed.

Fixes: 3a5a06554566 ("mm/zone_device: rename page_free callback to folio_free")
Cc: Zi Yan <ziy@nvidia.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c       | 2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
 drivers/gpu/drm/drm_pagemap.c            | 3 ++-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 4 ++--
 drivers/pci/p2pdma.c                     | 2 +-
 include/linux/memremap.h                 | 7 ++++++-
 lib/test_hmm.c                           | 4 +---
 mm/memremap.c                            | 5 +++--
 8 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e5000bef90f2..b58f34eec6e5 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -1014,7 +1014,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
  * to a normal PFN during H_SVM_PAGE_OUT.
  * Gets called with kvm->arch.uvmem_lock held.
  */
-static void kvmppc_uvmem_folio_free(struct folio *folio)
+static void kvmppc_uvmem_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	unsigned long pfn = page_to_pfn(page) -
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index af53e796ea1b..a26e3c448e47 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -567,7 +567,7 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
 	return r < 0 ? r : 0;
 }
 
-static void svm_migrate_folio_free(struct folio *folio)
+static void svm_migrate_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct svm_range_bo *svm_bo = page->zone_device_data;
diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 03ee39a761a4..df253b13cf85 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -1144,11 +1144,12 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 /**
  * drm_pagemap_folio_free() - Put GPU SVM zone device data associated with a folio
  * @folio: Pointer to the folio
+ * @order: Order of the folio prior to being split by core MM
  *
  * This function is a callback used to put the GPU SVM zone device data
  * associated with a page when it is being released.
  */
-static void drm_pagemap_folio_free(struct folio *folio)
+static void drm_pagemap_folio_free(struct folio *folio, unsigned int order)
 {
 	drm_pagemap_zdd_put(folio->page.zone_device_data);
 }
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 58071652679d..545f316fca14 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -115,14 +115,14 @@ unsigned long nouveau_dmem_page_addr(struct page *page)
 	return chunk->bo->offset + off;
 }
 
-static void nouveau_dmem_folio_free(struct folio *folio)
+static void nouveau_dmem_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
 
 	spin_lock(&dmem->lock);
-	if (folio_order(folio)) {
+	if (order) {
 		page->zone_device_data = dmem->free_folios;
 		dmem->free_folios = folio;
 	} else {
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4a2fc7ab42c3..a6fa7610f8a8 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -200,7 +200,7 @@ static const struct attribute_group p2pmem_group = {
 	.name = "p2pmem",
 };
 
-static void p2pdma_folio_free(struct folio *folio)
+static void p2pdma_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 713ec0435b48..97fcffeb1c1e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -79,8 +79,13 @@ struct dev_pagemap_ops {
 	 * Called once the folio refcount reaches 0.  The reference count will be
 	 * reset to one by the core code after the method is called to prepare
 	 * for handing out the folio again.
+	 *
+	 * The core MM splits the folio before calling folio_free, restoring the
+	 * zone pages associated with the folio to an initialized state (e.g.,
+	 * non-compound, pgmap valid, etc...). The order argument represents the
+	 * folio’s order prior to the split.
 	 */
-	void (*folio_free)(struct folio *folio);
+	void (*folio_free)(struct folio *folio, unsigned int order);
 
 	/*
 	 * Used for private (un-addressable) device memory only.  Must migrate
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 8af169d3873a..e17c71d02a3a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1580,13 +1580,11 @@ static const struct file_operations dmirror_fops = {
 	.owner		= THIS_MODULE,
 };
 
-static void dmirror_devmem_free(struct folio *folio)
+static void dmirror_devmem_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct page *rpage = BACKING_PAGE(page);
 	struct dmirror_device *mdevice;
-	struct folio *rfolio = page_folio(rpage);
-	unsigned int order = folio_order(rfolio);
 
 	if (rpage != page) {
 		if (order)
diff --git a/mm/memremap.c b/mm/memremap.c
index 63c6ab4fdf08..39dc4bd190d0 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -417,6 +417,7 @@ void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
 	unsigned long nr = folio_nr_pages(folio);
+	unsigned int order = folio_order(folio);
 	int i;
 
 	if (WARN_ON_ONCE(!pgmap))
@@ -453,7 +454,7 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
-		pgmap->ops->folio_free(folio);
+		pgmap->ops->folio_free(folio, order);
 		percpu_ref_put_many(&folio->pgmap->ref, nr);
 		break;
 
@@ -472,7 +473,7 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_PCI_P2PDMA:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
-		pgmap->ops->folio_free(folio);
+		pgmap->ops->folio_free(folio, order);
 		break;
 	}
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
@ 2026-01-11 22:35   ` Matthew Wilcox
  2026-01-12  0:19     ` Balbir Singh
  0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2026-01-11 22:35 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, dri-devel, Matthew Brost, Zi Yan, Madhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Balbir Singh,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alistair Popple,
	linuxppc-dev, kvm, linux-kernel, amd-gfx, nouveau, linux-pci,
	linux-mm, linux-cxl

On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> The core MM splits the folio before calling folio_free, restoring the
> zone pages associated with the folio to an initialized state (e.g.,
> non-compound, pgmap valid, etc...). The order argument represents the
> folio’s order prior to the split which can be used driver side to know
> how many pages are being freed.

This really feels like the wrong way to fix this problem.

I think someone from the graphics side really needs to take the lead on
understanding what the MM is doing (both currently and in the future).
I'm happy to work with you, but it feels like there's a lot of churn right
now because there's a lot of people working on this without understanding
the MM side of things (and conversely, I don't think (m)any people on the
MM side really understand what graphics cards are trying to accomplish).

Who is that going to be?  I'm happy to get on the phone with someone.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-11 22:35   ` Matthew Wilcox
@ 2026-01-12  0:19     ` Balbir Singh
  2026-01-12  0:51       ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Balbir Singh @ 2026-01-12  0:19 UTC (permalink / raw)
  To: Matthew Wilcox, Francois Dugast
  Cc: intel-xe, dri-devel, Matthew Brost, Zi Yan, Madhavan Srinivasan,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 1/12/26 08:35, Matthew Wilcox wrote:
> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>> The core MM splits the folio before calling folio_free, restoring the
>> zone pages associated with the folio to an initialized state (e.g.,
>> non-compound, pgmap valid, etc...). The order argument represents the
>> folio’s order prior to the split which can be used driver side to know
>> how many pages are being freed.
> 
> This really feels like the wrong way to fix this problem.
> 

This stems from a special requirement, freeing is done in two phases

1. Free the folio -> inform the driver (which implies freeing the backing device memory)
2. Return the folio back, split it back to single order folios

The current code does not do 2. 1 followed by 2 does not work for
Francois since the backing memory can get reused before we reach step 2.
The proposed patch does 2 followed 1, but doing 2 means we've lost the 
folio order and thus the old order is passed in. Although, I wonder if the
backing folio's zone_device_data can be used to encode any order information
about the device side allocation.

@Francois, I hope I did not miss anything in the explanation above.

> I think someone from the graphics side really needs to take the lead on
> understanding what the MM is doing (both currently and in the future).
> I'm happy to work with you, but it feels like there's a lot of churn right
> now because there's a lot of people working on this without understanding
> the MM side of things (and conversely, I don't think (m)any people on the
> MM side really understand what graphics cards are trying to accomplish).
> 

I suspect you are referring to folio specialization and/or downsizing?

> Who is that going to be?  I'm happy to get on the phone with someone.

Happy to work with you, but I am not the authority on graphics, I can speak
to zone device folios. I suspect we'd need to speak to more than one person.

Balbir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12  0:19     ` Balbir Singh
@ 2026-01-12  0:51       ` Zi Yan
  2026-01-12  1:37         ` Matthew Brost
                           ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Zi Yan @ 2026-01-12  0:51 UTC (permalink / raw)
  To: Matthew Wilcox, Balbir Singh
  Cc: Francois Dugast, intel-xe, dri-devel, Matthew Brost,
	Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 11 Jan 2026, at 19:19, Balbir Singh wrote:

> On 1/12/26 08:35, Matthew Wilcox wrote:
>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>> The core MM splits the folio before calling folio_free, restoring the
>>> zone pages associated with the folio to an initialized state (e.g.,
>>> non-compound, pgmap valid, etc...). The order argument represents the
>>> folio’s order prior to the split which can be used driver side to know
>>> how many pages are being freed.
>>
>> This really feels like the wrong way to fix this problem.
>>

Hi Matthew,

I think the wording is confusing, since the actual issue is that:

1. zone_device_page_init() calls prep_compound_page() to form a large folio,
2. but free_zone_device_folio() never reverse the course,
3. the undo of prep_compound_page() in free_zone_device_folio() needs to
   be done before driver callback ->folio_free(), since once ->folio_free()
   is called, the folio can be reallocated immediately,
4. after the undo of prep_compound_page(), folio_order() can no longer provide
   the original order information, thus, folio_free() needs that for proper
   device side ref manipulation.

So this is not used for "split" but undo of prep_compound_page(). It might
look like a split to non core MM people, since it changes a large folio
to a bunch of base pages. BTW, core MM has no compound_page_dctor() but
open codes it in free_pages_prepare() by resetting page flags, page->mapping,
and so on. So it might be why the undo prep_compound_page() is missed
by non core MM people.

>
> This stems from a special requirement, freeing is done in two phases
>
> 1. Free the folio -> inform the driver (which implies freeing the backing device memory)
> 2. Return the folio back, split it back to single order folios

Hi Balbir,

Please refrain from using "split" here, since it confuses MM people. A folio
is split when it is still in use, but in this case, the folio has been freed
and needs to be restored to "free page" state.

>
> The current code does not do 2. 1 followed by 2 does not work for
> Francois since the backing memory can get reused before we reach step 2.
> The proposed patch does 2 followed 1, but doing 2 means we've lost the
> folio order and thus the old order is passed in. Although, I wonder if the
> backing folio's zone_device_data can be used to encode any order information
> about the device side allocation.
>
> @Francois, I hope I did not miss anything in the explanation above.
>
>> I think someone from the graphics side really needs to take the lead on
>> understanding what the MM is doing (both currently and in the future).
>> I'm happy to work with you, but it feels like there's a lot of churn right
>> now because there's a lot of people working on this without understanding
>> the MM side of things (and conversely, I don't think (m)any people on the
>> MM side really understand what graphics cards are trying to accomplish).
>>
>
> I suspect you are referring to folio specialization and/or downsizing?
>
>> Who is that going to be?  I'm happy to get on the phone with someone.
>
> Happy to work with you, but I am not the authority on graphics, I can speak
> to zone device folios. I suspect we'd need to speak to more than one person.
>

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12  0:51       ` Zi Yan
@ 2026-01-12  1:37         ` Matthew Brost
  2026-01-12  4:50         ` Balbir Singh
  2026-01-12 13:45         ` Jason Gunthorpe
  2 siblings, 0 replies; 32+ messages in thread
From: Matthew Brost @ 2026-01-12  1:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> 
> > On 1/12/26 08:35, Matthew Wilcox wrote:
> >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> >>> The core MM splits the folio before calling folio_free, restoring the
> >>> zone pages associated with the folio to an initialized state (e.g.,
> >>> non-compound, pgmap valid, etc...). The order argument represents the
> >>> folio’s order prior to the split which can be used driver side to know
> >>> how many pages are being freed.
> >>
> >> This really feels like the wrong way to fix this problem.
> >>
> 
> Hi Matthew,
> 
> I think the wording is confusing, since the actual issue is that:
> 
> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> 2. but free_zone_device_folio() never reverse the course,
> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>    be done before driver callback ->folio_free(), since once ->folio_free()
>    is called, the folio can be reallocated immediately,
> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>    the original order information, thus, folio_free() needs that for proper
>    device side ref manipulation.
> 
> So this is not used for "split" but undo of prep_compound_page(). It might
> look like a split to non core MM people, since it changes a large folio
> to a bunch of base pages. BTW, core MM has no compound_page_dctor() but
> open codes it in free_pages_prepare() by resetting page flags, page->mapping,
> and so on. So it might be why the undo prep_compound_page() is missed
> by non core MM people.
> 

Let me try to reword this while avoiding the term “split” and properly
explaining the problem.

> >
> > This stems from a special requirement, freeing is done in two phases
> >
> > 1. Free the folio -> inform the driver (which implies freeing the backing device memory)
> > 2. Return the folio back, split it back to single order folios
> 
> Hi Balbir,
> 
> Please refrain from using "split" here, since it confuses MM people. A folio
> is split when it is still in use, but in this case, the folio has been freed
> and needs to be restored to "free page" state.
> 

Yeah, “split” is a bad term. We are reinitializing all zone pages in a
folio upon free.

> >
> > The current code does not do 2. 1 followed by 2 does not work for
> > Francois since the backing memory can get reused before we reach step 2.
> > The proposed patch does 2 followed 1, but doing 2 means we've lost the
> > folio order and thus the old order is passed in. Although, I wonder if the
> > backing folio's zone_device_data can be used to encode any order information
> > about the device side allocation.
> >
> > @Francois, I hope I did not miss anything in the explanation above.

Yes, correct. The pages in the folio must be reinitialized before
calling into the driver to free them, because once that happens, the
pages can be immediately reallocated.

> >
> >> I think someone from the graphics side really needs to take the lead on
> >> understanding what the MM is doing (both currently and in the future).
> >> I'm happy to work with you, but it feels like there's a lot of churn right
> >> now because there's a lot of people working on this without understanding
> >> the MM side of things (and conversely, I don't think (m)any people on the
> >> MM side really understand what graphics cards are trying to accomplish).

I can’t disagree with anything you’re saying. The core MM is about as
complex as it gets, and my understanding of what’s going on isn’t
great—it’s basically just reverse engineering until I reach a point
where I can fix a problem, think it’s correct, and hope I don’t get
shredded.

Graphics/DRM is also quite complex, but that’s where I work...
> >>
> >
> > I suspect you are referring to folio specialization and/or downsizing?
> >
> >> Who is that going to be?  I'm happy to get on the phone with someone.
> >
> > Happy to work with you, but I am not the authority on graphics, I can speak
> > to zone device folios. I suspect we'd need to speak to more than one person.
> >

Also happy to work with you, but I agree with Zi—graphics isn’t
something one company can speak as an authority on, much less one
person.

Matt

> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12  0:51       ` Zi Yan
  2026-01-12  1:37         ` Matthew Brost
@ 2026-01-12  4:50         ` Balbir Singh
  2026-01-12 13:45         ` Jason Gunthorpe
  2 siblings, 0 replies; 32+ messages in thread
From: Balbir Singh @ 2026-01-12  4:50 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox
  Cc: Francois Dugast, intel-xe, dri-devel, Matthew Brost,
	Madhavan Srinivasan, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Jason Gunthorpe, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 1/12/26 10:51, Zi Yan wrote:
> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> 
>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>> The core MM splits the folio before calling folio_free, restoring the
>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>> folio’s order prior to the split which can be used driver side to know
>>>> how many pages are being freed.
>>>
>>> This really feels like the wrong way to fix this problem.
>>>
> 
> Hi Matthew,
> 
> I think the wording is confusing, since the actual issue is that:
> 
> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> 2. but free_zone_device_folio() never reverse the course,
> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>    be done before driver callback ->folio_free(), since once ->folio_free()
>    is called, the folio can be reallocated immediately,
> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>    the original order information, thus, folio_free() needs that for proper
>    device side ref manipulation.
> 
> So this is not used for "split" but undo of prep_compound_page(). It might
> look like a split to non core MM people, since it changes a large folio
> to a bunch of base pages. BTW, core MM has no compound_page_dctor() but
> open codes it in free_pages_prepare() by resetting page flags, page->mapping,
> and so on. So it might be why the undo prep_compound_page() is missed
> by non core MM people.
> 
>>
>> This stems from a special requirement, freeing is done in two phases
>>
>> 1. Free the folio -> inform the driver (which implies freeing the backing device memory)
>> 2. Return the folio back, split it back to single order folios
> 
> Hi Balbir,
> 
> Please refrain from using "split" here, since it confuses MM people. A folio
> is split when it is still in use, but in this case, the folio has been freed
> and needs to be restored to "free page" state.
> 

Yeah, the word split came from the initial version that called it folio_split_unref()
and I was also thinking of the split callback for zone device folios, but I agree
(re)initialization is a better term.

>>
>> The current code does not do 2. 1 followed by 2 does not work for
>> Francois since the backing memory can get reused before we reach step 2.
>> The proposed patch does 2 followed 1, but doing 2 means we've lost the
>> folio order and thus the old order is passed in. Although, I wonder if the
>> backing folio's zone_device_data can be used to encode any order information
>> about the device side allocation.
>>
>> @Francois, I hope I did not miss anything in the explanation above.
>>
>>> I think someone from the graphics side really needs to take the lead on
>>> understanding what the MM is doing (both currently and in the future).
>>> I'm happy to work with you, but it feels like there's a lot of churn right
>>> now because there's a lot of people working on this without understanding
>>> the MM side of things (and conversely, I don't think (m)any people on the
>>> MM side really understand what graphics cards are trying to accomplish).
>>>
>>
>> I suspect you are referring to folio specialization and/or downsizing?
>>
>>> Who is that going to be?  I'm happy to get on the phone with someone.
>>
>> Happy to work with you, but I am not the authority on graphics, I can speak
>> to zone device folios. I suspect we'd need to speak to more than one person.
>>
> 
> --
> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12  0:51       ` Zi Yan
  2026-01-12  1:37         ` Matthew Brost
  2026-01-12  4:50         ` Balbir Singh
@ 2026-01-12 13:45         ` Jason Gunthorpe
  2026-01-12 16:31           ` Zi Yan
  2026-01-12 21:49           ` Matthew Brost
  2 siblings, 2 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 13:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> 
> > On 1/12/26 08:35, Matthew Wilcox wrote:
> >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> >>> The core MM splits the folio before calling folio_free, restoring the
> >>> zone pages associated with the folio to an initialized state (e.g.,
> >>> non-compound, pgmap valid, etc...). The order argument represents the
> >>> folio’s order prior to the split which can be used driver side to know
> >>> how many pages are being freed.
> >>
> >> This really feels like the wrong way to fix this problem.
> >>
> 
> Hi Matthew,
> 
> I think the wording is confusing, since the actual issue is that:
> 
> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> 2. but free_zone_device_folio() never reverse the course,
> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>    be done before driver callback ->folio_free(), since once ->folio_free()
>    is called, the folio can be reallocated immediately,
> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>    the original order information, thus, folio_free() needs that for proper
>    device side ref manipulation.

There is something wrong with the driver if the "folio can be
reallocated immediately".

The flow generally expects there to be a driver allocator linked to
folio_free()

1) Allocator finds free memory
2) zone_device_page_init() allocates the memory and makes refcount=1
3) __folio_put() knows the recount 0.
4) free_zone_device_folio() calls folio_free(), but it doesn't
   actually need to undo prep_compound_page() because *NOTHING* can
   use the page pointer at this point.
5) Driver puts the memory back into the allocator and now #1 can
   happen. It knows how much memory to put back because folio->order
   is valid from #2
6) #1 happens again, then #2 happens again and the folio is in the
   right state for use. The successor #2 fully undoes the work of the
   predecessor #2.

If you have races where #1 can happen immediately after #3 then the
driver design is fundamentally broken and passing around order isn't
going to help anything.

If the allocator is using the struct page memory then step #5 should
also clean up the struct page with the allocator data before returning
it to the allocator.

I vaugely remember talking about this before in the context of the Xe
driver.. You can't just take an existing VRAM allocator and layer it
on top of the folios and have it broadly ignore the folio_free
callback.

Jsaon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 13:45         ` Jason Gunthorpe
@ 2026-01-12 16:31           ` Zi Yan
  2026-01-12 16:50             ` Jason Gunthorpe
  2026-01-12 21:49           ` Matthew Brost
  1 sibling, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-01-12 16:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 8:45, Jason Gunthorpe wrote:

> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
>>
>>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>>> The core MM splits the folio before calling folio_free, restoring the
>>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>>> folio’s order prior to the split which can be used driver side to know
>>>>> how many pages are being freed.
>>>>
>>>> This really feels like the wrong way to fix this problem.
>>>>
>>
>> Hi Matthew,
>>
>> I think the wording is confusing, since the actual issue is that:
>>
>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
>> 2. but free_zone_device_folio() never reverse the course,
>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>>    be done before driver callback ->folio_free(), since once ->folio_free()
>>    is called, the folio can be reallocated immediately,
>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>>    the original order information, thus, folio_free() needs that for proper
>>    device side ref manipulation.
>
> There is something wrong with the driver if the "folio can be
> reallocated immediately".
>
> The flow generally expects there to be a driver allocator linked to
> folio_free()
>
> 1) Allocator finds free memory
> 2) zone_device_page_init() allocates the memory and makes refcount=1
> 3) __folio_put() knows the recount 0.
> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>    actually need to undo prep_compound_page() because *NOTHING* can
>    use the page pointer at this point.
> 5) Driver puts the memory back into the allocator and now #1 can
>    happen. It knows how much memory to put back because folio->order
>    is valid from #2
> 6) #1 happens again, then #2 happens again and the folio is in the
>    right state for use. The successor #2 fully undoes the work of the
>    predecessor #2.

But how can a successor #2 undo the work if the second #1 only allocates
half of the original folio? For example, an order-9 at PFN 0 is
allocated and freed, then an order-8 at PFN 0 is allocated and another
order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
without corrupting each other’s data?


>
> If you have races where #1 can happen immediately after #3 then the
> driver design is fundamentally broken and passing around order isn't
> going to help anything.
>
> If the allocator is using the struct page memory then step #5 should
> also clean up the struct page with the allocator data before returning
> it to the allocator.

Do you mean ->folio_free() callback should undo prep_compound_page()
instead?

>
> I vaugely remember talking about this before in the context of the Xe
> driver.. You can't just take an existing VRAM allocator and layer it
> on top of the folios and have it broadly ignore the folio_free
> callback.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 16:31           ` Zi Yan
@ 2026-01-12 16:50             ` Jason Gunthorpe
  2026-01-12 17:46               ` Zi Yan
  2026-01-12 23:07               ` Matthew Brost
  0 siblings, 2 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 16:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
> > folio_free()
> >
> > 1) Allocator finds free memory
> > 2) zone_device_page_init() allocates the memory and makes refcount=1
> > 3) __folio_put() knows the recount 0.
> > 4) free_zone_device_folio() calls folio_free(), but it doesn't
> >    actually need to undo prep_compound_page() because *NOTHING* can
> >    use the page pointer at this point.
> > 5) Driver puts the memory back into the allocator and now #1 can
> >    happen. It knows how much memory to put back because folio->order
> >    is valid from #2
> > 6) #1 happens again, then #2 happens again and the folio is in the
> >    right state for use. The successor #2 fully undoes the work of the
> >    predecessor #2.
> 
> But how can a successor #2 undo the work if the second #1 only allocates
> half of the original folio? For example, an order-9 at PFN 0 is
> allocated and freed, then an order-8 at PFN 0 is allocated and another
> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
> without corrupting each other’s data?

What do you mean? The fundamental rule is you can't read the folio or
the order outside folio_free once it's refcount reaches 0.

So the successor #2 will write updated heads and order to the order 8
pages at PFN 0 and the ones starting at PFN 256 will remain with
garbage.

This is OK because nothing is allowed to read them as their refcount
is 0.

If later PFN256 is allocated then it will get updated head and order
at the same time it's refcount becomes 1.

There is corruption and they don't corrupt each other's data.

> > If the allocator is using the struct page memory then step #5 should
> > also clean up the struct page with the allocator data before returning
> > it to the allocator.
> 
> Do you mean ->folio_free() callback should undo prep_compound_page()
> instead?

I wouldn't say undo, I was very careful to say it needs to get the
struct page memory into a state that the allocator algorithm expects,
whatever that means.

Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 16:50             ` Jason Gunthorpe
@ 2026-01-12 17:46               ` Zi Yan
  2026-01-12 18:25                 ` Jason Gunthorpe
  2026-01-12 23:07               ` Matthew Brost
  1 sibling, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-01-12 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
>>> folio_free()
>>>
>>> 1) Allocator finds free memory
>>> 2) zone_device_page_init() allocates the memory and makes refcount=1
>>> 3) __folio_put() knows the recount 0.
>>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>>>    actually need to undo prep_compound_page() because *NOTHING* can
>>>    use the page pointer at this point.
>>> 5) Driver puts the memory back into the allocator and now #1 can
>>>    happen. It knows how much memory to put back because folio->order
>>>    is valid from #2
>>> 6) #1 happens again, then #2 happens again and the folio is in the
>>>    right state for use. The successor #2 fully undoes the work of the
>>>    predecessor #2.
>>
>> But how can a successor #2 undo the work if the second #1 only allocates
>> half of the original folio? For example, an order-9 at PFN 0 is
>> allocated and freed, then an order-8 at PFN 0 is allocated and another
>> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
>> without corrupting each other’s data?
>
> What do you mean? The fundamental rule is you can't read the folio or
> the order outside folio_free once it's refcount reaches 0.

There is no such a rule. In core MM, folio_split(), which splits a high
order folio to low order ones, freezes the folio (turning refcount to 0)
and manipulates the folio order and all tail pages compound_head to
restructure the folio. Your fundamental rule breaks this.
Allowing compound information to stay after a folio is freed means
you cannot tell whether a folio is under split or freed.

>
> So the successor #2 will write updated heads and order to the order 8
> pages at PFN 0 and the ones starting at PFN 256 will remain with
> garbage.
>
> This is OK because nothing is allowed to read them as their refcount
> is 0.
>
> If later PFN256 is allocated then it will get updated head and order
> at the same time it's refcount becomes 1.
>
> There is corruption and they don't corrupt each other's data.
>
>>> If the allocator is using the struct page memory then step #5 should
>>> also clean up the struct page with the allocator data before returning
>>> it to the allocator.
>>
>> Do you mean ->folio_free() callback should undo prep_compound_page()
>> instead?
>
> I wouldn't say undo, I was very careful to say it needs to get the
> struct page memory into a state that the allocator algorithm expects,
> whatever that means.
>
> Jason


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 17:46               ` Zi Yan
@ 2026-01-12 18:25                 ` Jason Gunthorpe
  2026-01-12 18:55                   ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 18:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 12:46:57PM -0500, Zi Yan wrote:
> On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote:
> 
> > On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
> >>> folio_free()
> >>>
> >>> 1) Allocator finds free memory
> >>> 2) zone_device_page_init() allocates the memory and makes refcount=1
> >>> 3) __folio_put() knows the recount 0.
> >>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
> >>>    actually need to undo prep_compound_page() because *NOTHING* can
> >>>    use the page pointer at this point.
> >>> 5) Driver puts the memory back into the allocator and now #1 can
> >>>    happen. It knows how much memory to put back because folio->order
> >>>    is valid from #2
> >>> 6) #1 happens again, then #2 happens again and the folio is in the
> >>>    right state for use. The successor #2 fully undoes the work of the
> >>>    predecessor #2.
> >>
> >> But how can a successor #2 undo the work if the second #1 only allocates
> >> half of the original folio? For example, an order-9 at PFN 0 is
> >> allocated and freed, then an order-8 at PFN 0 is allocated and another
> >> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
> >> without corrupting each other’s data?
> >
> > What do you mean? The fundamental rule is you can't read the folio or
> > the order outside folio_free once it's refcount reaches 0.
> 
> There is no such a rule. In core MM, folio_split(), which splits a high
> order folio to low order ones, freezes the folio (turning refcount to 0)
> and manipulates the folio order and all tail pages compound_head to
> restructure the folio.

That's different, I am talking about reaching 0 because it has been
freed, meaning there are no external pointers to it.

Further, when a page is frozen page_ref_freeze() takes in the number
of references the caller has ownership over and it doesn't succeed if
there are stray references elsewhere.

This is very important because the entire operating model of split
only works if it has exclusive locks over all the valid pointers into
that page.

Spurious refcount failures concurrent with split cannot be allowed.

I don't see how pointing at __folio_freeze_and_split_unmapped() can
justify this series.

> Your fundamental rule breaks this.  Allowing compound information
> to stay after a folio is freed means you cannot tell whether a folio
> is under split or freed.

You can't refcount a folio out of nothing. It has to come from a
memory location that already is holding a refcount, and then you can
incr it.

For example lockless GUP fast will read the PTE, adjust to the head
page, attempt to incr it, then recheck the PTE.

If there are races then sure maybe the PTE will point to a stray tail
page that refers to an already allocated head page, but the re-check
of the PTE wille exclude this. The refcount system already has to
tolerate spurious refcount incrs because of GUP fast.

Nothing should be looking at order and refcount to try to guess if
concurrent split is happening!!

Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 18:25                 ` Jason Gunthorpe
@ 2026-01-12 18:55                   ` Zi Yan
  2026-01-12 19:28                     ` Jason Gunthorpe
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-01-12 18:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 13:25, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 12:46:57PM -0500, Zi Yan wrote:
>> On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote:
>>
>>> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
>>>>> folio_free()
>>>>>
>>>>> 1) Allocator finds free memory
>>>>> 2) zone_device_page_init() allocates the memory and makes refcount=1
>>>>> 3) __folio_put() knows the recount 0.
>>>>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>>>>>    actually need to undo prep_compound_page() because *NOTHING* can
>>>>>    use the page pointer at this point.
>>>>> 5) Driver puts the memory back into the allocator and now #1 can
>>>>>    happen. It knows how much memory to put back because folio->order
>>>>>    is valid from #2
>>>>> 6) #1 happens again, then #2 happens again and the folio is in the
>>>>>    right state for use. The successor #2 fully undoes the work of the
>>>>>    predecessor #2.
>>>>
>>>> But how can a successor #2 undo the work if the second #1 only allocates
>>>> half of the original folio? For example, an order-9 at PFN 0 is
>>>> allocated and freed, then an order-8 at PFN 0 is allocated and another
>>>> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
>>>> without corrupting each other’s data?
>>>
>>> What do you mean? The fundamental rule is you can't read the folio or
>>> the order outside folio_free once it's refcount reaches 0.
>>
>> There is no such a rule. In core MM, folio_split(), which splits a high
>> order folio to low order ones, freezes the folio (turning refcount to 0)
>> and manipulates the folio order and all tail pages compound_head to
>> restructure the folio.
>
> That's different, I am talking about reaching 0 because it has been
> freed, meaning there are no external pointers to it.
>
> Further, when a page is frozen page_ref_freeze() takes in the number
> of references the caller has ownership over and it doesn't succeed if
> there are stray references elsewhere.
>
> This is very important because the entire operating model of split
> only works if it has exclusive locks over all the valid pointers into
> that page.
>
> Spurious refcount failures concurrent with split cannot be allowed.
>
> I don't see how pointing at __folio_freeze_and_split_unmapped() can
> justify this series.
>

But from anyone looking at the folio state, refcount == 0, compound_head
is set, they cannot tell the difference.

If what you said is true, why is free_pages_prepare() needed? No one
should touch these free pages. Why bother resetting these states.

>> Your fundamental rule breaks this.  Allowing compound information
>> to stay after a folio is freed means you cannot tell whether a folio
>> is under split or freed.
>
> You can't refcount a folio out of nothing. It has to come from a
> memory location that already is holding a refcount, and then you can
> incr it.

Right. There is also no guarantee that all code is correct and follows
this.

My point here is that calling prep_compound_page() on a compound page
does not follow core MM’s conventions.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 18:55                   ` Zi Yan
@ 2026-01-12 19:28                     ` Jason Gunthorpe
  0 siblings, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2026-01-12 19:28 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Matthew Brost, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 01:55:18PM -0500, Zi Yan wrote:
> > That's different, I am talking about reaching 0 because it has been
> > freed, meaning there are no external pointers to it.
> >
> > Further, when a page is frozen page_ref_freeze() takes in the number
> > of references the caller has ownership over and it doesn't succeed if
> > there are stray references elsewhere.
> >
> > This is very important because the entire operating model of split
> > only works if it has exclusive locks over all the valid pointers into
> > that page.
> >
> > Spurious refcount failures concurrent with split cannot be allowed.
> >
> > I don't see how pointing at __folio_freeze_and_split_unmapped() can
> > justify this series.
> >
> 
> But from anyone looking at the folio state, refcount == 0, compound_head
> is set, they cannot tell the difference.

This isn't reliable, nothing correct can be doing it :\

> If what you said is true, why is free_pages_prepare() needed? No one
> should touch these free pages. Why bother resetting these states.

? that function does alot of stuff, thinks like uncharging the cgroup
should obviously happen at free time.

What part of it are you looking at?

> > You can't refcount a folio out of nothing. It has to come from a
> > memory location that already is holding a refcount, and then you can
> > incr it.
> 
> Right. There is also no guarantee that all code is correct and follows
> this.

Let's concretely point at things that have a problem please.

> My point here is that calling prep_compound_page() on a compound page
> does not follow core MM’s conventions.

Maybe, but that doesn't mean it isn't the right solution..

Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 16:50             ` Jason Gunthorpe
  2026-01-12 17:46               ` Zi Yan
@ 2026-01-12 23:07               ` Matthew Brost
  1 sibling, 0 replies; 32+ messages in thread
From: Matthew Brost @ 2026-01-12 23:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zi Yan, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 12:50:01PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
> > > folio_free()
> > >
> > > 1) Allocator finds free memory
> > > 2) zone_device_page_init() allocates the memory and makes refcount=1
> > > 3) __folio_put() knows the recount 0.
> > > 4) free_zone_device_folio() calls folio_free(), but it doesn't
> > >    actually need to undo prep_compound_page() because *NOTHING* can
> > >    use the page pointer at this point.
> > > 5) Driver puts the memory back into the allocator and now #1 can
> > >    happen. It knows how much memory to put back because folio->order
> > >    is valid from #2
> > > 6) #1 happens again, then #2 happens again and the folio is in the
> > >    right state for use. The successor #2 fully undoes the work of the
> > >    predecessor #2.
> > 
> > But how can a successor #2 undo the work if the second #1 only allocates
> > half of the original folio? For example, an order-9 at PFN 0 is
> > allocated and freed, then an order-8 at PFN 0 is allocated and another
> > order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
> > without corrupting each other’s data?
> 
> What do you mean? The fundamental rule is you can't read the folio or
> the order outside folio_free once it's refcount reaches 0.
> 
> So the successor #2 will write updated heads and order to the order 8
> pages at PFN 0 and the ones starting at PFN 256 will remain with
> garbage.
> 
> This is OK because nothing is allowed to read them as their refcount
> is 0.
> 
> If later PFN256 is allocated then it will get updated head and order
> at the same time it's refcount becomes 1.
> 
> There is corruption and they don't corrupt each other's data.
> 
> > > If the allocator is using the struct page memory then step #5 should
> > > also clean up the struct page with the allocator data before returning
> > > it to the allocator.
> > 
> > Do you mean ->folio_free() callback should undo prep_compound_page()
> > instead?
> 
> I wouldn't say undo, I was very careful to say it needs to get the
> struct page memory into a state that the allocator algorithm expects,
> whatever that means.
> 

Hi Jason,

A lot of back and forth with Zi — if I’m understanding correctly, your
suggestion is to just call free_zone_device_folio_prepare() [1] in
->folio_free() if required by the driver. This is the function that puts
struct page into a state my allocator expects. That works just fine for
me.

Matt

[1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4

> Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 13:45         ` Jason Gunthorpe
  2026-01-12 16:31           ` Zi Yan
@ 2026-01-12 21:49           ` Matthew Brost
  2026-01-12 23:15             ` Zi Yan
  1 sibling, 1 reply; 32+ messages in thread
From: Matthew Brost @ 2026-01-12 21:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zi Yan, Matthew Wilcox, Balbir Singh, Francois Dugast, intel-xe,
	dri-devel, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:

Hi, catching up here.

> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> > On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> > 
> > > On 1/12/26 08:35, Matthew Wilcox wrote:
> > >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> > >>> The core MM splits the folio before calling folio_free, restoring the
> > >>> zone pages associated with the folio to an initialized state (e.g.,
> > >>> non-compound, pgmap valid, etc...). The order argument represents the
> > >>> folio’s order prior to the split which can be used driver side to know
> > >>> how many pages are being freed.
> > >>
> > >> This really feels like the wrong way to fix this problem.
> > >>
> > 
> > Hi Matthew,
> > 
> > I think the wording is confusing, since the actual issue is that:
> > 
> > 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> > 2. but free_zone_device_folio() never reverse the course,
> > 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
> >    be done before driver callback ->folio_free(), since once ->folio_free()
> >    is called, the folio can be reallocated immediately,
> > 4. after the undo of prep_compound_page(), folio_order() can no longer provide
> >    the original order information, thus, folio_free() needs that for proper
> >    device side ref manipulation.
> 
> There is something wrong with the driver if the "folio can be
> reallocated immediately".
> 
> The flow generally expects there to be a driver allocator linked to
> folio_free()
> 
> 1) Allocator finds free memory
> 2) zone_device_page_init() allocates the memory and makes refcount=1
> 3) __folio_put() knows the recount 0.
> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>    actually need to undo prep_compound_page() because *NOTHING* can
>    use the page pointer at this point.

Correct—nothing can use the folio prior to calling folio_free(). Once
folio_free() returns, the driver side is free to immediately reallocate
the folio (or a subset of its pages).

> 5) Driver puts the memory back into the allocator and now #1 can
>    happen. It knows how much memory to put back because folio->order
>    is valid from #2
> 6) #1 happens again, then #2 happens again and the folio is in the
>    right state for use. The successor #2 fully undoes the work of the
>    predecessor #2.
> 
> If you have races where #1 can happen immediately after #3 then the
> driver design is fundamentally broken and passing around order isn't
> going to help anything.
>

The above race does not exist; if it did, I agree we’d be solving
nothing here.
 
> If the allocator is using the struct page memory then step #5 should
> also clean up the struct page with the allocator data before returning
> it to the allocator.
> 

We could move the call to free_zone_device_folio_prepare() [1] into the
driver-side implementation of ->folio_free() and drop the order argument
here. Zi didn’t particularly like that; he preferred calling
free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
which is why this patch exists.

FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
actually need the order regardless of where
free_zone_device_folio_prepare() is called, but Nouveau does need the
order if free_zone_device_folio_prepare() is called before
->folio_free().

[1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
[2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405

> I vaugely remember talking about this before in the context of the Xe
> driver.. You can't just take an existing VRAM allocator and layer it
> on top of the folios and have it broadly ignore the folio_free
> callback.
> 

We are definitely not ignoring the ->folio_free callback—that is the
point at which we tell our VRAM allocator (DRM buddy) it is okay to
release the allocation and make it available for reuse.

Matt

> Jsaon


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 21:49           ` Matthew Brost
@ 2026-01-12 23:15             ` Zi Yan
  2026-01-12 23:22               ` Matthew Brost
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-01-12 23:15 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Jason Gunthorpe, Matthew Wilcox, Balbir Singh, Francois Dugast,
	intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On 12 Jan 2026, at 16:49, Matthew Brost wrote:

> On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:
>
> Hi, catching up here.
>
>> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
>>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
>>>
>>>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>>>> The core MM splits the folio before calling folio_free, restoring the
>>>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>>>> folio’s order prior to the split which can be used driver side to know
>>>>>> how many pages are being freed.
>>>>>
>>>>> This really feels like the wrong way to fix this problem.
>>>>>
>>>
>>> Hi Matthew,
>>>
>>> I think the wording is confusing, since the actual issue is that:
>>>
>>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
>>> 2. but free_zone_device_folio() never reverse the course,
>>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>>>    be done before driver callback ->folio_free(), since once ->folio_free()
>>>    is called, the folio can be reallocated immediately,
>>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>>>    the original order information, thus, folio_free() needs that for proper
>>>    device side ref manipulation.
>>
>> There is something wrong with the driver if the "folio can be
>> reallocated immediately".
>>
>> The flow generally expects there to be a driver allocator linked to
>> folio_free()
>>
>> 1) Allocator finds free memory
>> 2) zone_device_page_init() allocates the memory and makes refcount=1
>> 3) __folio_put() knows the recount 0.
>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>>    actually need to undo prep_compound_page() because *NOTHING* can
>>    use the page pointer at this point.
>
> Correct—nothing can use the folio prior to calling folio_free(). Once
> folio_free() returns, the driver side is free to immediately reallocate
> the folio (or a subset of its pages).
>
>> 5) Driver puts the memory back into the allocator and now #1 can
>>    happen. It knows how much memory to put back because folio->order
>>    is valid from #2
>> 6) #1 happens again, then #2 happens again and the folio is in the
>>    right state for use. The successor #2 fully undoes the work of the
>>    predecessor #2.
>>
>> If you have races where #1 can happen immediately after #3 then the
>> driver design is fundamentally broken and passing around order isn't
>> going to help anything.
>>
>
> The above race does not exist; if it did, I agree we’d be solving
> nothing here.
>
>> If the allocator is using the struct page memory then step #5 should
>> also clean up the struct page with the allocator data before returning
>> it to the allocator.
>>
>
> We could move the call to free_zone_device_folio_prepare() [1] into the
> driver-side implementation of ->folio_free() and drop the order argument
> here. Zi didn’t particularly like that; he preferred calling
> free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> which is why this patch exists.

On a second thought, if calling free_zone_device_folio_prepare() in
->folio_free() works, feel free to do so.

>
> FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
> actually need the order regardless of where
> free_zone_device_folio_prepare() is called, but Nouveau does need the
> order if free_zone_device_folio_prepare() is called before
> ->folio_free().
>
> [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
> [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405
>
>> I vaugely remember talking about this before in the context of the Xe
>> driver.. You can't just take an existing VRAM allocator and layer it
>> on top of the folios and have it broadly ignore the folio_free
>> callback.
>>
>
> We are definitely not ignoring the ->folio_free callback—that is the
> point at which we tell our VRAM allocator (DRM buddy) it is okay to
> release the allocation and make it available for reuse.
>
> Matt
>
>> Jsaon


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
  2026-01-12 23:15             ` Zi Yan
@ 2026-01-12 23:22               ` Matthew Brost
  0 siblings, 0 replies; 32+ messages in thread
From: Matthew Brost @ 2026-01-12 23:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: Jason Gunthorpe, Matthew Wilcox, Balbir Singh, Francois Dugast,
	intel-xe, dri-devel, Madhavan Srinivasan, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP),
	Felix Kuehling, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Lyude Paul, Danilo Krummrich, Bjorn Helgaas,
	Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Andrew Morton, Leon Romanovsky, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linuxppc-dev,
	kvm, linux-kernel, amd-gfx, nouveau, linux-pci, linux-mm,
	linux-cxl

On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote:
> On 12 Jan 2026, at 16:49, Matthew Brost wrote:
> 
> > On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:
> >
> > Hi, catching up here.
> >
> >> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> >>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> >>>
> >>>> On 1/12/26 08:35, Matthew Wilcox wrote:
> >>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> >>>>>> The core MM splits the folio before calling folio_free, restoring the
> >>>>>> zone pages associated with the folio to an initialized state (e.g.,
> >>>>>> non-compound, pgmap valid, etc...). The order argument represents the
> >>>>>> folio’s order prior to the split which can be used driver side to know
> >>>>>> how many pages are being freed.
> >>>>>
> >>>>> This really feels like the wrong way to fix this problem.
> >>>>>
> >>>
> >>> Hi Matthew,
> >>>
> >>> I think the wording is confusing, since the actual issue is that:
> >>>
> >>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> >>> 2. but free_zone_device_folio() never reverse the course,
> >>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
> >>>    be done before driver callback ->folio_free(), since once ->folio_free()
> >>>    is called, the folio can be reallocated immediately,
> >>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
> >>>    the original order information, thus, folio_free() needs that for proper
> >>>    device side ref manipulation.
> >>
> >> There is something wrong with the driver if the "folio can be
> >> reallocated immediately".
> >>
> >> The flow generally expects there to be a driver allocator linked to
> >> folio_free()
> >>
> >> 1) Allocator finds free memory
> >> 2) zone_device_page_init() allocates the memory and makes refcount=1
> >> 3) __folio_put() knows the recount 0.
> >> 4) free_zone_device_folio() calls folio_free(), but it doesn't
> >>    actually need to undo prep_compound_page() because *NOTHING* can
> >>    use the page pointer at this point.
> >
> > Correct—nothing can use the folio prior to calling folio_free(). Once
> > folio_free() returns, the driver side is free to immediately reallocate
> > the folio (or a subset of its pages).
> >
> >> 5) Driver puts the memory back into the allocator and now #1 can
> >>    happen. It knows how much memory to put back because folio->order
> >>    is valid from #2
> >> 6) #1 happens again, then #2 happens again and the folio is in the
> >>    right state for use. The successor #2 fully undoes the work of the
> >>    predecessor #2.
> >>
> >> If you have races where #1 can happen immediately after #3 then the
> >> driver design is fundamentally broken and passing around order isn't
> >> going to help anything.
> >>
> >
> > The above race does not exist; if it did, I agree we’d be solving
> > nothing here.
> >
> >> If the allocator is using the struct page memory then step #5 should
> >> also clean up the struct page with the allocator data before returning
> >> it to the allocator.
> >>
> >
> > We could move the call to free_zone_device_folio_prepare() [1] into the
> > driver-side implementation of ->folio_free() and drop the order argument
> > here. Zi didn’t particularly like that; he preferred calling
> > free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> > which is why this patch exists.
> 
> On a second thought, if calling free_zone_device_folio_prepare() in
> ->folio_free() works, feel free to do so.
> 

+1, testing this change right now and it does indeed work.

Matt

> >
> > FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
> > actually need the order regardless of where
> > free_zone_device_folio_prepare() is called, but Nouveau does need the
> > order if free_zone_device_folio_prepare() is called before
> > ->folio_free().
> >
> > [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
> > [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405
> >
> >> I vaugely remember talking about this before in the context of the Xe
> >> driver.. You can't just take an existing VRAM allocator and layer it
> >> on top of the folios and have it broadly ignore the folio_free
> >> callback.
> >>
> >
> > We are definitely not ignoring the ->folio_free callback—that is the
> > point at which we tell our VRAM allocator (DRM buddy) it is okay to
> > release the allocation and make it available for reuse.
> >
> > Matt
> >
> >> Jsaon
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
@ 2026-01-11 20:55 ` Francois Dugast
  2026-01-12  0:44   ` Balbir Singh
  2026-01-11 20:55 ` [PATCH v4 3/7] fs/dax: Use " Francois Dugast
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Balbir Singh, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alistair Popple, linux-mm,
	linux-cxl, linux-kernel, Francois Dugast

From: Matthew Brost <matthew.brost@intel.com>

Add free_zone_device_folio_prepare(), a helper that restores large
ZONE_DEVICE folios to a sane, initial state before freeing them.

Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
compound metadata). Before returning such pages to the device pgmap
allocator, each constituent page must be reset to a standalone
ZONE_DEVICE folio with a valid pgmap and no compound state.

Use this helper prior to folio_free() for device-private and
device-coherent folios to ensure consistent device page state for
subsequent allocations.

Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
Cc: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 include/linux/memremap.h |  1 +
 mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 97fcffeb1c1e..88e1d4707296 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
 
 #ifdef CONFIG_ZONE_DEVICE
 void zone_device_page_init(struct page *page, unsigned int order);
+void free_zone_device_folio_prepare(struct folio *folio);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
diff --git a/mm/memremap.c b/mm/memremap.c
index 39dc4bd190d0..375a61e18858 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
 }
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
+/**
+ * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
+ * @folio: ZONE_DEVICE folio to prepare for release.
+ *
+ * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
+ * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
+ * must be restored to a sane ZONE_DEVICE state before they are released.
+ *
+ * This helper:
+ *   - Clears @folio->mapping and, for compound folios, clears each page's
+ *     compound-head state (ClearPageHead()/clear_compound_head()).
+ *   - Resets the compound order metadata (folio_reset_order()) and then
+ *     initializes each constituent page as a standalone ZONE_DEVICE folio:
+ *       * clears ->mapping
+ *       * restores ->pgmap (prep_compound_page() overwrites it)
+ *       * clears ->share (only relevant for fsdax; unused for device-private)
+ *
+ * If @folio is order-0, only the mapping is cleared and no further work is
+ * required.
+ */
+void free_zone_device_folio_prepare(struct folio *folio)
+{
+	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
+	int order, i;
+
+	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
+
+	folio->mapping = NULL;
+	order = folio_order(folio);
+	if (!order)
+		return;
+
+	folio_reset_order(folio);
+
+	for (i = 0; i < (1UL << order); i++) {
+		struct page *page = folio_page(folio, i);
+		struct folio *new_folio = (struct folio *)page;
+
+		ClearPageHead(page);
+		clear_compound_head(page);
+
+		new_folio->mapping = NULL;
+		/*
+		 * Reset pgmap which was over-written by
+		 * prep_compound_page().
+		 */
+		new_folio->pgmap = pgmap;
+		new_folio->share = 0;	/* fsdax only, unused for device private */
+		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
+		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
+	}
+}
+EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
+
 void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
@@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
+		free_zone_device_folio_prepare(folio);
 		pgmap->ops->folio_free(folio, order);
 		percpu_ref_put_many(&folio->pgmap->ref, nr);
 		break;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
@ 2026-01-12  0:44   ` Balbir Singh
  2026-01-12  1:16     ` Matthew Brost
  0 siblings, 1 reply; 32+ messages in thread
From: Balbir Singh @ 2026-01-12  0:44 UTC (permalink / raw)
  To: Francois Dugast, intel-xe
  Cc: dri-devel, Matthew Brost, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On 1/12/26 06:55, Francois Dugast wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> Add free_zone_device_folio_prepare(), a helper that restores large
> ZONE_DEVICE folios to a sane, initial state before freeing them.
> 
> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> compound metadata). Before returning such pages to the device pgmap
> allocator, each constituent page must be reset to a standalone
> ZONE_DEVICE folio with a valid pgmap and no compound state.
> 
> Use this helper prior to folio_free() for device-private and
> device-coherent folios to ensure consistent device page state for
> subsequent allocations.
> 
> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: linux-cxl@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  include/linux/memremap.h |  1 +
>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 97fcffeb1c1e..88e1d4707296 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
>  
>  #ifdef CONFIG_ZONE_DEVICE
>  void zone_device_page_init(struct page *page, unsigned int order);
> +void free_zone_device_folio_prepare(struct folio *folio);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 39dc4bd190d0..375a61e18858 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
>  }
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
> +/**
> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> + * @folio: ZONE_DEVICE folio to prepare for release.
> + *
> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> + * must be restored to a sane ZONE_DEVICE state before they are released.
> + *
> + * This helper:
> + *   - Clears @folio->mapping and, for compound folios, clears each page's
> + *     compound-head state (ClearPageHead()/clear_compound_head()).
> + *   - Resets the compound order metadata (folio_reset_order()) and then
> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> + *       * clears ->mapping
> + *       * restores ->pgmap (prep_compound_page() overwrites it)
> + *       * clears ->share (only relevant for fsdax; unused for device-private)
> + *
> + * If @folio is order-0, only the mapping is cleared and no further work is
> + * required.
> + */
> +void free_zone_device_folio_prepare(struct folio *folio)
> +{
> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> +	int order, i;
> +
> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> +
> +	folio->mapping = NULL;
> +	order = folio_order(folio);
> +	if (!order)
> +		return;
> +
> +	folio_reset_order(folio);
> +
> +	for (i = 0; i < (1UL << order); i++) {
> +		struct page *page = folio_page(folio, i);
> +		struct folio *new_folio = (struct folio *)page;
> +
> +		ClearPageHead(page);
> +		clear_compound_head(page);
> +
> +		new_folio->mapping = NULL;
> +		/*
> +		 * Reset pgmap which was over-written by
> +		 * prep_compound_page().
> +		 */
> +		new_folio->pgmap = pgmap;
> +		new_folio->share = 0;	/* fsdax only, unused for device private */
> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);

Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
that PMD_ORDER more frees than we'd like?

> +	}
> +}
> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> +
>  void free_zone_device_folio(struct folio *folio)
>  {
>  	struct dev_pagemap *pgmap = folio->pgmap;
> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
>  	case MEMORY_DEVICE_COHERENT:
>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
>  			break;
> +		free_zone_device_folio_prepare(folio);
>  		pgmap->ops->folio_free(folio, order);
>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
>  		break;

Balbir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12  0:44   ` Balbir Singh
@ 2026-01-12  1:16     ` Matthew Brost
  2026-01-12  2:15       ` Balbir Singh
  0 siblings, 1 reply; 32+ messages in thread
From: Matthew Brost @ 2026-01-12  1:16 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> On 1/12/26 06:55, Francois Dugast wrote:
> > From: Matthew Brost <matthew.brost@intel.com>
> > 
> > Add free_zone_device_folio_prepare(), a helper that restores large
> > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > 
> > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > compound metadata). Before returning such pages to the device pgmap
> > allocator, each constituent page must be reset to a standalone
> > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > 
> > Use this helper prior to folio_free() for device-private and
> > device-coherent folios to ensure consistent device page state for
> > subsequent allocations.
> > 
> > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-cxl@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > ---
> >  include/linux/memremap.h |  1 +
> >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> > 
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 97fcffeb1c1e..88e1d4707296 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> >  
> >  #ifdef CONFIG_ZONE_DEVICE
> >  void zone_device_page_init(struct page *page, unsigned int order);
> > +void free_zone_device_folio_prepare(struct folio *folio);
> >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >  void memunmap_pages(struct dev_pagemap *pgmap);
> >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 39dc4bd190d0..375a61e18858 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> >  }
> >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> >  
> > +/**
> > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > + * @folio: ZONE_DEVICE folio to prepare for release.
> > + *
> > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > + *
> > + * This helper:
> > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > + *       * clears ->mapping
> > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > + *
> > + * If @folio is order-0, only the mapping is cleared and no further work is
> > + * required.
> > + */
> > +void free_zone_device_folio_prepare(struct folio *folio)
> > +{
> > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > +	int order, i;
> > +
> > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > +
> > +	folio->mapping = NULL;
> > +	order = folio_order(folio);
> > +	if (!order)
> > +		return;
> > +
> > +	folio_reset_order(folio);
> > +
> > +	for (i = 0; i < (1UL << order); i++) {
> > +		struct page *page = folio_page(folio, i);
> > +		struct folio *new_folio = (struct folio *)page;
> > +
> > +		ClearPageHead(page);
> > +		clear_compound_head(page);
> > +
> > +		new_folio->mapping = NULL;
> > +		/*
> > +		 * Reset pgmap which was over-written by
> > +		 * prep_compound_page().
> > +		 */
> > +		new_folio->pgmap = pgmap;
> > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> 
> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> that PMD_ORDER more frees than we'd like?
> 

No, calling free_folio() more often doesn’t solve anything—in fact, that
would make my implementation explode. I explained this in detail here [1]
to Zi.

To recap [1], my memory allocator has no visibility into individual
pages or folios; it is DRM Buddy layered on top of TTM BO. This design
allows VRAM to be allocated or evicted for both traditional GPU
allocations (GEMs) and SVM allocations.

Now, to recap the actual issue: if device folios are not split upon free
and are later reallocated with a different order in
zone_device_page_init, the implementation breaks. This problem is not
specific to Xe—Nouveau happens to always allocate at the same order, so
it works by coincidence. Reallocating at a different order is valid
behavior and must be supported.

Matt

[1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413

> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > +
> >  void free_zone_device_folio(struct folio *folio)
> >  {
> >  	struct dev_pagemap *pgmap = folio->pgmap;
> > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> >  	case MEMORY_DEVICE_COHERENT:
> >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> >  			break;
> > +		free_zone_device_folio_prepare(folio);
> >  		pgmap->ops->folio_free(folio, order);
> >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> >  		break;
> 
> Balbir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12  1:16     ` Matthew Brost
@ 2026-01-12  2:15       ` Balbir Singh
  2026-01-12  2:37         ` Matthew Brost
  0 siblings, 1 reply; 32+ messages in thread
From: Balbir Singh @ 2026-01-12  2:15 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On 1/12/26 11:16, Matthew Brost wrote:
> On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
>> On 1/12/26 06:55, Francois Dugast wrote:
>>> From: Matthew Brost <matthew.brost@intel.com>
>>>
>>> Add free_zone_device_folio_prepare(), a helper that restores large
>>> ZONE_DEVICE folios to a sane, initial state before freeing them.
>>>
>>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
>>> compound metadata). Before returning such pages to the device pgmap
>>> allocator, each constituent page must be reset to a standalone
>>> ZONE_DEVICE folio with a valid pgmap and no compound state.
>>>
>>> Use this helper prior to folio_free() for device-private and
>>> device-coherent folios to ensure consistent device page state for
>>> subsequent allocations.
>>>
>>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: David Hildenbrand <david@kernel.org>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Balbir Singh <balbirs@nvidia.com>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Mike Rapoport <rppt@kernel.org>
>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: linux-mm@kvack.org
>>> Cc: linux-cxl@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Suggested-by: Alistair Popple <apopple@nvidia.com>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
>>> ---
>>>  include/linux/memremap.h |  1 +
>>>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 56 insertions(+)
>>>
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index 97fcffeb1c1e..88e1d4707296 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
>>>  
>>>  #ifdef CONFIG_ZONE_DEVICE
>>>  void zone_device_page_init(struct page *page, unsigned int order);
>>> +void free_zone_device_folio_prepare(struct folio *folio);
>>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>>  void memunmap_pages(struct dev_pagemap *pgmap);
>>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 39dc4bd190d0..375a61e18858 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
>>>  }
>>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>>  
>>> +/**
>>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
>>> + * @folio: ZONE_DEVICE folio to prepare for release.
>>> + *
>>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
>>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
>>> + * must be restored to a sane ZONE_DEVICE state before they are released.
>>> + *
>>> + * This helper:
>>> + *   - Clears @folio->mapping and, for compound folios, clears each page's
>>> + *     compound-head state (ClearPageHead()/clear_compound_head()).
>>> + *   - Resets the compound order metadata (folio_reset_order()) and then
>>> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
>>> + *       * clears ->mapping
>>> + *       * restores ->pgmap (prep_compound_page() overwrites it)
>>> + *       * clears ->share (only relevant for fsdax; unused for device-private)
>>> + *
>>> + * If @folio is order-0, only the mapping is cleared and no further work is
>>> + * required.
>>> + */
>>> +void free_zone_device_folio_prepare(struct folio *folio)
>>> +{
>>> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
>>> +	int order, i;
>>> +
>>> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
>>> +
>>> +	folio->mapping = NULL;
>>> +	order = folio_order(folio);
>>> +	if (!order)
>>> +		return;
>>> +
>>> +	folio_reset_order(folio);
>>> +
>>> +	for (i = 0; i < (1UL << order); i++) {
>>> +		struct page *page = folio_page(folio, i);
>>> +		struct folio *new_folio = (struct folio *)page;
>>> +
>>> +		ClearPageHead(page);
>>> +		clear_compound_head(page);
>>> +
>>> +		new_folio->mapping = NULL;
>>> +		/*
>>> +		 * Reset pgmap which was over-written by
>>> +		 * prep_compound_page().
>>> +		 */
>>> +		new_folio->pgmap = pgmap;
>>> +		new_folio->share = 0;	/* fsdax only, unused for device private */
>>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
>>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
>>
>> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
>> that PMD_ORDER more frees than we'd like?
>>
> 
> No, calling free_folio() more often doesn’t solve anything—in fact, that
> would make my implementation explode. I explained this in detail here [1]
> to Zi.
> 
> To recap [1], my memory allocator has no visibility into individual
> pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> allows VRAM to be allocated or evicted for both traditional GPU
> allocations (GEMs) and SVM allocations.
> 

I assume it is still backed by pages that are ref counted? I suspect you'd
need to convert one reference count to PMD_ORDER reference counts to make
this change work, or are the references not at page granularity? 

I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed
like a per folio refcount

> Now, to recap the actual issue: if device folios are not split upon free
> and are later reallocated with a different order in
> zone_device_page_init, the implementation breaks. This problem is not
> specific to Xe—Nouveau happens to always allocate at the same order, so
> it works by coincidence. Reallocating at a different order is valid
> behavior and must be supported.
> 

Agreed

> Matt
> 
> [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> 
>>> +	}
>>> +}
>>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
>>> +
>>>  void free_zone_device_folio(struct folio *folio)
>>>  {
>>>  	struct dev_pagemap *pgmap = folio->pgmap;
>>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
>>>  	case MEMORY_DEVICE_COHERENT:
>>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
>>>  			break;
>>> +		free_zone_device_folio_prepare(folio);
>>>  		pgmap->ops->folio_free(folio, order);
>>>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
>>>  		break;
>>
>> Balbir



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12  2:15       ` Balbir Singh
@ 2026-01-12  2:37         ` Matthew Brost
  2026-01-12  2:50           ` Matthew Brost
  0 siblings, 1 reply; 32+ messages in thread
From: Matthew Brost @ 2026-01-12  2:37 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On Mon, Jan 12, 2026 at 01:15:12PM +1100, Balbir Singh wrote:
> On 1/12/26 11:16, Matthew Brost wrote:
> > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> >> On 1/12/26 06:55, Francois Dugast wrote:
> >>> From: Matthew Brost <matthew.brost@intel.com>
> >>>
> >>> Add free_zone_device_folio_prepare(), a helper that restores large
> >>> ZONE_DEVICE folios to a sane, initial state before freeing them.
> >>>
> >>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> >>> compound metadata). Before returning such pages to the device pgmap
> >>> allocator, each constituent page must be reset to a standalone
> >>> ZONE_DEVICE folio with a valid pgmap and no compound state.
> >>>
> >>> Use this helper prior to folio_free() for device-private and
> >>> device-coherent folios to ensure consistent device page state for
> >>> subsequent allocations.
> >>>
> >>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> >>> Cc: Zi Yan <ziy@nvidia.com>
> >>> Cc: David Hildenbrand <david@kernel.org>
> >>> Cc: Oscar Salvador <osalvador@suse.de>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: Balbir Singh <balbirs@nvidia.com>
> >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> >>> Cc: Vlastimil Babka <vbabka@suse.cz>
> >>> Cc: Mike Rapoport <rppt@kernel.org>
> >>> Cc: Suren Baghdasaryan <surenb@google.com>
> >>> Cc: Michal Hocko <mhocko@suse.com>
> >>> Cc: Alistair Popple <apopple@nvidia.com>
> >>> Cc: linux-mm@kvack.org
> >>> Cc: linux-cxl@vger.kernel.org
> >>> Cc: linux-kernel@vger.kernel.org
> >>> Suggested-by: Alistair Popple <apopple@nvidia.com>
> >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> >>> ---
> >>>  include/linux/memremap.h |  1 +
> >>>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> >>>  2 files changed, 56 insertions(+)
> >>>
> >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >>> index 97fcffeb1c1e..88e1d4707296 100644
> >>> --- a/include/linux/memremap.h
> >>> +++ b/include/linux/memremap.h
> >>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> >>>  
> >>>  #ifdef CONFIG_ZONE_DEVICE
> >>>  void zone_device_page_init(struct page *page, unsigned int order);
> >>> +void free_zone_device_folio_prepare(struct folio *folio);
> >>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >>>  void memunmap_pages(struct dev_pagemap *pgmap);
> >>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> >>> diff --git a/mm/memremap.c b/mm/memremap.c
> >>> index 39dc4bd190d0..375a61e18858 100644
> >>> --- a/mm/memremap.c
> >>> +++ b/mm/memremap.c
> >>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> >>>  
> >>> +/**
> >>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> >>> + * @folio: ZONE_DEVICE folio to prepare for release.
> >>> + *
> >>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> >>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> >>> + * must be restored to a sane ZONE_DEVICE state before they are released.
> >>> + *
> >>> + * This helper:
> >>> + *   - Clears @folio->mapping and, for compound folios, clears each page's
> >>> + *     compound-head state (ClearPageHead()/clear_compound_head()).
> >>> + *   - Resets the compound order metadata (folio_reset_order()) and then
> >>> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> >>> + *       * clears ->mapping
> >>> + *       * restores ->pgmap (prep_compound_page() overwrites it)
> >>> + *       * clears ->share (only relevant for fsdax; unused for device-private)
> >>> + *
> >>> + * If @folio is order-0, only the mapping is cleared and no further work is
> >>> + * required.
> >>> + */
> >>> +void free_zone_device_folio_prepare(struct folio *folio)
> >>> +{
> >>> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> >>> +	int order, i;
> >>> +
> >>> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> >>> +
> >>> +	folio->mapping = NULL;
> >>> +	order = folio_order(folio);
> >>> +	if (!order)
> >>> +		return;
> >>> +
> >>> +	folio_reset_order(folio);
> >>> +
> >>> +	for (i = 0; i < (1UL << order); i++) {
> >>> +		struct page *page = folio_page(folio, i);
> >>> +		struct folio *new_folio = (struct folio *)page;
> >>> +
> >>> +		ClearPageHead(page);
> >>> +		clear_compound_head(page);
> >>> +
> >>> +		new_folio->mapping = NULL;
> >>> +		/*
> >>> +		 * Reset pgmap which was over-written by
> >>> +		 * prep_compound_page().
> >>> +		 */
> >>> +		new_folio->pgmap = pgmap;
> >>> +		new_folio->share = 0;	/* fsdax only, unused for device private */
> >>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> >>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> >>
> >> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> >> that PMD_ORDER more frees than we'd like?
> >>
> > 
> > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > would make my implementation explode. I explained this in detail here [1]
> > to Zi.
> > 
> > To recap [1], my memory allocator has no visibility into individual
> > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > allows VRAM to be allocated or evicted for both traditional GPU
> > allocations (GEMs) and SVM allocations.
> > 
> 
> I assume it is still backed by pages that are ref counted? I suspect you'd

Yes.

> need to convert one reference count to PMD_ORDER reference counts to make
> this change work, or are the references not at page granularity? 
> 
> I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed
> like a per folio refcount
> 

The refcount is incremented by 1 for each call to
folio_set_zone_device_data. If we have a 2MB device folio backing a
2MB allocation, the refcount is 1. If we have 512 4KB device pages
backing a 2MB allocation, the refcount is 512. The refcount matches the
number of folio_free calls we expect to receive for the size of the
backing allocation. Right now, in Xe, we allocate either 4k, 64k or 2M
but thia all configurable via a table driver side (Xe) in GPU SVM (drm
common layer).

Matt

> > Now, to recap the actual issue: if device folios are not split upon free
> > and are later reallocated with a different order in
> > zone_device_page_init, the implementation breaks. This problem is not
> > specific to Xe—Nouveau happens to always allocate at the same order, so
> > it works by coincidence. Reallocating at a different order is valid
> > behavior and must be supported.
> > 
> 
> Agreed
> 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > 
> >>> +	}
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> >>> +
> >>>  void free_zone_device_folio(struct folio *folio)
> >>>  {
> >>>  	struct dev_pagemap *pgmap = folio->pgmap;
> >>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> >>>  	case MEMORY_DEVICE_COHERENT:
> >>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> >>>  			break;
> >>> +		free_zone_device_folio_prepare(folio);
> >>>  		pgmap->ops->folio_free(folio, order);
> >>>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> >>>  		break;
> >>
> >> Balbir
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
  2026-01-12  2:37         ` Matthew Brost
@ 2026-01-12  2:50           ` Matthew Brost
  0 siblings, 0 replies; 32+ messages in thread
From: Matthew Brost @ 2026-01-12  2:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Francois Dugast, intel-xe, dri-devel, Zi Yan, David Hildenbrand,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alistair Popple, linux-mm, linux-cxl, linux-kernel

On Sun, Jan 11, 2026 at 06:37:06PM -0800, Matthew Brost wrote:
> On Mon, Jan 12, 2026 at 01:15:12PM +1100, Balbir Singh wrote:
> > On 1/12/26 11:16, Matthew Brost wrote:
> > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > >> On 1/12/26 06:55, Francois Dugast wrote:
> > >>> From: Matthew Brost <matthew.brost@intel.com>
> > >>>
> > >>> Add free_zone_device_folio_prepare(), a helper that restores large
> > >>> ZONE_DEVICE folios to a sane, initial state before freeing them.
> > >>>
> > >>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > >>> compound metadata). Before returning such pages to the device pgmap
> > >>> allocator, each constituent page must be reset to a standalone
> > >>> ZONE_DEVICE folio with a valid pgmap and no compound state.
> > >>>
> > >>> Use this helper prior to folio_free() for device-private and
> > >>> device-coherent folios to ensure consistent device page state for
> > >>> subsequent allocations.
> > >>>
> > >>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > >>> Cc: Zi Yan <ziy@nvidia.com>
> > >>> Cc: David Hildenbrand <david@kernel.org>
> > >>> Cc: Oscar Salvador <osalvador@suse.de>
> > >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> > >>> Cc: Balbir Singh <balbirs@nvidia.com>
> > >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > >>> Cc: Vlastimil Babka <vbabka@suse.cz>
> > >>> Cc: Mike Rapoport <rppt@kernel.org>
> > >>> Cc: Suren Baghdasaryan <surenb@google.com>
> > >>> Cc: Michal Hocko <mhocko@suse.com>
> > >>> Cc: Alistair Popple <apopple@nvidia.com>
> > >>> Cc: linux-mm@kvack.org
> > >>> Cc: linux-cxl@vger.kernel.org
> > >>> Cc: linux-kernel@vger.kernel.org
> > >>> Suggested-by: Alistair Popple <apopple@nvidia.com>
> > >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > >>> ---
> > >>>  include/linux/memremap.h |  1 +
> > >>>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > >>>  2 files changed, 56 insertions(+)
> > >>>
> > >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > >>> index 97fcffeb1c1e..88e1d4707296 100644
> > >>> --- a/include/linux/memremap.h
> > >>> +++ b/include/linux/memremap.h
> > >>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > >>>  
> > >>>  #ifdef CONFIG_ZONE_DEVICE
> > >>>  void zone_device_page_init(struct page *page, unsigned int order);
> > >>> +void free_zone_device_folio_prepare(struct folio *folio);
> > >>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > >>>  void memunmap_pages(struct dev_pagemap *pgmap);
> > >>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > >>> diff --git a/mm/memremap.c b/mm/memremap.c
> > >>> index 39dc4bd190d0..375a61e18858 100644
> > >>> --- a/mm/memremap.c
> > >>> +++ b/mm/memremap.c
> > >>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > >>>  }
> > >>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > >>>  
> > >>> +/**
> > >>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > >>> + * @folio: ZONE_DEVICE folio to prepare for release.
> > >>> + *
> > >>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > >>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > >>> + * must be restored to a sane ZONE_DEVICE state before they are released.
> > >>> + *
> > >>> + * This helper:
> > >>> + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > >>> + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > >>> + *   - Resets the compound order metadata (folio_reset_order()) and then
> > >>> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > >>> + *       * clears ->mapping
> > >>> + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > >>> + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > >>> + *
> > >>> + * If @folio is order-0, only the mapping is cleared and no further work is
> > >>> + * required.
> > >>> + */
> > >>> +void free_zone_device_folio_prepare(struct folio *folio)
> > >>> +{
> > >>> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > >>> +	int order, i;
> > >>> +
> > >>> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > >>> +
> > >>> +	folio->mapping = NULL;
> > >>> +	order = folio_order(folio);
> > >>> +	if (!order)
> > >>> +		return;
> > >>> +
> > >>> +	folio_reset_order(folio);
> > >>> +
> > >>> +	for (i = 0; i < (1UL << order); i++) {
> > >>> +		struct page *page = folio_page(folio, i);
> > >>> +		struct folio *new_folio = (struct folio *)page;
> > >>> +
> > >>> +		ClearPageHead(page);
> > >>> +		clear_compound_head(page);
> > >>> +
> > >>> +		new_folio->mapping = NULL;
> > >>> +		/*
> > >>> +		 * Reset pgmap which was over-written by
> > >>> +		 * prep_compound_page().
> > >>> +		 */
> > >>> +		new_folio->pgmap = pgmap;
> > >>> +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > >>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > >>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > >>
> > >> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > >> that PMD_ORDER more frees than we'd like?
> > >>
> > > 
> > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > would make my implementation explode. I explained this in detail here [1]
> > > to Zi.
> > > 
> > > To recap [1], my memory allocator has no visibility into individual
> > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > allows VRAM to be allocated or evicted for both traditional GPU
> > > allocations (GEMs) and SVM allocations.
> > > 
> > 
> > I assume it is still backed by pages that are ref counted? I suspect you'd
> 
> Yes.
> 

Let me clarify this a bit. We don’t track individual pages in our
refcounting; instead, we maintain a reference count for the original
allocation (i.e., there are no partial frees of the original
allocation). This refcounting is handled in GPU SVM (DRM common), and
when the allocation’s refcount reaches zero, GPU SVM calls into the
driver to indicate that the memory can be released. In Xe, the backing
memory is a TTM BO (think of this as an eviction hook), which is layered
on top of DRM Buddy (which actually controls VRAM allocation and can
determine device pages from this layer).

I suspect AMD, when using GPU SVM (they have indicated this is the
plan), will also use TTM BO here. Nova, assuming they eventually adopt
SVM and use GPU SVM, will likely implement something very similar to TTM
in Rust, but with DRM Buddy also controlling the actual allocation (they
have already written bindings for DRM buddy).

Matt

> > need to convert one reference count to PMD_ORDER reference counts to make
> > this change work, or are the references not at page granularity? 
> > 
> > I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed
> > like a per folio refcount
> > 
> 
> The refcount is incremented by 1 for each call to
> folio_set_zone_device_data. If we have a 2MB device folio backing a
> 2MB allocation, the refcount is 1. If we have 512 4KB device pages
> backing a 2MB allocation, the refcount is 512. The refcount matches the
> number of folio_free calls we expect to receive for the size of the
> backing allocation. Right now, in Xe, we allocate either 4k, 64k or 2M
> but thia all configurable via a table driver side (Xe) in GPU SVM (drm
> common layer).
> 
> Matt
> 
> > > Now, to recap the actual issue: if device folios are not split upon free
> > > and are later reallocated with a different order in
> > > zone_device_page_init, the implementation breaks. This problem is not
> > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > it works by coincidence. Reallocating at a different order is valid
> > > behavior and must be supported.
> > > 
> > 
> > Agreed
> > 
> > > Matt
> > > 
> > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > 
> > >>> +	}
> > >>> +}
> > >>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > >>> +
> > >>>  void free_zone_device_folio(struct folio *folio)
> > >>>  {
> > >>>  	struct dev_pagemap *pgmap = folio->pgmap;
> > >>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > >>>  	case MEMORY_DEVICE_COHERENT:
> > >>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > >>>  			break;
> > >>> +		free_zone_device_folio_prepare(folio);
> > >>>  		pgmap->ops->folio_free(folio, order);
> > >>>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > >>>  		break;
> > >>
> > >> Balbir
> > 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 3/7] fs/dax: Use free_zone_device_folio_prepare() helper
  2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
@ 2026-01-11 20:55 ` Francois Dugast
  2026-01-12  4:14   ` kernel test robot
  2026-01-11 20:55 ` [PATCH v4 4/7] drm/pagemap: Unlock and put folios when possible Francois Dugast
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Matthew Brost, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Alistair Popple, Balbir Singh, linux-mm, linux-fsdevel,
	nvdimm, linux-kernel, Francois Dugast

From: Matthew Brost <matthew.brost@intel.com>

Use free_zone_device_folio_prepare() to restore fsdax ZONE_DEVICE folios
to a sane initial state upon the final put.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Cc: nvdimm@lists.linux.dev
Cc: linux-kernel@vger.kernel.org
Suggested-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 fs/dax.c | 24 +-----------------------
 1 file changed, 1 insertion(+), 23 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 289e6254aa30..d998f7615abb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -391,29 +391,7 @@ static inline unsigned long dax_folio_put(struct folio *folio)
 	if (ref)
 		return ref;
 
-	folio->mapping = NULL;
-	order = folio_order(folio);
-	if (!order)
-		return 0;
-	folio_reset_order(folio);
-
-	for (i = 0; i < (1UL << order); i++) {
-		struct dev_pagemap *pgmap = page_pgmap(&folio->page);
-		struct page *page = folio_page(folio, i);
-		struct folio *new_folio = (struct folio *)page;
-
-		ClearPageHead(page);
-		clear_compound_head(page);
-
-		new_folio->mapping = NULL;
-		/*
-		 * Reset pgmap which was over-written by
-		 * prep_compound_page().
-		 */
-		new_folio->pgmap = pgmap;
-		new_folio->share = 0;
-		WARN_ON_ONCE(folio_ref_count(new_folio));
-	}
+	free_zone_device_folio_prepare(folio);
 
 	return ref;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/7] fs/dax: Use free_zone_device_folio_prepare() helper
  2026-01-11 20:55 ` [PATCH v4 3/7] fs/dax: Use " Francois Dugast
@ 2026-01-12  4:14   ` kernel test robot
  0 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2026-01-12  4:14 UTC (permalink / raw)
  To: Francois Dugast, intel-xe
  Cc: llvm, oe-kbuild-all, dri-devel, Matthew Brost, Dan Williams,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Andrew Morton, Linux Memory Management List, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Alistair Popple, Balbir Singh, linux-fsdevel, nvdimm,
	linux-kernel, Francois Dugast

Hi Francois,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-tip/drm-tip]
[also build test WARNING on linus/master v6.19-rc4 next-20260109]
[cannot apply to drm-misc/drm-misc-next akpm-mm/mm-everything powerpc/topic/ppc-kvm pci/next pci/for-linus akpm-mm/mm-nonmm-unstable brauner-vfs/vfs.all daeinki-drm-exynos/exynos-drm-next drm/drm-next drm-i915/for-linux-next drm-i915/for-linux-next-fixes]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Francois-Dugast/mm-zone_device-Add-order-argument-to-folio_free-callback/20260112-050347
base:   https://gitlab.freedesktop.org/drm/tip.git drm-tip
patch link:    https://lore.kernel.org/r/20260111205820.830410-4-francois.dugast%40intel.com
patch subject: [PATCH v4 3/7] fs/dax: Use free_zone_device_folio_prepare() helper
config: x86_64-rhel-9.4-rust (https://download.01.org/0day-ci/archive/20260112/202601121151.afcnEvLk-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260112/202601121151.afcnEvLk-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601121151.afcnEvLk-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> fs/dax.c:384:6: warning: unused variable 'order' [-Wunused-variable]
     384 |         int order, i;
         |             ^~~~~
>> fs/dax.c:384:13: warning: unused variable 'i' [-Wunused-variable]
     384 |         int order, i;
         |                    ^
   2 warnings generated.


vim +/order +384 fs/dax.c

38607c62b34b46 Alistair Popple 2025-02-28  380  
38607c62b34b46 Alistair Popple 2025-02-28  381  static inline unsigned long dax_folio_put(struct folio *folio)
6061b69b9a550a Shiyang Ruan    2022-06-03  382  {
38607c62b34b46 Alistair Popple 2025-02-28  383  	unsigned long ref;
38607c62b34b46 Alistair Popple 2025-02-28 @384  	int order, i;
38607c62b34b46 Alistair Popple 2025-02-28  385  
38607c62b34b46 Alistair Popple 2025-02-28  386  	if (!dax_folio_is_shared(folio))
38607c62b34b46 Alistair Popple 2025-02-28  387  		ref = 0;
38607c62b34b46 Alistair Popple 2025-02-28  388  	else
38607c62b34b46 Alistair Popple 2025-02-28  389  		ref = --folio->share;
38607c62b34b46 Alistair Popple 2025-02-28  390  
38607c62b34b46 Alistair Popple 2025-02-28  391  	if (ref)
38607c62b34b46 Alistair Popple 2025-02-28  392  		return ref;
38607c62b34b46 Alistair Popple 2025-02-28  393  
ae869c69ed0719 Matthew Brost   2026-01-11  394  	free_zone_device_folio_prepare(folio);
16900426586032 Shiyang Ruan    2022-12-01  395  
38607c62b34b46 Alistair Popple 2025-02-28  396  	return ref;
16900426586032 Shiyang Ruan    2022-12-01  397  }
16900426586032 Shiyang Ruan    2022-12-01  398  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 4/7] drm/pagemap: Unlock and put folios when possible
  2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
                   ` (2 preceding siblings ...)
  2026-01-11 20:55 ` [PATCH v4 3/7] fs/dax: Use " Francois Dugast
@ 2026-01-11 20:55 ` Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 5/7] drm/pagemap: Add helper to access zone_device_data Francois Dugast
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Francois Dugast, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Alistair Popple, Balbir Singh, linux-mm, Matthew Brost

If the page is part of a folio, unlock and put the whole folio at once
instead of individual pages one after the other. This will reduce the
amount of operations once device THP are in use.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linux-mm@kvack.org
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 drivers/gpu/drm/drm_pagemap.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index df253b13cf85..bd9a4703fbce 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -154,15 +154,15 @@ static void drm_pagemap_zdd_put(struct drm_pagemap_zdd *zdd)
 }
 
 /**
- * drm_pagemap_migration_unlock_put_page() - Put a migration page
- * @page: Pointer to the page to put
+ * drm_pagemap_migration_unlock_put_folio() - Put a migration folio
+ * @folio: Pointer to the folio to put
  *
- * This function unlocks and puts a page.
+ * This function unlocks and puts a folio.
  */
-static void drm_pagemap_migration_unlock_put_page(struct page *page)
+static void drm_pagemap_migration_unlock_put_folio(struct folio *folio)
 {
-	unlock_page(page);
-	put_page(page);
+	folio_unlock(folio);
+	folio_put(folio);
 }
 
 /**
@@ -177,15 +177,23 @@ static void drm_pagemap_migration_unlock_put_pages(unsigned long npages,
 {
 	unsigned long i;
 
-	for (i = 0; i < npages; ++i) {
+	for (i = 0; i < npages;) {
 		struct page *page;
+		struct folio *folio;
+		unsigned int order = 0;
 
 		if (!migrate_pfn[i])
-			continue;
+			goto next;
 
 		page = migrate_pfn_to_page(migrate_pfn[i]);
-		drm_pagemap_migration_unlock_put_page(page);
+		folio = page_folio(page);
+		order = folio_order(folio);
+
+		drm_pagemap_migration_unlock_put_folio(folio);
 		migrate_pfn[i] = 0;
+
+next:
+		i += NR_PAGES(order);
 	}
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 5/7] drm/pagemap: Add helper to access zone_device_data
  2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
                   ` (3 preceding siblings ...)
  2026-01-11 20:55 ` [PATCH v4 4/7] drm/pagemap: Unlock and put folios when possible Francois Dugast
@ 2026-01-11 20:55 ` Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 6/7] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 7/7] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast
  6 siblings, 0 replies; 32+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Francois Dugast, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Alistair Popple, Balbir Singh, linux-mm, Matthew Brost

This new helper helps ensure all accesses to zone_device_data use the
correct API whether the page is part of a folio or not.

v2:
- Move to drm_pagemap.h, stick to folio_zone_device_data (Matthew Brost)
- Return struct drm_pagemap_zdd * (Matthew Brost)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linux-mm@kvack.org
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 drivers/gpu/drm/drm_gpusvm.c  |  7 +++++--
 drivers/gpu/drm/drm_pagemap.c | 21 ++++++++++++---------
 include/drm/drm_pagemap.h     | 15 +++++++++++++++
 3 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/drm_gpusvm.c b/drivers/gpu/drm/drm_gpusvm.c
index aa9a0b60e727..585d913d3d19 100644
--- a/drivers/gpu/drm/drm_gpusvm.c
+++ b/drivers/gpu/drm/drm_gpusvm.c
@@ -1488,12 +1488,15 @@ int drm_gpusvm_get_pages(struct drm_gpusvm *gpusvm,
 		order = drm_gpusvm_hmm_pfn_to_order(pfns[i], i, npages);
 		if (is_device_private_page(page) ||
 		    is_device_coherent_page(page)) {
+			struct drm_pagemap_zdd *__zdd =
+				drm_pagemap_page_zone_device_data(page);
+
 			if (!ctx->allow_mixed &&
-			    zdd != page->zone_device_data && i > 0) {
+			    zdd != __zdd && i > 0) {
 				err = -EOPNOTSUPP;
 				goto err_unmap;
 			}
-			zdd = page->zone_device_data;
+			zdd = __zdd;
 			if (pagemap != page_pgmap(page)) {
 				if (i > 0) {
 					err = -EOPNOTSUPP;
diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index bd9a4703fbce..308c14291eba 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -252,7 +252,7 @@ static int drm_pagemap_migrate_map_pages(struct device *dev,
 		order = folio_order(folio);
 
 		if (is_device_private_page(page)) {
-			struct drm_pagemap_zdd *zdd = page->zone_device_data;
+			struct drm_pagemap_zdd *zdd = drm_pagemap_page_zone_device_data(page);
 			struct drm_pagemap *dpagemap = zdd->dpagemap;
 			struct drm_pagemap_addr addr;
 
@@ -323,7 +323,7 @@ static void drm_pagemap_migrate_unmap_pages(struct device *dev,
 			goto next;
 
 		if (is_zone_device_page(page)) {
-			struct drm_pagemap_zdd *zdd = page->zone_device_data;
+			struct drm_pagemap_zdd *zdd = drm_pagemap_page_zone_device_data(page);
 			struct drm_pagemap *dpagemap = zdd->dpagemap;
 
 			dpagemap->ops->device_unmap(dpagemap, dev, pagemap_addr[i]);
@@ -611,7 +611,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 
 		pages[i] = NULL;
 		if (src_page && is_device_private_page(src_page)) {
-			struct drm_pagemap_zdd *src_zdd = src_page->zone_device_data;
+			struct drm_pagemap_zdd *src_zdd =
+				drm_pagemap_page_zone_device_data(src_page);
 
 			if (page_pgmap(src_page) == pagemap &&
 			    !mdetails->can_migrate_same_pagemap) {
@@ -733,8 +734,8 @@ static int drm_pagemap_migrate_populate_ram_pfn(struct vm_area_struct *vas,
 			goto next;
 
 		if (fault_page) {
-			if (src_page->zone_device_data !=
-			    fault_page->zone_device_data)
+			if (drm_pagemap_page_zone_device_data(src_page) !=
+			    drm_pagemap_page_zone_device_data(fault_page))
 				goto next;
 		}
 
@@ -1075,7 +1076,7 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 	void *buf;
 	int i, err = 0;
 
-	zdd = page->zone_device_data;
+	zdd = drm_pagemap_page_zone_device_data(page);
 	if (time_before64(get_jiffies_64(), zdd->devmem_allocation->timeslice_expiration))
 		return 0;
 
@@ -1159,7 +1160,9 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
  */
 static void drm_pagemap_folio_free(struct folio *folio, unsigned int order)
 {
-	drm_pagemap_zdd_put(folio->page.zone_device_data);
+	struct page *page = folio_page(folio, 0);
+
+	drm_pagemap_zdd_put(drm_pagemap_page_zone_device_data(page));
 }
 
 /**
@@ -1175,7 +1178,7 @@ static void drm_pagemap_folio_free(struct folio *folio, unsigned int order)
  */
 static vm_fault_t drm_pagemap_migrate_to_ram(struct vm_fault *vmf)
 {
-	struct drm_pagemap_zdd *zdd = vmf->page->zone_device_data;
+	struct drm_pagemap_zdd *zdd = drm_pagemap_page_zone_device_data(vmf->page);
 	int err;
 
 	err = __drm_pagemap_migrate_to_ram(vmf->vma,
@@ -1241,7 +1244,7 @@ EXPORT_SYMBOL_GPL(drm_pagemap_devmem_init);
  */
 struct drm_pagemap *drm_pagemap_page_to_dpagemap(struct page *page)
 {
-	struct drm_pagemap_zdd *zdd = page->zone_device_data;
+	struct drm_pagemap_zdd *zdd = drm_pagemap_page_zone_device_data(page);
 
 	return zdd->devmem_allocation->dpagemap;
 }
diff --git a/include/drm/drm_pagemap.h b/include/drm/drm_pagemap.h
index 46e9c58f09e0..736fb6cb7b33 100644
--- a/include/drm/drm_pagemap.h
+++ b/include/drm/drm_pagemap.h
@@ -4,6 +4,7 @@
 
 #include <linux/dma-direction.h>
 #include <linux/hmm.h>
+#include <linux/memremap.h>
 #include <linux/types.h>
 
 #define NR_PAGES(order) (1U << (order))
@@ -359,4 +360,18 @@ int drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
 void drm_pagemap_destroy(struct drm_pagemap *dpagemap, bool is_atomic_or_reclaim);
 
 int drm_pagemap_reinit(struct drm_pagemap *dpagemap);
+
+/**
+ * drm_pagemap_page_zone_device_data() - Page to zone_device_data
+ * @page: Pointer to the page
+ *
+ * Return: Page's zone_device_data
+ */
+static inline struct drm_pagemap_zdd *drm_pagemap_page_zone_device_data(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	return folio_zone_device_data(folio);
+}
+
 #endif
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 6/7] drm/pagemap: Correct cpages calculation for migrate_vma_setup
  2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
                   ` (4 preceding siblings ...)
  2026-01-11 20:55 ` [PATCH v4 5/7] drm/pagemap: Add helper to access zone_device_data Francois Dugast
@ 2026-01-11 20:55 ` Francois Dugast
  2026-01-12 14:17   ` Francois Dugast
  2026-01-11 20:55 ` [PATCH v4 7/7] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast
  6 siblings, 1 reply; 32+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Matthew Brost, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Alistair Popple, Balbir Singh, linux-mm, Francois Dugast

From: Matthew Brost <matthew.brost@intel.com>

cpages returned from migrate_vma_setup represents the total number of
individual pages found, not the number of 4K pages. The math in
drm_pagemap_migrate_to_devmem for npages is based on the number of 4K
pages, so cpages != npages can fail even if the entire memory range is
found in migrate_vma_setup (e.g., when a single 2M page is found).
Add drm_pagemap_cpages, which converts cpages to the number of 4K pages
found.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linux-mm@kvack.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 drivers/gpu/drm/drm_pagemap.c | 38 ++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 308c14291eba..af2c8f4da00e 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -452,6 +452,41 @@ static int drm_pagemap_migrate_range(struct drm_pagemap_devmem *devmem,
 	return ret;
 }
 
+/**
+ * drm_pagemap_cpages() - Count collected pages
+ * @migrate_pfn: Array of migrate_pfn entries to account
+ * @npages: Number of entries in @migrate_pfn
+ *
+ * Compute the total number of minimum-sized pages represented by the
+ * collected entries in @migrate_pfn. The total is derived from the
+ * order encoded in each entry.
+ *
+ * Return: Total number of minimum-sized pages.
+ */
+static int drm_pagemap_cpages(unsigned long *migrate_pfn, unsigned long npages)
+{
+	unsigned long i, cpages = 0;
+
+	for (i = 0; i < npages;) {
+		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
+		struct folio *folio;
+		unsigned int order = 0;
+
+		if (page) {
+			folio = page_folio(page);
+			order = folio_order(folio);
+			cpages += NR_PAGES(order);
+		} else if (migrate_pfn[i] & MIGRATE_PFN_COMPOUND) {
+			order = HPAGE_PMD_ORDER;
+			cpages += NR_PAGES(order);
+		}
+
+		i += NR_PAGES(order);
+	}
+
+	return cpages;
+}
+
 /**
  * drm_pagemap_migrate_to_devmem() - Migrate a struct mm_struct range to device memory
  * @devmem_allocation: The device memory allocation to migrate to.
@@ -564,7 +599,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 		goto err_free;
 	}
 
-	if (migrate.cpages != npages) {
+	if (migrate.cpages != npages &&
+	    drm_pagemap_cpages(migrate.src, npages) != npages) {
 		/*
 		 * Some pages to migrate. But we want to migrate all or
 		 * nothing. Raced or unknown device pages.
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 6/7] drm/pagemap: Correct cpages calculation for migrate_vma_setup
  2026-01-11 20:55 ` [PATCH v4 6/7] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
@ 2026-01-12 14:17   ` Francois Dugast
  0 siblings, 0 replies; 32+ messages in thread
From: Francois Dugast @ 2026-01-12 14:17 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Matthew Brost, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Alistair Popple, Balbir Singh, linux-mm

On Sun, Jan 11, 2026 at 09:55:45PM +0100, Francois Dugast wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> cpages returned from migrate_vma_setup represents the total number of
> individual pages found, not the number of 4K pages. The math in
> drm_pagemap_migrate_to_devmem for npages is based on the number of 4K
> pages, so cpages != npages can fail even if the entire memory range is
> found in migrate_vma_setup (e.g., when a single 2M page is found).
> Add drm_pagemap_cpages, which converts cpages to the number of 4K pages
> found.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>

Reviewed-by: Francois Dugast <francois.dugast@intel.com>

> ---
>  drivers/gpu/drm/drm_pagemap.c | 38 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
> index 308c14291eba..af2c8f4da00e 100644
> --- a/drivers/gpu/drm/drm_pagemap.c
> +++ b/drivers/gpu/drm/drm_pagemap.c
> @@ -452,6 +452,41 @@ static int drm_pagemap_migrate_range(struct drm_pagemap_devmem *devmem,
>  	return ret;
>  }
>  
> +/**
> + * drm_pagemap_cpages() - Count collected pages
> + * @migrate_pfn: Array of migrate_pfn entries to account
> + * @npages: Number of entries in @migrate_pfn
> + *
> + * Compute the total number of minimum-sized pages represented by the
> + * collected entries in @migrate_pfn. The total is derived from the
> + * order encoded in each entry.
> + *
> + * Return: Total number of minimum-sized pages.
> + */
> +static int drm_pagemap_cpages(unsigned long *migrate_pfn, unsigned long npages)
> +{
> +	unsigned long i, cpages = 0;
> +
> +	for (i = 0; i < npages;) {
> +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> +		struct folio *folio;
> +		unsigned int order = 0;
> +
> +		if (page) {
> +			folio = page_folio(page);
> +			order = folio_order(folio);
> +			cpages += NR_PAGES(order);
> +		} else if (migrate_pfn[i] & MIGRATE_PFN_COMPOUND) {
> +			order = HPAGE_PMD_ORDER;
> +			cpages += NR_PAGES(order);
> +		}
> +
> +		i += NR_PAGES(order);
> +	}
> +
> +	return cpages;
> +}
> +
>  /**
>   * drm_pagemap_migrate_to_devmem() - Migrate a struct mm_struct range to device memory
>   * @devmem_allocation: The device memory allocation to migrate to.
> @@ -564,7 +599,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>  		goto err_free;
>  	}
>  
> -	if (migrate.cpages != npages) {
> +	if (migrate.cpages != npages &&
> +	    drm_pagemap_cpages(migrate.src, npages) != npages) {
>  		/*
>  		 * Some pages to migrate. But we want to migrate all or
>  		 * nothing. Raced or unknown device pages.
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 7/7] drm/pagemap: Enable THP support for GPU memory migration
  2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
                   ` (5 preceding siblings ...)
  2026-01-11 20:55 ` [PATCH v4 6/7] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
@ 2026-01-11 20:55 ` Francois Dugast
  2026-01-11 21:37   ` Matthew Brost
  6 siblings, 1 reply; 32+ messages in thread
From: Francois Dugast @ 2026-01-11 20:55 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Francois Dugast, Matthew Brost, Thomas Hellström,
	Michal Mrozek, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Alistair Popple,
	Balbir Singh, linux-mm

This enables support for Transparent Huge Pages (THP) for device pages by
using MIGRATE_VMA_SELECT_COMPOUND during migration. It removes the need to
split folios and loop multiple times over all pages to perform required
operations at page level. Instead, we rely on newly introduced support for
higher orders in drm_pagemap and folio-level API.

In Xe, this drastically improves performance when using SVM. The GT stats
below collected after a 2MB page fault show overall servicing is more than
7 times faster, and thanks to reduced CPU overhead the time spent on the
actual copy goes from 23% without THP to 80% with THP:

Without THP:

    svm_2M_pagefault_us: 966
    svm_2M_migrate_us: 942
    svm_2M_device_copy_us: 223
    svm_2M_get_pages_us: 9
    svm_2M_bind_us: 10

With THP:

    svm_2M_pagefault_us: 132
    svm_2M_migrate_us: 128
    svm_2M_device_copy_us: 106
    svm_2M_get_pages_us: 1
    svm_2M_bind_us: 2

v2:
- Fix one occurrence of drm_pagemap_get_devmem_page() (Matthew Brost)

v3:
- Remove migrate_device_split_page() and folio_split_lock, instead rely on
  free_zone_device_folio() to split folios before freeing (Matthew Brost)
- Assert folio order is HPAGE_PMD_ORDER (Matthew Brost)
- Always use folio_set_zone_device_data() in split (Matthew Brost)

v4:
- Warn on compound device page, s/continue/goto next/ (Matthew Brost)

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linux-mm@kvack.org
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 drivers/gpu/drm/drm_pagemap.c | 77 ++++++++++++++++++++++++++++++-----
 1 file changed, 67 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index af2c8f4da00e..bd2c9af51564 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -200,16 +200,20 @@ static void drm_pagemap_migration_unlock_put_pages(unsigned long npages,
 /**
  * drm_pagemap_get_devmem_page() - Get a reference to a device memory page
  * @page: Pointer to the page
+ * @order: Order
  * @zdd: Pointer to the GPU SVM zone device data
  *
  * This function associates the given page with the specified GPU SVM zone
  * device data and initializes it for zone device usage.
  */
 static void drm_pagemap_get_devmem_page(struct page *page,
+					unsigned int order,
 					struct drm_pagemap_zdd *zdd)
 {
-	page->zone_device_data = drm_pagemap_zdd_get(zdd);
-	zone_device_page_init(page, 0);
+	struct folio *folio = page_folio(page);
+
+	folio_set_zone_device_data(folio, drm_pagemap_zdd_get(zdd));
+	zone_device_page_init(page, order);
 }
 
 /**
@@ -534,7 +538,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 		 * rare and only occur when the madvise attributes of memory are
 		 * changed or atomics are being used.
 		 */
-		.flags		= MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
+				  MIGRATE_VMA_SELECT_COMPOUND,
 	};
 	unsigned long i, npages = npages_in_range(start, end);
 	unsigned long own_pages = 0, migrated_pages = 0;
@@ -640,11 +645,16 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 
 	own_pages = 0;
 
-	for (i = 0; i < npages; ++i) {
+	for (i = 0; i < npages;) {
+		unsigned long j;
 		struct page *page = pfn_to_page(migrate.dst[i]);
 		struct page *src_page = migrate_pfn_to_page(migrate.src[i]);
-		cur.start = i;
+		unsigned int order = 0;
+
+		drm_WARN_ONCE(dpagemap->drm, folio_order(page_folio(page)),
+			      "Unexpected compound device page found\n");
 
+		cur.start = i;
 		pages[i] = NULL;
 		if (src_page && is_device_private_page(src_page)) {
 			struct drm_pagemap_zdd *src_zdd =
@@ -654,7 +664,7 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 			    !mdetails->can_migrate_same_pagemap) {
 				migrate.dst[i] = 0;
 				own_pages++;
-				continue;
+				goto next;
 			}
 			if (mdetails->source_peer_migrates) {
 				cur.dpagemap = src_zdd->dpagemap;
@@ -670,7 +680,20 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 			pages[i] = page;
 		}
 		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
-		drm_pagemap_get_devmem_page(page, zdd);
+
+		if (migrate.src[i] & MIGRATE_PFN_COMPOUND) {
+			drm_WARN_ONCE(dpagemap->drm, src_page &&
+				      folio_order(page_folio(src_page)) != HPAGE_PMD_ORDER,
+				      "Unexpected folio order\n");
+
+			order = HPAGE_PMD_ORDER;
+			migrate.dst[i] |= MIGRATE_PFN_COMPOUND;
+
+			for (j = 1; j < NR_PAGES(order) && i + j < npages; j++)
+				migrate.dst[i + j] = 0;
+		}
+
+		drm_pagemap_get_devmem_page(page, order, zdd);
 
 		/* If we switched the migrating drm_pagemap, migrate previous pages now */
 		err = drm_pagemap_migrate_range(devmem_allocation, migrate.src, migrate.dst,
@@ -680,7 +703,11 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
 			npages = i + 1;
 			goto err_finalize;
 		}
+
+next:
+		i += NR_PAGES(order);
 	}
+
 	cur.start = npages;
 	cur.ops = NULL; /* Force migration */
 	err = drm_pagemap_migrate_range(devmem_allocation, migrate.src, migrate.dst,
@@ -789,6 +816,8 @@ static int drm_pagemap_migrate_populate_ram_pfn(struct vm_area_struct *vas,
 		page = folio_page(folio, 0);
 		mpfn[i] = migrate_pfn(page_to_pfn(page));
 
+		if (order)
+			mpfn[i] |= MIGRATE_PFN_COMPOUND;
 next:
 		if (page)
 			addr += page_size(page);
@@ -1044,8 +1073,15 @@ int drm_pagemap_evict_to_ram(struct drm_pagemap_devmem *devmem_allocation)
 	if (err)
 		goto err_finalize;
 
-	for (i = 0; i < npages; ++i)
+	for (i = 0; i < npages;) {
+		unsigned int order = 0;
+
 		pages[i] = migrate_pfn_to_page(src[i]);
+		if (pages[i])
+			order = folio_order(page_folio(pages[i]));
+
+		i += NR_PAGES(order);
+	}
 
 	err = ops->copy_to_ram(pages, pagemap_addr, npages, NULL);
 	if (err)
@@ -1098,7 +1134,8 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 		.vma		= vas,
 		.pgmap_owner	= page_pgmap(page)->owner,
 		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
-		MIGRATE_VMA_SELECT_DEVICE_COHERENT,
+				  MIGRATE_VMA_SELECT_DEVICE_COHERENT |
+				  MIGRATE_VMA_SELECT_COMPOUND,
 		.fault_page	= page,
 	};
 	struct drm_pagemap_migrate_details mdetails = {};
@@ -1164,8 +1201,15 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 	if (err)
 		goto err_finalize;
 
-	for (i = 0; i < npages; ++i)
+	for (i = 0; i < npages;) {
+		unsigned int order = 0;
+
 		pages[i] = migrate_pfn_to_page(migrate.src[i]);
+		if (pages[i])
+			order = folio_order(page_folio(pages[i]));
+
+		i += NR_PAGES(order);
+	}
 
 	err = ops->copy_to_ram(pages, pagemap_addr, npages, NULL);
 	if (err)
@@ -1224,9 +1268,22 @@ static vm_fault_t drm_pagemap_migrate_to_ram(struct vm_fault *vmf)
 	return err ? VM_FAULT_SIGBUS : 0;
 }
 
+static void drm_pagemap_folio_split(struct folio *orig_folio, struct folio *new_folio)
+{
+	struct drm_pagemap_zdd *zdd;
+
+	if (!new_folio)
+		return;
+
+	new_folio->pgmap = orig_folio->pgmap;
+	zdd = folio_zone_device_data(orig_folio);
+	folio_set_zone_device_data(new_folio, drm_pagemap_zdd_get(zdd));
+}
+
 static const struct dev_pagemap_ops drm_pagemap_pagemap_ops = {
 	.folio_free = drm_pagemap_folio_free,
 	.migrate_to_ram = drm_pagemap_migrate_to_ram,
+	.folio_split = drm_pagemap_folio_split,
 };
 
 /**
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 7/7] drm/pagemap: Enable THP support for GPU memory migration
  2026-01-11 20:55 ` [PATCH v4 7/7] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast
@ 2026-01-11 21:37   ` Matthew Brost
  0 siblings, 0 replies; 32+ messages in thread
From: Matthew Brost @ 2026-01-11 21:37 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, dri-devel, Thomas Hellström, Michal Mrozek,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Alistair Popple,
	Balbir Singh, linux-mm

On Sun, Jan 11, 2026 at 09:55:46PM +0100, Francois Dugast wrote:
> This enables support for Transparent Huge Pages (THP) for device pages by
> using MIGRATE_VMA_SELECT_COMPOUND during migration. It removes the need to
> split folios and loop multiple times over all pages to perform required
> operations at page level. Instead, we rely on newly introduced support for
> higher orders in drm_pagemap and folio-level API.
> 
> In Xe, this drastically improves performance when using SVM. The GT stats
> below collected after a 2MB page fault show overall servicing is more than
> 7 times faster, and thanks to reduced CPU overhead the time spent on the
> actual copy goes from 23% without THP to 80% with THP:
> 
> Without THP:
> 
>     svm_2M_pagefault_us: 966
>     svm_2M_migrate_us: 942
>     svm_2M_device_copy_us: 223
>     svm_2M_get_pages_us: 9
>     svm_2M_bind_us: 10
> 
> With THP:
> 
>     svm_2M_pagefault_us: 132
>     svm_2M_migrate_us: 128
>     svm_2M_device_copy_us: 106
>     svm_2M_get_pages_us: 1
>     svm_2M_bind_us: 2
> 
> v2:
> - Fix one occurrence of drm_pagemap_get_devmem_page() (Matthew Brost)
> 
> v3:
> - Remove migrate_device_split_page() and folio_split_lock, instead rely on
>   free_zone_device_folio() to split folios before freeing (Matthew Brost)
> - Assert folio order is HPAGE_PMD_ORDER (Matthew Brost)
> - Always use folio_set_zone_device_data() in split (Matthew Brost)
> 
> v4:
> - Warn on compound device page, s/continue/goto next/ (Matthew Brost)
> 
> Cc: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Michal Mrozek <michal.mrozek@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  drivers/gpu/drm/drm_pagemap.c | 77 ++++++++++++++++++++++++++++++-----
>  1 file changed, 67 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
> index af2c8f4da00e..bd2c9af51564 100644
> --- a/drivers/gpu/drm/drm_pagemap.c
> +++ b/drivers/gpu/drm/drm_pagemap.c
> @@ -200,16 +200,20 @@ static void drm_pagemap_migration_unlock_put_pages(unsigned long npages,
>  /**
>   * drm_pagemap_get_devmem_page() - Get a reference to a device memory page
>   * @page: Pointer to the page
> + * @order: Order
>   * @zdd: Pointer to the GPU SVM zone device data
>   *
>   * This function associates the given page with the specified GPU SVM zone
>   * device data and initializes it for zone device usage.
>   */
>  static void drm_pagemap_get_devmem_page(struct page *page,
> +					unsigned int order,
>  					struct drm_pagemap_zdd *zdd)
>  {
> -	page->zone_device_data = drm_pagemap_zdd_get(zdd);
> -	zone_device_page_init(page, 0);
> +	struct folio *folio = page_folio(page);
> +
> +	folio_set_zone_device_data(folio, drm_pagemap_zdd_get(zdd));
> +	zone_device_page_init(page, order);
>  }
>  
>  /**
> @@ -534,7 +538,8 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>  		 * rare and only occur when the madvise attributes of memory are
>  		 * changed or atomics are being used.
>  		 */
> -		.flags		= MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT,
> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
> +				  MIGRATE_VMA_SELECT_COMPOUND,
>  	};
>  	unsigned long i, npages = npages_in_range(start, end);
>  	unsigned long own_pages = 0, migrated_pages = 0;
> @@ -640,11 +645,16 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>  
>  	own_pages = 0;
>  
> -	for (i = 0; i < npages; ++i) {
> +	for (i = 0; i < npages;) {
> +		unsigned long j;
>  		struct page *page = pfn_to_page(migrate.dst[i]);
>  		struct page *src_page = migrate_pfn_to_page(migrate.src[i]);
> -		cur.start = i;
> +		unsigned int order = 0;
> +
> +		drm_WARN_ONCE(dpagemap->drm, folio_order(page_folio(page)),
> +			      "Unexpected compound device page found\n");
>  
> +		cur.start = i;
>  		pages[i] = NULL;
>  		if (src_page && is_device_private_page(src_page)) {
>  			struct drm_pagemap_zdd *src_zdd =
> @@ -654,7 +664,7 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>  			    !mdetails->can_migrate_same_pagemap) {
>  				migrate.dst[i] = 0;
>  				own_pages++;
> -				continue;
> +				goto next;
>  			}
>  			if (mdetails->source_peer_migrates) {
>  				cur.dpagemap = src_zdd->dpagemap;
> @@ -670,7 +680,20 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>  			pages[i] = page;
>  		}
>  		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> -		drm_pagemap_get_devmem_page(page, zdd);
> +
> +		if (migrate.src[i] & MIGRATE_PFN_COMPOUND) {
> +			drm_WARN_ONCE(dpagemap->drm, src_page &&
> +				      folio_order(page_folio(src_page)) != HPAGE_PMD_ORDER,
> +				      "Unexpected folio order\n");
> +
> +			order = HPAGE_PMD_ORDER;
> +			migrate.dst[i] |= MIGRATE_PFN_COMPOUND;
> +
> +			for (j = 1; j < NR_PAGES(order) && i + j < npages; j++)
> +				migrate.dst[i + j] = 0;
> +		}
> +
> +		drm_pagemap_get_devmem_page(page, order, zdd);
>  
>  		/* If we switched the migrating drm_pagemap, migrate previous pages now */
>  		err = drm_pagemap_migrate_range(devmem_allocation, migrate.src, migrate.dst,
> @@ -680,7 +703,11 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>  			npages = i + 1;
>  			goto err_finalize;
>  		}
> +
> +next:
> +		i += NR_PAGES(order);
>  	}
> +
>  	cur.start = npages;
>  	cur.ops = NULL; /* Force migration */
>  	err = drm_pagemap_migrate_range(devmem_allocation, migrate.src, migrate.dst,
> @@ -789,6 +816,8 @@ static int drm_pagemap_migrate_populate_ram_pfn(struct vm_area_struct *vas,
>  		page = folio_page(folio, 0);
>  		mpfn[i] = migrate_pfn(page_to_pfn(page));
>  
> +		if (order)
> +			mpfn[i] |= MIGRATE_PFN_COMPOUND;
>  next:
>  		if (page)
>  			addr += page_size(page);
> @@ -1044,8 +1073,15 @@ int drm_pagemap_evict_to_ram(struct drm_pagemap_devmem *devmem_allocation)
>  	if (err)
>  		goto err_finalize;
>  
> -	for (i = 0; i < npages; ++i)
> +	for (i = 0; i < npages;) {
> +		unsigned int order = 0;
> +
>  		pages[i] = migrate_pfn_to_page(src[i]);
> +		if (pages[i])
> +			order = folio_order(page_folio(pages[i]));
> +
> +		i += NR_PAGES(order);
> +	}
>  
>  	err = ops->copy_to_ram(pages, pagemap_addr, npages, NULL);
>  	if (err)
> @@ -1098,7 +1134,8 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
>  		.vma		= vas,
>  		.pgmap_owner	= page_pgmap(page)->owner,
>  		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
> -		MIGRATE_VMA_SELECT_DEVICE_COHERENT,
> +				  MIGRATE_VMA_SELECT_DEVICE_COHERENT |
> +				  MIGRATE_VMA_SELECT_COMPOUND,
>  		.fault_page	= page,
>  	};
>  	struct drm_pagemap_migrate_details mdetails = {};
> @@ -1164,8 +1201,15 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
>  	if (err)
>  		goto err_finalize;
>  
> -	for (i = 0; i < npages; ++i)
> +	for (i = 0; i < npages;) {
> +		unsigned int order = 0;
> +
>  		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> +		if (pages[i])
> +			order = folio_order(page_folio(pages[i]));
> +
> +		i += NR_PAGES(order);
> +	}
>  
>  	err = ops->copy_to_ram(pages, pagemap_addr, npages, NULL);
>  	if (err)
> @@ -1224,9 +1268,22 @@ static vm_fault_t drm_pagemap_migrate_to_ram(struct vm_fault *vmf)
>  	return err ? VM_FAULT_SIGBUS : 0;
>  }
>  
> +static void drm_pagemap_folio_split(struct folio *orig_folio, struct folio *new_folio)
> +{
> +	struct drm_pagemap_zdd *zdd;
> +
> +	if (!new_folio)
> +		return;
> +
> +	new_folio->pgmap = orig_folio->pgmap;
> +	zdd = folio_zone_device_data(orig_folio);
> +	folio_set_zone_device_data(new_folio, drm_pagemap_zdd_get(zdd));
> +}
> +
>  static const struct dev_pagemap_ops drm_pagemap_pagemap_ops = {
>  	.folio_free = drm_pagemap_folio_free,
>  	.migrate_to_ram = drm_pagemap_migrate_to_ram,
> +	.folio_split = drm_pagemap_folio_split,
>  };
>  
>  /**
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2026-01-12 23:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
2026-01-11 22:35   ` Matthew Wilcox
2026-01-12  0:19     ` Balbir Singh
2026-01-12  0:51       ` Zi Yan
2026-01-12  1:37         ` Matthew Brost
2026-01-12  4:50         ` Balbir Singh
2026-01-12 13:45         ` Jason Gunthorpe
2026-01-12 16:31           ` Zi Yan
2026-01-12 16:50             ` Jason Gunthorpe
2026-01-12 17:46               ` Zi Yan
2026-01-12 18:25                 ` Jason Gunthorpe
2026-01-12 18:55                   ` Zi Yan
2026-01-12 19:28                     ` Jason Gunthorpe
2026-01-12 23:07               ` Matthew Brost
2026-01-12 21:49           ` Matthew Brost
2026-01-12 23:15             ` Zi Yan
2026-01-12 23:22               ` Matthew Brost
2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
2026-01-12  0:44   ` Balbir Singh
2026-01-12  1:16     ` Matthew Brost
2026-01-12  2:15       ` Balbir Singh
2026-01-12  2:37         ` Matthew Brost
2026-01-12  2:50           ` Matthew Brost
2026-01-11 20:55 ` [PATCH v4 3/7] fs/dax: Use " Francois Dugast
2026-01-12  4:14   ` kernel test robot
2026-01-11 20:55 ` [PATCH v4 4/7] drm/pagemap: Unlock and put folios when possible Francois Dugast
2026-01-11 20:55 ` [PATCH v4 5/7] drm/pagemap: Add helper to access zone_device_data Francois Dugast
2026-01-11 20:55 ` [PATCH v4 6/7] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
2026-01-12 14:17   ` Francois Dugast
2026-01-11 20:55 ` [PATCH v4 7/7] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast
2026-01-11 21:37   ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox