[PATCH v1 0/4] mm: cleanups for device-exclusive entries (hmm)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/4] mm: cleanups for device-exclusive entries (hmm)
@ 2025-01-29 11:57 David Hildenbrand
  2025-01-29 11:57 ` [PATCH v1 1/4] lib/test_hmm: make dmirror_atomic_map() consume a single page David Hildenbrand
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

This is a follow-up to [1], performing some related cleanups. There
are more cleanups to be had, but I'll have to focus on some other stuff
next. I might come back to that once I'm stuck on (annoyed by :) )
other things.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alex Shi <alexs@kernel.org>
Cc: Yanteng Si <si.yanteng@linux.dev>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>

[1] https://lkml.kernel.org/r/20250129115411.2077152-1-david@redhat.com

David Hildenbrand (4):
  lib/test_hmm: make dmirror_atomic_map() consume a single page
  mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE
  mm/memory: pass folio and pte to restore_exclusive_pte()
  mm/memory: document restore_exclusive_pte()

 drivers/gpu/drm/nouveau/nouveau_svm.c |  6 +--
 include/linux/mmu_notifier.h          |  4 +-
 include/linux/rmap.h                  |  2 +-
 lib/test_hmm.c                        | 35 ++++++-----------
 mm/memory.c                           | 54 +++++++++++++++++++--------
 mm/rmap.c                             |  3 +-
 6 files changed, 54 insertions(+), 50 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 1/4] lib/test_hmm: make dmirror_atomic_map() consume a single page
  2025-01-29 11:57 [PATCH v1 0/4] mm: cleanups for device-exclusive entries (hmm) David Hildenbrand
@ 2025-01-29 11:57 ` David Hildenbrand
  2025-01-30  0:29   ` Alistair Popple
  2025-01-29 11:58 ` [PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE David Hildenbrand
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

The caller now always passes a single page; let's simplify, and return
"0" on success.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 lib/test_hmm.c | 33 ++++++++++-----------------------
 1 file changed, 10 insertions(+), 23 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 9e1b07a227a3..1c0a58279db9 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -706,34 +706,23 @@ static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
 	return 0;
 }
 
-static int dmirror_atomic_map(unsigned long start, unsigned long end,
-			      struct page **pages, struct dmirror *dmirror)
+static int dmirror_atomic_map(unsigned long addr, struct page *page,
+		struct dmirror *dmirror)
 {
-	unsigned long pfn, mapped = 0;
-	int i;
+	void *entry;
 
 	/* Map the migrated pages into the device's page tables. */
 	mutex_lock(&dmirror->mutex);
 
-	for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++, i++) {
-		void *entry;
-
-		if (!pages[i])
-			continue;
-
-		entry = pages[i];
-		entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
-		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
-		if (xa_is_err(entry)) {
-			mutex_unlock(&dmirror->mutex);
-			return xa_err(entry);
-		}
-
-		mapped++;
+	entry = xa_tag_pointer(page, DPT_XA_TAG_ATOMIC);
+	entry = xa_store(&dmirror->pt, addr >> PAGE_SHIFT, entry, GFP_ATOMIC);
+	if (xa_is_err(entry)) {
+		mutex_unlock(&dmirror->mutex);
+		return xa_err(entry);
 	}
 
 	mutex_unlock(&dmirror->mutex);
-	return mapped;
+	return 0;
 }
 
 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
@@ -803,9 +792,7 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 			break;
 		}
 
-		ret = dmirror_atomic_map(addr, addr + PAGE_SIZE, &page, dmirror);
-		if (!ret)
-			ret = -EBUSY;
+		ret = dmirror_atomic_map(addr, page, dmirror);
 		folio_unlock(folio);
 		folio_put(folio);
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE
  2025-01-29 11:57 [PATCH v1 0/4] mm: cleanups for device-exclusive entries (hmm) David Hildenbrand
  2025-01-29 11:57 ` [PATCH v1 1/4] lib/test_hmm: make dmirror_atomic_map() consume a single page David Hildenbrand
@ 2025-01-29 11:58 ` David Hildenbrand
  2025-01-30  5:34   ` Alistair Popple
  2025-01-29 11:58 ` [PATCH v1 3/4] mm/memory: pass folio and pte to restore_exclusive_pte() David Hildenbrand
  2025-01-29 11:58 ` [PATCH v1 4/4] mm/memory: document restore_exclusive_pte() David Hildenbrand
  3 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

We no longer get a MMU_NOTIFY_EXCLUSIVE on conversion with the owner set
that one has to filter out: if there already *is* a device-exclusive
entry (e.g., other device, we don't have that information), GUP will
convert it back to an ordinary PTE and notify via
remove_device_exclusive_entry().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +-----
 include/linux/mmu_notifier.h          | 4 +---
 include/linux/rmap.h                  | 2 +-
 lib/test_hmm.c                        | 2 +-
 mm/rmap.c                             | 3 +--
 5 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 39e3740980bb..4758fee182b4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -510,10 +510,6 @@ static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
 	struct svm_notifier *sn =
 		container_of(mni, struct svm_notifier, notifier);
 
-	if (range->event == MMU_NOTIFY_EXCLUSIVE &&
-	    range->owner == sn->svmm->vmm->cli->drm->dev)
-		return true;
-
 	/*
 	 * serializes the update to mni->invalidate_seq done by caller and
 	 * prevents invalidation of the PTE from progressing while HW is being
@@ -609,7 +605,7 @@ static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
 
 		notifier_seq = mmu_interval_read_begin(&notifier->notifier);
 		mmap_read_lock(mm);
-		page = make_device_exclusive(mm, start, drm->dev, &folio);
+		page = make_device_exclusive(mm, start, &folio);
 		mmap_read_unlock(mm);
 		if (IS_ERR(page)) {
 			ret = -EINVAL;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d4e714661826..bac2385099dd 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -44,9 +44,7 @@ struct mmu_interval_notifier;
  * owner field matches the driver's device private pgmap owner.
  *
  * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
- * longer have exclusive access to the page. When sent during creation of an
- * exclusive range the owner will be initialised to the value provided by the
- * caller of make_device_exclusive(), otherwise the owner will be NULL.
+ * longer have exclusive access to the page.
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 86425d42c1a9..3b216b91d2e5 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -664,7 +664,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags);
 void try_to_unmap(struct folio *, enum ttu_flags flags);
 
 struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
-		void *owner, struct folio **foliop);
+		struct folio **foliop);
 
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 1c0a58279db9..8520c1d1b21b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -786,7 +786,7 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 		struct folio *folio;
 		struct page *page;
 
-		page = make_device_exclusive(mm, addr, &folio, NULL);
+		page = make_device_exclusive(mm, addr, &folio);
 		if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			break;
diff --git a/mm/rmap.c b/mm/rmap.c
index 4acc9f6d743a..d99dbf59adc6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2397,7 +2397,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
  * make_device_exclusive() - Mark an address for exclusive use by a device
  * @mm: mm_struct of associated target process
  * @addr: the virtual address to mark for exclusive device access
- * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
  * @foliop: folio pointer will be stored here on success.
  *
  * This function looks up the page mapped at the given address, grabs a
@@ -2421,7 +2420,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
  * Returns: pointer to mapped page on success, otherwise a negative error.
  */
 struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
-		void *owner, struct folio **foliop)
+		struct folio **foliop)
 {
 	struct folio *folio, *fw_folio;
 	struct vm_area_struct *vma;
-- 
2.48.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 3/4] mm/memory: pass folio and pte to restore_exclusive_pte()
  2025-01-29 11:57 [PATCH v1 0/4] mm: cleanups for device-exclusive entries (hmm) David Hildenbrand
  2025-01-29 11:57 ` [PATCH v1 1/4] lib/test_hmm: make dmirror_atomic_map() consume a single page David Hildenbrand
  2025-01-29 11:58 ` [PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE David Hildenbrand
@ 2025-01-29 11:58 ` David Hildenbrand
  2025-01-30  5:37   ` Alistair Popple
  2025-01-29 11:58 ` [PATCH v1 4/4] mm/memory: document restore_exclusive_pte() David Hildenbrand
  3 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

Let's pass the folio and the pte to restore_exclusive_pte(), so we
can avoid repeated page_folio() and ptep_get(). To do that,
pass the pte to try_restore_exclusive_pte() and use a folio in there
already.

While at it, just avoid the "swp_entry_t entry" variable in
try_restore_exclusive_pte() and add a folio-locked check to
restore_exclusive_pte().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index cd689cd8a7c8..46956994aaff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -719,14 +719,13 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
 #endif
 
 static void restore_exclusive_pte(struct vm_area_struct *vma,
-				  struct page *page, unsigned long address,
-				  pte_t *ptep)
+		struct folio *folio, struct page *page, unsigned long address,
+		pte_t *ptep, pte_t orig_pte)
 {
-	struct folio *folio = page_folio(page);
-	pte_t orig_pte;
 	pte_t pte;
 
-	orig_pte = ptep_get(ptep);
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+
 	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
@@ -756,16 +755,15 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
  * Tries to restore an exclusive pte if the page lock can be acquired without
  * sleeping.
  */
-static int
-try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma,
-			unsigned long addr)
+static int try_restore_exclusive_pte(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, pte_t orig_pte)
 {
-	swp_entry_t entry = pte_to_swp_entry(ptep_get(src_pte));
-	struct page *page = pfn_swap_entry_to_page(entry);
+	struct page *page = pfn_swap_entry_to_page(pte_to_swp_entry(orig_pte));
+	struct folio *folio = page_folio(page);
 
-	if (trylock_page(page)) {
-		restore_exclusive_pte(vma, page, addr, src_pte);
-		unlock_page(page);
+	if (folio_trylock(folio)) {
+		restore_exclusive_pte(vma, folio, page, addr, ptep, orig_pte);
+		folio_unlock(folio);
 		return 0;
 	}
 
@@ -871,7 +869,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		 * (ie. COW) mappings.
 		 */
 		VM_BUG_ON(!is_cow_mapping(src_vma->vm_flags));
-		if (try_restore_exclusive_pte(src_pte, src_vma, addr))
+		if (try_restore_exclusive_pte(src_vma, addr, src_pte, orig_pte))
 			return -EBUSY;
 		return -ENOENT;
 	} else if (is_pte_marker_entry(entry)) {
@@ -3979,7 +3977,8 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 				&vmf->ptl);
 	if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
-		restore_exclusive_pte(vma, vmf->page, vmf->address, vmf->pte);
+		restore_exclusive_pte(vma, folio, vmf->page, vmf->address,
+				      vmf->pte, vmf->orig_pte);
 
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.48.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-29 11:57 [PATCH v1 0/4] mm: cleanups for device-exclusive entries (hmm) David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-01-29 11:58 ` [PATCH v1 3/4] mm/memory: pass folio and pte to restore_exclusive_pte() David Hildenbrand
@ 2025-01-29 11:58 ` David Hildenbrand
  2025-01-30  0:27   ` Alistair Popple
  3 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

Let's document how this function is to be used, and why the requirement
for the folio lock might maybe be dropped in the future.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 46956994aaff..caaae8df11a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
 }
 #endif
 
+/**
+ * restore_exclusive_pte - Restore a device-exclusive entry
+ * @vma: VMA covering @address
+ * @folio: the mapped folio
+ * @page: the mapped folio page
+ * @address: the virtual address
+ * @ptep: PTE pointer into the locked page table mapping the folio page
+ * @orig_pte: PTE value at @ptep
+ *
+ * Restore a device-exclusive non-swap entry to an ordinary present PTE.
+ *
+ * The folio and the page table must be locked, and MMU notifiers must have
+ * been called to invalidate any (exclusive) device mappings. In case of
+ * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
+ * fault MMU_NOTIFY_EXCLUSIVE is triggered.
+ *
+ * Locking the folio makes sure that anybody who just converted the PTE to
+ * a device-private entry can map it into the device, before unlocking it; so
+ * the folio lock prevents concurrent conversion to device-exclusive.
+ *
+ * TODO: the folio lock does not protect against all cases of concurrent
+ * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
+ * must already use MMU notifiers to sync against any concurrent changes
+ * Maybe the requirement for the folio lock can be dropped in the future.
+ */
 static void restore_exclusive_pte(struct vm_area_struct *vma,
 		struct folio *folio, struct page *page, unsigned long address,
 		pte_t *ptep, pte_t orig_pte)
-- 
2.48.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-29 11:58 ` [PATCH v1 4/4] mm/memory: document restore_exclusive_pte() David Hildenbrand
@ 2025-01-30  0:27   ` Alistair Popple
  2025-01-30  9:37     ` David Hildenbrand
  2025-01-30 10:43     ` Simona Vetter
  0 siblings, 2 replies; 20+ messages in thread
From: Alistair Popple @ 2025-01-30  0:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
> Let's document how this function is to be used, and why the requirement
> for the folio lock might maybe be dropped in the future.

Sorry, only just catching up on your other thread. The folio lock was to ensure
the GPU got a chance to make forward progress by mapping the page. Without it
the CPU could immediately invalidate the entry before the GPU had a chance to
retry the fault.

Obviously performance wise having such thrashing is terrible, so should
really be avoided by userspace, but the lock at least allowed such programs
to complete.

> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 46956994aaff..caaae8df11a9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +/**
> + * restore_exclusive_pte - Restore a device-exclusive entry
> + * @vma: VMA covering @address
> + * @folio: the mapped folio
> + * @page: the mapped folio page
> + * @address: the virtual address
> + * @ptep: PTE pointer into the locked page table mapping the folio page
> + * @orig_pte: PTE value at @ptep
> + *
> + * Restore a device-exclusive non-swap entry to an ordinary present PTE.
> + *
> + * The folio and the page table must be locked, and MMU notifiers must have
> + * been called to invalidate any (exclusive) device mappings. In case of
> + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
> + * fault MMU_NOTIFY_EXCLUSIVE is triggered.
> + *
> + * Locking the folio makes sure that anybody who just converted the PTE to
> + * a device-private entry can map it into the device, before unlocking it; so
> + * the folio lock prevents concurrent conversion to device-exclusive.

I don't quite follow this - a concurrent conversion would already fail
because the GUP in make_device_exclusive_range() would most likely cause
an unexpected reference during the migration. And if a migration entry
has already been installed for the device private PTE conversion then
make_device_exclusive_range() will skip it as a non-present entry anyway.

However s/device-private/device-exclusive/ makes sense - the intent was to allow
the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault)
could convert it back to a normal PTE.

> + * TODO: the folio lock does not protect against all cases of concurrent
> + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
> + * must already use MMU notifiers to sync against any concurrent changes

Right. It's expected drivers are using MMU notifiers to keep page tables in
sync, same as for hmm_range_fault().

> + * Maybe the requirement for the folio lock can be dropped in the future.
> + */
>  static void restore_exclusive_pte(struct vm_area_struct *vma,
>  		struct folio *folio, struct page *page, unsigned long address,
>  		pte_t *ptep, pte_t orig_pte)
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 1/4] lib/test_hmm: make dmirror_atomic_map() consume a single page
  2025-01-29 11:57 ` [PATCH v1 1/4] lib/test_hmm: make dmirror_atomic_map() consume a single page David Hildenbrand
@ 2025-01-30  0:29   ` Alistair Popple
  0 siblings, 0 replies; 20+ messages in thread
From: Alistair Popple @ 2025-01-30  0:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:57:59PM +0100, David Hildenbrand wrote:
> The caller now always passes a single page; let's simplify, and return
> "0" on success.

Thanks for cleaning that up.

Reviewed-by: Alistair Popple <apopple@nvidia.com>

> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  lib/test_hmm.c | 33 ++++++++++-----------------------
>  1 file changed, 10 insertions(+), 23 deletions(-)
> 
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 9e1b07a227a3..1c0a58279db9 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -706,34 +706,23 @@ static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
>  	return 0;
>  }
>  
> -static int dmirror_atomic_map(unsigned long start, unsigned long end,
> -			      struct page **pages, struct dmirror *dmirror)
> +static int dmirror_atomic_map(unsigned long addr, struct page *page,
> +		struct dmirror *dmirror)
>  {
> -	unsigned long pfn, mapped = 0;
> -	int i;
> +	void *entry;
>  
>  	/* Map the migrated pages into the device's page tables. */
>  	mutex_lock(&dmirror->mutex);
>  
> -	for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++, i++) {
> -		void *entry;
> -
> -		if (!pages[i])
> -			continue;
> -
> -		entry = pages[i];
> -		entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
> -		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
> -		if (xa_is_err(entry)) {
> -			mutex_unlock(&dmirror->mutex);
> -			return xa_err(entry);
> -		}
> -
> -		mapped++;
> +	entry = xa_tag_pointer(page, DPT_XA_TAG_ATOMIC);
> +	entry = xa_store(&dmirror->pt, addr >> PAGE_SHIFT, entry, GFP_ATOMIC);
> +	if (xa_is_err(entry)) {
> +		mutex_unlock(&dmirror->mutex);
> +		return xa_err(entry);
>  	}
>  
>  	mutex_unlock(&dmirror->mutex);
> -	return mapped;
> +	return 0;
>  }
>  
>  static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
> @@ -803,9 +792,7 @@ static int dmirror_exclusive(struct dmirror *dmirror,
>  			break;
>  		}
>  
> -		ret = dmirror_atomic_map(addr, addr + PAGE_SIZE, &page, dmirror);
> -		if (!ret)
> -			ret = -EBUSY;
> +		ret = dmirror_atomic_map(addr, page, dmirror);
>  		folio_unlock(folio);
>  		folio_put(folio);
>  
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE
  2025-01-29 11:58 ` [PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE David Hildenbrand
@ 2025-01-30  5:34   ` Alistair Popple
  2025-01-30  9:28     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2025-01-30  5:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:58:00PM +0100, David Hildenbrand wrote:
> We no longer get a MMU_NOTIFY_EXCLUSIVE on conversion with the owner set
> that one has to filter out: if there already *is* a device-exclusive
> entry (e.g., other device, we don't have that information), GUP will
> convert it back to an ordinary PTE and notify via
> remove_device_exclusive_entry().

What tree is this against? I tried applying to v6.13 and Linus current master
but neither applied cleanly.
 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +-----
>  include/linux/mmu_notifier.h          | 4 +---
>  include/linux/rmap.h                  | 2 +-
>  lib/test_hmm.c                        | 2 +-
>  mm/rmap.c                             | 3 +--
>  5 files changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index 39e3740980bb..4758fee182b4 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -510,10 +510,6 @@ static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
>  	struct svm_notifier *sn =
>  		container_of(mni, struct svm_notifier, notifier);
>  
> -	if (range->event == MMU_NOTIFY_EXCLUSIVE &&
> -	    range->owner == sn->svmm->vmm->cli->drm->dev)
> -		return true;

I think this will cause a live-lock because make_device_exclusive_range()
will call the notifier which without the filtering will increment the sequence
count and cause endless retries of the loop in nouveau_atomic_range_fault().
The notifier needs to be able to figure out if it was called in response to
something this thread did (ie. make_device_exclusive_range) and can therefore
ignore the invalidation, or from some other thread.

Looking at hmm_test I see that doesn't use the sequence counter to ensure
the PTE remains valid whilst it is mapped. I think that is probably wrong, so
apologies if that lead you astray.

>  	/*
>  	 * serializes the update to mni->invalidate_seq done by caller and
>  	 * prevents invalidation of the PTE from progressing while HW is being
> @@ -609,7 +605,7 @@ static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
>  
>  		notifier_seq = mmu_interval_read_begin(&notifier->notifier);
>  		mmap_read_lock(mm);
> -		page = make_device_exclusive(mm, start, drm->dev, &folio);
> +		page = make_device_exclusive(mm, start, &folio);
>  		mmap_read_unlock(mm);
>  		if (IS_ERR(page)) {
>  			ret = -EINVAL;
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d4e714661826..bac2385099dd 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -44,9 +44,7 @@ struct mmu_interval_notifier;
>   * owner field matches the driver's device private pgmap owner.
>   *
>   * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> - * longer have exclusive access to the page. When sent during creation of an
> - * exclusive range the owner will be initialised to the value provided by the
> - * caller of make_device_exclusive(), otherwise the owner will be NULL.
> + * longer have exclusive access to the page.
>   */
>  enum mmu_notifier_event {
>  	MMU_NOTIFY_UNMAP = 0,
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 86425d42c1a9..3b216b91d2e5 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -664,7 +664,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags);
>  void try_to_unmap(struct folio *, enum ttu_flags flags);
>  
>  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> -		void *owner, struct folio **foliop);
> +		struct folio **foliop);
>  
>  /* Avoid racy checks */
>  #define PVMW_SYNC		(1 << 0)
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 1c0a58279db9..8520c1d1b21b 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -786,7 +786,7 @@ static int dmirror_exclusive(struct dmirror *dmirror,
>  		struct folio *folio;
>  		struct page *page;
>  
> -		page = make_device_exclusive(mm, addr, &folio, NULL);
> +		page = make_device_exclusive(mm, addr, &folio);
>  		if (IS_ERR(page)) {
>  			ret = PTR_ERR(page);
>  			break;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4acc9f6d743a..d99dbf59adc6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2397,7 +2397,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>   * make_device_exclusive() - Mark an address for exclusive use by a device
>   * @mm: mm_struct of associated target process
>   * @addr: the virtual address to mark for exclusive device access
> - * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
>   * @foliop: folio pointer will be stored here on success.
>   *
>   * This function looks up the page mapped at the given address, grabs a
> @@ -2421,7 +2420,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>   * Returns: pointer to mapped page on success, otherwise a negative error.
>   */
>  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> -		void *owner, struct folio **foliop)
> +		struct folio **foliop)
>  {
>  	struct folio *folio, *fw_folio;
>  	struct vm_area_struct *vma;
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 3/4] mm/memory: pass folio and pte to restore_exclusive_pte()
  2025-01-29 11:58 ` [PATCH v1 3/4] mm/memory: pass folio and pte to restore_exclusive_pte() David Hildenbrand
@ 2025-01-30  5:37   ` Alistair Popple
  0 siblings, 0 replies; 20+ messages in thread
From: Alistair Popple @ 2025-01-30  5:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:58:01PM +0100, David Hildenbrand wrote:
> Let's pass the folio and the pte to restore_exclusive_pte(), so we
> can avoid repeated page_folio() and ptep_get(). To do that,
> pass the pte to try_restore_exclusive_pte() and use a folio in there
> already.
> 
> While at it, just avoid the "swp_entry_t entry" variable in
> try_restore_exclusive_pte() and add a folio-locked check to
> restore_exclusive_pte().

Seems reasonable.

Reviewed-by: Alistair Popple <apopple@nvidia.com>
 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index cd689cd8a7c8..46956994aaff 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -719,14 +719,13 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>  #endif
>  
>  static void restore_exclusive_pte(struct vm_area_struct *vma,
> -				  struct page *page, unsigned long address,
> -				  pte_t *ptep)
> +		struct folio *folio, struct page *page, unsigned long address,
> +		pte_t *ptep, pte_t orig_pte)
>  {
> -	struct folio *folio = page_folio(page);
> -	pte_t orig_pte;
>  	pte_t pte;
>  
> -	orig_pte = ptep_get(ptep);
> +	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +
>  	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
>  	if (pte_swp_soft_dirty(orig_pte))
>  		pte = pte_mksoft_dirty(pte);
> @@ -756,16 +755,15 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
>   * Tries to restore an exclusive pte if the page lock can be acquired without
>   * sleeping.
>   */
> -static int
> -try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma,
> -			unsigned long addr)
> +static int try_restore_exclusive_pte(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, pte_t orig_pte)
>  {
> -	swp_entry_t entry = pte_to_swp_entry(ptep_get(src_pte));
> -	struct page *page = pfn_swap_entry_to_page(entry);
> +	struct page *page = pfn_swap_entry_to_page(pte_to_swp_entry(orig_pte));
> +	struct folio *folio = page_folio(page);
>  
> -	if (trylock_page(page)) {
> -		restore_exclusive_pte(vma, page, addr, src_pte);
> -		unlock_page(page);
> +	if (folio_trylock(folio)) {
> +		restore_exclusive_pte(vma, folio, page, addr, ptep, orig_pte);
> +		folio_unlock(folio);
>  		return 0;
>  	}
>  
> @@ -871,7 +869,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		 * (ie. COW) mappings.
>  		 */
>  		VM_BUG_ON(!is_cow_mapping(src_vma->vm_flags));
> -		if (try_restore_exclusive_pte(src_pte, src_vma, addr))
> +		if (try_restore_exclusive_pte(src_vma, addr, src_pte, orig_pte))
>  			return -EBUSY;
>  		return -ENOENT;
>  	} else if (is_pte_marker_entry(entry)) {
> @@ -3979,7 +3977,8 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>  				&vmf->ptl);
>  	if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> -		restore_exclusive_pte(vma, vmf->page, vmf->address, vmf->pte);
> +		restore_exclusive_pte(vma, folio, vmf->page, vmf->address,
> +				      vmf->pte, vmf->orig_pte);
>  
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE
  2025-01-30  5:34   ` Alistair Popple
@ 2025-01-30  9:28     ` David Hildenbrand
  2025-01-30 13:29       ` Simona Vetter
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-01-30  9:28 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 30.01.25 06:34, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:58:00PM +0100, David Hildenbrand wrote:
>> We no longer get a MMU_NOTIFY_EXCLUSIVE on conversion with the owner set
>> that one has to filter out: if there already *is* a device-exclusive
>> entry (e.g., other device, we don't have that information), GUP will
>> convert it back to an ordinary PTE and notify via
>> remove_device_exclusive_entry().
> 
> What tree is this against? I tried applying to v6.13 and Linus current master
> but neither applied cleanly.

See the cover letter. This is on top of the fixes series, which is based 
on mm-unstable from yesterday.

>   
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +-----
>>   include/linux/mmu_notifier.h          | 4 +---
>>   include/linux/rmap.h                  | 2 +-
>>   lib/test_hmm.c                        | 2 +-
>>   mm/rmap.c                             | 3 +--
>>   5 files changed, 5 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
>> index 39e3740980bb..4758fee182b4 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
>> @@ -510,10 +510,6 @@ static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
>>   	struct svm_notifier *sn =
>>   		container_of(mni, struct svm_notifier, notifier);
>>   
>> -	if (range->event == MMU_NOTIFY_EXCLUSIVE &&
>> -	    range->owner == sn->svmm->vmm->cli->drm->dev)
>> -		return true;
> 
> I think this will cause a live-lock because make_device_exclusive_range()
> will call the notifier which without the filtering will increment the sequence
> count and cause endless retries of the loop in nouveau_atomic_range_fault().
> The notifier needs to be able to figure out if it was called in response to
> something this thread did (ie. make_device_exclusive_range) and can therefore
> ignore the invalidation, or from some other thread.

Yes, as discussed in the other patch, this must stay to inform secondary 
MMUs about the conversion *to* device exclusive.

> 
> Looking at hmm_test I see that doesn't use the sequence counter to ensure
> the PTE remains valid whilst it is mapped. I think that is probably wrong, so
> apologies if that lead you astray.

Yes, the hmm_test does not completely follow the same model the nouveau 
implementation does; so it might not be completely correct.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-30  0:27   ` Alistair Popple
@ 2025-01-30  9:37     ` David Hildenbrand
  2025-01-30 13:31       ` Simona Vetter
  2025-01-30 10:43     ` Simona Vetter
  1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-01-30  9:37 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 30.01.25 01:27, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
>> Let's document how this function is to be used, and why the requirement
>> for the folio lock might maybe be dropped in the future.
> 
> Sorry, only just catching up on your other thread. The folio lock was to ensure
> the GPU got a chance to make forward progress by mapping the page. Without it
> the CPU could immediately invalidate the entry before the GPU had a chance to
> retry the fault.
 > > Obviously performance wise having such thrashing is terrible, so should
> really be avoided by userspace, but the lock at least allowed such programs
> to complete.

Thanks for the clarification. So it's relevant that the MMU notifier in 
remove_device_exclusive_entry() is sent after taking the folio lock.

However, as soon as we drop the folio lock, 
remove_device_exclusive_entry() will become active, lock the folio and 
trigger the MMU notifier.

So the time it is actually mapped into the device is rather

> 
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memory.c | 25 +++++++++++++++++++++++++
>>   1 file changed, 25 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 46956994aaff..caaae8df11a9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>>   }
>>   #endif
>>   
>> +/**
>> + * restore_exclusive_pte - Restore a device-exclusive entry
>> + * @vma: VMA covering @address
>> + * @folio: the mapped folio
>> + * @page: the mapped folio page
>> + * @address: the virtual address
>> + * @ptep: PTE pointer into the locked page table mapping the folio page
>> + * @orig_pte: PTE value at @ptep
>> + *
>> + * Restore a device-exclusive non-swap entry to an ordinary present PTE.
>> + *
>> + * The folio and the page table must be locked, and MMU notifiers must have
>> + * been called to invalidate any (exclusive) device mappings. In case of
>> + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
>> + * fault MMU_NOTIFY_EXCLUSIVE is triggered.
>> + *
>> + * Locking the folio makes sure that anybody who just converted the PTE to
>> + * a device-private entry can map it into the device, before unlocking it; so
>> + * the folio lock prevents concurrent conversion to device-exclusive.
> 
> I don't quite follow this - a concurrent conversion would already fail
> because the GUP in make_device_exclusive_range() would most likely cause
> an unexpected reference during the migration. And if a migration entry
> has already been installed for the device private PTE conversion then
> make_device_exclusive_range() will skip it as a non-present entry anyway.

Sorry, I meant "device-exclusive", so migration is not a concern.

> 
> However s/device-private/device-exclusive/ makes sense - the intent was to allow
> the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault)
> could convert it back to a normal PTE.
> 
>> + * TODO: the folio lock does not protect against all cases of concurrent
>> + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
>> + * must already use MMU notifiers to sync against any concurrent changes
> 
> Right. It's expected drivers are using MMU notifiers to keep page tables in
> sync, same as for hmm_range_fault().

Let me try to rephrase it given that the folio lock is purely to 
guarantee forward-progress, not for correctness; that's what MMU 
notifiers must be used for.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-30  0:27   ` Alistair Popple
  2025-01-30  9:37     ` David Hildenbrand
@ 2025-01-30 10:43     ` Simona Vetter
  2025-01-31  0:20       ` Alistair Popple
  1 sibling, 1 reply; 20+ messages in thread
From: Simona Vetter @ 2025-01-30 10:43 UTC (permalink / raw)
  To: Alistair Popple
  Cc: David Hildenbrand, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 11:27:37AM +1100, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
> > Let's document how this function is to be used, and why the requirement
> > for the folio lock might maybe be dropped in the future.
> 
> Sorry, only just catching up on your other thread. The folio lock was to ensure
> the GPU got a chance to make forward progress by mapping the page. Without it
> the CPU could immediately invalidate the entry before the GPU had a chance to
> retry the fault.
> 
> Obviously performance wise having such thrashing is terrible, so should
> really be avoided by userspace, but the lock at least allowed such programs
> to complete.

Imo this is not a legit use-case. If userspace concurrently (instead of
clearly alternating) uses the same 4k page for gpu atomics and on the cpu,
it just gets to keep the fallout.

Plus there's no guarantee that we hold the folio_lock long enough for the
gpu to actually complete the atomic, so this isn't even really helping
with forward progress even if this somehow would be a legit usecase.

But this is also why thp is potentially an issue, because if thp
constantly creates pmd entries that potentially causes false sharing and
we do have thrashing that shouldn't happen.
-Sima

> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/memory.c | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 46956994aaff..caaae8df11a9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
> >  }
> >  #endif
> >  
> > +/**
> > + * restore_exclusive_pte - Restore a device-exclusive entry
> > + * @vma: VMA covering @address
> > + * @folio: the mapped folio
> > + * @page: the mapped folio page
> > + * @address: the virtual address
> > + * @ptep: PTE pointer into the locked page table mapping the folio page
> > + * @orig_pte: PTE value at @ptep
> > + *
> > + * Restore a device-exclusive non-swap entry to an ordinary present PTE.
> > + *
> > + * The folio and the page table must be locked, and MMU notifiers must have
> > + * been called to invalidate any (exclusive) device mappings. In case of
> > + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
> > + * fault MMU_NOTIFY_EXCLUSIVE is triggered.
> > + *
> > + * Locking the folio makes sure that anybody who just converted the PTE to
> > + * a device-private entry can map it into the device, before unlocking it; so
> > + * the folio lock prevents concurrent conversion to device-exclusive.
> 
> I don't quite follow this - a concurrent conversion would already fail
> because the GUP in make_device_exclusive_range() would most likely cause
> an unexpected reference during the migration. And if a migration entry
> has already been installed for the device private PTE conversion then
> make_device_exclusive_range() will skip it as a non-present entry anyway.
> 
> However s/device-private/device-exclusive/ makes sense - the intent was to allow
> the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault)
> could convert it back to a normal PTE.
> 
> > + * TODO: the folio lock does not protect against all cases of concurrent
> > + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
> > + * must already use MMU notifiers to sync against any concurrent changes
> 
> Right. It's expected drivers are using MMU notifiers to keep page tables in
> sync, same as for hmm_range_fault().
> 
> > + * Maybe the requirement for the folio lock can be dropped in the future.
> > + */
> >  static void restore_exclusive_pte(struct vm_area_struct *vma,
> >  		struct folio *folio, struct page *page, unsigned long address,
> >  		pte_t *ptep, pte_t orig_pte)
> > -- 
> > 2.48.1
> > 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE
  2025-01-30  9:28     ` David Hildenbrand
@ 2025-01-30 13:29       ` Simona Vetter
  2025-01-30 15:26         ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Simona Vetter @ 2025-01-30 13:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 10:28:00AM +0100, David Hildenbrand wrote:
> On 30.01.25 06:34, Alistair Popple wrote:
> > Looking at hmm_test I see that doesn't use the sequence counter to ensure
> > the PTE remains valid whilst it is mapped. I think that is probably wrong, so
> > apologies if that lead you astray.
> 
> Yes, the hmm_test does not completely follow the same model the nouveau
> implementation does; so it might not be completely correct.

But unrelated but just crossed my mind:

I guess another crucial difference is that the hw (probably, not sure)
will restart the fault if we don't repair it to its liking. So the
hmm-test does need some kind of retry loop too somewhere to match that.
But might be good to also still land some of the other improvements
discussed in these threads to make make_device_exclusive a bit more
reliable instead of relying on busy-looping throug the hw fault handler
for everything.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-30  9:37     ` David Hildenbrand
@ 2025-01-30 13:31       ` Simona Vetter
  2025-01-30 15:29         ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Simona Vetter @ 2025-01-30 13:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 10:37:06AM +0100, David Hildenbrand wrote:
> On 30.01.25 01:27, Alistair Popple wrote:
> > On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
> > > Let's document how this function is to be used, and why the requirement
> > > for the folio lock might maybe be dropped in the future.
> > 
> > Sorry, only just catching up on your other thread. The folio lock was to ensure
> > the GPU got a chance to make forward progress by mapping the page. Without it
> > the CPU could immediately invalidate the entry before the GPU had a chance to
> > retry the fault.
> > > Obviously performance wise having such thrashing is terrible, so should
> > really be avoided by userspace, but the lock at least allowed such programs
> > to complete.
> 
> Thanks for the clarification. So it's relevant that the MMU notifier in
> remove_device_exclusive_entry() is sent after taking the folio lock.
> 
> However, as soon as we drop the folio lock, remove_device_exclusive_entry()
> will become active, lock the folio and trigger the MMU notifier.
> 
> So the time it is actually mapped into the device is rather

Looks like you cut off a bit here (or mail transport did that somewhere),
but see my other reply I don't think this is a legit use-case. So we don't
have to worry. Well beyond documenting that if userspace concurrently thrashes
the same page with both device atomics and cpu access it will stall real
bad.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE
  2025-01-30 13:29       ` Simona Vetter
@ 2025-01-30 15:26         ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-01-30 15:26 UTC (permalink / raw)
  To: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 30.01.25 14:29, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 10:28:00AM +0100, David Hildenbrand wrote:
>> On 30.01.25 06:34, Alistair Popple wrote:
>>> Looking at hmm_test I see that doesn't use the sequence counter to ensure
>>> the PTE remains valid whilst it is mapped. I think that is probably wrong, so
>>> apologies if that lead you astray.
>>
>> Yes, the hmm_test does not completely follow the same model the nouveau
>> implementation does; so it might not be completely correct.
> 
> But unrelated but just crossed my mind:
> 
> I guess another crucial difference is that the hw (probably, not sure)
> will restart the fault if we don't repair it to its liking. So the
> hmm-test does need some kind of retry loop too somewhere to match that.

Yes. Especially for the folio lock spinning is a rather suboptimal 
approach. So we likely would want the option to just lock it instead of 
try-locking it. (or getting rid of it entirely :) )

> But might be good to also still land some of the other improvements
> discussed in these threads to make make_device_exclusive a bit more
> reliable instead of relying on busy-looping throug the hw fault handler
> for everything.

Right.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-30 13:31       ` Simona Vetter
@ 2025-01-30 15:29         ` David Hildenbrand
  2025-01-31  0:14           ` Alistair Popple
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-01-30 15:29 UTC (permalink / raw)
  To: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 30.01.25 14:31, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 10:37:06AM +0100, David Hildenbrand wrote:
>> On 30.01.25 01:27, Alistair Popple wrote:
>>> On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
>>>> Let's document how this function is to be used, and why the requirement
>>>> for the folio lock might maybe be dropped in the future.
>>>
>>> Sorry, only just catching up on your other thread. The folio lock was to ensure
>>> the GPU got a chance to make forward progress by mapping the page. Without it
>>> the CPU could immediately invalidate the entry before the GPU had a chance to
>>> retry the fault.
>>>> Obviously performance wise having such thrashing is terrible, so should
>>> really be avoided by userspace, but the lock at least allowed such programs
>>> to complete.
>>
>> Thanks for the clarification. So it's relevant that the MMU notifier in
>> remove_device_exclusive_entry() is sent after taking the folio lock.
>>
>> However, as soon as we drop the folio lock, remove_device_exclusive_entry()
>> will become active, lock the folio and trigger the MMU notifier.
>>
>> So the time it is actually mapped into the device is rather

I meant to say "rather short." :)

> 
> Looks like you cut off a bit here (or mail transport did that somewhere),
> but see my other reply I don't think this is a legit use-case. So we don't
> have to worry.

In that case, we would need the folio lock in the future.

> Well beyond documenting that if userspace concurrently thrashes
> the same page with both device atomics and cpu access it will stall real
> bad.

I'm curious, is locking between device-cpu or device-device something 
that can happen frequently? In that case, you would get that trashing 
naturally?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-30 15:29         ` David Hildenbrand
@ 2025-01-31  0:14           ` Alistair Popple
  2025-01-31 17:20             ` Simona Vetter
  0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2025-01-31  0:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 04:29:33PM +0100, David Hildenbrand wrote:
> On 30.01.25 14:31, Simona Vetter wrote:
> > On Thu, Jan 30, 2025 at 10:37:06AM +0100, David Hildenbrand wrote:
> > > On 30.01.25 01:27, Alistair Popple wrote:
> > > > On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
> > > > > Let's document how this function is to be used, and why the requirement
> > > > > for the folio lock might maybe be dropped in the future.
> > > > 
> > > > Sorry, only just catching up on your other thread. The folio lock was to ensure
> > > > the GPU got a chance to make forward progress by mapping the page. Without it
> > > > the CPU could immediately invalidate the entry before the GPU had a chance to
> > > > retry the fault.
> > > > > Obviously performance wise having such thrashing is terrible, so should
> > > > really be avoided by userspace, but the lock at least allowed such programs
> > > > to complete.
> > > 
> > > Thanks for the clarification. So it's relevant that the MMU notifier in
> > > remove_device_exclusive_entry() is sent after taking the folio lock.
> > > 
> > > However, as soon as we drop the folio lock, remove_device_exclusive_entry()
> > > will become active, lock the folio and trigger the MMU notifier.
> > > 
> > > So the time it is actually mapped into the device is rather
> 
> I meant to say "rather short." :)
> 
> > 
> > Looks like you cut off a bit here (or mail transport did that somewhere),
> > but see my other reply I don't think this is a legit use-case. So we don't
> > have to worry.
> 
> In that case, we would need the folio lock in the future.
> 
> > Well beyond documenting that if userspace concurrently thrashes
> > the same page with both device atomics and cpu access it will stall real
> > bad.
> 
> I'm curious, is locking between device-cpu or device-device something that
> can happen frequently? In that case, you would get that trashing naturally?

It results in terrible performance so in practice it isn't something that I've
seen except when stress testing the driver. Those stress tests were useful for
exposing a range of kernel/driver bugs/issues though, and despite the short time
it is mapped the lock was sufficient to allow atomic thrashing tests to complete
vs. having the device fault endlessly.

So unless it's making things more difficult I'd rather keep the lock.

> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-30 10:43     ` Simona Vetter
@ 2025-01-31  0:20       ` Alistair Popple
  2025-01-31  9:15         ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2025-01-31  0:20 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 11:43:25AM +0100, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 11:27:37AM +1100, Alistair Popple wrote:
> > On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
> > > Let's document how this function is to be used, and why the requirement
> > > for the folio lock might maybe be dropped in the future.
> > 
> > Sorry, only just catching up on your other thread. The folio lock was to ensure
> > the GPU got a chance to make forward progress by mapping the page. Without it
> > the CPU could immediately invalidate the entry before the GPU had a chance to
> > retry the fault.
> > 
> > Obviously performance wise having such thrashing is terrible, so should
> > really be avoided by userspace, but the lock at least allowed such programs
> > to complete.
> 
> Imo this is not a legit use-case. If userspace concurrently (instead of
> clearly alternating) uses the same 4k page for gpu atomics and on the cpu,
> it just gets to keep the fallout.
> 
> Plus there's no guarantee that we hold the folio_lock long enough for the
> gpu to actually complete the atomic, so this isn't even really helping
> with forward progress even if this somehow would be a legit usecase.

Yes, agree it's not a legit real world use case. In practice though it was
useful for testing this and other driver code by thrashing to generate a lot
device/cpu faults and invalidations. Obviously "just for testing" is not a great
justification though, so if it's causing problems we could get rid of it.

> But this is also why thp is potentially an issue, because if thp
> constantly creates pmd entries that potentially causes false sharing and
> we do have thrashing that shouldn't happen.

Yeah, I don't we should extend this to thps.

 - Alistair

> -Sima
> 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >  mm/memory.c | 25 +++++++++++++++++++++++++
> > >  1 file changed, 25 insertions(+)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 46956994aaff..caaae8df11a9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
> > >  }
> > >  #endif
> > >  
> > > +/**
> > > + * restore_exclusive_pte - Restore a device-exclusive entry
> > > + * @vma: VMA covering @address
> > > + * @folio: the mapped folio
> > > + * @page: the mapped folio page
> > > + * @address: the virtual address
> > > + * @ptep: PTE pointer into the locked page table mapping the folio page
> > > + * @orig_pte: PTE value at @ptep
> > > + *
> > > + * Restore a device-exclusive non-swap entry to an ordinary present PTE.
> > > + *
> > > + * The folio and the page table must be locked, and MMU notifiers must have
> > > + * been called to invalidate any (exclusive) device mappings. In case of
> > > + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
> > > + * fault MMU_NOTIFY_EXCLUSIVE is triggered.
> > > + *
> > > + * Locking the folio makes sure that anybody who just converted the PTE to
> > > + * a device-private entry can map it into the device, before unlocking it; so
> > > + * the folio lock prevents concurrent conversion to device-exclusive.
> > 
> > I don't quite follow this - a concurrent conversion would already fail
> > because the GUP in make_device_exclusive_range() would most likely cause
> > an unexpected reference during the migration. And if a migration entry
> > has already been installed for the device private PTE conversion then
> > make_device_exclusive_range() will skip it as a non-present entry anyway.
> > 
> > However s/device-private/device-exclusive/ makes sense - the intent was to allow
> > the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault)
> > could convert it back to a normal PTE.
> > 
> > > + * TODO: the folio lock does not protect against all cases of concurrent
> > > + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
> > > + * must already use MMU notifiers to sync against any concurrent changes
> > 
> > Right. It's expected drivers are using MMU notifiers to keep page tables in
> > sync, same as for hmm_range_fault().
> > 
> > > + * Maybe the requirement for the folio lock can be dropped in the future.
> > > + */
> > >  static void restore_exclusive_pte(struct vm_area_struct *vma,
> > >  		struct folio *folio, struct page *page, unsigned long address,
> > >  		pte_t *ptep, pte_t orig_pte)
> > > -- 
> > > 2.48.1
> > > 
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-31  0:20       ` Alistair Popple
@ 2025-01-31  9:15         ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-01-31  9:15 UTC (permalink / raw)
  To: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 31.01.25 01:20, Alistair Popple wrote:
> On Thu, Jan 30, 2025 at 11:43:25AM +0100, Simona Vetter wrote:
>> On Thu, Jan 30, 2025 at 11:27:37AM +1100, Alistair Popple wrote:
>>> On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
>>>> Let's document how this function is to be used, and why the requirement
>>>> for the folio lock might maybe be dropped in the future.
>>>
>>> Sorry, only just catching up on your other thread. The folio lock was to ensure
>>> the GPU got a chance to make forward progress by mapping the page. Without it
>>> the CPU could immediately invalidate the entry before the GPU had a chance to
>>> retry the fault.
>>>
>>> Obviously performance wise having such thrashing is terrible, so should
>>> really be avoided by userspace, but the lock at least allowed such programs
>>> to complete.
>>
>> Imo this is not a legit use-case. If userspace concurrently (instead of
>> clearly alternating) uses the same 4k page for gpu atomics and on the cpu,
>> it just gets to keep the fallout.
>>
>> Plus there's no guarantee that we hold the folio_lock long enough for the
>> gpu to actually complete the atomic, so this isn't even really helping
>> with forward progress even if this somehow would be a legit usecase.
> 
> Yes, agree it's not a legit real world use case. In practice though it was
> useful for testing this and other driver code by thrashing to generate a lot
> device/cpu faults and invalidations. Obviously "just for testing" is not a great
> justification though, so if it's causing problems we could get rid of it.

Okay, I'll make that clear in the documentation. Getting rid of the 
folio lock might be really beneficial in some cases.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
  2025-01-31  0:14           ` Alistair Popple
@ 2025-01-31 17:20             ` Simona Vetter
  0 siblings, 0 replies; 20+ messages in thread
From: Simona Vetter @ 2025-01-31 17:20 UTC (permalink / raw)
  To: Alistair Popple
  Cc: David Hildenbrand, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Fri, Jan 31, 2025 at 11:14:06AM +1100, Alistair Popple wrote:
> On Thu, Jan 30, 2025 at 04:29:33PM +0100, David Hildenbrand wrote:
> > On 30.01.25 14:31, Simona Vetter wrote:
> > > On Thu, Jan 30, 2025 at 10:37:06AM +0100, David Hildenbrand wrote:
> > > > On 30.01.25 01:27, Alistair Popple wrote:
> > > > > On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
> > > > > > Let's document how this function is to be used, and why the requirement
> > > > > > for the folio lock might maybe be dropped in the future.
> > > > > 
> > > > > Sorry, only just catching up on your other thread. The folio lock was to ensure
> > > > > the GPU got a chance to make forward progress by mapping the page. Without it
> > > > > the CPU could immediately invalidate the entry before the GPU had a chance to
> > > > > retry the fault.
> > > > > > Obviously performance wise having such thrashing is terrible, so should
> > > > > really be avoided by userspace, but the lock at least allowed such programs
> > > > > to complete.
> > > > 
> > > > Thanks for the clarification. So it's relevant that the MMU notifier in
> > > > remove_device_exclusive_entry() is sent after taking the folio lock.
> > > > 
> > > > However, as soon as we drop the folio lock, remove_device_exclusive_entry()
> > > > will become active, lock the folio and trigger the MMU notifier.
> > > > 
> > > > So the time it is actually mapped into the device is rather
> > 
> > I meant to say "rather short." :)
> > 
> > > 
> > > Looks like you cut off a bit here (or mail transport did that somewhere),
> > > but see my other reply I don't think this is a legit use-case. So we don't
> > > have to worry.
> > 
> > In that case, we would need the folio lock in the future.
> > 
> > > Well beyond documenting that if userspace concurrently thrashes
> > > the same page with both device atomics and cpu access it will stall real
> > > bad.
> > 
> > I'm curious, is locking between device-cpu or device-device something that
> > can happen frequently? In that case, you would get that trashing naturally?
> 
> It results in terrible performance so in practice it isn't something that I've
> seen except when stress testing the driver. Those stress tests were useful for
> exposing a range of kernel/driver bugs/issues though, and despite the short time
> it is mapped the lock was sufficient to allow atomic thrashing tests to complete
> vs. having the device fault endlessly.

Yeah the practical use-case of device atomics is that they alternate, as
the processing shifts between the cpu and the gpu. Classic one is when you
generate variable amounts of data per input block that you fill into a big
preallocated array, and the atomic counter is your dumb-as-rock malloc for
that. The atomic moves as the generating/consuming of that data moves
around the system (and needs a fault each time it moves), but you really
never want to have concurrent access going on. Any thrashing concurrent
access is just broken userspace (or a driver stress test).

> So unless it's making things more difficult I'd rather keep the lock.

But why do these stress-tests need to complete? We have a lot of these in
our gpu test suite, and we just nuke them after a short timeout if nothing
blows up. Concurrently hammering the same page from both gpu and cpu
really isn't something that should have any forward progress gurantees
imo. And I feel like too much locking just risks us hiding some much more
nasty (design) bugs underneath - I've had that experience reviewing both
amdkfd and the in-work xe code. So my preference for gpu interacting with
core mm is that we have the least amount of locking to make sure we really
don't have a breaking design impendence mismatch going on.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-01-31 17:20 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-29 11:57 [PATCH v1 0/4] mm: cleanups for device-exclusive entries (hmm) David Hildenbrand
2025-01-29 11:57 ` [PATCH v1 1/4] lib/test_hmm: make dmirror_atomic_map() consume a single page David Hildenbrand
2025-01-30  0:29   ` Alistair Popple
2025-01-29 11:58 ` [PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE David Hildenbrand
2025-01-30  5:34   ` Alistair Popple
2025-01-30  9:28     ` David Hildenbrand
2025-01-30 13:29       ` Simona Vetter
2025-01-30 15:26         ` David Hildenbrand
2025-01-29 11:58 ` [PATCH v1 3/4] mm/memory: pass folio and pte to restore_exclusive_pte() David Hildenbrand
2025-01-30  5:37   ` Alistair Popple
2025-01-29 11:58 ` [PATCH v1 4/4] mm/memory: document restore_exclusive_pte() David Hildenbrand
2025-01-30  0:27   ` Alistair Popple
2025-01-30  9:37     ` David Hildenbrand
2025-01-30 13:31       ` Simona Vetter
2025-01-30 15:29         ` David Hildenbrand
2025-01-31  0:14           ` Alistair Popple
2025-01-31 17:20             ` Simona Vetter
2025-01-30 10:43     ` Simona Vetter
2025-01-31  0:20       ` Alistair Popple
2025-01-31  9:15         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox