linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
@ 2026-02-05 11:10 Thomas Hellström
  2026-02-05 11:20 ` Balbir Singh
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Thomas Hellström @ 2026-02-05 11:10 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Alistair Popple, Ralph Campbell,
	Christoph Hellwig, Jason Gunthorpe, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Matthew Brost, John Hubbard,
	linux-mm, dri-devel, stable

If hmm_range_fault() fails a folio_trylock() in do_swap_page,
trying to acquire the lock of a device-private folio for migration,
to ram, the function will spin until it succeeds grabbing the lock.

However, if the process holding the lock is depending on a work
item to be completed, which is scheduled on the same CPU as the
spinning hmm_range_fault(), that work item might be starved and
we end up in a livelock / starvation situation which is never
resolved.

This can happen, for example if the process holding the
device-private folio lock is stuck in
   migrate_device_unmap()->lru_add_drain_all()
The lru_add_drain_all() function requires a short work-item
to be run on all online cpus to complete.

A prerequisite for this to happen is:
a) Both zone device and system memory folios are considered in
   migrate_device_unmap(), so that there is a reason to call
   lru_add_drain_all() for a system memory folio while a
   folio lock is held on a zone device folio.
b) The zone device folio has an initial mapcount > 1 which causes
   at least one migration PTE entry insertion to be deferred to
   try_to_migrate(), which can happen after the call to
   lru_add_drain_all().
c) No or voluntary only preemption.

This all seems pretty unlikely to happen, but indeed is hit by
the "xe_exec_system_allocator" igt test.

Resolve this by waiting for the folio to be unlocked if the
folio_trylock() fails in the do_swap_page() function.

Rename the migration_entry_wait_on_locked() function to
softleaf_entry_wait_unlock() and update its documentation to
indicate the new use-case.

Future code improvements might consider moving
the lru_add_drain_all() call in migrate_device_unmap() to be
called *after* all pages have migration entries inserted.
That would eliminate also b) above.

v2:
- Instead of a cond_resched() in the hmm_range_fault() function,
  eliminate the problem by waiting for the folio to be unlocked
  in do_swap_page() (Alistair Popple, Andrew Morton)
v3:
- Add a stub migration_entry_wait_on_locked() for the
  !CONFIG_MIGRATION case. (Kernel Test Robot)
v4:
- Rename migrate_entry_wait_on_locked() to
  softleaf_entry_wait_on_locked() and update docs (Alistair Popple)

Suggested-by: Alistair Popple <apopple@nvidia.com>
Fixes: 1afaeb8293c9 ("mm/migrate: Trylock device page in do_swap_page")
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: <stable@vger.kernel.org> # v6.15+
Reviewed-by: John Hubbard <jhubbard@nvidia.com> #v3
---
 include/linux/migrate.h |  8 +++++++-
 mm/filemap.c            | 15 ++++++++++-----
 mm/memory.c             |  3 ++-
 mm/migrate.c            |  8 ++++----
 mm/migrate_device.c     |  2 +-
 5 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 26ca00c325d9..3cc387f1957d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -65,7 +65,7 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list);
 
 int migrate_huge_page_move_mapping(struct address_space *mapping,
 		struct folio *dst, struct folio *src);
-void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
+void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
 		__releases(ptl);
 void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
 int folio_migrate_mapping(struct address_space *mapping,
@@ -97,6 +97,12 @@ static inline int set_movable_ops(const struct movable_operations *ops, enum pag
 	return -ENOSYS;
 }
 
+static inline void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
+	__releases(ptl)
+{
+	spin_unlock(ptl);
+}
+
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/filemap.c b/mm/filemap.c
index ebd75684cb0a..d98e4883f13d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1379,14 +1379,16 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
 
 #ifdef CONFIG_MIGRATION
 /**
- * migration_entry_wait_on_locked - Wait for a migration entry to be removed
- * @entry: migration swap entry.
+ * softleaf_entry_wait_on_locked - Wait for a migration entry or
+ * device_private entry to be removed.
+ * @entry: migration or device_private swap entry.
  * @ptl: already locked ptl. This function will drop the lock.
  *
- * Wait for a migration entry referencing the given page to be removed. This is
+ * Wait for a migration entry referencing the given page, or device_private
+ * entry referencing a dvice_private page to be unlocked. This is
  * equivalent to folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE) except
  * this can be called without taking a reference on the page. Instead this
- * should be called while holding the ptl for the migration entry referencing
+ * should be called while holding the ptl for @entry referencing
  * the page.
  *
  * Returns after unlocking the ptl.
@@ -1394,7 +1396,7 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
  * This follows the same logic as folio_wait_bit_common() so see the comments
  * there.
  */
-void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
+void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
 	__releases(ptl)
 {
 	struct wait_page_queue wait_page;
@@ -1428,6 +1430,9 @@ void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
 	 * If a migration entry exists for the page the migration path must hold
 	 * a valid reference to the page, and it must take the ptl to remove the
 	 * migration entry. So the page is valid until the ptl is dropped.
+	 * Similarly any path attempting to drop the last reference to a
+	 * device-private page needs to grab the ptl to remove the device-private
+	 * entry.
 	 */
 	spin_unlock(ptl);
 
diff --git a/mm/memory.c b/mm/memory.c
index da360a6eb8a4..20172476a57f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				unlock_page(vmf->page);
 				put_page(vmf->page);
 			} else {
-				pte_unmap_unlock(vmf->pte, vmf->ptl);
+				pte_unmap(vmf->pte);
+				softleaf_entry_wait_on_locked(entry, vmf->ptl);
 			}
 		} else if (softleaf_is_hwpoison(entry)) {
 			ret = VM_FAULT_HWPOISON;
diff --git a/mm/migrate.c b/mm/migrate.c
index 4688b9e38cd2..cf6449b4202e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -499,7 +499,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	if (!softleaf_is_migration(entry))
 		goto out;
 
-	migration_entry_wait_on_locked(entry, ptl);
+	softleaf_entry_wait_on_locked(entry, ptl);
 	return;
 out:
 	spin_unlock(ptl);
@@ -531,10 +531,10 @@ void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, p
 		 * If migration entry existed, safe to release vma lock
 		 * here because the pgtable page won't be freed without the
 		 * pgtable lock released.  See comment right above pgtable
-		 * lock release in migration_entry_wait_on_locked().
+		 * lock release in softleaf_entry_wait_on_locked().
 		 */
 		hugetlb_vma_unlock_read(vma);
-		migration_entry_wait_on_locked(entry, ptl);
+		softleaf_entry_wait_on_locked(entry, ptl);
 		return;
 	}
 
@@ -552,7 +552,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
 	ptl = pmd_lock(mm, pmd);
 	if (!pmd_is_migration_entry(*pmd))
 		goto unlock;
-	migration_entry_wait_on_locked(softleaf_from_pmd(*pmd), ptl);
+	softleaf_entry_wait_on_locked(softleaf_from_pmd(*pmd), ptl);
 	return;
 unlock:
 	spin_unlock(ptl);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 23379663b1e1..deab89fd4541 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -176,7 +176,7 @@ static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
 		}
 
 		if (softleaf_is_migration(entry)) {
-			migration_entry_wait_on_locked(entry, ptl);
+			softleaf_entry_wait_on_locked(entry, ptl);
 			spin_unlock(ptl);
 			return -EAGAIN;
 		}
-- 
2.52.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
  2026-02-05 11:10 [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem Thomas Hellström
@ 2026-02-05 11:20 ` Balbir Singh
  2026-02-05 12:41   ` Thomas Hellström
  2026-02-09 14:47 ` Thomas Hellström
  2026-02-10  2:56 ` Balbir Singh
  2 siblings, 1 reply; 9+ messages in thread
From: Balbir Singh @ 2026-02-05 11:20 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe
  Cc: Alistair Popple, Ralph Campbell, Christoph Hellwig,
	Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, Andrew Morton,
	Matthew Brost, John Hubbard, linux-mm, dri-devel, stable

On 2/5/26 22:10, Thomas Hellström wrote:
> If hmm_range_fault() fails a folio_trylock() in do_swap_page,
> trying to acquire the lock of a device-private folio for migration,
> to ram, the function will spin until it succeeds grabbing the lock.
> 
> However, if the process holding the lock is depending on a work
> item to be completed, which is scheduled on the same CPU as the
> spinning hmm_range_fault(), that work item might be starved and
> we end up in a livelock / starvation situation which is never
> resolved.
> 
> This can happen, for example if the process holding the
> device-private folio lock is stuck in
>    migrate_device_unmap()->lru_add_drain_all()
> The lru_add_drain_all() function requires a short work-item
> to be run on all online cpus to complete.
> 
> A prerequisite for this to happen is:
> a) Both zone device and system memory folios are considered in
>    migrate_device_unmap(), so that there is a reason to call
>    lru_add_drain_all() for a system memory folio while a
>    folio lock is held on a zone device folio.
> b) The zone device folio has an initial mapcount > 1 which causes
>    at least one migration PTE entry insertion to be deferred to
>    try_to_migrate(), which can happen after the call to
>    lru_add_drain_all().
> c) No or voluntary only preemption.
> 
> This all seems pretty unlikely to happen, but indeed is hit by
> the "xe_exec_system_allocator" igt test.
> 

Do you have a stack trace from the test? I am trying to visualize the
livelock/starvation, but I can't from the description.

> Resolve this by waiting for the folio to be unlocked if the
> folio_trylock() fails in the do_swap_page() function.
> 
> Rename the migration_entry_wait_on_locked() function to
> softleaf_entry_wait_unlock() and update its documentation to
> indicate the new use-case.
> 
> Future code improvements might consider moving
> the lru_add_drain_all() call in migrate_device_unmap() to be
> called *after* all pages have migration entries inserted.
> That would eliminate also b) above.
> 
> v2:
> - Instead of a cond_resched() in the hmm_range_fault() function,
>   eliminate the problem by waiting for the folio to be unlocked
>   in do_swap_page() (Alistair Popple, Andrew Morton)
> v3:
> - Add a stub migration_entry_wait_on_locked() for the
>   !CONFIG_MIGRATION case. (Kernel Test Robot)
> v4:
> - Rename migrate_entry_wait_on_locked() to
>   softleaf_entry_wait_on_locked() and update docs (Alistair Popple)
> 
> Suggested-by: Alistair Popple <apopple@nvidia.com>
> Fixes: 1afaeb8293c9 ("mm/migrate: Trylock device page in do_swap_page")
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v6.15+
> Reviewed-by: John Hubbard <jhubbard@nvidia.com> #v3
> ---
>  include/linux/migrate.h |  8 +++++++-
>  mm/filemap.c            | 15 ++++++++++-----
>  mm/memory.c             |  3 ++-
>  mm/migrate.c            |  8 ++++----
>  mm/migrate_device.c     |  2 +-
>  5 files changed, 24 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 26ca00c325d9..3cc387f1957d 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -65,7 +65,7 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list);
>  
>  int migrate_huge_page_move_mapping(struct address_space *mapping,
>  		struct folio *dst, struct folio *src);
> -void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
> +void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
>  		__releases(ptl);
>  void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
>  int folio_migrate_mapping(struct address_space *mapping,
> @@ -97,6 +97,12 @@ static inline int set_movable_ops(const struct movable_operations *ops, enum pag
>  	return -ENOSYS;
>  }
>  
> +static inline void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
> +	__releases(ptl)
> +{
> +	spin_unlock(ptl);
> +}
> +
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ebd75684cb0a..d98e4883f13d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1379,14 +1379,16 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
>  
>  #ifdef CONFIG_MIGRATION
>  /**
> - * migration_entry_wait_on_locked - Wait for a migration entry to be removed
> - * @entry: migration swap entry.
> + * softleaf_entry_wait_on_locked - Wait for a migration entry or
> + * device_private entry to be removed.
> + * @entry: migration or device_private swap entry.
>   * @ptl: already locked ptl. This function will drop the lock.
>   *
> - * Wait for a migration entry referencing the given page to be removed. This is
> + * Wait for a migration entry referencing the given page, or device_private
> + * entry referencing a dvice_private page to be unlocked. This is
>   * equivalent to folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE) except
>   * this can be called without taking a reference on the page. Instead this
> - * should be called while holding the ptl for the migration entry referencing
> + * should be called while holding the ptl for @entry referencing
>   * the page.
>   *
>   * Returns after unlocking the ptl.
> @@ -1394,7 +1396,7 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
>   * This follows the same logic as folio_wait_bit_common() so see the comments
>   * there.
>   */
> -void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
> +void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
>  	__releases(ptl)
>  {
>  	struct wait_page_queue wait_page;
> @@ -1428,6 +1430,9 @@ void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
>  	 * If a migration entry exists for the page the migration path must hold
>  	 * a valid reference to the page, and it must take the ptl to remove the
>  	 * migration entry. So the page is valid until the ptl is dropped.
> +	 * Similarly any path attempting to drop the last reference to a
> +	 * device-private page needs to grab the ptl to remove the device-private
> +	 * entry.
>  	 */
>  	spin_unlock(ptl);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index da360a6eb8a4..20172476a57f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  				unlock_page(vmf->page);
>  				put_page(vmf->page);
>  			} else {
> -				pte_unmap_unlock(vmf->pte, vmf->ptl);
> +				pte_unmap(vmf->pte);
> +				softleaf_entry_wait_on_locked(entry, vmf->ptl);
>  			}
>  		} else if (softleaf_is_hwpoison(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4688b9e38cd2..cf6449b4202e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -499,7 +499,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  	if (!softleaf_is_migration(entry))
>  		goto out;
>  
> -	migration_entry_wait_on_locked(entry, ptl);
> +	softleaf_entry_wait_on_locked(entry, ptl);
>  	return;
>  out:
>  	spin_unlock(ptl);
> @@ -531,10 +531,10 @@ void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, p
>  		 * If migration entry existed, safe to release vma lock
>  		 * here because the pgtable page won't be freed without the
>  		 * pgtable lock released.  See comment right above pgtable
> -		 * lock release in migration_entry_wait_on_locked().
> +		 * lock release in softleaf_entry_wait_on_locked().
>  		 */
>  		hugetlb_vma_unlock_read(vma);
> -		migration_entry_wait_on_locked(entry, ptl);
> +		softleaf_entry_wait_on_locked(entry, ptl);
>  		return;
>  	}
>  
> @@ -552,7 +552,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
>  	ptl = pmd_lock(mm, pmd);
>  	if (!pmd_is_migration_entry(*pmd))
>  		goto unlock;
> -	migration_entry_wait_on_locked(softleaf_from_pmd(*pmd), ptl);
> +	softleaf_entry_wait_on_locked(softleaf_from_pmd(*pmd), ptl);
>  	return;
>  unlock:
>  	spin_unlock(ptl);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 23379663b1e1..deab89fd4541 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -176,7 +176,7 @@ static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
>  		}
>  
>  		if (softleaf_is_migration(entry)) {
> -			migration_entry_wait_on_locked(entry, ptl);
> +			softleaf_entry_wait_on_locked(entry, ptl);
>  			spin_unlock(ptl);
>  			return -EAGAIN;
>  		}

Balbir


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
  2026-02-05 11:20 ` Balbir Singh
@ 2026-02-05 12:41   ` Thomas Hellström
  2026-02-10  2:47     ` Balbir Singh
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Hellström @ 2026-02-05 12:41 UTC (permalink / raw)
  To: Balbir Singh, intel-xe
  Cc: Alistair Popple, Ralph Campbell, Christoph Hellwig,
	Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, Andrew Morton,
	Matthew Brost, John Hubbard, linux-mm, dri-devel, stable

On Thu, 2026-02-05 at 22:20 +1100, Balbir Singh wrote:
> On 2/5/26 22:10, Thomas Hellström wrote:
> > If hmm_range_fault() fails a folio_trylock() in do_swap_page,
> > trying to acquire the lock of a device-private folio for migration,
> > to ram, the function will spin until it succeeds grabbing the lock.
> > 
> > However, if the process holding the lock is depending on a work
> > item to be completed, which is scheduled on the same CPU as the
> > spinning hmm_range_fault(), that work item might be starved and
> > we end up in a livelock / starvation situation which is never
> > resolved.
> > 
> > This can happen, for example if the process holding the
> > device-private folio lock is stuck in
> >    migrate_device_unmap()->lru_add_drain_all()
> > The lru_add_drain_all() function requires a short work-item
> > to be run on all online cpus to complete.
> > 
> > A prerequisite for this to happen is:
> > a) Both zone device and system memory folios are considered in
> >    migrate_device_unmap(), so that there is a reason to call
> >    lru_add_drain_all() for a system memory folio while a
> >    folio lock is held on a zone device folio.
> > b) The zone device folio has an initial mapcount > 1 which causes
> >    at least one migration PTE entry insertion to be deferred to
> >    try_to_migrate(), which can happen after the call to
> >    lru_add_drain_all().
> > c) No or voluntary only preemption.
> > 
> > This all seems pretty unlikely to happen, but indeed is hit by
> > the "xe_exec_system_allocator" igt test.
> > 
> 
> Do you have a stack trace from the test? I am trying to visualize the
> livelock/starvation, but I can't from the description.

The spinning thread: (The backtrace varies slightly from time to time:)

[  805.201476] watchdog: BUG: soft lockup - CPU#139 stuck for 52s!
[kworker/u900:1:9985]
[  805.201477] Modules linked in: xt_conntrack nft_chain_nat
xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge
stp llc xfrm_user xfrm_algo xt_addrtype nft_compat x_tables nf_tables
mei_gsc_proxy pmt_crashlog mtd_intel_dg mei_gsc overlay qrtr
snd_hda_codec_intelhdmi snd_hda_codec_hdmi intel_rapl_msr
intel_rapl_common cfg80211 intel_uncore_frequency
intel_uncore_frequency_common intel_ifs i10nm_edac sunrpc binfmt_misc
skx_edac_common nfit xe x86_pkg_temp_thermal intel_powerclamp coretemp
nls_iso8859_1 kvm_intel kvm drm_ttm_helper drm_suballoc_helper
gpu_sched snd_hda_intel cmdlinepart drm_gpuvm snd_intel_dspcfg drm_exec
spi_nor drm_gpusvm_helper snd_hda_codec drm_buddy pmt_telemetry
dax_hmem snd_hwdep pmt_discovery mtd video irqbypass cxl_acpi qat_4xxx
iaa_crypto snd_hda_core pmt_class ttm rapl ses cxl_port snd_pcm
intel_cstate enclosure cxl_core intel_qat isst_if_mmio isst_if_mbox_pci
drm_display_helper snd_timer snd cec idxd crc8 einj ast mei_me
spi_intel_pci rc_core soundcore isst_if_common
[  805.201496]  ipmi_ssif authenc i2c_i801 intel_vsec idxd_bus
spi_intel i2c_algo_bit mei i2c_ismt i2c_smbus wmi joydev input_leds
ipmi_si acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler
acpi_pad mac_hid pfr_telemetry pfr_update sch_fq_codel msr efi_pstore
dm_multipath nfnetlink dmi_sysfs autofs4 btrfs blake2b libblake2b
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
async_tx xor raid6_pq raid1 raid0 linear rndis_host cdc_ether usbnet
mii nvme hid_generic mpt3sas i40e nvme_core usbhid ahci
ghash_clmulni_intel raid_class nvme_keyring scsi_transport_sas hid
libahci nvme_auth libie hkdf libie_adminq pinctrl_emmitsburg
aesni_intel
[  805.201510] CPU: 139 UID: 0 PID: 9985 Comm: kworker/u900:1 Tainted:
G S      W    L      6.19.0-rc7+ #18 PREEMPT(voluntary) 
[  805.201512] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [L]=SOFTLOCKUP
[  805.201512] Hardware name: Supermicro SYS-421GE-TNRT/X13DEG-OA, BIOS
2.5a 02/21/2025
[  805.201513] Workqueue: xe_page_fault_work_queue
xe_pagefault_queue_work [xe]
[  805.201599] RIP: 0010:_raw_spin_unlock+0x16/0x40
[  805.201602] Code: cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90
90 90 90 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 00 65 ff 0d fa a6 40
01 <74> 10 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44 00
[  805.201603] RSP: 0018:ffffd2a663a4f678 EFLAGS: 00000247
[  805.201603] RAX: fffff85c67e35080 RBX: ffffd2a663a4f7b8 RCX:
0000000000000000
[  805.201604] RDX: ffff8b88fdd31a00 RSI: 0000000000000000 RDI:
fffff75c86ff5928
[  805.201605] RBP: ffffd2a663a4f678 R08: 0000000000000000 R09:
0000000000000000
[  805.201605] R10: 0000000000000000 R11: 0000000000000000 R12:
0000631d10d42000
[  805.201606] R13: ffffd2a663a4f7b8 R14: 00000001a4ca4067 R15:
74000003ff9f8d42
[  805.201606] FS:  0000000000000000(0000) GS:ffff8bc76202b000(0000)
knlGS:0000000000000000
[  805.201607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  805.201608] CR2: 0000631d10c00088 CR3: 0000003de3040004 CR4:
0000000000f72ef0
[  805.201609] PKRU: 55555554
[  805.201609] Call Trace:
[  805.201610]  <TASK>
[  805.201610]  do_swap_page+0x17c6/0x1b70
[  805.201612]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[  805.201614]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[  805.201615]  ? __pfx_default_wake_function+0x10/0x10
[  805.201617]  ? ___pte_offset_map+0x1c/0x130
[  805.201619]  __handle_mm_fault+0xa75/0x1020
[  805.201621]  handle_mm_fault+0xeb/0x2f0
[  805.201622]  ? handle_mm_fault+0x11a/0x2f0
[  805.201623]  hmm_vma_fault.isra.0+0x5b/0xb0
[  805.201625]  hmm_vma_walk_pmd+0x5c7/0xc40
[  805.201627]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[  805.201629]  walk_pgd_range+0x5ba/0xbf0
[  805.201631]  __walk_page_range+0x8e/0x220
[  805.201633]  walk_page_range_mm_unsafe+0x149/0x210
[  805.201635]  walk_page_range+0x2a/0x40
[  805.201636]  hmm_range_fault+0x5c/0xb0
[  805.201638]  drm_gpusvm_range_evict+0x11a/0x1d0 [drm_gpusvm_helper]
[  805.201641]  __xe_svm_handle_pagefault+0x5fa/0xf00 [xe]
[  805.201736]  ? select_task_rq_fair+0x9bc/0x2970
[  805.201738]  xe_svm_handle_pagefault+0x3d/0xb0 [xe]
[  805.201827]  xe_pagefault_queue_work+0x233/0x370 [xe]
[  805.201905]  process_one_work+0x18d/0x370
[  805.201907]  worker_thread+0x31a/0x460
[  805.201908]  ? __pfx_worker_thread+0x10/0x10
[  805.201909]  kthread+0x10b/0x220
[  805.201910]  ? __pfx_kthread+0x10/0x10
[  805.201912]  ret_from_fork+0x289/0x2c0
[  805.201913]  ? __pfx_kthread+0x10/0x10
[  805.201915]  ret_from_fork_asm+0x1a/0x30
[  805.201917]  </TASK>

The thread holding the page-lock:

[ 1629.938195] Workqueue: xe_page_fault_work_queue
xe_pagefault_queue_work [xe]
[ 1629.938340] Call Trace:
[ 1629.938341]  <TASK>
[ 1629.938342]  __schedule+0x47f/0x1890
[ 1629.938346]  ? psi_group_change+0x1bd/0x4d0
[ 1629.938350]  ? __pick_eevdf+0x70/0x180
[ 1629.938353]  schedule+0x27/0xf0
[ 1629.938357]  schedule_timeout+0xcf/0x110
[ 1629.938361]  __wait_for_common+0x98/0x180
[ 1629.938364]  ? __pfx_schedule_timeout+0x10/0x10
[ 1629.938368]  wait_for_completion+0x24/0x40
[ 1629.938370]  __flush_work+0x2b6/0x400
[ 1629.938373]  ? kick_pool+0x77/0x1b0
[ 1629.938377]  ? __pfx_wq_barrier_func+0x10/0x10
[ 1629.938382]  flush_work+0x1c/0x30
[ 1629.938384]  __lru_add_drain_all+0x19f/0x2a0
[ 1629.938390]  lru_add_drain_all+0x10/0x20
[ 1629.938392]  migrate_device_unmap+0x433/0x480
[ 1629.938398]  migrate_vma_setup+0x245/0x300
[ 1629.938403]  drm_pagemap_migrate_to_devmem+0x2a8/0xc00
[drm_gpusvm_helper]
[ 1629.938410]  ? krealloc_node_align_noprof+0x12f/0x3a0
[ 1629.938413]  ? __xe_bo_create_locked+0x376/0x840 [xe]
[ 1629.938529]  xe_drm_pagemap_populate_mm+0x25f/0x3a0 [xe]
[ 1629.938721]  drm_pagemap_populate_mm+0x74/0xe0 [drm_gpusvm_helper]
[ 1629.938731]  xe_svm_alloc_vram+0xad/0x270 [xe]
[ 1629.938933]  ? xe_tile_local_pagemap+0x41/0x170 [xe]
[ 1629.939095]  ? ktime_get+0x41/0x100
[ 1629.939098]  __xe_svm_handle_pagefault+0xa90/0xf00 [xe]
[ 1629.939279]  xe_svm_handle_pagefault+0x3d/0xb0 [xe]
[ 1629.939460]  xe_pagefault_queue_work+0x233/0x370 [xe]
[ 1629.939620]  process_one_work+0x18d/0x370
[ 1629.939623]  worker_thread+0x31a/0x460
[ 1629.939626]  ? __pfx_worker_thread+0x10/0x10
[ 1629.939629]  kthread+0x10b/0x220
[ 1629.939632]  ? __pfx_kthread+0x10/0x10
[ 1629.939636]  ret_from_fork+0x289/0x2c0
[ 1629.939639]  ? __pfx_kthread+0x10/0x10
[ 1629.939642]  ret_from_fork_asm+0x1a/0x30
[ 1629.939648]  </TASK>

The worker that this thread waits on in flush_work() is, 
most likely, the one starved on cpu-time on cpu #139.

Thanks,
Thomas


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
  2026-02-05 11:10 [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem Thomas Hellström
  2026-02-05 11:20 ` Balbir Singh
@ 2026-02-09 14:47 ` Thomas Hellström
  2026-02-10  1:34   ` Andrew Morton
  2026-02-10  2:22   ` Alistair Popple
  2026-02-10  2:56 ` Balbir Singh
  2 siblings, 2 replies; 9+ messages in thread
From: Thomas Hellström @ 2026-02-09 14:47 UTC (permalink / raw)
  To: intel-xe
  Cc: Alistair Popple, Ralph Campbell, Christoph Hellwig,
	Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, Andrew Morton,
	Matthew Brost, John Hubbard, linux-mm, dri-devel, stable

@Alistair, any chance of an R-B for the below version?
@Andrew, will this go through the -mm tree or alternaltively an ack for
merging through drm-xe-fixes?

/Thomas

8<-------------------------------------------------------------------

On Thu, 2026-02-05 at 12:10 +0100, Thomas Hellström wrote:
> If hmm_range_fault() fails a folio_trylock() in do_swap_page,
> trying to acquire the lock of a device-private folio for migration,
> to ram, the function will spin until it succeeds grabbing the lock.
> 
> However, if the process holding the lock is depending on a work
> item to be completed, which is scheduled on the same CPU as the
> spinning hmm_range_fault(), that work item might be starved and
> we end up in a livelock / starvation situation which is never
> resolved.
> 
> This can happen, for example if the process holding the
> device-private folio lock is stuck in
>    migrate_device_unmap()->lru_add_drain_all()
> The lru_add_drain_all() function requires a short work-item
> to be run on all online cpus to complete.
> 
> A prerequisite for this to happen is:
> a) Both zone device and system memory folios are considered in
>    migrate_device_unmap(), so that there is a reason to call
>    lru_add_drain_all() for a system memory folio while a
>    folio lock is held on a zone device folio.
> b) The zone device folio has an initial mapcount > 1 which causes
>    at least one migration PTE entry insertion to be deferred to
>    try_to_migrate(), which can happen after the call to
>    lru_add_drain_all().
> c) No or voluntary only preemption.
> 
> This all seems pretty unlikely to happen, but indeed is hit by
> the "xe_exec_system_allocator" igt test.
> 
> Resolve this by waiting for the folio to be unlocked if the
> folio_trylock() fails in the do_swap_page() function.
> 
> Rename the migration_entry_wait_on_locked() function to
> softleaf_entry_wait_unlock() and update its documentation to
> indicate the new use-case.
> 
> Future code improvements might consider moving
> the lru_add_drain_all() call in migrate_device_unmap() to be
> called *after* all pages have migration entries inserted.
> That would eliminate also b) above.
> 
> v2:
> - Instead of a cond_resched() in the hmm_range_fault() function,
>   eliminate the problem by waiting for the folio to be unlocked
>   in do_swap_page() (Alistair Popple, Andrew Morton)
> v3:
> - Add a stub migration_entry_wait_on_locked() for the
>   !CONFIG_MIGRATION case. (Kernel Test Robot)
> v4:
> - Rename migrate_entry_wait_on_locked() to
>   softleaf_entry_wait_on_locked() and update docs (Alistair Popple)
> 
> Suggested-by: Alistair Popple <apopple@nvidia.com>
> Fixes: 1afaeb8293c9 ("mm/migrate: Trylock device page in
> do_swap_page")
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v6.15+
> Reviewed-by: John Hubbard <jhubbard@nvidia.com> #v3
> ---
>  include/linux/migrate.h |  8 +++++++-
>  mm/filemap.c            | 15 ++++++++++-----
>  mm/memory.c             |  3 ++-
>  mm/migrate.c            |  8 ++++----
>  mm/migrate_device.c     |  2 +-
>  5 files changed, 24 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 26ca00c325d9..3cc387f1957d 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -65,7 +65,7 @@ bool isolate_folio_to_list(struct folio *folio,
> struct list_head *list);
>  
>  int migrate_huge_page_move_mapping(struct address_space *mapping,
>  		struct folio *dst, struct folio *src);
> -void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t
> *ptl)
> +void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t
> *ptl)
>  		__releases(ptl);
>  void folio_migrate_flags(struct folio *newfolio, struct folio
> *folio);
>  int folio_migrate_mapping(struct address_space *mapping,
> @@ -97,6 +97,12 @@ static inline int set_movable_ops(const struct
> movable_operations *ops, enum pag
>  	return -ENOSYS;
>  }
>  
> +static inline void softleaf_entry_wait_on_locked(softleaf_t entry,
> spinlock_t *ptl)
> +	__releases(ptl)
> +{
> +	spin_unlock(ptl);
> +}
> +
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ebd75684cb0a..d98e4883f13d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1379,14 +1379,16 @@ static inline int
> folio_wait_bit_common(struct folio *folio, int bit_nr,
>  
>  #ifdef CONFIG_MIGRATION
>  /**
> - * migration_entry_wait_on_locked - Wait for a migration entry to be
> removed
> - * @entry: migration swap entry.
> + * softleaf_entry_wait_on_locked - Wait for a migration entry or
> + * device_private entry to be removed.
> + * @entry: migration or device_private swap entry.
>   * @ptl: already locked ptl. This function will drop the lock.
>   *
> - * Wait for a migration entry referencing the given page to be
> removed. This is
> + * Wait for a migration entry referencing the given page, or
> device_private
> + * entry referencing a dvice_private page to be unlocked. This is
>   * equivalent to folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE)
> except
>   * this can be called without taking a reference on the page.
> Instead this
> - * should be called while holding the ptl for the migration entry
> referencing
> + * should be called while holding the ptl for @entry referencing
>   * the page.
>   *
>   * Returns after unlocking the ptl.
> @@ -1394,7 +1396,7 @@ static inline int folio_wait_bit_common(struct
> folio *folio, int bit_nr,
>   * This follows the same logic as folio_wait_bit_common() so see the
> comments
>   * there.
>   */
> -void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t
> *ptl)
> +void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t
> *ptl)
>  	__releases(ptl)
>  {
>  	struct wait_page_queue wait_page;
> @@ -1428,6 +1430,9 @@ void migration_entry_wait_on_locked(softleaf_t
> entry, spinlock_t *ptl)
>  	 * If a migration entry exists for the page the migration
> path must hold
>  	 * a valid reference to the page, and it must take the ptl
> to remove the
>  	 * migration entry. So the page is valid until the ptl is
> dropped.
> +	 * Similarly any path attempting to drop the last reference
> to a
> +	 * device-private page needs to grab the ptl to remove the
> device-private
> +	 * entry.
>  	 */
>  	spin_unlock(ptl);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index da360a6eb8a4..20172476a57f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  				unlock_page(vmf->page);
>  				put_page(vmf->page);
>  			} else {
> -				pte_unmap_unlock(vmf->pte, vmf-
> >ptl);
> +				pte_unmap(vmf->pte);
> +				softleaf_entry_wait_on_locked(entry,
> vmf->ptl);
>  			}
>  		} else if (softleaf_is_hwpoison(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4688b9e38cd2..cf6449b4202e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -499,7 +499,7 @@ void migration_entry_wait(struct mm_struct *mm,
> pmd_t *pmd,
>  	if (!softleaf_is_migration(entry))
>  		goto out;
>  
> -	migration_entry_wait_on_locked(entry, ptl);
> +	softleaf_entry_wait_on_locked(entry, ptl);
>  	return;
>  out:
>  	spin_unlock(ptl);
> @@ -531,10 +531,10 @@ void migration_entry_wait_huge(struct
> vm_area_struct *vma, unsigned long addr, p
>  		 * If migration entry existed, safe to release vma
> lock
>  		 * here because the pgtable page won't be freed
> without the
>  		 * pgtable lock released.  See comment right above
> pgtable
> -		 * lock release in migration_entry_wait_on_locked().
> +		 * lock release in softleaf_entry_wait_on_locked().
>  		 */
>  		hugetlb_vma_unlock_read(vma);
> -		migration_entry_wait_on_locked(entry, ptl);
> +		softleaf_entry_wait_on_locked(entry, ptl);
>  		return;
>  	}
>  
> @@ -552,7 +552,7 @@ void pmd_migration_entry_wait(struct mm_struct
> *mm, pmd_t *pmd)
>  	ptl = pmd_lock(mm, pmd);
>  	if (!pmd_is_migration_entry(*pmd))
>  		goto unlock;
> -	migration_entry_wait_on_locked(softleaf_from_pmd(*pmd),
> ptl);
> +	softleaf_entry_wait_on_locked(softleaf_from_pmd(*pmd), ptl);
>  	return;
>  unlock:
>  	spin_unlock(ptl);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 23379663b1e1..deab89fd4541 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -176,7 +176,7 @@ static int migrate_vma_collect_huge_pmd(pmd_t
> *pmdp, unsigned long start,
>  		}
>  
>  		if (softleaf_is_migration(entry)) {
> -			migration_entry_wait_on_locked(entry, ptl);
> +			softleaf_entry_wait_on_locked(entry, ptl);
>  			spin_unlock(ptl);
>  			return -EAGAIN;
>  		}


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
  2026-02-09 14:47 ` Thomas Hellström
@ 2026-02-10  1:34   ` Andrew Morton
  2026-02-12  8:52     ` Thomas Hellström
  2026-02-10  2:22   ` Alistair Popple
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2026-02-10  1:34 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Alistair Popple, Ralph Campbell, Christoph Hellwig,
	Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, Matthew Brost,
	John Hubbard, linux-mm, dri-devel, stable

On Mon, 09 Feb 2026 15:47:38 +0100 Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:

> @Alistair, any chance of an R-B for the below version?

Yes please.

> @Andrew, will this go through the -mm tree or alternaltively an ack for
> merging through drm-xe-fixes?

Either works.  I'll grab a copy.  It you want to take this via drm then
I'll drop the mm.git copy once the drm tree's version appears in linux-next.

Acked-by: Andrew Morton <akpm@linux-foundation.org>

> > The lru_add_drain_all() function requires a short work-item

Pet peeve: s/the foo() function/foo()/g.  It's just as good!

> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  				unlock_page(vmf->page);
> >  				put_page(vmf->page);
> >  			} else {
> > -				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +				pte_unmap(vmf->pte);
> > +				softleaf_entry_wait_on_locked(entry, vmf->ptl);
> >  			}
> >  		} else if (softleaf_is_hwpoison(entry)) {
> >  			ret = VM_FAULT_HWPOISON;

So apart from the rename, this is the whole patch.  This got nicer!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
  2026-02-09 14:47 ` Thomas Hellström
  2026-02-10  1:34   ` Andrew Morton
@ 2026-02-10  2:22   ` Alistair Popple
  1 sibling, 0 replies; 9+ messages in thread
From: Alistair Popple @ 2026-02-10  2:22 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
	Jason Gunthorpe, Leon Romanovsky, Andrew Morton, Matthew Brost,
	John Hubbard, linux-mm, dri-devel, stable

On 2026-02-10 at 01:47 +1100, Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote...
> @Alistair, any chance of an R-B for the below version?

For sure. Sorry I've been getting back to this but caught up with internal
stuff.

> > +static inline void softleaf_entry_wait_on_locked(softleaf_t entry,
> > spinlock_t *ptl)
> > +	__releases(ptl)
> > +{
> > +	spin_unlock(ptl);
> > +}
> > +

I noticed this just because we didn't have it previously, but I assume it's to
avoid compilation failures in do_swap_page(). This is definitely the better way
of dealing with the conditional compilation, though if I were to add a nit it
would be that a WARN_ON_ONCE() would be nice here.

But this is fine, and thanks for doing the rename. Feel free to add:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

> >  #endif /* CONFIG_MIGRATION */
> >  
> >  #ifdef CONFIG_NUMA_BALANCING
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index ebd75684cb0a..d98e4883f13d 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1379,14 +1379,16 @@ static inline int
> > folio_wait_bit_common(struct folio *folio, int bit_nr,
> >  
> >  #ifdef CONFIG_MIGRATION
> >  /**
> > - * migration_entry_wait_on_locked - Wait for a migration entry to be
> > removed
> > - * @entry: migration swap entry.
> > + * softleaf_entry_wait_on_locked - Wait for a migration entry or
> > + * device_private entry to be removed.
> > + * @entry: migration or device_private swap entry.
> >   * @ptl: already locked ptl. This function will drop the lock.
> >   *
> > - * Wait for a migration entry referencing the given page to be
> > removed. This is
> > + * Wait for a migration entry referencing the given page, or
> > device_private
> > + * entry referencing a dvice_private page to be unlocked. This is
> >   * equivalent to folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE)
> > except
> >   * this can be called without taking a reference on the page.
> > Instead this
> > - * should be called while holding the ptl for the migration entry
> > referencing
> > + * should be called while holding the ptl for @entry referencing
> >   * the page.
> >   *
> >   * Returns after unlocking the ptl.
> > @@ -1394,7 +1396,7 @@ static inline int folio_wait_bit_common(struct
> > folio *folio, int bit_nr,
> >   * This follows the same logic as folio_wait_bit_common() so see the
> > comments
> >   * there.
> >   */
> > -void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t
> > *ptl)
> > +void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t
> > *ptl)
> >  	__releases(ptl)
> >  {
> >  	struct wait_page_queue wait_page;
> > @@ -1428,6 +1430,9 @@ void migration_entry_wait_on_locked(softleaf_t
> > entry, spinlock_t *ptl)
> >  	 * If a migration entry exists for the page the migration
> > path must hold
> >  	 * a valid reference to the page, and it must take the ptl
> > to remove the
> >  	 * migration entry. So the page is valid until the ptl is
> > dropped.
> > +	 * Similarly any path attempting to drop the last reference
> > to a
> > +	 * device-private page needs to grab the ptl to remove the
> > device-private
> > +	 * entry.
> >  	 */
> >  	spin_unlock(ptl);
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index da360a6eb8a4..20172476a57f 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  				unlock_page(vmf->page);
> >  				put_page(vmf->page);
> >  			} else {
> > -				pte_unmap_unlock(vmf->pte, vmf-
> > >ptl);
> > +				pte_unmap(vmf->pte);
> > +				softleaf_entry_wait_on_locked(entry,
> > vmf->ptl);
> >  			}
> >  		} else if (softleaf_is_hwpoison(entry)) {
> >  			ret = VM_FAULT_HWPOISON;
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 4688b9e38cd2..cf6449b4202e 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -499,7 +499,7 @@ void migration_entry_wait(struct mm_struct *mm,
> > pmd_t *pmd,
> >  	if (!softleaf_is_migration(entry))
> >  		goto out;
> >  
> > -	migration_entry_wait_on_locked(entry, ptl);
> > +	softleaf_entry_wait_on_locked(entry, ptl);
> >  	return;
> >  out:
> >  	spin_unlock(ptl);
> > @@ -531,10 +531,10 @@ void migration_entry_wait_huge(struct
> > vm_area_struct *vma, unsigned long addr, p
> >  		 * If migration entry existed, safe to release vma
> > lock
> >  		 * here because the pgtable page won't be freed
> > without the
> >  		 * pgtable lock released.  See comment right above
> > pgtable
> > -		 * lock release in migration_entry_wait_on_locked().
> > +		 * lock release in softleaf_entry_wait_on_locked().
> >  		 */
> >  		hugetlb_vma_unlock_read(vma);
> > -		migration_entry_wait_on_locked(entry, ptl);
> > +		softleaf_entry_wait_on_locked(entry, ptl);
> >  		return;
> >  	}
> >  
> > @@ -552,7 +552,7 @@ void pmd_migration_entry_wait(struct mm_struct
> > *mm, pmd_t *pmd)
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (!pmd_is_migration_entry(*pmd))
> >  		goto unlock;
> > -	migration_entry_wait_on_locked(softleaf_from_pmd(*pmd),
> > ptl);
> > +	softleaf_entry_wait_on_locked(softleaf_from_pmd(*pmd), ptl);
> >  	return;
> >  unlock:
> >  	spin_unlock(ptl);
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index 23379663b1e1..deab89fd4541 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -176,7 +176,7 @@ static int migrate_vma_collect_huge_pmd(pmd_t
> > *pmdp, unsigned long start,
> >  		}
> >  
> >  		if (softleaf_is_migration(entry)) {
> > -			migration_entry_wait_on_locked(entry, ptl);
> > +			softleaf_entry_wait_on_locked(entry, ptl);
> >  			spin_unlock(ptl);
> >  			return -EAGAIN;
> >  		}


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
  2026-02-05 12:41   ` Thomas Hellström
@ 2026-02-10  2:47     ` Balbir Singh
  0 siblings, 0 replies; 9+ messages in thread
From: Balbir Singh @ 2026-02-10  2:47 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe
  Cc: Alistair Popple, Ralph Campbell, Christoph Hellwig,
	Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, Andrew Morton,
	Matthew Brost, John Hubbard, linux-mm, dri-devel, stable

On 2/5/26 23:41, Thomas Hellström wrote:
> On Thu, 2026-02-05 at 22:20 +1100, Balbir Singh wrote:
>> On 2/5/26 22:10, Thomas Hellström wrote:
>>> If hmm_range_fault() fails a folio_trylock() in do_swap_page,
>>> trying to acquire the lock of a device-private folio for migration,
>>> to ram, the function will spin until it succeeds grabbing the lock.
>>>
>>> However, if the process holding the lock is depending on a work
>>> item to be completed, which is scheduled on the same CPU as the
>>> spinning hmm_range_fault(), that work item might be starved and
>>> we end up in a livelock / starvation situation which is never
>>> resolved.
>>>
>>> This can happen, for example if the process holding the
>>> device-private folio lock is stuck in
>>>    migrate_device_unmap()->lru_add_drain_all()
>>> The lru_add_drain_all() function requires a short work-item
>>> to be run on all online cpus to complete.
>>>
>>> A prerequisite for this to happen is:
>>> a) Both zone device and system memory folios are considered in
>>>    migrate_device_unmap(), so that there is a reason to call
>>>    lru_add_drain_all() for a system memory folio while a
>>>    folio lock is held on a zone device folio.
>>> b) The zone device folio has an initial mapcount > 1 which causes
>>>    at least one migration PTE entry insertion to be deferred to
>>>    try_to_migrate(), which can happen after the call to
>>>    lru_add_drain_all().
>>> c) No or voluntary only preemption.
>>>
>>> This all seems pretty unlikely to happen, but indeed is hit by
>>> the "xe_exec_system_allocator" igt test.
>>>
>>
>> Do you have a stack trace from the test? I am trying to visualize the
>> livelock/starvation, but I can't from the description.
> 
> The spinning thread: (The backtrace varies slightly from time to time:)
> 
> [  805.201476] watchdog: BUG: soft lockup - CPU#139 stuck for 52s!
> [kworker/u900:1:9985]
> [  805.201477] Modules linked in: xt_conntrack nft_chain_nat
> xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge
> stp llc xfrm_user xfrm_algo xt_addrtype nft_compat x_tables nf_tables
> mei_gsc_proxy pmt_crashlog mtd_intel_dg mei_gsc overlay qrtr
> snd_hda_codec_intelhdmi snd_hda_codec_hdmi intel_rapl_msr
> intel_rapl_common cfg80211 intel_uncore_frequency
> intel_uncore_frequency_common intel_ifs i10nm_edac sunrpc binfmt_misc
> skx_edac_common nfit xe x86_pkg_temp_thermal intel_powerclamp coretemp
> nls_iso8859_1 kvm_intel kvm drm_ttm_helper drm_suballoc_helper
> gpu_sched snd_hda_intel cmdlinepart drm_gpuvm snd_intel_dspcfg drm_exec
> spi_nor drm_gpusvm_helper snd_hda_codec drm_buddy pmt_telemetry
> dax_hmem snd_hwdep pmt_discovery mtd video irqbypass cxl_acpi qat_4xxx
> iaa_crypto snd_hda_core pmt_class ttm rapl ses cxl_port snd_pcm
> intel_cstate enclosure cxl_core intel_qat isst_if_mmio isst_if_mbox_pci
> drm_display_helper snd_timer snd cec idxd crc8 einj ast mei_me
> spi_intel_pci rc_core soundcore isst_if_common
> [  805.201496]  ipmi_ssif authenc i2c_i801 intel_vsec idxd_bus
> spi_intel i2c_algo_bit mei i2c_ismt i2c_smbus wmi joydev input_leds
> ipmi_si acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler
> acpi_pad mac_hid pfr_telemetry pfr_update sch_fq_codel msr efi_pstore
> dm_multipath nfnetlink dmi_sysfs autofs4 btrfs blake2b libblake2b
> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
> async_tx xor raid6_pq raid1 raid0 linear rndis_host cdc_ether usbnet
> mii nvme hid_generic mpt3sas i40e nvme_core usbhid ahci
> ghash_clmulni_intel raid_class nvme_keyring scsi_transport_sas hid
> libahci nvme_auth libie hkdf libie_adminq pinctrl_emmitsburg
> aesni_intel
> [  805.201510] CPU: 139 UID: 0 PID: 9985 Comm: kworker/u900:1 Tainted:
> G S      W    L      6.19.0-rc7+ #18 PREEMPT(voluntary) 
> [  805.201512] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [L]=SOFTLOCKUP
> [  805.201512] Hardware name: Supermicro SYS-421GE-TNRT/X13DEG-OA, BIOS
> 2.5a 02/21/2025
> [  805.201513] Workqueue: xe_page_fault_work_queue
> xe_pagefault_queue_work [xe]
> [  805.201599] RIP: 0010:_raw_spin_unlock+0x16/0x40
> [  805.201602] Code: cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90
> 90 90 90 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 00 65 ff 0d fa a6 40
> 01 <74> 10 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44 00
> [  805.201603] RSP: 0018:ffffd2a663a4f678 EFLAGS: 00000247
> [  805.201603] RAX: fffff85c67e35080 RBX: ffffd2a663a4f7b8 RCX:
> 0000000000000000
> [  805.201604] RDX: ffff8b88fdd31a00 RSI: 0000000000000000 RDI:
> fffff75c86ff5928
> [  805.201605] RBP: ffffd2a663a4f678 R08: 0000000000000000 R09:
> 0000000000000000
> [  805.201605] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000631d10d42000
> [  805.201606] R13: ffffd2a663a4f7b8 R14: 00000001a4ca4067 R15:
> 74000003ff9f8d42
> [  805.201606] FS:  0000000000000000(0000) GS:ffff8bc76202b000(0000)
> knlGS:0000000000000000
> [  805.201607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  805.201608] CR2: 0000631d10c00088 CR3: 0000003de3040004 CR4:
> 0000000000f72ef0
> [  805.201609] PKRU: 55555554
> [  805.201609] Call Trace:
> [  805.201610]  <TASK>
> [  805.201610]  do_swap_page+0x17c6/0x1b70
> [  805.201612]  ? sysvec_apic_timer_interrupt+0x57/0xc0
> [  805.201614]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [  805.201615]  ? __pfx_default_wake_function+0x10/0x10
> [  805.201617]  ? ___pte_offset_map+0x1c/0x130
> [  805.201619]  __handle_mm_fault+0xa75/0x1020
> [  805.201621]  handle_mm_fault+0xeb/0x2f0
> [  805.201622]  ? handle_mm_fault+0x11a/0x2f0
> [  805.201623]  hmm_vma_fault.isra.0+0x5b/0xb0
> [  805.201625]  hmm_vma_walk_pmd+0x5c7/0xc40
> [  805.201627]  ? sysvec_apic_timer_interrupt+0x57/0xc0
> [  805.201629]  walk_pgd_range+0x5ba/0xbf0
> [  805.201631]  __walk_page_range+0x8e/0x220
> [  805.201633]  walk_page_range_mm_unsafe+0x149/0x210
> [  805.201635]  walk_page_range+0x2a/0x40
> [  805.201636]  hmm_range_fault+0x5c/0xb0
> [  805.201638]  drm_gpusvm_range_evict+0x11a/0x1d0 [drm_gpusvm_helper]
> [  805.201641]  __xe_svm_handle_pagefault+0x5fa/0xf00 [xe]
> [  805.201736]  ? select_task_rq_fair+0x9bc/0x2970
> [  805.201738]  xe_svm_handle_pagefault+0x3d/0xb0 [xe]
> [  805.201827]  xe_pagefault_queue_work+0x233/0x370 [xe]
> [  805.201905]  process_one_work+0x18d/0x370
> [  805.201907]  worker_thread+0x31a/0x460
> [  805.201908]  ? __pfx_worker_thread+0x10/0x10
> [  805.201909]  kthread+0x10b/0x220
> [  805.201910]  ? __pfx_kthread+0x10/0x10
> [  805.201912]  ret_from_fork+0x289/0x2c0
> [  805.201913]  ? __pfx_kthread+0x10/0x10
> [  805.201915]  ret_from_fork_asm+0x1a/0x30
> [  805.201917]  </TASK>
> 
> The thread holding the page-lock:
> 
> [ 1629.938195] Workqueue: xe_page_fault_work_queue
> xe_pagefault_queue_work [xe]
> [ 1629.938340] Call Trace:
> [ 1629.938341]  <TASK>
> [ 1629.938342]  __schedule+0x47f/0x1890
> [ 1629.938346]  ? psi_group_change+0x1bd/0x4d0
> [ 1629.938350]  ? __pick_eevdf+0x70/0x180
> [ 1629.938353]  schedule+0x27/0xf0
> [ 1629.938357]  schedule_timeout+0xcf/0x110
> [ 1629.938361]  __wait_for_common+0x98/0x180
> [ 1629.938364]  ? __pfx_schedule_timeout+0x10/0x10
> [ 1629.938368]  wait_for_completion+0x24/0x40
> [ 1629.938370]  __flush_work+0x2b6/0x400
> [ 1629.938373]  ? kick_pool+0x77/0x1b0
> [ 1629.938377]  ? __pfx_wq_barrier_func+0x10/0x10
> [ 1629.938382]  flush_work+0x1c/0x30
> [ 1629.938384]  __lru_add_drain_all+0x19f/0x2a0
> [ 1629.938390]  lru_add_drain_all+0x10/0x20
> [ 1629.938392]  migrate_device_unmap+0x433/0x480
> [ 1629.938398]  migrate_vma_setup+0x245/0x300
> [ 1629.938403]  drm_pagemap_migrate_to_devmem+0x2a8/0xc00
> [drm_gpusvm_helper]
> [ 1629.938410]  ? krealloc_node_align_noprof+0x12f/0x3a0
> [ 1629.938413]  ? __xe_bo_create_locked+0x376/0x840 [xe]
> [ 1629.938529]  xe_drm_pagemap_populate_mm+0x25f/0x3a0 [xe]
> [ 1629.938721]  drm_pagemap_populate_mm+0x74/0xe0 [drm_gpusvm_helper]
> [ 1629.938731]  xe_svm_alloc_vram+0xad/0x270 [xe]
> [ 1629.938933]  ? xe_tile_local_pagemap+0x41/0x170 [xe]
> [ 1629.939095]  ? ktime_get+0x41/0x100
> [ 1629.939098]  __xe_svm_handle_pagefault+0xa90/0xf00 [xe]
> [ 1629.939279]  xe_svm_handle_pagefault+0x3d/0xb0 [xe]
> [ 1629.939460]  xe_pagefault_queue_work+0x233/0x370 [xe]
> [ 1629.939620]  process_one_work+0x18d/0x370
> [ 1629.939623]  worker_thread+0x31a/0x460
> [ 1629.939626]  ? __pfx_worker_thread+0x10/0x10
> [ 1629.939629]  kthread+0x10b/0x220
> [ 1629.939632]  ? __pfx_kthread+0x10/0x10
> [ 1629.939636]  ret_from_fork+0x289/0x2c0
> [ 1629.939639]  ? __pfx_kthread+0x10/0x10
> [ 1629.939642]  ret_from_fork_asm+0x1a/0x30
> [ 1629.939648]  </TASK>
> 
> The worker that this thread waits on in flush_work() is, 
> most likely, the one starved on cpu-time on cpu #139.
> 
Thanks, makes sense!

Balbir


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
  2026-02-05 11:10 [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem Thomas Hellström
  2026-02-05 11:20 ` Balbir Singh
  2026-02-09 14:47 ` Thomas Hellström
@ 2026-02-10  2:56 ` Balbir Singh
  2 siblings, 0 replies; 9+ messages in thread
From: Balbir Singh @ 2026-02-10  2:56 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe
  Cc: Alistair Popple, Ralph Campbell, Christoph Hellwig,
	Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, Andrew Morton,
	Matthew Brost, John Hubbard, linux-mm, dri-devel, stable

On 2/5/26 22:10, Thomas Hellström wrote:
> If hmm_range_fault() fails a folio_trylock() in do_swap_page,
> trying to acquire the lock of a device-private folio for migration,
> to ram, the function will spin until it succeeds grabbing the lock.
> 
> However, if the process holding the lock is depending on a work
> item to be completed, which is scheduled on the same CPU as the
> spinning hmm_range_fault(), that work item might be starved and
> we end up in a livelock / starvation situation which is never
> resolved.
> 
> This can happen, for example if the process holding the
> device-private folio lock is stuck in
>    migrate_device_unmap()->lru_add_drain_all()
> The lru_add_drain_all() function requires a short work-item
> to be run on all online cpus to complete.
> 
> A prerequisite for this to happen is:
> a) Both zone device and system memory folios are considered in
>    migrate_device_unmap(), so that there is a reason to call
>    lru_add_drain_all() for a system memory folio while a
>    folio lock is held on a zone device folio.
> b) The zone device folio has an initial mapcount > 1 which causes
>    at least one migration PTE entry insertion to be deferred to
>    try_to_migrate(), which can happen after the call to
>    lru_add_drain_all().
> c) No or voluntary only preemption.
> 
> This all seems pretty unlikely to happen, but indeed is hit by
> the "xe_exec_system_allocator" igt test.
> 
> Resolve this by waiting for the folio to be unlocked if the
> folio_trylock() fails in the do_swap_page() function.
> 
> Rename the migration_entry_wait_on_locked() function to
> softleaf_entry_wait_unlock() and update its documentation to
> indicate the new use-case.
> 
> Future code improvements might consider moving
> the lru_add_drain_all() call in migrate_device_unmap() to be
> called *after* all pages have migration entries inserted.
> That would eliminate also b) above.
> 
> v2:
> - Instead of a cond_resched() in the hmm_range_fault() function,
>   eliminate the problem by waiting for the folio to be unlocked
>   in do_swap_page() (Alistair Popple, Andrew Morton)
> v3:
> - Add a stub migration_entry_wait_on_locked() for the
>   !CONFIG_MIGRATION case. (Kernel Test Robot)
> v4:
> - Rename migrate_entry_wait_on_locked() to
>   softleaf_entry_wait_on_locked() and update docs (Alistair Popple)
> 
> Suggested-by: Alistair Popple <apopple@nvidia.com>
> Fixes: 1afaeb8293c9 ("mm/migrate: Trylock device page in do_swap_page")
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v6.15+
> Reviewed-by: John Hubbard <jhubbard@nvidia.com> #v3
> ---
>  include/linux/migrate.h |  8 +++++++-
>  mm/filemap.c            | 15 ++++++++++-----
>  mm/memory.c             |  3 ++-
>  mm/migrate.c            |  8 ++++----
>  mm/migrate_device.c     |  2 +-
>  5 files changed, 24 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 26ca00c325d9..3cc387f1957d 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -65,7 +65,7 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list);
>  
>  int migrate_huge_page_move_mapping(struct address_space *mapping,
>  		struct folio *dst, struct folio *src);
> -void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
> +void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
>  		__releases(ptl);
>  void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
>  int folio_migrate_mapping(struct address_space *mapping,
> @@ -97,6 +97,12 @@ static inline int set_movable_ops(const struct movable_operations *ops, enum pag
>  	return -ENOSYS;
>  }
>  
> +static inline void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
> +	__releases(ptl)
> +{
> +	spin_unlock(ptl);
> +}
> +
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ebd75684cb0a..d98e4883f13d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1379,14 +1379,16 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
>  
>  #ifdef CONFIG_MIGRATION
>  /**
> - * migration_entry_wait_on_locked - Wait for a migration entry to be removed
> - * @entry: migration swap entry.
> + * softleaf_entry_wait_on_locked - Wait for a migration entry or
> + * device_private entry to be removed.
> + * @entry: migration or device_private swap entry.
>   * @ptl: already locked ptl. This function will drop the lock.
>   *
> - * Wait for a migration entry referencing the given page to be removed. This is
> + * Wait for a migration entry referencing the given page, or device_private
> + * entry referencing a dvice_private page to be unlocked. This is
>   * equivalent to folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE) except
>   * this can be called without taking a reference on the page. Instead this
> - * should be called while holding the ptl for the migration entry referencing
> + * should be called while holding the ptl for @entry referencing
>   * the page.
>   *
>   * Returns after unlocking the ptl.
> @@ -1394,7 +1396,7 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
>   * This follows the same logic as folio_wait_bit_common() so see the comments
>   * there.
>   */
> -void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
> +void softleaf_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
>  	__releases(ptl)
>  {
>  	struct wait_page_queue wait_page;
> @@ -1428,6 +1430,9 @@ void migration_entry_wait_on_locked(softleaf_t entry, spinlock_t *ptl)
>  	 * If a migration entry exists for the page the migration path must hold
>  	 * a valid reference to the page, and it must take the ptl to remove the
>  	 * migration entry. So the page is valid until the ptl is dropped.
> +	 * Similarly any path attempting to drop the last reference to a
> +	 * device-private page needs to grab the ptl to remove the device-private
> +	 * entry.
>  	 */
>  	spin_unlock(ptl);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index da360a6eb8a4..20172476a57f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  				unlock_page(vmf->page);
>  				put_page(vmf->page);
>  			} else {
> -				pte_unmap_unlock(vmf->pte, vmf->ptl);
> +				pte_unmap(vmf->pte);
> +				softleaf_entry_wait_on_locked(entry, vmf->ptl);
>  			}
>  		} else if (softleaf_is_hwpoison(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4688b9e38cd2..cf6449b4202e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -499,7 +499,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  	if (!softleaf_is_migration(entry))
>  		goto out;
>  
> -	migration_entry_wait_on_locked(entry, ptl);
> +	softleaf_entry_wait_on_locked(entry, ptl);
>  	return;
>  out:
>  	spin_unlock(ptl);
> @@ -531,10 +531,10 @@ void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, p
>  		 * If migration entry existed, safe to release vma lock
>  		 * here because the pgtable page won't be freed without the
>  		 * pgtable lock released.  See comment right above pgtable
> -		 * lock release in migration_entry_wait_on_locked().
> +		 * lock release in softleaf_entry_wait_on_locked().
>  		 */
>  		hugetlb_vma_unlock_read(vma);
> -		migration_entry_wait_on_locked(entry, ptl);
> +		softleaf_entry_wait_on_locked(entry, ptl);
>  		return;
>  	}
>  
> @@ -552,7 +552,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
>  	ptl = pmd_lock(mm, pmd);
>  	if (!pmd_is_migration_entry(*pmd))
>  		goto unlock;
> -	migration_entry_wait_on_locked(softleaf_from_pmd(*pmd), ptl);
> +	softleaf_entry_wait_on_locked(softleaf_from_pmd(*pmd), ptl);
>  	return;
>  unlock:
>  	spin_unlock(ptl);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 23379663b1e1..deab89fd4541 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -176,7 +176,7 @@ static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
>  		}
>  
>  		if (softleaf_is_migration(entry)) {
> -			migration_entry_wait_on_locked(entry, ptl);
> +			softleaf_entry_wait_on_locked(entry, ptl);
>  			spin_unlock(ptl);
>  			return -EAGAIN;
>  		}


Seems reasonable
Acked-by: Balbir Singh <balbirs@nvidia.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
  2026-02-10  1:34   ` Andrew Morton
@ 2026-02-12  8:52     ` Thomas Hellström
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Hellström @ 2026-02-12  8:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: intel-xe, Alistair Popple, Ralph Campbell, Christoph Hellwig,
	Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, Matthew Brost,
	John Hubbard, linux-mm, dri-devel, stable

On Mon, 2026-02-09 at 17:34 -0800, Andrew Morton wrote:
> On Mon, 09 Feb 2026 15:47:38 +0100 Thomas Hellström
> <thomas.hellstrom@linux.intel.com> wrote:
> 
> > @Alistair, any chance of an R-B for the below version?
> 
> Yes please.
> 
> > @Andrew, will this go through the -mm tree or alternaltively an ack
> > for
> > merging through drm-xe-fixes?
> 
> Either works.  I'll grab a copy.  It you want to take this via drm
> then
> I'll drop the mm.git copy once the drm tree's version appears in
> linux-next.
> 
> Acked-by: Andrew Morton <akpm@linux-foundation.org>
> 
> > 

The drm tree's version now appears in linux-next as

a69d1ab971a624c6f112cea61536569d579c3215

Thanks,
Thomas



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-12  8:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-05 11:10 [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem Thomas Hellström
2026-02-05 11:20 ` Balbir Singh
2026-02-05 12:41   ` Thomas Hellström
2026-02-10  2:47     ` Balbir Singh
2026-02-09 14:47 ` Thomas Hellström
2026-02-10  1:34   ` Andrew Morton
2026-02-12  8:52     ` Thomas Hellström
2026-02-10  2:22   ` Alistair Popple
2026-02-10  2:56 ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox