* [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
@ 2026-01-30 14:45 Thomas Hellström
2026-01-30 18:00 ` Andrew Morton
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Hellström @ 2026-01-30 14:45 UTC (permalink / raw)
To: intel-xe
Cc: Thomas Hellström, Ralph Campbell, Christoph Hellwig,
Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, Andrew Morton,
Matthew Brost, linux-mm, stable, dri-devel
If hmm_range_fault() fails a folio_trylock() in do_swap_page,
trying to acquire the lock of a device-private folio for migration,
to ram, the function will spin until it succeeds grabbing the lock.
However, if the process holding the lock is depending on a work
item to be completed, which is scheduled on the same CPU as the
spinning hmm_range_fault(), that work item might be starved and
we end up in a livelock / starvation situation which is never
resolved.
This can happen, for example if the process holding the
device-private folio lock is stuck in
migrate_device_unmap()->lru_add_drain_all()
The lru_add_drain_all() function requires a short work-item
to be run on all online cpus to complete.
A prerequisite for this to happen is:
a) Both zone device and system memory folios are considered in
migrate_device_unmap(), so that there is a reason to call
lru_add_drain_all() for a system memory folio while a
folio lock is held on a zone device folio.
b) The zone device folio has an initial mapcount > 1 which causes
at least one migration PTE entry insertion to be deferred to
try_to_migrate(), which can happen after the call to
lru_add_drain_all().
c) No or voluntary only preemption.
This all seems pretty unlikely to happen, but indeed is hit by
the "xe_exec_system_allocator" igt test.
Resolve this using a cond_resched() after each iteration in
hmm_range_fault(). Future code improvements might consider moving
the lru_add_drain_all() call in migrate_device_unmap() out of the
folio locked region.
Also, hmm_range_fault() can be a very long-running function
so a cond_resched() at the end of each iteration can be
motivated even in the absence of an -EBUSY.
Fixes: d28c2c9a4877 ("mm/hmm: make full use of walk_page_range()")
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: linux-mm@kvack.org
Cc: <stable@vger.kernel.org> # v5.5+
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
mm/hmm.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/hmm.c b/mm/hmm.c
index 4ec74c18bef6..160c9e4e5a92 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -674,6 +674,13 @@ int hmm_range_fault(struct hmm_range *range)
return -EBUSY;
ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
&hmm_walk_ops, &hmm_vma_walk);
+ /*
+ * Conditionally reschedule to let other work items get
+ * a chance to unlock device-private pages whose locks
+ * we're spinning on.
+ */
+ cond_resched();
+
/*
* When -EBUSY is returned the loop restarts with
* hmm_vma_walk.last set to an address that has not been stored
--
2.52.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-30 14:45 [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem Thomas Hellström
@ 2026-01-30 18:00 ` Andrew Morton
2026-01-30 19:56 ` Thomas Hellström
2026-01-31 3:01 ` John Hubbard
0 siblings, 2 replies; 30+ messages in thread
From: Andrew Morton @ 2026-01-30 18:00 UTC (permalink / raw)
To: Thomas Hellström
Cc: intel-xe, Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, Matthew Brost, linux-mm,
stable, dri-devel
On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> If hmm_range_fault() fails a folio_trylock() in do_swap_page,
> trying to acquire the lock of a device-private folio for migration,
> to ram, the function will spin until it succeeds grabbing the lock.
>
> However, if the process holding the lock is depending on a work
> item to be completed, which is scheduled on the same CPU as the
> spinning hmm_range_fault(), that work item might be starved and
> we end up in a livelock / starvation situation which is never
> resolved.
>
> This can happen, for example if the process holding the
> device-private folio lock is stuck in
> migrate_device_unmap()->lru_add_drain_all()
> The lru_add_drain_all() function requires a short work-item
> to be run on all online cpus to complete.
This is pretty bad behavior from lru_add_drain_all().
> A prerequisite for this to happen is:
> a) Both zone device and system memory folios are considered in
> migrate_device_unmap(), so that there is a reason to call
> lru_add_drain_all() for a system memory folio while a
> folio lock is held on a zone device folio.
> b) The zone device folio has an initial mapcount > 1 which causes
> at least one migration PTE entry insertion to be deferred to
> try_to_migrate(), which can happen after the call to
> lru_add_drain_all().
> c) No or voluntary only preemption.
>
> This all seems pretty unlikely to happen, but indeed is hit by
> the "xe_exec_system_allocator" igt test.
>
> Resolve this using a cond_resched() after each iteration in
> hmm_range_fault(). Future code improvements might consider moving
> the lru_add_drain_all() call in migrate_device_unmap() out of the
> folio locked region.
>
> Also, hmm_range_fault() can be a very long-running function
> so a cond_resched() at the end of each iteration can be
> motivated even in the absence of an -EBUSY.
>
> Fixes: d28c2c9a4877 ("mm/hmm: make full use of walk_page_range()")
Six years ago.
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -674,6 +674,13 @@ int hmm_range_fault(struct hmm_range *range)
> return -EBUSY;
> ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
> &hmm_walk_ops, &hmm_vma_walk);
> + /*
> + * Conditionally reschedule to let other work items get
> + * a chance to unlock device-private pages whose locks
> + * we're spinning on.
> + */
> + cond_resched();
> +
> /*
> * When -EBUSY is returned the loop restarts with
> * hmm_vma_walk.last set to an address that has not been stored
If the process which is running hmm_range_fault() has
SCHED_FIFO/SHCED_RR then cond_resched() doesn't work. An explicit
msleep() would be better?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-30 18:00 ` Andrew Morton
@ 2026-01-30 19:56 ` Thomas Hellström
2026-01-30 20:38 ` Andrew Morton
2026-01-31 3:01 ` John Hubbard
1 sibling, 1 reply; 30+ messages in thread
From: Thomas Hellström @ 2026-01-30 19:56 UTC (permalink / raw)
To: Andrew Morton
Cc: intel-xe, Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, Matthew Brost, linux-mm,
stable, dri-devel
On Fri, 2026-01-30 at 10:00 -0800, Andrew Morton wrote:
> On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> <thomas.hellstrom@linux.intel.com> wrote:
>
> > If hmm_range_fault() fails a folio_trylock() in do_swap_page,
> > trying to acquire the lock of a device-private folio for migration,
> > to ram, the function will spin until it succeeds grabbing the lock.
> >
> > However, if the process holding the lock is depending on a work
> > item to be completed, which is scheduled on the same CPU as the
> > spinning hmm_range_fault(), that work item might be starved and
> > we end up in a livelock / starvation situation which is never
> > resolved.
> >
> > This can happen, for example if the process holding the
> > device-private folio lock is stuck in
> > migrate_device_unmap()->lru_add_drain_all()
> > The lru_add_drain_all() function requires a short work-item
> > to be run on all online cpus to complete.
>
> This is pretty bad behavior from lru_add_drain_all().
>
> > A prerequisite for this to happen is:
> > a) Both zone device and system memory folios are considered in
> > migrate_device_unmap(), so that there is a reason to call
> > lru_add_drain_all() for a system memory folio while a
> > folio lock is held on a zone device folio.
> > b) The zone device folio has an initial mapcount > 1 which causes
> > at least one migration PTE entry insertion to be deferred to
> > try_to_migrate(), which can happen after the call to
> > lru_add_drain_all().
> > c) No or voluntary only preemption.
> >
> > This all seems pretty unlikely to happen, but indeed is hit by
> > the "xe_exec_system_allocator" igt test.
> >
> > Resolve this using a cond_resched() after each iteration in
> > hmm_range_fault(). Future code improvements might consider moving
> > the lru_add_drain_all() call in migrate_device_unmap() out of the
> > folio locked region.
> >
> > Also, hmm_range_fault() can be a very long-running function
> > so a cond_resched() at the end of each iteration can be
> > motivated even in the absence of an -EBUSY.
> >
> > Fixes: d28c2c9a4877 ("mm/hmm: make full use of walk_page_range()")
>
> Six years ago.
Yeah, although unlikely to have been hit due to our multi-device
migration code might have been the first instance of all those
prerequisites to be fulfilled.
>
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -674,6 +674,13 @@ int hmm_range_fault(struct hmm_range *range)
> > return -EBUSY;
> > ret = walk_page_range(mm, hmm_vma_walk.last,
> > range->end,
> > &hmm_walk_ops,
> > &hmm_vma_walk);
> > + /*
> > + * Conditionally reschedule to let other work
> > items get
> > + * a chance to unlock device-private pages whose
> > locks
> > + * we're spinning on.
> > + */
> > + cond_resched();
> > +
> > /*
> > * When -EBUSY is returned the loop restarts with
> > * hmm_vma_walk.last set to an address that has
> > not been stored
>
> If the process which is running hmm_range_fault() has
> SCHED_FIFO/SHCED_RR then cond_resched() doesn't work. An explicit
> msleep() would be better?
Unfortunately hmm_range_fault() is typically called from a gpu
pagefault handler and it's crucial to get the gpu up and running again
as fast as possible.
Is there a way we could test for the cases where cond_resched() doesn't
work and in that case instead call sched_yield(), at least on -EBUSY
errors?
Thanks,
Thomas
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-30 19:56 ` Thomas Hellström
@ 2026-01-30 20:38 ` Andrew Morton
2026-01-30 21:01 ` Matthew Brost
0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2026-01-30 20:38 UTC (permalink / raw)
To: Thomas Hellström
Cc: intel-xe, Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, Matthew Brost, linux-mm,
stable, dri-devel
On Fri, 30 Jan 2026 20:56:31 +0100 Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> >
> > > --- a/mm/hmm.c
> > > +++ b/mm/hmm.c
> > > @@ -674,6 +674,13 @@ int hmm_range_fault(struct hmm_range *range)
> > > return -EBUSY;
> > > ret = walk_page_range(mm, hmm_vma_walk.last,
> > > range->end,
> > > &hmm_walk_ops,
> > > &hmm_vma_walk);
> > > + /*
> > > + * Conditionally reschedule to let other work
> > > items get
> > > + * a chance to unlock device-private pages whose
> > > locks
> > > + * we're spinning on.
> > > + */
> > > + cond_resched();
> > > +
> > > /*
> > > * When -EBUSY is returned the loop restarts with
> > > * hmm_vma_walk.last set to an address that has
> > > not been stored
> >
> > If the process which is running hmm_range_fault() has
> > SCHED_FIFO/SHCED_RR then cond_resched() doesn't work. An explicit
> > msleep() would be better?
>
> Unfortunately hmm_range_fault() is typically called from a gpu
> pagefault handler and it's crucial to get the gpu up and running again
> as fast as possible.
Would a millisecond matter? Regular old preemption will often cause
longer delays.
> Is there a way we could test for the cases where cond_resched() doesn't
> work and in that case instead call sched_yield(), at least on -EBUSY
> errors?
kernel-internal sched_yield() was taken away years ago and I don't
think there's a replacement, particularly one which will cause a
realtime-policy task to yield to a non-rt-policy one.
It's common for kernel code to forget that it could have realtime
policy - we probably have potential lockups in various places.
I suggest you rerun your testcase with this patch using `chrt -r', see
if my speculation is correct.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-30 20:38 ` Andrew Morton
@ 2026-01-30 21:01 ` Matthew Brost
2026-01-30 21:08 ` Andrew Morton
0 siblings, 1 reply; 30+ messages in thread
From: Matthew Brost @ 2026-01-30 21:01 UTC (permalink / raw)
To: Andrew Morton
Cc: Thomas Hellström, intel-xe, Ralph Campbell,
Christoph Hellwig, Jason Gunthorpe, Jason Gunthorpe,
Leon Romanovsky, linux-mm, stable, dri-devel
On Fri, Jan 30, 2026 at 12:38:10PM -0800, Andrew Morton wrote:
> On Fri, 30 Jan 2026 20:56:31 +0100 Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>
> > >
> > > > --- a/mm/hmm.c
> > > > +++ b/mm/hmm.c
> > > > @@ -674,6 +674,13 @@ int hmm_range_fault(struct hmm_range *range)
> > > > return -EBUSY;
> > > > ret = walk_page_range(mm, hmm_vma_walk.last,
> > > > range->end,
> > > > &hmm_walk_ops,
> > > > &hmm_vma_walk);
> > > > + /*
> > > > + * Conditionally reschedule to let other work
> > > > items get
> > > > + * a chance to unlock device-private pages whose
> > > > locks
> > > > + * we're spinning on.
> > > > + */
> > > > + cond_resched();
> > > > +
> > > > /*
> > > > * When -EBUSY is returned the loop restarts with
> > > > * hmm_vma_walk.last set to an address that has
> > > > not been stored
> > >
> > > If the process which is running hmm_range_fault() has
> > > SCHED_FIFO/SHCED_RR then cond_resched() doesn't work. An explicit
> > > msleep() would be better?
> >
> > Unfortunately hmm_range_fault() is typically called from a gpu
> > pagefault handler and it's crucial to get the gpu up and running again
> > as fast as possible.
>
> Would a millisecond matter? Regular old preemption will often cause
> longer delays.
>
I think millisecond is too high. We are aiming to GPU page faults
serviced in 10-15us of CPU time (GPU copy time varies based on size of
fault / copy bus speed but still at most 200us).
Matt
> > Is there a way we could test for the cases where cond_resched() doesn't
> > work and in that case instead call sched_yield(), at least on -EBUSY
> > errors?
>
> kernel-internal sched_yield() was taken away years ago and I don't
> think there's a replacement, particularly one which will cause a
> realtime-policy task to yield to a non-rt-policy one.
>
> It's common for kernel code to forget that it could have realtime
> policy - we probably have potential lockups in various places.
>
> I suggest you rerun your testcase with this patch using `chrt -r', see
> if my speculation is correct.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-30 21:01 ` Matthew Brost
@ 2026-01-30 21:08 ` Andrew Morton
2026-01-31 0:59 ` Matthew Brost
0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2026-01-30 21:08 UTC (permalink / raw)
To: Matthew Brost
Cc: Thomas Hellström, intel-xe, Ralph Campbell,
Christoph Hellwig, Jason Gunthorpe, Jason Gunthorpe,
Leon Romanovsky, linux-mm, stable, dri-devel
On Fri, 30 Jan 2026 13:01:24 -0800 Matthew Brost <matthew.brost@intel.com> wrote:
> > > Unfortunately hmm_range_fault() is typically called from a gpu
> > > pagefault handler and it's crucial to get the gpu up and running again
> > > as fast as possible.
> >
> > Would a millisecond matter? Regular old preemption will often cause
> > longer delays.
> >
>
> I think millisecond is too high. We are aiming to GPU page faults
> serviced in 10-15us of CPU time (GPU copy time varies based on size of
> fault / copy bus speed but still at most 200us).
But it's a rare case?
Am I incorrect in believing that getting preempted will cause latencies
much larger than this?
> Matt
>
> > > Is there a way we could test for the cases where cond_resched() doesn't
> > > work and in that case instead call sched_yield(), at least on -EBUSY
> > > errors?
> >
> > kernel-internal sched_yield() was taken away years ago and I don't
> > think there's a replacement, particularly one which will cause a
> > realtime-policy task to yield to a non-rt-policy one.
> >
> > It's common for kernel code to forget that it could have realtime
> > policy - we probably have potential lockups in various places.
> >
> > I suggest you rerun your testcase with this patch using `chrt -r', see
> > if my speculation is correct.
Please?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-30 21:08 ` Andrew Morton
@ 2026-01-31 0:59 ` Matthew Brost
0 siblings, 0 replies; 30+ messages in thread
From: Matthew Brost @ 2026-01-31 0:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Thomas Hellström, intel-xe, Ralph Campbell,
Christoph Hellwig, Jason Gunthorpe, Jason Gunthorpe,
Leon Romanovsky, linux-mm, stable, dri-devel
On Fri, Jan 30, 2026 at 01:08:35PM -0800, Andrew Morton wrote:
> On Fri, 30 Jan 2026 13:01:24 -0800 Matthew Brost <matthew.brost@intel.com> wrote:
>
> > > > Unfortunately hmm_range_fault() is typically called from a gpu
> > > > pagefault handler and it's crucial to get the gpu up and running again
> > > > as fast as possible.
> > >
> > > Would a millisecond matter? Regular old preemption will often cause
> > > longer delays.
> > >
> >
> > I think millisecond is too high. We are aiming to GPU page faults
> > serviced in 10-15us of CPU time (GPU copy time varies based on size of
> > fault / copy bus speed but still at most 200us).
>
> But it's a rare case?
>
Not that rare. I believe this code path — where hmm_range_fault returns
-EBUSY — can be triggered any time HMM_PFN_REQ_FAULT is set and a page
needs to be faulted in.
We don't set HMM_PFN_REQ_FAULT in our GPU fault handler unless
migrations are racing, which should indeed be rare with well behaved
user space. But there are other cases, such as userptr binds, that do
set HMM_PFN_REQ_FAULT, where it's somewhat expected to fault in a bunch
of CPU pages. Doing an msleep probably isn’t a great idea in core code
that a bunch of drivers call, unless this is truly the last resort.
> Am I incorrect in believing that getting preempted will cause latencies
> much larger than this?
>
I'm not really sure — I'm not a scheduling expert — but from my research
I think preemption is still less than 1ms, and in cases where you don't
preempt, cond_resched() is basically free.
> > Matt
> >
> > > > Is there a way we could test for the cases where cond_resched() doesn't
> > > > work and in that case instead call sched_yield(), at least on -EBUSY
> > > > errors?
> > >
> > > kernel-internal sched_yield() was taken away years ago and I don't
> > > think there's a replacement, particularly one which will cause a
> > > realtime-policy task to yield to a non-rt-policy one.
> > >
> > > It's common for kernel code to forget that it could have realtime
> > > policy - we probably have potential lockups in various places.
> > >
> > > I suggest you rerun your testcase with this patch using `chrt -r', see
> > > if my speculation is correct.
>
> Please?
Thomas is in Europe, so he’s already done for the day. But I tested this
fix and verified that it resolves the hang we were seeing.
I also did at least 10 runs with chrt -r 1, chrt -r 50, and chrt -r 99.
I couldn’t get it to hang — previously I could reproduce the hang in at
most 2 runs.
In this test case all threads are on work queues, which I believe can
bypass real-time scheduling policies, so that likely explains the chrt
result. I think we’d have to craft a test case that triggers an
hmm_range_fault from a user space call (userptr binds would do this) and
race it with a migration to catch any RT bugs.
Matt
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-30 18:00 ` Andrew Morton
2026-01-30 19:56 ` Thomas Hellström
@ 2026-01-31 3:01 ` John Hubbard
2026-01-31 12:57 ` Thomas Hellström
1 sibling, 1 reply; 30+ messages in thread
From: John Hubbard @ 2026-01-31 3:01 UTC (permalink / raw)
To: Andrew Morton, Thomas Hellström
Cc: intel-xe, Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, Matthew Brost, linux-mm,
stable, dri-devel
On 1/30/26 10:00 AM, Andrew Morton wrote:
> On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
...
>> This can happen, for example if the process holding the
>> device-private folio lock is stuck in
>> migrate_device_unmap()->lru_add_drain_all()
>> The lru_add_drain_all() function requires a short work-item
>> to be run on all online cpus to complete.
>
> This is pretty bad behavior from lru_add_drain_all().
Yes. And also, by code inspection, it seems like other folio_batch
items (I was going to say pagevecs, heh) can leak in after calling
lru_add_drain_all(), making things even worse.
Maybe we really should be calling lru_cache_disable/enable()
pairs for migration, even though it looks heavier weight.
This diff would address both points, and maybe fix Matthew's issue,
although I haven't done much real testing on it other than a quick
run of run_vmtests.sh:
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 23379663b1e1..3c55a766dd33 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -570,7 +570,6 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
struct folio *fault_folio = fault_page ?
page_folio(fault_page) : NULL;
unsigned long i, restore = 0;
- bool allow_drain = true;
unsigned long unmapped = 0;
lru_add_drain();
@@ -595,12 +594,6 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
/* ZONE_DEVICE folios are not on LRU */
if (!folio_is_zone_device(folio)) {
- if (!folio_test_lru(folio) && allow_drain) {
- /* Drain CPU's lru cache */
- lru_add_drain_all();
- allow_drain = false;
- }
-
if (!folio_isolate_lru(folio)) {
src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
@@ -759,11 +752,15 @@ int migrate_vma_setup(struct migrate_vma *args)
args->cpages = 0;
args->npages = 0;
+ lru_cache_disable();
+
migrate_vma_collect(args);
if (args->cpages)
migrate_vma_unmap(args);
+ lru_cache_enable();
+
/*
* At this point pages are locked and unmapped, and thus they have
* stable content and can safely be copied to destination memory that
@@ -1395,6 +1392,8 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start,
{
unsigned long i, j, pfn;
+ lru_cache_disable();
+
for (pfn = start, i = 0; i < npages; pfn++, i++) {
struct page *page = pfn_to_page(pfn);
struct folio *folio = page_folio(page);
@@ -1413,6 +1412,8 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start,
migrate_device_unmap(src_pfns, npages, NULL);
+ lru_cache_enable();
+
return 0;
}
EXPORT_SYMBOL(migrate_device_range);
@@ -1429,6 +1430,8 @@ int migrate_device_pfns(unsigned long *src_pfns, unsigned long npages)
{
unsigned long i, j;
+ lru_cache_disable();
+
for (i = 0; i < npages; i++) {
struct page *page = pfn_to_page(src_pfns[i]);
struct folio *folio = page_folio(page);
@@ -1446,6 +1449,8 @@ int migrate_device_pfns(unsigned long *src_pfns, unsigned long npages)
migrate_device_unmap(src_pfns, npages, NULL);
+ lru_cache_enable();
+
return 0;
}
EXPORT_SYMBOL(migrate_device_pfns);
thanks,
--
John Hubbard
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-31 3:01 ` John Hubbard
@ 2026-01-31 12:57 ` Thomas Hellström
2026-01-31 19:00 ` Matthew Brost
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Hellström @ 2026-01-31 12:57 UTC (permalink / raw)
To: John Hubbard, Andrew Morton
Cc: intel-xe, Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, Matthew Brost, linux-mm,
stable, dri-devel
On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> On 1/30/26 10:00 AM, Andrew Morton wrote:
> > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > <thomas.hellstrom@linux.intel.com> wrote:
> ...
> > > This can happen, for example if the process holding the
> > > device-private folio lock is stuck in
> > > migrate_device_unmap()->lru_add_drain_all()
> > > The lru_add_drain_all() function requires a short work-item
> > > to be run on all online cpus to complete.
> >
> > This is pretty bad behavior from lru_add_drain_all().
>
> Yes. And also, by code inspection, it seems like other folio_batch
> items (I was going to say pagevecs, heh) can leak in after calling
> lru_add_drain_all(), making things even worse.
>
> Maybe we really should be calling lru_cache_disable/enable()
> pairs for migration, even though it looks heavier weight.
>
> This diff would address both points, and maybe fix Matthew's issue,
> although I haven't done much real testing on it other than a quick
> run of run_vmtests.sh:
It looks like lru_cache_disable() is using synchronize_rcu_expedited(),
which whould be a huge performance killer?
From the migrate code it looks like it's calling lru_add_drain_all()
once only, because migration is still best effort, so it's accepting
failures if someone adds pages to the per-cpu lru_add structures,
rather than wanting to take the heavy performance loss of
lru_cache_disable().
The problem at hand is also solved if we move the lru_add_drain_all()
out of the page-locked region in migrate_vma_setup(), like if we hit a
system folio not on the LRU, we'd unlock all folios, call
lru_add_drain_all() and retry from start.
But the root cause, even though lru_add_drain_all() is bad-behaving, is
IMHO the trylock spin in hmm_range_fault(). This is relatively recently
introduced to avoid another livelock problem, but there were other
fixes associated with that as well, so might not be strictly necessary.
IIRC he original non-trylocking code in do_swap_page() first took a
reference to the folio, released the page-table lock and then performed
a sleeping folio lock. Problem was that if the folio was already locked
for migration, that additional folio refcount would block migration
(which might not be a big problem considering do_swap_page() might want
to migrate to system ram anyway). @Matt Brost what's your take on this?
I'm also not sure a folio refcount should block migration after the
introduction of pinned (like in pin_user_pages) pages. Rather perhaps a
folio pin-count should block migration and in that case do_swap_page()
can definitely do a sleeping folio lock and the problem is gone.
But it looks like an AR for us to try to check how bad
lru_cache_disable() really is. And perhaps compare with an
unconditional lru_add_drain_all() at migration start.
Does anybody know who would be able to tell whether a page refcount
still should block migration (like today) or whether that could
actually be relaxed to a page pincount?
Thanks,
Thomas
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 23379663b1e1..3c55a766dd33 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -570,7 +570,6 @@ static unsigned long
> migrate_device_unmap(unsigned long *src_pfns,
> struct folio *fault_folio = fault_page ?
> page_folio(fault_page) : NULL;
> unsigned long i, restore = 0;
> - bool allow_drain = true;
> unsigned long unmapped = 0;
>
> lru_add_drain();
> @@ -595,12 +594,6 @@ static unsigned long
> migrate_device_unmap(unsigned long *src_pfns,
>
> /* ZONE_DEVICE folios are not on LRU */
> if (!folio_is_zone_device(folio)) {
> - if (!folio_test_lru(folio) && allow_drain) {
> - /* Drain CPU's lru cache */
> - lru_add_drain_all();
> - allow_drain = false;
> - }
> -
> if (!folio_isolate_lru(folio)) {
> src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> restore++;
> @@ -759,11 +752,15 @@ int migrate_vma_setup(struct migrate_vma *args)
> args->cpages = 0;
> args->npages = 0;
>
> + lru_cache_disable();
> +
> migrate_vma_collect(args);
>
> if (args->cpages)
> migrate_vma_unmap(args);
>
> + lru_cache_enable();
> +
> /*
> * At this point pages are locked and unmapped, and thus
> they have
> * stable content and can safely be copied to destination
> memory that
> @@ -1395,6 +1392,8 @@ int migrate_device_range(unsigned long
> *src_pfns, unsigned long start,
> {
> unsigned long i, j, pfn;
>
> + lru_cache_disable();
> +
> for (pfn = start, i = 0; i < npages; pfn++, i++) {
> struct page *page = pfn_to_page(pfn);
> struct folio *folio = page_folio(page);
> @@ -1413,6 +1412,8 @@ int migrate_device_range(unsigned long
> *src_pfns, unsigned long start,
>
> migrate_device_unmap(src_pfns, npages, NULL);
>
> + lru_cache_enable();
> +
> return 0;
> }
> EXPORT_SYMBOL(migrate_device_range);
> @@ -1429,6 +1430,8 @@ int migrate_device_pfns(unsigned long
> *src_pfns, unsigned long npages)
> {
> unsigned long i, j;
>
> + lru_cache_disable();
> +
> for (i = 0; i < npages; i++) {
> struct page *page = pfn_to_page(src_pfns[i]);
> struct folio *folio = page_folio(page);
> @@ -1446,6 +1449,8 @@ int migrate_device_pfns(unsigned long
> *src_pfns, unsigned long npages)
>
> migrate_device_unmap(src_pfns, npages, NULL);
>
> + lru_cache_enable();
> +
> return 0;
> }
> EXPORT_SYMBOL(migrate_device_pfns);
>
>
>
>
> thanks,
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-31 12:57 ` Thomas Hellström
@ 2026-01-31 19:00 ` Matthew Brost
2026-01-31 21:42 ` John Hubbard
0 siblings, 1 reply; 30+ messages in thread
From: Matthew Brost @ 2026-01-31 19:00 UTC (permalink / raw)
To: Thomas Hellström
Cc: John Hubbard, Andrew Morton, intel-xe, Ralph Campbell,
Christoph Hellwig, Jason Gunthorpe, Jason Gunthorpe,
Leon Romanovsky, linux-mm, stable, dri-devel
On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
> On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > > <thomas.hellstrom@linux.intel.com> wrote:
> > ...
> > > > This can happen, for example if the process holding the
> > > > device-private folio lock is stuck in
> > > > migrate_device_unmap()->lru_add_drain_all()
> > > > The lru_add_drain_all() function requires a short work-item
> > > > to be run on all online cpus to complete.
> > >
> > > This is pretty bad behavior from lru_add_drain_all().
> >
> > Yes. And also, by code inspection, it seems like other folio_batch
> > items (I was going to say pagevecs, heh) can leak in after calling
> > lru_add_drain_all(), making things even worse.
> >
> > Maybe we really should be calling lru_cache_disable/enable()
> > pairs for migration, even though it looks heavier weight.
> >
> > This diff would address both points, and maybe fix Matthew's issue,
> > although I haven't done much real testing on it other than a quick
> > run of run_vmtests.sh:
>
> It looks like lru_cache_disable() is using synchronize_rcu_expedited(),
> which whould be a huge performance killer?
>
Yep. I’ve done some quick testing with John’s patch, and
xe_exec_system_alloc slows down by what seems like orders of magnitude in
certain sections. I haven’t done a deep dive yet, but the initial results
don’t look good.
I also eventually hit a kernel deadlock. I have the stack trace saved.
> From the migrate code it looks like it's calling lru_add_drain_all()
> once only, because migration is still best effort, so it's accepting
> failures if someone adds pages to the per-cpu lru_add structures,
> rather than wanting to take the heavy performance loss of
> lru_cache_disable().
>
> The problem at hand is also solved if we move the lru_add_drain_all()
> out of the page-locked region in migrate_vma_setup(), like if we hit a
> system folio not on the LRU, we'd unlock all folios, call
> lru_add_drain_all() and retry from start.
>
That seems like something to try. It should actually be pretty easy to
implement as well. It’s good to determine whether a backoff like this is
common, and whether the backoff causes a performance hit or leads to a
large number of retries under the right race conditions.
> But the root cause, even though lru_add_drain_all() is bad-behaving, is
> IMHO the trylock spin in hmm_range_fault(). This is relatively recently
> introduced to avoid another livelock problem, but there were other
> fixes associated with that as well, so might not be strictly necessary.
>
> IIRC he original non-trylocking code in do_swap_page() first took a
Here is change for reference:
git format-patch -1 1afaeb8293c9a
> reference to the folio, released the page-table lock and then performed
> a sleeping folio lock. Problem was that if the folio was already locked
So original code never had page lock.
> for migration, that additional folio refcount would block migration
The additional folio refcount could block migration, so if multiple
threads fault the same page you could spin thousands of times before
one of them actually wins the race and moves the page. Or, if
migrate_to_ram contends on some common mutex or similar structure
(Xe/GPU SVM doesn’t, but AMD and Nouveau do), you could get a stable
livelock.
> (which might not be a big problem considering do_swap_page() might want
> to migrate to system ram anyway). @Matt Brost what's your take on this?
>
The primary reason I used a trylock in do_swap_page is because the
migrate_vma_* functions also use trylocks. It seems reasonable to
simply convert the lock in do_swap_page to a sleeping lock. I believe
that would solve this issue for both non-RT and RT threads. I don’t know
enough about the MM to say whether using a sleeping lock here is
acceptable, though. Perhaps Andrew can provide guidance.
> I'm also not sure a folio refcount should block migration after the
> introduction of pinned (like in pin_user_pages) pages. Rather perhaps a
> folio pin-count should block migration and in that case do_swap_page()
> can definitely do a sleeping folio lock and the problem is gone.
>
I’m not convinced the folio refcount has any bearing if we can take a
sleeping lock in do_swap_page, but perhaps I’m missing something.
> But it looks like an AR for us to try to check how bad
> lru_cache_disable() really is. And perhaps compare with an
> unconditional lru_add_drain_all() at migration start.
>
> Does anybody know who would be able to tell whether a page refcount
> still should block migration (like today) or whether that could
> actually be relaxed to a page pincount?
>
This is a good question. AFAIK this is probably a leftover from the
original device-pages implementation, and it could likely be relaxed.
But I’m not really convinced the folio refcount is relevant to this
discussion (see above).
Matt
> Thanks,
> Thomas
>
> >
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index 23379663b1e1..3c55a766dd33 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -570,7 +570,6 @@ static unsigned long
> > migrate_device_unmap(unsigned long *src_pfns,
> > struct folio *fault_folio = fault_page ?
> > page_folio(fault_page) : NULL;
> > unsigned long i, restore = 0;
> > - bool allow_drain = true;
> > unsigned long unmapped = 0;
> >
> > lru_add_drain();
> > @@ -595,12 +594,6 @@ static unsigned long
> > migrate_device_unmap(unsigned long *src_pfns,
> >
> > /* ZONE_DEVICE folios are not on LRU */
> > if (!folio_is_zone_device(folio)) {
> > - if (!folio_test_lru(folio) && allow_drain) {
> > - /* Drain CPU's lru cache */
> > - lru_add_drain_all();
> > - allow_drain = false;
> > - }
> > -
> > if (!folio_isolate_lru(folio)) {
> > src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> > restore++;
> > @@ -759,11 +752,15 @@ int migrate_vma_setup(struct migrate_vma *args)
> > args->cpages = 0;
> > args->npages = 0;
> >
> > + lru_cache_disable();
> > +
> > migrate_vma_collect(args);
> >
> > if (args->cpages)
> > migrate_vma_unmap(args);
> >
> > + lru_cache_enable();
> > +
> > /*
> > * At this point pages are locked and unmapped, and thus
> > they have
> > * stable content and can safely be copied to destination
> > memory that
> > @@ -1395,6 +1392,8 @@ int migrate_device_range(unsigned long
> > *src_pfns, unsigned long start,
> > {
> > unsigned long i, j, pfn;
> >
> > + lru_cache_disable();
> > +
> > for (pfn = start, i = 0; i < npages; pfn++, i++) {
> > struct page *page = pfn_to_page(pfn);
> > struct folio *folio = page_folio(page);
> > @@ -1413,6 +1412,8 @@ int migrate_device_range(unsigned long
> > *src_pfns, unsigned long start,
> >
> > migrate_device_unmap(src_pfns, npages, NULL);
> >
> > + lru_cache_enable();
> > +
> > return 0;
> > }
> > EXPORT_SYMBOL(migrate_device_range);
> > @@ -1429,6 +1430,8 @@ int migrate_device_pfns(unsigned long
> > *src_pfns, unsigned long npages)
> > {
> > unsigned long i, j;
> >
> > + lru_cache_disable();
> > +
> > for (i = 0; i < npages; i++) {
> > struct page *page = pfn_to_page(src_pfns[i]);
> > struct folio *folio = page_folio(page);
> > @@ -1446,6 +1449,8 @@ int migrate_device_pfns(unsigned long
> > *src_pfns, unsigned long npages)
> >
> > migrate_device_unmap(src_pfns, npages, NULL);
> >
> > + lru_cache_enable();
> > +
> > return 0;
> > }
> > EXPORT_SYMBOL(migrate_device_pfns);
> >
> >
> >
> >
> > thanks,
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-31 19:00 ` Matthew Brost
@ 2026-01-31 21:42 ` John Hubbard
2026-02-01 19:24 ` Matthew Brost
2026-02-02 9:13 ` Thomas Hellström
0 siblings, 2 replies; 30+ messages in thread
From: John Hubbard @ 2026-01-31 21:42 UTC (permalink / raw)
To: Matthew Brost, Thomas Hellström
Cc: Andrew Morton, intel-xe, Ralph Campbell, Christoph Hellwig,
Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, linux-mm,
stable, dri-devel
On 1/31/26 11:00 AM, Matthew Brost wrote:
> On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
>> On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
>>> On 1/30/26 10:00 AM, Andrew Morton wrote:
>>>> On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
>>>> <thomas.hellstrom@linux.intel.com> wrote:
>>> ...
>> It looks like lru_cache_disable() is using synchronize_rcu_expedited(),
>> which whould be a huge performance killer?
>>
>
> Yep. I’ve done some quick testing with John’s patch, and
> xe_exec_system_alloc slows down by what seems like orders of magnitude in
ouchie!
> certain sections. I haven’t done a deep dive yet, but the initial results
> don’t look good.
>
> I also eventually hit a kernel deadlock. I have the stack trace saved.
>
>> From the migrate code it looks like it's calling lru_add_drain_all()
>> once only, because migration is still best effort, so it's accepting
>> failures if someone adds pages to the per-cpu lru_add structures,
>> rather than wanting to take the heavy performance loss of
>> lru_cache_disable().
Yes, I'm clearly far too biased right now towards "make migration
succeed more often" (some notes below). lru_cache_disable() is sounding
awfully severe in terms of perf loss.
>>
>> The problem at hand is also solved if we move the lru_add_drain_all()
>> out of the page-locked region in migrate_vma_setup(), like if we hit a
>> system folio not on the LRU, we'd unlock all folios, call
>> lru_add_drain_all() and retry from start.
>>
>
> That seems like something to try. It should actually be pretty easy to
> implement as well. It’s good to determine whether a backoff like this is
This does seem like a less drastic fix, and it keeps the same design.
> common, and whether the backoff causes a performance hit or leads to a
> large number of retries under the right race conditions.
>
>> But the root cause, even though lru_add_drain_all() is bad-behaving, is
>> IMHO the trylock spin in hmm_range_fault(). This is relatively recently
>> introduced to avoid another livelock problem, but there were other
>> fixes associated with that as well, so might not be strictly necessary.
>>
>> IIRC he original non-trylocking code in do_swap_page() first took a
>
> Here is change for reference:
>
> git format-patch -1 1afaeb8293c9a
>
>> reference to the folio, released the page-table lock and then performed
>> a sleeping folio lock. Problem was that if the folio was already locked
>
> So original code never had page lock.
>
>> for migration, that additional folio refcount would block migration
>
> The additional folio refcount could block migration, so if multiple
> threads fault the same page you could spin thousands of times before
> one of them actually wins the race and moves the page. Or, if
> migrate_to_ram contends on some common mutex or similar structure
> (Xe/GPU SVM doesn’t, but AMD and Nouveau do), you could get a stable
> livelock.
>
>> (which might not be a big problem considering do_swap_page() might want
>> to migrate to system ram anyway). @Matt Brost what's your take on this?
>>
>
> The primary reason I used a trylock in do_swap_page is because the
> migrate_vma_* functions also use trylocks. It seems reasonable to
Those are trylocks because it is collecting multiple pages/folios, so in
order to avoid deadlocks (very easy to hit with that pattern), it goes
with trylock.
> simply convert the lock in do_swap_page to a sleeping lock. I believe
> that would solve this issue for both non-RT and RT threads. I don’t know
> enough about the MM to say whether using a sleeping lock here is
> acceptable, though. Perhaps Andrew can provide guidance.
This might actually be possible.
>
>> I'm also not sure a folio refcount should block migration after the
>> introduction of pinned (like in pin_user_pages) pages. Rather perhaps a
>> folio pin-count should block migration and in that case do_swap_page()
>> can definitely do a sleeping folio lock and the problem is gone.
A problem for that specific point is that pincount and refcount both
mean, "the page is pinned" (which in turn literally means "not allowed
to migrate/move").
(In fact, pincount is implemented in terms of refcount, in most
configurations still.)
>>
>
> I’m not convinced the folio refcount has any bearing if we can take a
> sleeping lock in do_swap_page, but perhaps I’m missing something.
So far, I am not able to find a problem with your proposal. So,
something like this I believe could actually work:
diff --git a/mm/memory.c b/mm/memory.c
index da360a6eb8a4..af73430e7888 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->page = softleaf_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
} else if (softleaf_is_device_private(entry)) {
+ struct dev_pagemap *pgmap;
+
if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
/*
* migrate_to_ram is not yet ready to operate
@@ -4674,18 +4676,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* Get a page reference while we know the page can't be
* freed.
*/
- if (trylock_page(vmf->page)) {
- struct dev_pagemap *pgmap;
-
- get_page(vmf->page);
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- pgmap = page_pgmap(vmf->page);
- ret = pgmap->ops->migrate_to_ram(vmf);
- unlock_page(vmf->page);
- put_page(vmf->page);
- } else {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- }
+ get_page(vmf->page);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ lock_page(vmf->page);
+ pgmap = page_pgmap(vmf->page);
+ ret = pgmap->ops->migrate_to_ram(vmf);
+ unlock_page(vmf->page);
+ put_page(vmf->page);
} else if (softleaf_is_hwpoison(entry)) {
ret = VM_FAULT_HWPOISON;
} else if (softleaf_is_marker(entry)) {
>
>> But it looks like an AR for us to try to check how bad
>> lru_cache_disable() really is. And perhaps compare with an
>> unconditional lru_add_drain_all() at migration start.
>>
>> Does anybody know who would be able to tell whether a page refcount
>> still should block migration (like today) or whether that could
>> actually be relaxed to a page pincount?
Yes, it really should block migration, see my response above: both
pincount and refcount literally mean "do not move this page".
As an aside because it might help at some point, I'm just now testing a
tiny patchset that uses:
wait_var_event_killable(&folio->_refcount,
folio_ref_count(folio) <= expected)
during migration, paired with:
wake_up_var(&folio->_refcount) in put_page().
This waits for the expected refcount, instead of doing a blind, tight
retry loop during migration attempts. This lets migration succeed even
when waiting a long time for another caller to release a refcount.
It works well, but of course, it also has a potentially serious
performance cost (which I need to quantify), because it adds cycles to
the put_page() hot path. Which is why I haven't posted it yet, even as
an RFC. It's still in the "is this even reasonable" stage, just food
for thought here.
thanks,
--
John Hubbard
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-31 21:42 ` John Hubbard
@ 2026-02-01 19:24 ` Matthew Brost
2026-02-01 20:48 ` John Hubbard
2026-02-02 9:13 ` Thomas Hellström
1 sibling, 1 reply; 30+ messages in thread
From: Matthew Brost @ 2026-02-01 19:24 UTC (permalink / raw)
To: John Hubbard
Cc: Thomas Hellström, Andrew Morton, intel-xe, Ralph Campbell,
Christoph Hellwig, Jason Gunthorpe, Jason Gunthorpe,
Leon Romanovsky, linux-mm, stable, dri-devel
On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard wrote:
> On 1/31/26 11:00 AM, Matthew Brost wrote:
> > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
> > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > > > > <thomas.hellstrom@linux.intel.com> wrote:
> > > > ...
> > > It looks like lru_cache_disable() is using synchronize_rcu_expedited(),
> > > which whould be a huge performance killer?
> > >
> >
> > Yep. I’ve done some quick testing with John’s patch, and
> > xe_exec_system_alloc slows down by what seems like orders of magnitude in
>
> ouchie!
>
> > certain sections. I haven’t done a deep dive yet, but the initial results
> > don’t look good.
> >
> > I also eventually hit a kernel deadlock. I have the stack trace saved.
> >
> > > From the migrate code it looks like it's calling lru_add_drain_all()
> > > once only, because migration is still best effort, so it's accepting
> > > failures if someone adds pages to the per-cpu lru_add structures,
> > > rather than wanting to take the heavy performance loss of
> > > lru_cache_disable().
>
> Yes, I'm clearly far too biased right now towards "make migration
> succeed more often" (some notes below). lru_cache_disable() is sounding
> awfully severe in terms of perf loss.
>
> > >
> > > The problem at hand is also solved if we move the lru_add_drain_all()
> > > out of the page-locked region in migrate_vma_setup(), like if we hit a
> > > system folio not on the LRU, we'd unlock all folios, call
> > > lru_add_drain_all() and retry from start.
> > >
> >
> > That seems like something to try. It should actually be pretty easy to
> > implement as well. It’s good to determine whether a backoff like this is
>
> This does seem like a less drastic fix, and it keeps the same design.
>
Perhaps Thomas and I can look at this option during the work week.
> > common, and whether the backoff causes a performance hit or leads to a
> > large number of retries under the right race conditions.
> >
> > > But the root cause, even though lru_add_drain_all() is bad-behaving, is
> > > IMHO the trylock spin in hmm_range_fault(). This is relatively recently
> > > introduced to avoid another livelock problem, but there were other
> > > fixes associated with that as well, so might not be strictly necessary.
> > >
> > > IIRC he original non-trylocking code in do_swap_page() first took a
> >
> > Here is change for reference:
> >
> > git format-patch -1 1afaeb8293c9a
> >
> > > reference to the folio, released the page-table lock and then performed
> > > a sleeping folio lock. Problem was that if the folio was already locked
> >
> > So original code never had page lock.
> >
> > > for migration, that additional folio refcount would block migration
> >
> > The additional folio refcount could block migration, so if multiple
> > threads fault the same page you could spin thousands of times before
> > one of them actually wins the race and moves the page. Or, if
> > migrate_to_ram contends on some common mutex or similar structure
> > (Xe/GPU SVM doesn’t, but AMD and Nouveau do), you could get a stable
> > livelock.
> >
> > > (which might not be a big problem considering do_swap_page() might want
> > > to migrate to system ram anyway). @Matt Brost what's your take on this?
> > >
> >
> > The primary reason I used a trylock in do_swap_page is because the
> > migrate_vma_* functions also use trylocks. It seems reasonable to
>
> Those are trylocks because it is collecting multiple pages/folios, so in
> order to avoid deadlocks (very easy to hit with that pattern), it goes
> with trylock.
>
> > simply convert the lock in do_swap_page to a sleeping lock. I believe
> > that would solve this issue for both non-RT and RT threads. I don’t know
> > enough about the MM to say whether using a sleeping lock here is
> > acceptable, though. Perhaps Andrew can provide guidance.
>
> This might actually be possible.
>
> >
> > > I'm also not sure a folio refcount should block migration after the
> > > introduction of pinned (like in pin_user_pages) pages. Rather perhaps a
> > > folio pin-count should block migration and in that case do_swap_page()
> > > can definitely do a sleeping folio lock and the problem is gone.
>
> A problem for that specific point is that pincount and refcount both
> mean, "the page is pinned" (which in turn literally means "not allowed
> to migrate/move").
>
> (In fact, pincount is implemented in terms of refcount, in most
> configurations still.)
>
> > >
> >
> > I’m not convinced the folio refcount has any bearing if we can take a
> > sleeping lock in do_swap_page, but perhaps I’m missing something.
>
> So far, I am not able to find a problem with your proposal. So,
> something like this I believe could actually work:
>
I did something slightly more defensive with a refcount protection, but
this seems to work + fix the raised by Thomas and shows no noticeable
performance difference. If we go this route, do_huge_pmd_device_private
would need to be updated with the same pattern as well - I don't have
large device pages enabled in current test branch but would have to test
that part out too.
diff --git a/mm/memory.c b/mm/memory.c
index da360a6eb8a4..1e7ccc4a1a6c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->page = softleaf_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
} else if (softleaf_is_device_private(entry)) {
+ struct dev_pagemap *pgmap;
+
if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
/*
* migrate_to_ram is not yet ready to operate
@@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->orig_pte)))
goto unlock;
- /*
- * Get a page reference while we know the page can't be
- * freed.
- */
- if (trylock_page(vmf->page)) {
- struct dev_pagemap *pgmap;
-
- get_page(vmf->page);
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ lock_page(vmf->page);
+ if (get_page_unless_zero(vmf->page)) {
pgmap = page_pgmap(vmf->page);
ret = pgmap->ops->migrate_to_ram(vmf);
unlock_page(vmf->page);
put_page(vmf->page);
} else {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ unlock_page(vmf->page);
}
} else if (softleaf_is_hwpoison(entry)) {
ret = VM_FAULT_HWPOISON;
> diff --git a/mm/memory.c b/mm/memory.c
> index da360a6eb8a4..af73430e7888 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> vmf->page = softleaf_to_page(entry);
> ret = remove_device_exclusive_entry(vmf);
> } else if (softleaf_is_device_private(entry)) {
> + struct dev_pagemap *pgmap;
> +
> if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> /*
> * migrate_to_ram is not yet ready to operate
> @@ -4674,18 +4676,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * Get a page reference while we know the page can't be
> * freed.
> */
> - if (trylock_page(vmf->page)) {
> - struct dev_pagemap *pgmap;
> -
> - get_page(vmf->page);
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - pgmap = page_pgmap(vmf->page);
> - ret = pgmap->ops->migrate_to_ram(vmf);
> - unlock_page(vmf->page);
> - put_page(vmf->page);
> - } else {
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - }
> + get_page(vmf->page);
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> + lock_page(vmf->page);
> + pgmap = page_pgmap(vmf->page);
> + ret = pgmap->ops->migrate_to_ram(vmf);
> + unlock_page(vmf->page);
> + put_page(vmf->page);
> } else if (softleaf_is_hwpoison(entry)) {
> ret = VM_FAULT_HWPOISON;
> } else if (softleaf_is_marker(entry)) {
>
> >
> > > But it looks like an AR for us to try to check how bad
> > > lru_cache_disable() really is. And perhaps compare with an
> > > unconditional lru_add_drain_all() at migration start.
> > >
> > > Does anybody know who would be able to tell whether a page refcount
> > > still should block migration (like today) or whether that could
> > > actually be relaxed to a page pincount?
>
> Yes, it really should block migration, see my response above: both
> pincount and refcount literally mean "do not move this page".
>
> As an aside because it might help at some point, I'm just now testing a
> tiny patchset that uses:
>
> wait_var_event_killable(&folio->_refcount,
> folio_ref_count(folio) <= expected)
>
> during migration, paired with:
>
> wake_up_var(&folio->_refcount) in put_page().
>
> This waits for the expected refcount, instead of doing a blind, tight
> retry loop during migration attempts. This lets migration succeed even
> when waiting a long time for another caller to release a refcount.
>
> It works well, but of course, it also has a potentially serious
> performance cost (which I need to quantify), because it adds cycles to
> the put_page() hot path. Which is why I haven't posted it yet, even as
> an RFC. It's still in the "is this even reasonable" stage, just food
> for thought here.
>
If you post an RFC we (Intel) can give it try as we have tests that
really stress migration in odd ways and have fairly good metrics to
catch perf issues too.
Matt
> thanks,
> --
> John Hubbard
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-01 19:24 ` Matthew Brost
@ 2026-02-01 20:48 ` John Hubbard
2026-02-01 21:07 ` Matthew Brost
0 siblings, 1 reply; 30+ messages in thread
From: John Hubbard @ 2026-02-01 20:48 UTC (permalink / raw)
To: Matthew Brost
Cc: Thomas Hellström, Andrew Morton, intel-xe, Ralph Campbell,
Christoph Hellwig, Jason Gunthorpe, Jason Gunthorpe,
Leon Romanovsky, linux-mm, stable, dri-devel
On 2/1/26 11:24 AM, Matthew Brost wrote:
> On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard wrote:
>> On 1/31/26 11:00 AM, Matthew Brost wrote:
>>> On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
>>>> On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
>>>>> On 1/30/26 10:00 AM, Andrew Morton wrote:
>>>>>> On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström wrote:
>>>>> ...
>>> I’m not convinced the folio refcount has any bearing if we can take a
>>> sleeping lock in do_swap_page, but perhaps I’m missing something.
>>
>> So far, I am not able to find a problem with your proposal. So,
>> something like this I believe could actually work:
>
> I did something slightly more defensive with a refcount protection, but
> this seems to work + fix the raised by Thomas and shows no noticeable
> performance difference. If we go this route, do_huge_pmd_device_private
> would need to be updated with the same pattern as well - I don't have
> large device pages enabled in current test branch but would have to test
> that part out too.
>
> diff --git a/mm/memory.c b/mm/memory.c
> index da360a6eb8a4..1e7ccc4a1a6c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> vmf->page = softleaf_to_page(entry);
> ret = remove_device_exclusive_entry(vmf);
> } else if (softleaf_is_device_private(entry)) {
> + struct dev_pagemap *pgmap;
> +
> if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> /*
> * migrate_to_ram is not yet ready to operate
> @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> vmf->orig_pte)))
> goto unlock;
>
> - /*
> - * Get a page reference while we know the page can't be
> - * freed.
> - */
> - if (trylock_page(vmf->page)) {
> - struct dev_pagemap *pgmap;
> -
> - get_page(vmf->page);
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> + lock_page(vmf->page);
> + if (get_page_unless_zero(vmf->page)) {
I think this ordering has a problem, because it releases the PTL before
getting a refcount. That allows another thread to free the page, and
then the lock_page() call here will be doing a use-after-free.
That's why I reversed the order of those, and then as a result the
get_page_unless_zero() also becomes unnecessary.
> pgmap = page_pgmap(vmf->page);
> ret = pgmap->ops->migrate_to_ram(vmf);
> unlock_page(vmf->page);
> put_page(vmf->page);
> } else {
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> + unlock_page(vmf->page);
> }
> } else if (softleaf_is_hwpoison(entry)) {
> ret = VM_FAULT_HWPOISON;
>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index da360a6eb8a4..af73430e7888 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> vmf->page = softleaf_to_page(entry);
>> ret = remove_device_exclusive_entry(vmf);
>> } else if (softleaf_is_device_private(entry)) {
>> + struct dev_pagemap *pgmap;
>> +
>> if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
>> /*
>> * migrate_to_ram is not yet ready to operate
>> @@ -4674,18 +4676,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> * Get a page reference while we know the page can't be
>> * freed.
>> */
>> - if (trylock_page(vmf->page)) {
>> - struct dev_pagemap *pgmap;
>> -
>> - get_page(vmf->page);
>> - pte_unmap_unlock(vmf->pte, vmf->ptl);
>> - pgmap = page_pgmap(vmf->page);
>> - ret = pgmap->ops->migrate_to_ram(vmf);
>> - unlock_page(vmf->page);
>> - put_page(vmf->page);
>> - } else {
>> - pte_unmap_unlock(vmf->pte, vmf->ptl);
>> - }
>> + get_page(vmf->page);
>> + pte_unmap_unlock(vmf->pte, vmf->ptl);
>> + lock_page(vmf->page);
>> + pgmap = page_pgmap(vmf->page);
>> + ret = pgmap->ops->migrate_to_ram(vmf);
>> + unlock_page(vmf->page);
>> + put_page(vmf->page);
>> } else if (softleaf_is_hwpoison(entry)) {
>> ret = VM_FAULT_HWPOISON;
>> } else if (softleaf_is_marker(entry)) {
>>
>>>
>>>> But it looks like an AR for us to try to check how bad
>>>> lru_cache_disable() really is. And perhaps compare with an
>>>> unconditional lru_add_drain_all() at migration start.
>>>>
>>>> Does anybody know who would be able to tell whether a page refcount
>>>> still should block migration (like today) or whether that could
>>>> actually be relaxed to a page pincount?
>>
>> Yes, it really should block migration, see my response above: both
>> pincount and refcount literally mean "do not move this page".
>>
>> As an aside because it might help at some point, I'm just now testing a
>> tiny patchset that uses:
>>
>> wait_var_event_killable(&folio->_refcount,
>> folio_ref_count(folio) <= expected)
>>
>> during migration, paired with:
>>
>> wake_up_var(&folio->_refcount) in put_page().
>>
>> This waits for the expected refcount, instead of doing a blind, tight
>> retry loop during migration attempts. This lets migration succeed even
>> when waiting a long time for another caller to release a refcount.
>>
>> It works well, but of course, it also has a potentially serious
>> performance cost (which I need to quantify), because it adds cycles to
>> the put_page() hot path. Which is why I haven't posted it yet, even as
>> an RFC. It's still in the "is this even reasonable" stage, just food
>> for thought here.
>>
>
> If you post an RFC we (Intel) can give it try as we have tests that
> really stress migration in odd ways and have fairly good metrics to
> catch perf issues too.
>
That would be wonderful, thanks! Testing is hard.
thanks,
--
John Hubbard
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-01 20:48 ` John Hubbard
@ 2026-02-01 21:07 ` Matthew Brost
2026-02-02 0:10 ` Alistair Popple
0 siblings, 1 reply; 30+ messages in thread
From: Matthew Brost @ 2026-02-01 21:07 UTC (permalink / raw)
To: John Hubbard
Cc: Thomas Hellström, Andrew Morton, intel-xe, Ralph Campbell,
Christoph Hellwig, Jason Gunthorpe, Jason Gunthorpe,
Leon Romanovsky, linux-mm, stable, dri-devel
On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard wrote:
> On 2/1/26 11:24 AM, Matthew Brost wrote:
> > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard wrote:
> > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
> > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström wrote:
> > > > > > ...
> > > > I’m not convinced the folio refcount has any bearing if we can take a
> > > > sleeping lock in do_swap_page, but perhaps I’m missing something.
> > >
> > > So far, I am not able to find a problem with your proposal. So,
> > > something like this I believe could actually work:
> >
> > I did something slightly more defensive with a refcount protection, but
> > this seems to work + fix the raised by Thomas and shows no noticeable
> > performance difference. If we go this route, do_huge_pmd_device_private
> > would need to be updated with the same pattern as well - I don't have
> > large device pages enabled in current test branch but would have to test
> > that part out too.
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > vmf->page = softleaf_to_page(entry);
> > ret = remove_device_exclusive_entry(vmf);
> > } else if (softleaf_is_device_private(entry)) {
> > + struct dev_pagemap *pgmap;
> > +
> > if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > /*
> > * migrate_to_ram is not yet ready to operate
> > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > vmf->orig_pte)))
> > goto unlock;
> >
> > - /*
> > - * Get a page reference while we know the page can't be
> > - * freed.
> > - */
> > - if (trylock_page(vmf->page)) {
> > - struct dev_pagemap *pgmap;
> > -
> > - get_page(vmf->page);
> > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > + lock_page(vmf->page);
> > + if (get_page_unless_zero(vmf->page)) {
>
> I think this ordering has a problem, because it releases the PTL before
> getting a refcount. That allows another thread to free the page, and
Yes, I reasoned that this could be a problem too after thinking about it
a bit more. The issue with taking a refcount without the lock is that
we’re back to the livelock problem that was solved here:
git format-patch -1 1afaeb8293c9a
> then the lock_page() call here will be doing a use-after-free.
>
I don’t think it’s a use-after-free per se; rather, the page could have
moved and been reallocated. If the same_pte check were moved under the
page lock, I think it would largely solve that, but if the page were
reallocated as a larger folio, the page lock might collide with the
folio-order bit encoding and hang forever. This is likely extremely hard
to hit, as you’d need multiple threads faulting the same page plus the
driver reallocating the page as a folio at the same time, but
nonetheless it could be a problem.
So maybe back to drawing board...
Matt
> That's why I reversed the order of those, and then as a result the
> get_page_unless_zero() also becomes unnecessary.
>
> > pgmap = page_pgmap(vmf->page);
> > ret = pgmap->ops->migrate_to_ram(vmf);
> > unlock_page(vmf->page);
> > put_page(vmf->page);
> > } else {
> > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > + unlock_page(vmf->page);
> > }
> > } else if (softleaf_is_hwpoison(entry)) {
> > ret = VM_FAULT_HWPOISON;
> >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index da360a6eb8a4..af73430e7888 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > vmf->page = softleaf_to_page(entry);
> > > ret = remove_device_exclusive_entry(vmf);
> > > } else if (softleaf_is_device_private(entry)) {
> > > + struct dev_pagemap *pgmap;
> > > +
> > > if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > /*
> > > * migrate_to_ram is not yet ready to operate
> > > @@ -4674,18 +4676,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > * Get a page reference while we know the page can't be
> > > * freed.
> > > */
> > > - if (trylock_page(vmf->page)) {
> > > - struct dev_pagemap *pgmap;
> > > -
> > > - get_page(vmf->page);
> > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > - pgmap = page_pgmap(vmf->page);
> > > - ret = pgmap->ops->migrate_to_ram(vmf);
> > > - unlock_page(vmf->page);
> > > - put_page(vmf->page);
> > > - } else {
> > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > - }
> > > + get_page(vmf->page);
> > > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > + lock_page(vmf->page);
> > > + pgmap = page_pgmap(vmf->page);
> > > + ret = pgmap->ops->migrate_to_ram(vmf);
> > > + unlock_page(vmf->page);
> > > + put_page(vmf->page);
> > > } else if (softleaf_is_hwpoison(entry)) {
> > > ret = VM_FAULT_HWPOISON;
> > > } else if (softleaf_is_marker(entry)) {
> > >
> > > >
> > > > > But it looks like an AR for us to try to check how bad
> > > > > lru_cache_disable() really is. And perhaps compare with an
> > > > > unconditional lru_add_drain_all() at migration start.
> > > > >
> > > > > Does anybody know who would be able to tell whether a page refcount
> > > > > still should block migration (like today) or whether that could
> > > > > actually be relaxed to a page pincount?
> > >
> > > Yes, it really should block migration, see my response above: both
> > > pincount and refcount literally mean "do not move this page".
> > >
> > > As an aside because it might help at some point, I'm just now testing a
> > > tiny patchset that uses:
> > >
> > > wait_var_event_killable(&folio->_refcount,
> > > folio_ref_count(folio) <= expected)
> > >
> > > during migration, paired with:
> > >
> > > wake_up_var(&folio->_refcount) in put_page().
> > >
> > > This waits for the expected refcount, instead of doing a blind, tight
> > > retry loop during migration attempts. This lets migration succeed even
> > > when waiting a long time for another caller to release a refcount.
> > >
> > > It works well, but of course, it also has a potentially serious
> > > performance cost (which I need to quantify), because it adds cycles to
> > > the put_page() hot path. Which is why I haven't posted it yet, even as
> > > an RFC. It's still in the "is this even reasonable" stage, just food
> > > for thought here.
> > >
> >
> > If you post an RFC we (Intel) can give it try as we have tests that
> > really stress migration in odd ways and have fairly good metrics to
> > catch perf issues too.
> >
>
> That would be wonderful, thanks! Testing is hard.
>
> thanks,
> --
> John Hubbard
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-01 21:07 ` Matthew Brost
@ 2026-02-02 0:10 ` Alistair Popple
2026-02-02 9:30 ` Thomas Hellström
0 siblings, 1 reply; 30+ messages in thread
From: Alistair Popple @ 2026-02-02 0:10 UTC (permalink / raw)
To: Matthew Brost
Cc: John Hubbard, Thomas Hellström, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On 2026-02-02 at 08:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard wrote:
> > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard wrote:
> > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
> > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström wrote:
> > > > > > > ...
> > > > > I’m not convinced the folio refcount has any bearing if we can take a
> > > > > sleeping lock in do_swap_page, but perhaps I’m missing something.
I think the point of the trylock vs. lock is that if you can't immediately
lock the page then it's an indication the page is undergoing a migration.
In other words there's no point waiting for the lock and then trying to call
migrate_to_ram() as the page will have already moved by the time you acquire
the lock. Of course that just means you spin faulting until the page finally
migrates.
If I'm understanding the problem it sounds like we just want to sleep until the
migration is complete, ie. same as the migration entry path. We don't have a
device_private_entry_wait() function, but I don't think we need one, see below.
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > vmf->page = softleaf_to_page(entry);
> > > ret = remove_device_exclusive_entry(vmf);
> > > } else if (softleaf_is_device_private(entry)) {
> > > + struct dev_pagemap *pgmap;
> > > +
> > > if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > /*
> > > * migrate_to_ram is not yet ready to operate
> > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > vmf->orig_pte)))
> > > goto unlock;
> > >
> > > - /*
> > > - * Get a page reference while we know the page can't be
> > > - * freed.
> > > - */
> > > - if (trylock_page(vmf->page)) {
> > > - struct dev_pagemap *pgmap;
> > > -
> > > - get_page(vmf->page);
At this point we:
1. Know the page needs to migrate
2. Have the page locked
3. Have a reference on the page
4. Have the PTL locked
Or in other words we have everything we need to install a migration entry,
so why not just do that? This thread would then proceed into migrate_to_ram()
having already done migrate_vma_collect_pmd() for the faulting page and any
other threads would just sleep in the wait on migration entry path until the
migration is complete, avoiding the livelock problem the trylock was introduced
for in 1afaeb8293c9a.
- Alistair
> > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > + lock_page(vmf->page);
> > > + if (get_page_unless_zero(vmf->page)) {
> >
> > I think this ordering has a problem, because it releases the PTL before
> > getting a refcount. That allows another thread to free the page, and
>
> Yes, I reasoned that this could be a problem too after thinking about it
> a bit more. The issue with taking a refcount without the lock is that
> we’re back to the livelock problem that was solved here:
>
> git format-patch -1 1afaeb8293c9a
>
> > then the lock_page() call here will be doing a use-after-free.
> >
>
> I don’t think it’s a use-after-free per se; rather, the page could have
> moved and been reallocated. If the same_pte check were moved under the
> page lock, I think it would largely solve that, but if the page were
> reallocated as a larger folio, the page lock might collide with the
> folio-order bit encoding and hang forever. This is likely extremely hard
> to hit, as you’d need multiple threads faulting the same page plus the
> driver reallocating the page as a folio at the same time, but
> nonetheless it could be a problem.
>
> So maybe back to drawing board...
>
> Matt
>
> > That's why I reversed the order of those, and then as a result the
> > get_page_unless_zero() also becomes unnecessary.
> >
> > > pgmap = page_pgmap(vmf->page);
> > > ret = pgmap->ops->migrate_to_ram(vmf);
> > > unlock_page(vmf->page);
> > > put_page(vmf->page);
> > > } else {
> > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > + unlock_page(vmf->page);
> > > }
> > > } else if (softleaf_is_hwpoison(entry)) {
> > > ret = VM_FAULT_HWPOISON;
> > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index da360a6eb8a4..af73430e7888 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > vmf->page = softleaf_to_page(entry);
> > > > ret = remove_device_exclusive_entry(vmf);
> > > > } else if (softleaf_is_device_private(entry)) {
> > > > + struct dev_pagemap *pgmap;
> > > > +
> > > > if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > /*
> > > > * migrate_to_ram is not yet ready to operate
> > > > @@ -4674,18 +4676,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > * Get a page reference while we know the page can't be
> > > > * freed.
> > > > */
> > > > - if (trylock_page(vmf->page)) {
> > > > - struct dev_pagemap *pgmap;
> > > > -
> > > > - get_page(vmf->page);
> > > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > - pgmap = page_pgmap(vmf->page);
> > > > - ret = pgmap->ops->migrate_to_ram(vmf);
> > > > - unlock_page(vmf->page);
> > > > - put_page(vmf->page);
> > > > - } else {
> > > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > - }
> > > > + get_page(vmf->page);
> > > > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > + lock_page(vmf->page);
> > > > + pgmap = page_pgmap(vmf->page);
> > > > + ret = pgmap->ops->migrate_to_ram(vmf);
> > > > + unlock_page(vmf->page);
> > > > + put_page(vmf->page);
> > > > } else if (softleaf_is_hwpoison(entry)) {
> > > > ret = VM_FAULT_HWPOISON;
> > > > } else if (softleaf_is_marker(entry)) {
> > > >
> > > > >
> > > > > > But it looks like an AR for us to try to check how bad
> > > > > > lru_cache_disable() really is. And perhaps compare with an
> > > > > > unconditional lru_add_drain_all() at migration start.
> > > > > >
> > > > > > Does anybody know who would be able to tell whether a page refcount
> > > > > > still should block migration (like today) or whether that could
> > > > > > actually be relaxed to a page pincount?
> > > >
> > > > Yes, it really should block migration, see my response above: both
> > > > pincount and refcount literally mean "do not move this page".
> > > >
> > > > As an aside because it might help at some point, I'm just now testing a
> > > > tiny patchset that uses:
> > > >
> > > > wait_var_event_killable(&folio->_refcount,
> > > > folio_ref_count(folio) <= expected)
> > > >
> > > > during migration, paired with:
> > > >
> > > > wake_up_var(&folio->_refcount) in put_page().
> > > >
> > > > This waits for the expected refcount, instead of doing a blind, tight
> > > > retry loop during migration attempts. This lets migration succeed even
> > > > when waiting a long time for another caller to release a refcount.
> > > >
> > > > It works well, but of course, it also has a potentially serious
> > > > performance cost (which I need to quantify), because it adds cycles to
> > > > the put_page() hot path. Which is why I haven't posted it yet, even as
> > > > an RFC. It's still in the "is this even reasonable" stage, just food
> > > > for thought here.
> > > >
> > >
> > > If you post an RFC we (Intel) can give it try as we have tests that
> > > really stress migration in odd ways and have fairly good metrics to
> > > catch perf issues too.
> > >
> >
> > That would be wonderful, thanks! Testing is hard.
> >
> > thanks,
> > --
> > John Hubbard
> >
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-01-31 21:42 ` John Hubbard
2026-02-01 19:24 ` Matthew Brost
@ 2026-02-02 9:13 ` Thomas Hellström
2026-02-02 10:34 ` Alistair Popple
2026-02-02 22:28 ` John Hubbard
1 sibling, 2 replies; 30+ messages in thread
From: Thomas Hellström @ 2026-02-02 9:13 UTC (permalink / raw)
To: John Hubbard, Matthew Brost
Cc: Andrew Morton, intel-xe, Ralph Campbell, Christoph Hellwig,
Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, linux-mm,
stable, dri-devel
On Sat, 2026-01-31 at 13:42 -0800, John Hubbard wrote:
> On 1/31/26 11:00 AM, Matthew Brost wrote:
> > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
> > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > > > > <thomas.hellstrom@linux.intel.com> wrote:
> > > > ...
>
> >
> > > I'm also not sure a folio refcount should block migration after
> > > the
> > > introduction of pinned (like in pin_user_pages) pages. Rather
> > > perhaps a
> > > folio pin-count should block migration and in that case
> > > do_swap_page()
> > > can definitely do a sleeping folio lock and the problem is gone.
>
> A problem for that specific point is that pincount and refcount both
> mean, "the page is pinned" (which in turn literally means "not
> allowed
> to migrate/move").
Yeah this is what I actually want to challenge since this is what
blocks us from doing a clean robust solution here. From brief reading
of the docs around the pin-count implementation, I understand it as "If
you want to access the struct page metadata, get a refcount, If you
want to access the actual memory of a page, take a pin-count"
I guess that might still not be true for all old instances in the
kernel using get_user_pages() instead of pin_user_pages() for things
like DMA, but perhaps we can set that in stone and document it at least
for device-private pages for now which would be sufficient for the
do_swap_pages() refcount not to block migration.
>
> (In fact, pincount is implemented in terms of refcount, in most
> configurations still.)
Yes but that's only a space optimization never intended to conflict,
right? Meaning a pin-count will imply a refcount but a refcount will
never imply a pin-count?
Thanks,
Thomas
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 0:10 ` Alistair Popple
@ 2026-02-02 9:30 ` Thomas Hellström
2026-02-02 10:25 ` Alistair Popple
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Hellström @ 2026-02-02 9:30 UTC (permalink / raw)
To: Alistair Popple, Matthew Brost
Cc: John Hubbard, Andrew Morton, intel-xe, Ralph Campbell,
Christoph Hellwig, Jason Gunthorpe, Jason Gunthorpe,
Leon Romanovsky, linux-mm, stable, dri-devel
Hi,
On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> On 2026-02-02 at 08:07 +1100, Matthew Brost <matthew.brost@intel.com>
> wrote...
> > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard wrote:
> > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard wrote:
> > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström
> > > > > > wrote:
> > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > > > > > > > > wrote:
> > > > > > > > ...
> > > > > > I’m not convinced the folio refcount has any bearing if we
> > > > > > can take a
> > > > > > sleeping lock in do_swap_page, but perhaps I’m missing
> > > > > > something.
>
> I think the point of the trylock vs. lock is that if you can't
> immediately
> lock the page then it's an indication the page is undergoing a
> migration.
> In other words there's no point waiting for the lock and then trying
> to call
> migrate_to_ram() as the page will have already moved by the time you
> acquire
> the lock. Of course that just means you spin faulting until the page
> finally
> migrates.
>
> If I'm understanding the problem it sounds like we just want to sleep
> until the
> migration is complete, ie. same as the migration entry path. We don't
> have a
> device_private_entry_wait() function, but I don't think we need one,
> see below.
>
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault
> > > > *vmf)
> > > > vmf->page = softleaf_to_page(entry);
> > > > ret =
> > > > remove_device_exclusive_entry(vmf);
> > > > } else if (softleaf_is_device_private(entry))
> > > > {
> > > > + struct dev_pagemap *pgmap;
> > > > +
> > > > if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > > {
> > > > /*
> > > > * migrate_to_ram is not yet
> > > > ready to operate
> > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct vm_fault
> > > > *vmf)
> > > > vmf-
> > > > >orig_pte)))
> > > > goto unlock;
> > > >
> > > > - /*
> > > > - * Get a page reference while we know
> > > > the page can't be
> > > > - * freed.
> > > > - */
> > > > - if (trylock_page(vmf->page)) {
> > > > - struct dev_pagemap *pgmap;
> > > > -
> > > > - get_page(vmf->page);
>
> At this point we:
> 1. Know the page needs to migrate
> 2. Have the page locked
> 3. Have a reference on the page
> 4. Have the PTL locked
>
> Or in other words we have everything we need to install a migration
> entry,
> so why not just do that? This thread would then proceed into
> migrate_to_ram()
> having already done migrate_vma_collect_pmd() for the faulting page
> and any
> other threads would just sleep in the wait on migration entry path
> until the
> migration is complete, avoiding the livelock problem the trylock was
> introduced
> for in 1afaeb8293c9a.
>
> - Alistair
>
> > >
There will always be a small time between when the page is locked and
when we can install a migration entry. If the page only has a single
mapcount, then the PTL lock is held during this time so the issue does
not occur. But for multiple map-counts we need to release the PTL lock
in migration to run try_to_migrate(), and before that, the migrate code
is running lru_add_drain_all() and gets stuck.
However it looks like Matt Brost, inspired by the above, tried to
change the order of try_to_migrate() vs lru_add_drain_all() which would
cause the fault handler to eventually block on a migration entry. That
seems to solve the issue we're seeing here.
Still, I think the spinning on the trylock here is still something we'd
want to get rid off, because IMO we shouldn't need to adapt otherwise
correct but slightly suboptimal code to a badly behaving path in the
fault handler. In essence we're never allowed to lock a device-private
folio and then call something that needs to execute on all processors
before unlocking.
Thanks,
Thomas
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 9:30 ` Thomas Hellström
@ 2026-02-02 10:25 ` Alistair Popple
2026-02-02 10:41 ` Thomas Hellström
0 siblings, 1 reply; 30+ messages in thread
From: Alistair Popple @ 2026-02-02 10:25 UTC (permalink / raw)
To: Thomas Hellström
Cc: Matthew Brost, John Hubbard, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On 2026-02-02 at 20:30 +1100, Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote...
> Hi,
>
> On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> > On 2026-02-02 at 08:07 +1100, Matthew Brost <matthew.brost@intel.com>
> > wrote...
> > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard wrote:
> > > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard wrote:
> > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström
> > > > > > > wrote:
> > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > > > > > > > > > wrote:
> > > > > > > > > ...
> > > > > > > I’m not convinced the folio refcount has any bearing if we
> > > > > > > can take a
> > > > > > > sleeping lock in do_swap_page, but perhaps I’m missing
> > > > > > > something.
> >
> > I think the point of the trylock vs. lock is that if you can't
> > immediately
> > lock the page then it's an indication the page is undergoing a
> > migration.
> > In other words there's no point waiting for the lock and then trying
> > to call
> > migrate_to_ram() as the page will have already moved by the time you
> > acquire
> > the lock. Of course that just means you spin faulting until the page
> > finally
> > migrates.
> >
> > If I'm understanding the problem it sounds like we just want to sleep
> > until the
> > migration is complete, ie. same as the migration entry path. We don't
> > have a
> > device_private_entry_wait() function, but I don't think we need one,
> > see below.
> >
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault
> > > > > *vmf)
> > > > > vmf->page = softleaf_to_page(entry);
> > > > > ret =
> > > > > remove_device_exclusive_entry(vmf);
> > > > > } else if (softleaf_is_device_private(entry))
> > > > > {
> > > > > + struct dev_pagemap *pgmap;
> > > > > +
> > > > > if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > > > {
> > > > > /*
> > > > > * migrate_to_ram is not yet
> > > > > ready to operate
> > > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct vm_fault
> > > > > *vmf)
> > > > > vmf-
> > > > > >orig_pte)))
> > > > > goto unlock;
> > > > >
> > > > > - /*
> > > > > - * Get a page reference while we know
> > > > > the page can't be
> > > > > - * freed.
> > > > > - */
> > > > > - if (trylock_page(vmf->page)) {
> > > > > - struct dev_pagemap *pgmap;
> > > > > -
> > > > > - get_page(vmf->page);
> >
> > At this point we:
> > 1. Know the page needs to migrate
> > 2. Have the page locked
> > 3. Have a reference on the page
> > 4. Have the PTL locked
> >
> > Or in other words we have everything we need to install a migration
> > entry,
> > so why not just do that? This thread would then proceed into
> > migrate_to_ram()
> > having already done migrate_vma_collect_pmd() for the faulting page
> > and any
> > other threads would just sleep in the wait on migration entry path
> > until the
> > migration is complete, avoiding the livelock problem the trylock was
> > introduced
> > for in 1afaeb8293c9a.
> >
> > - Alistair
> >
> > > >
>
> There will always be a small time between when the page is locked and
> when we can install a migration entry. If the page only has a single
> mapcount, then the PTL lock is held during this time so the issue does
> not occur. But for multiple map-counts we need to release the PTL lock
> in migration to run try_to_migrate(), and before that, the migrate code
> is running lru_add_drain_all() and gets stuck.
Oh right, my solution would be fine for the single mapping case but I hadn't
fully thought through the implications of other threads accessing this for
multiple map-counts. Agree it doesn't solve anything there (the rest of the
threads would still spin on the trylock).
Still we could use a similar solution for waiting on device-private entries as
we do for migration entries. Instead of spinning on the trylock (ie. PG_locked)
we could just wait on it to become unlocked if it's already locked. Would
something like the below completely untested code work? (obviously this is a bit
of hack, to do it properly you'd want to do more than just remove the check from
migration_entry_wait)
---
diff --git a/mm/memory.c b/mm/memory.c
index 2a55edc48a65..3e5e205ee279 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
pgmap = page_pgmap(vmf->page);
ret = pgmap->ops->migrate_to_ram(vmf);
- unlock_page(vmf->page);
put_page(vmf->page);
} else {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ migration_entry_wait(vma->vm_mm, vmf->pmd,
+ vmf->address);
}
} else if (softleaf_is_hwpoison(entry)) {
ret = VM_FAULT_HWPOISON;
diff --git a/mm/migrate.c b/mm/migrate.c
index 5169f9717f60..b676daf0f4e8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -496,8 +496,6 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
goto out;
entry = softleaf_from_pte(pte);
- if (!softleaf_is_migration(entry))
- goto out;
migration_entry_wait_on_locked(entry, ptl);
return;
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 9:13 ` Thomas Hellström
@ 2026-02-02 10:34 ` Alistair Popple
2026-02-02 10:51 ` Thomas Hellström
2026-02-02 22:28 ` John Hubbard
1 sibling, 1 reply; 30+ messages in thread
From: Alistair Popple @ 2026-02-02 10:34 UTC (permalink / raw)
To: Thomas Hellström
Cc: John Hubbard, Matthew Brost, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On 2026-02-02 at 20:13 +1100, Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote...
> On Sat, 2026-01-31 at 13:42 -0800, John Hubbard wrote:
> > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
> > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > > > > > <thomas.hellstrom@linux.intel.com> wrote:
> > > > > ...
> >
> > >
> > > > I'm also not sure a folio refcount should block migration after
> > > > the
> > > > introduction of pinned (like in pin_user_pages) pages. Rather
> > > > perhaps a
> > > > folio pin-count should block migration and in that case
> > > > do_swap_page()
> > > > can definitely do a sleeping folio lock and the problem is gone.
> >
> > A problem for that specific point is that pincount and refcount both
> > mean, "the page is pinned" (which in turn literally means "not
> > allowed
> > to migrate/move").
>
> Yeah this is what I actually want to challenge since this is what
> blocks us from doing a clean robust solution here. From brief reading
> of the docs around the pin-count implementation, I understand it as "If
> you want to access the struct page metadata, get a refcount, If you
> want to access the actual memory of a page, take a pin-count"
>
> I guess that might still not be true for all old instances in the
> kernel using get_user_pages() instead of pin_user_pages() for things
> like DMA, but perhaps we can set that in stone and document it at least
> for device-private pages for now which would be sufficient for the
> do_swap_pages() refcount not to block migration.
Having just spent a long time cleaning up a bunch of special rules/cases for
ZONE_DEVICE page refcounting I'm rather against reintroducing them just for some
ZONE_DEVICE pages. So whatever arguments are applied or introduced here would
need to be made to work for all pages, not just some ZONE_DEVICE pages.
> >
> > (In fact, pincount is implemented in terms of refcount, in most
> > configurations still.)
>
> Yes but that's only a space optimization never intended to conflict,
> right? Meaning a pin-count will imply a refcount but a refcount will
> never imply a pin-count?
>
> Thanks,
> Thomas
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 10:25 ` Alistair Popple
@ 2026-02-02 10:41 ` Thomas Hellström
2026-02-02 11:22 ` Alistair Popple
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Hellström @ 2026-02-02 10:41 UTC (permalink / raw)
To: Alistair Popple
Cc: Matthew Brost, John Hubbard, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote:
> On 2026-02-02 at 20:30 +1100, Thomas Hellström
> <thomas.hellstrom@linux.intel.com> wrote...
> > Hi,
> >
> > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> > > On 2026-02-02 at 08:07 +1100, Matthew Brost
> > > <matthew.brost@intel.com>
> > > wrote...
> > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard wrote:
> > > > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard
> > > > > > wrote:
> > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas
> > > > > > > > Hellström
> > > > > > > > wrote:
> > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard
> > > > > > > > > wrote:
> > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas
> > > > > > > > > > > Hellström
> > > > > > > > > > > wrote:
> > > > > > > > > > ...
> > > > > > > > I’m not convinced the folio refcount has any bearing if
> > > > > > > > we
> > > > > > > > can take a
> > > > > > > > sleeping lock in do_swap_page, but perhaps I’m missing
> > > > > > > > something.
> > >
> > > I think the point of the trylock vs. lock is that if you can't
> > > immediately
> > > lock the page then it's an indication the page is undergoing a
> > > migration.
> > > In other words there's no point waiting for the lock and then
> > > trying
> > > to call
> > > migrate_to_ram() as the page will have already moved by the time
> > > you
> > > acquire
> > > the lock. Of course that just means you spin faulting until the
> > > page
> > > finally
> > > migrates.
> > >
> > > If I'm understanding the problem it sounds like we just want to
> > > sleep
> > > until the
> > > migration is complete, ie. same as the migration entry path. We
> > > don't
> > > have a
> > > device_private_entry_wait() function, but I don't think we need
> > > one,
> > > see below.
> > >
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct
> > > > > > vm_fault
> > > > > > *vmf)
> > > > > > vmf->page =
> > > > > > softleaf_to_page(entry);
> > > > > > ret =
> > > > > > remove_device_exclusive_entry(vmf);
> > > > > > } else if
> > > > > > (softleaf_is_device_private(entry))
> > > > > > {
> > > > > > + struct dev_pagemap *pgmap;
> > > > > > +
> > > > > > if (vmf->flags &
> > > > > > FAULT_FLAG_VMA_LOCK)
> > > > > > {
> > > > > > /*
> > > > > > * migrate_to_ram is not
> > > > > > yet
> > > > > > ready to operate
> > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct
> > > > > > vm_fault
> > > > > > *vmf)
> > > > > >
> > > > > > vmf-
> > > > > > > orig_pte)))
> > > > > > goto unlock;
> > > > > >
> > > > > > - /*
> > > > > > - * Get a page reference while we
> > > > > > know
> > > > > > the page can't be
> > > > > > - * freed.
> > > > > > - */
> > > > > > - if (trylock_page(vmf->page)) {
> > > > > > - struct dev_pagemap *pgmap;
> > > > > > -
> > > > > > - get_page(vmf->page);
> > >
> > > At this point we:
> > > 1. Know the page needs to migrate
> > > 2. Have the page locked
> > > 3. Have a reference on the page
> > > 4. Have the PTL locked
> > >
> > > Or in other words we have everything we need to install a
> > > migration
> > > entry,
> > > so why not just do that? This thread would then proceed into
> > > migrate_to_ram()
> > > having already done migrate_vma_collect_pmd() for the faulting
> > > page
> > > and any
> > > other threads would just sleep in the wait on migration entry
> > > path
> > > until the
> > > migration is complete, avoiding the livelock problem the trylock
> > > was
> > > introduced
> > > for in 1afaeb8293c9a.
> > >
> > > - Alistair
> > >
> > > > >
> >
> > There will always be a small time between when the page is locked
> > and
> > when we can install a migration entry. If the page only has a
> > single
> > mapcount, then the PTL lock is held during this time so the issue
> > does
> > not occur. But for multiple map-counts we need to release the PTL
> > lock
> > in migration to run try_to_migrate(), and before that, the migrate
> > code
> > is running lru_add_drain_all() and gets stuck.
>
> Oh right, my solution would be fine for the single mapping case but I
> hadn't
> fully thought through the implications of other threads accessing
> this for
> multiple map-counts. Agree it doesn't solve anything there (the rest
> of the
> threads would still spin on the trylock).
>
> Still we could use a similar solution for waiting on device-private
> entries as
> we do for migration entries. Instead of spinning on the trylock (ie.
> PG_locked)
> we could just wait on it to become unlocked if it's already locked.
> Would
> something like the below completely untested code work? (obviously
> this is a bit
> of hack, to do it properly you'd want to do more than just remove the
> check from
> migration_entry_wait)
Well I guess there could be failed migration where something is
aborting the migration even after a page is locked. Also we must unlock
the PTL lock before waiting otherwise we could deadlock.
I believe a robust solution would be to take a folio reference and do a
sleeping lock like John's example. Then to assert that a folio pin-
count, not ref-count is required to pin a device-private folio. That
would eliminate the problem of the refcount held while locking blocking
migration. It looks like that's fully consistent with
https://docs.kernel.org/core-api/pin_user_pages.html
Then as general improvements we should fully unmap pages before calling
lru_add_drain_all() as MBrost suggest and finally, to be more nice to
the system in the common cases, add a cond_resched() to
hmm_range_fault().
Thanks,
Thomas
>
> ---
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2a55edc48a65..3e5e205ee279 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_unmap_unlock(vmf->pte, vmf-
> >ptl);
> pgmap = page_pgmap(vmf->page);
> ret = pgmap->ops-
> >migrate_to_ram(vmf);
> - unlock_page(vmf->page);
> put_page(vmf->page);
> } else {
> - pte_unmap_unlock(vmf->pte, vmf-
> >ptl);
> + migration_entry_wait(vma->vm_mm,
> vmf->pmd,
> + vmf->address);
> }
> } else if (softleaf_is_hwpoison(entry)) {
> ret = VM_FAULT_HWPOISON;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 5169f9717f60..b676daf0f4e8 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -496,8 +496,6 @@ void migration_entry_wait(struct mm_struct *mm,
> pmd_t *pmd,
> goto out;
>
> entry = softleaf_from_pte(pte);
> - if (!softleaf_is_migration(entry))
> - goto out;
>
> migration_entry_wait_on_locked(entry, ptl);
> return;
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 10:34 ` Alistair Popple
@ 2026-02-02 10:51 ` Thomas Hellström
2026-02-02 11:28 ` Alistair Popple
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Hellström @ 2026-02-02 10:51 UTC (permalink / raw)
To: Alistair Popple
Cc: John Hubbard, Matthew Brost, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On Mon, 2026-02-02 at 21:34 +1100, Alistair Popple wrote:
> On 2026-02-02 at 20:13 +1100, Thomas Hellström
> <thomas.hellstrom@linux.intel.com> wrote...
> > On Sat, 2026-01-31 at 13:42 -0800, John Hubbard wrote:
> > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström
> > > > wrote:
> > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > > > > > > <thomas.hellstrom@linux.intel.com> wrote:
> > > > > > ...
> > >
> > > >
> > > > > I'm also not sure a folio refcount should block migration
> > > > > after
> > > > > the
> > > > > introduction of pinned (like in pin_user_pages) pages. Rather
> > > > > perhaps a
> > > > > folio pin-count should block migration and in that case
> > > > > do_swap_page()
> > > > > can definitely do a sleeping folio lock and the problem is
> > > > > gone.
> > >
> > > A problem for that specific point is that pincount and refcount
> > > both
> > > mean, "the page is pinned" (which in turn literally means "not
> > > allowed
> > > to migrate/move").
> >
> > Yeah this is what I actually want to challenge since this is what
> > blocks us from doing a clean robust solution here. From brief
> > reading
> > of the docs around the pin-count implementation, I understand it as
> > "If
> > you want to access the struct page metadata, get a refcount, If you
> > want to access the actual memory of a page, take a pin-count"
> >
> > I guess that might still not be true for all old instances in the
> > kernel using get_user_pages() instead of pin_user_pages() for
> > things
> > like DMA, but perhaps we can set that in stone and document it at
> > least
> > for device-private pages for now which would be sufficient for the
> > do_swap_pages() refcount not to block migration.
>
> Having just spent a long time cleaning up a bunch of special
> rules/cases for
> ZONE_DEVICE page refcounting I'm rather against reintroducing them
> just for some
> ZONE_DEVICE pages. So whatever arguments are applied or introduced
> here would
> need to be made to work for all pages, not just some ZONE_DEVICE
> pages.
That's completely understandable. I would like to be able to say if we
apply the argument that when checking the pin-count pages are locked,
lru-isolated and with zero map-count then that would hold for all
pages, but my knowledge of the mm internals isn't sufficient
unfortunately.
So even if that would be an ultimate goal, we would probably have to be
prepared to have to revert (at least temporarily) such a solution for
!ZONE_DEVICE pages and have a plan for handling that.
Thanks,
Thomas
>
> > >
> > > (In fact, pincount is implemented in terms of refcount, in most
> > > configurations still.)
> >
> > Yes but that's only a space optimization never intended to
> > conflict,
> > right? Meaning a pin-count will imply a refcount but a refcount
> > will
> > never imply a pin-count?
> >
> > Thanks,
> > Thomas
> >
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 10:41 ` Thomas Hellström
@ 2026-02-02 11:22 ` Alistair Popple
2026-02-02 11:44 ` Thomas Hellström
0 siblings, 1 reply; 30+ messages in thread
From: Alistair Popple @ 2026-02-02 11:22 UTC (permalink / raw)
To: Thomas Hellström
Cc: Matthew Brost, John Hubbard, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On 2026-02-02 at 21:41 +1100, Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote...
> On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote:
> > On 2026-02-02 at 20:30 +1100, Thomas Hellström
> > <thomas.hellstrom@linux.intel.com> wrote...
> > > Hi,
> > >
> > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> > > > On 2026-02-02 at 08:07 +1100, Matthew Brost
> > > > <matthew.brost@intel.com>
> > > > wrote...
> > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard wrote:
> > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard
> > > > > > > wrote:
> > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas
> > > > > > > > > Hellström
> > > > > > > > > wrote:
> > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard
> > > > > > > > > > wrote:
> > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas
> > > > > > > > > > > > Hellström
> > > > > > > > > > > > wrote:
> > > > > > > > > > > ...
> > > > > > > > > I’m not convinced the folio refcount has any bearing if
> > > > > > > > > we
> > > > > > > > > can take a
> > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m missing
> > > > > > > > > something.
> > > >
> > > > I think the point of the trylock vs. lock is that if you can't
> > > > immediately
> > > > lock the page then it's an indication the page is undergoing a
> > > > migration.
> > > > In other words there's no point waiting for the lock and then
> > > > trying
> > > > to call
> > > > migrate_to_ram() as the page will have already moved by the time
> > > > you
> > > > acquire
> > > > the lock. Of course that just means you spin faulting until the
> > > > page
> > > > finally
> > > > migrates.
> > > >
> > > > If I'm understanding the problem it sounds like we just want to
> > > > sleep
> > > > until the
> > > > migration is complete, ie. same as the migration entry path. We
> > > > don't
> > > > have a
> > > > device_private_entry_wait() function, but I don't think we need
> > > > one,
> > > > see below.
> > > >
> > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > > > > --- a/mm/memory.c
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct
> > > > > > > vm_fault
> > > > > > > *vmf)
> > > > > > > vmf->page =
> > > > > > > softleaf_to_page(entry);
> > > > > > > ret =
> > > > > > > remove_device_exclusive_entry(vmf);
> > > > > > > } else if
> > > > > > > (softleaf_is_device_private(entry))
> > > > > > > {
> > > > > > > + struct dev_pagemap *pgmap;
> > > > > > > +
> > > > > > > if (vmf->flags &
> > > > > > > FAULT_FLAG_VMA_LOCK)
> > > > > > > {
> > > > > > > /*
> > > > > > > * migrate_to_ram is not
> > > > > > > yet
> > > > > > > ready to operate
> > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct
> > > > > > > vm_fault
> > > > > > > *vmf)
> > > > > > >
> > > > > > > vmf-
> > > > > > > > orig_pte)))
> > > > > > > goto unlock;
> > > > > > >
> > > > > > > - /*
> > > > > > > - * Get a page reference while we
> > > > > > > know
> > > > > > > the page can't be
> > > > > > > - * freed.
> > > > > > > - */
> > > > > > > - if (trylock_page(vmf->page)) {
> > > > > > > - struct dev_pagemap *pgmap;
> > > > > > > -
> > > > > > > - get_page(vmf->page);
> > > >
> > > > At this point we:
> > > > 1. Know the page needs to migrate
> > > > 2. Have the page locked
> > > > 3. Have a reference on the page
> > > > 4. Have the PTL locked
> > > >
> > > > Or in other words we have everything we need to install a
> > > > migration
> > > > entry,
> > > > so why not just do that? This thread would then proceed into
> > > > migrate_to_ram()
> > > > having already done migrate_vma_collect_pmd() for the faulting
> > > > page
> > > > and any
> > > > other threads would just sleep in the wait on migration entry
> > > > path
> > > > until the
> > > > migration is complete, avoiding the livelock problem the trylock
> > > > was
> > > > introduced
> > > > for in 1afaeb8293c9a.
> > > >
> > > > - Alistair
> > > >
> > > > > >
> > >
> > > There will always be a small time between when the page is locked
> > > and
> > > when we can install a migration entry. If the page only has a
> > > single
> > > mapcount, then the PTL lock is held during this time so the issue
> > > does
> > > not occur. But for multiple map-counts we need to release the PTL
> > > lock
> > > in migration to run try_to_migrate(), and before that, the migrate
> > > code
> > > is running lru_add_drain_all() and gets stuck.
> >
> > Oh right, my solution would be fine for the single mapping case but I
> > hadn't
> > fully thought through the implications of other threads accessing
> > this for
> > multiple map-counts. Agree it doesn't solve anything there (the rest
> > of the
> > threads would still spin on the trylock).
> >
> > Still we could use a similar solution for waiting on device-private
> > entries as
> > we do for migration entries. Instead of spinning on the trylock (ie.
> > PG_locked)
> > we could just wait on it to become unlocked if it's already locked.
> > Would
> > something like the below completely untested code work? (obviously
> > this is a bit
> > of hack, to do it properly you'd want to do more than just remove the
> > check from
> > migration_entry_wait)
>
> Well I guess there could be failed migration where something is
> aborting the migration even after a page is locked. Also we must unlock
> the PTL lock before waiting otherwise we could deadlock.
Yes, this is exactly what the migration entry wait code does. And if there's a
failed migration, no problem, you just retry. That's not a deadlock unless the
migration never succeeds and then your stuffed anyway.
> I believe a robust solution would be to take a folio reference and do a
> sleeping lock like John's example. Then to assert that a folio pin-
> count, not ref-count is required to pin a device-private folio. That
> would eliminate the problem of the refcount held while locking blocking
> migration. It looks like that's fully consistent with
Waiting on a migration entry like in my example below is exactly the same as
sleeping on the page lock other than it just waits for the page to be unlocked
rather than trying to lock it.
Internally migration_entry_wait_on_locked() is just an open-coded version
of folio_lock() which deals with dropping the PTL and that works without a page
refcount.
So I don't understand how this solution isn't robust? It requires no funniness
with refcounts and works practically the same as a sleeping lock.
- Alistair
> https://docs.kernel.org/core-api/pin_user_pages.html
>
> Then as general improvements we should fully unmap pages before calling
> lru_add_drain_all() as MBrost suggest and finally, to be more nice to
> the system in the common cases, add a cond_resched() to
> hmm_range_fault().
>
> Thanks,
> Thomas
>
>
>
> >
> > ---
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2a55edc48a65..3e5e205ee279 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > pte_unmap_unlock(vmf->pte, vmf-
> > >ptl);
> > pgmap = page_pgmap(vmf->page);
> > ret = pgmap->ops-
> > >migrate_to_ram(vmf);
> > - unlock_page(vmf->page);
> > put_page(vmf->page);
> > } else {
> > - pte_unmap_unlock(vmf->pte, vmf-
> > >ptl);
> > + migration_entry_wait(vma->vm_mm,
> > vmf->pmd,
> > + vmf->address);
> > }
> > } else if (softleaf_is_hwpoison(entry)) {
> > ret = VM_FAULT_HWPOISON;
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 5169f9717f60..b676daf0f4e8 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -496,8 +496,6 @@ void migration_entry_wait(struct mm_struct *mm,
> > pmd_t *pmd,
> > goto out;
> >
> > entry = softleaf_from_pte(pte);
> > - if (!softleaf_is_migration(entry))
> > - goto out;
> >
> > migration_entry_wait_on_locked(entry, ptl);
> > return;
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 10:51 ` Thomas Hellström
@ 2026-02-02 11:28 ` Alistair Popple
0 siblings, 0 replies; 30+ messages in thread
From: Alistair Popple @ 2026-02-02 11:28 UTC (permalink / raw)
To: Thomas Hellström
Cc: John Hubbard, Matthew Brost, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On 2026-02-02 at 21:51 +1100, Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote...
> On Mon, 2026-02-02 at 21:34 +1100, Alistair Popple wrote:
> > On 2026-02-02 at 20:13 +1100, Thomas Hellström
> > <thomas.hellstrom@linux.intel.com> wrote...
> > > On Sat, 2026-01-31 at 13:42 -0800, John Hubbard wrote:
> > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström
> > > > > wrote:
> > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > > > > > > > <thomas.hellstrom@linux.intel.com> wrote:
> > > > > > > ...
> > > >
> > > > >
> > > > > > I'm also not sure a folio refcount should block migration
> > > > > > after
> > > > > > the
> > > > > > introduction of pinned (like in pin_user_pages) pages. Rather
> > > > > > perhaps a
> > > > > > folio pin-count should block migration and in that case
> > > > > > do_swap_page()
> > > > > > can definitely do a sleeping folio lock and the problem is
> > > > > > gone.
> > > >
> > > > A problem for that specific point is that pincount and refcount
> > > > both
> > > > mean, "the page is pinned" (which in turn literally means "not
> > > > allowed
> > > > to migrate/move").
> > >
> > > Yeah this is what I actually want to challenge since this is what
> > > blocks us from doing a clean robust solution here. From brief
> > > reading
> > > of the docs around the pin-count implementation, I understand it as
> > > "If
> > > you want to access the struct page metadata, get a refcount, If you
> > > want to access the actual memory of a page, take a pin-count"
> > >
> > > I guess that might still not be true for all old instances in the
> > > kernel using get_user_pages() instead of pin_user_pages() for
> > > things
> > > like DMA, but perhaps we can set that in stone and document it at
> > > least
> > > for device-private pages for now which would be sufficient for the
> > > do_swap_pages() refcount not to block migration.
> >
> > Having just spent a long time cleaning up a bunch of special
> > rules/cases for
> > ZONE_DEVICE page refcounting I'm rather against reintroducing them
> > just for some
> > ZONE_DEVICE pages. So whatever arguments are applied or introduced
> > here would
> > need to be made to work for all pages, not just some ZONE_DEVICE
> > pages.
>
> That's completely understandable. I would like to be able to say if we
> apply the argument that when checking the pin-count pages are locked,
> lru-isolated and with zero map-count then that would hold for all
> pages, but my knowledge of the mm internals isn't sufficient
> unfortunately.
We don't actually have a good model for pinning device-private pages anyway
so I'm open to discussion, but I don't think we need to do that to solve this
problem. I would appreciate it if you could look at the proposed solution in the
other thread a litte bit more closely - AFAICT it should address your problem
by doing the same thing as replacing the trylock_page() with lock_page() without
requiring getting a page reference, etc.
- Alistair
> So even if that would be an ultimate goal, we would probably have to be
> prepared to have to revert (at least temporarily) such a solution for
> !ZONE_DEVICE pages and have a plan for handling that.
>
> Thanks,
> Thomas
>
>
> >
> > > >
> > > > (In fact, pincount is implemented in terms of refcount, in most
> > > > configurations still.)
> > >
> > > Yes but that's only a space optimization never intended to
> > > conflict,
> > > right? Meaning a pin-count will imply a refcount but a refcount
> > > will
> > > never imply a pin-count?
> > >
> > > Thanks,
> > > Thomas
> > >
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 11:22 ` Alistair Popple
@ 2026-02-02 11:44 ` Thomas Hellström
2026-02-02 12:26 ` Alistair Popple
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Hellström @ 2026-02-02 11:44 UTC (permalink / raw)
To: Alistair Popple
Cc: Matthew Brost, John Hubbard, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On Mon, 2026-02-02 at 22:22 +1100, Alistair Popple wrote:
> On 2026-02-02 at 21:41 +1100, Thomas Hellström
> <thomas.hellstrom@linux.intel.com> wrote...
> > On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote:
> > > On 2026-02-02 at 20:30 +1100, Thomas Hellström
> > > <thomas.hellstrom@linux.intel.com> wrote...
> > > > Hi,
> > > >
> > > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> > > > > On 2026-02-02 at 08:07 +1100, Matthew Brost
> > > > > <matthew.brost@intel.com>
> > > > > wrote...
> > > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard
> > > > > > wrote:
> > > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard
> > > > > > > > wrote:
> > > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas
> > > > > > > > > > Hellström
> > > > > > > > > > wrote:
> > > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas
> > > > > > > > > > > > > Hellström
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > ...
> > > > > > > > > > I’m not convinced the folio refcount has any
> > > > > > > > > > bearing if
> > > > > > > > > > we
> > > > > > > > > > can take a
> > > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m
> > > > > > > > > > missing
> > > > > > > > > > something.
> > > > >
> > > > > I think the point of the trylock vs. lock is that if you
> > > > > can't
> > > > > immediately
> > > > > lock the page then it's an indication the page is undergoing
> > > > > a
> > > > > migration.
> > > > > In other words there's no point waiting for the lock and then
> > > > > trying
> > > > > to call
> > > > > migrate_to_ram() as the page will have already moved by the
> > > > > time
> > > > > you
> > > > > acquire
> > > > > the lock. Of course that just means you spin faulting until
> > > > > the
> > > > > page
> > > > > finally
> > > > > migrates.
> > > > >
> > > > > If I'm understanding the problem it sounds like we just want
> > > > > to
> > > > > sleep
> > > > > until the
> > > > > migration is complete, ie. same as the migration entry path.
> > > > > We
> > > > > don't
> > > > > have a
> > > > > device_private_entry_wait() function, but I don't think we
> > > > > need
> > > > > one,
> > > > > see below.
> > > > >
> > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > > > > > --- a/mm/memory.c
> > > > > > > > +++ b/mm/memory.c
> > > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct
> > > > > > > > vm_fault
> > > > > > > > *vmf)
> > > > > > > > vmf->page =
> > > > > > > > softleaf_to_page(entry);
> > > > > > > > ret =
> > > > > > > > remove_device_exclusive_entry(vmf);
> > > > > > > > } else if
> > > > > > > > (softleaf_is_device_private(entry))
> > > > > > > > {
> > > > > > > > + struct dev_pagemap *pgmap;
> > > > > > > > +
> > > > > > > > if (vmf->flags &
> > > > > > > > FAULT_FLAG_VMA_LOCK)
> > > > > > > > {
> > > > > > > > /*
> > > > > > > > * migrate_to_ram is
> > > > > > > > not
> > > > > > > > yet
> > > > > > > > ready to operate
> > > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct
> > > > > > > > vm_fault
> > > > > > > > *vmf)
> > > > > > > >
> > > > > > > >
> > > > > > > > vmf-
> > > > > > > > > orig_pte)))
> > > > > > > > goto unlock;
> > > > > > > >
> > > > > > > > - /*
> > > > > > > > - * Get a page reference while
> > > > > > > > we
> > > > > > > > know
> > > > > > > > the page can't be
> > > > > > > > - * freed.
> > > > > > > > - */
> > > > > > > > - if (trylock_page(vmf->page)) {
> > > > > > > > - struct dev_pagemap
> > > > > > > > *pgmap;
> > > > > > > > -
> > > > > > > > - get_page(vmf->page);
> > > > >
> > > > > At this point we:
> > > > > 1. Know the page needs to migrate
> > > > > 2. Have the page locked
> > > > > 3. Have a reference on the page
> > > > > 4. Have the PTL locked
> > > > >
> > > > > Or in other words we have everything we need to install a
> > > > > migration
> > > > > entry,
> > > > > so why not just do that? This thread would then proceed into
> > > > > migrate_to_ram()
> > > > > having already done migrate_vma_collect_pmd() for the
> > > > > faulting
> > > > > page
> > > > > and any
> > > > > other threads would just sleep in the wait on migration entry
> > > > > path
> > > > > until the
> > > > > migration is complete, avoiding the livelock problem the
> > > > > trylock
> > > > > was
> > > > > introduced
> > > > > for in 1afaeb8293c9a.
> > > > >
> > > > > - Alistair
> > > > >
> > > > > > >
> > > >
> > > > There will always be a small time between when the page is
> > > > locked
> > > > and
> > > > when we can install a migration entry. If the page only has a
> > > > single
> > > > mapcount, then the PTL lock is held during this time so the
> > > > issue
> > > > does
> > > > not occur. But for multiple map-counts we need to release the
> > > > PTL
> > > > lock
> > > > in migration to run try_to_migrate(), and before that, the
> > > > migrate
> > > > code
> > > > is running lru_add_drain_all() and gets stuck.
> > >
> > > Oh right, my solution would be fine for the single mapping case
> > > but I
> > > hadn't
> > > fully thought through the implications of other threads accessing
> > > this for
> > > multiple map-counts. Agree it doesn't solve anything there (the
> > > rest
> > > of the
> > > threads would still spin on the trylock).
> > >
> > > Still we could use a similar solution for waiting on device-
> > > private
> > > entries as
> > > we do for migration entries. Instead of spinning on the trylock
> > > (ie.
> > > PG_locked)
> > > we could just wait on it to become unlocked if it's already
> > > locked.
> > > Would
> > > something like the below completely untested code work?
> > > (obviously
> > > this is a bit
> > > of hack, to do it properly you'd want to do more than just remove
> > > the
> > > check from
> > > migration_entry_wait)
> >
> > Well I guess there could be failed migration where something is
> > aborting the migration even after a page is locked. Also we must
> > unlock
> > the PTL lock before waiting otherwise we could deadlock.
>
> Yes, this is exactly what the migration entry wait code does. And if
> there's a
> failed migration, no problem, you just retry. That's not a deadlock
> unless the
> migration never succeeds and then your stuffed anyway.
>
> > I believe a robust solution would be to take a folio reference and
> > do a
> > sleeping lock like John's example. Then to assert that a folio pin-
> > count, not ref-count is required to pin a device-private folio.
> > That
> > would eliminate the problem of the refcount held while locking
> > blocking
> > migration. It looks like that's fully consistent with
>
> Waiting on a migration entry like in my example below is exactly the
> same as
> sleeping on the page lock other than it just waits for the page to be
> unlocked
> rather than trying to lock it.
>
> Internally migration_entry_wait_on_locked() is just an open-coded
> version
> of folio_lock() which deals with dropping the PTL and that works
> without a page
> refcount.
>
> So I don't understand how this solution isn't robust? It requires no
> funniness
> with refcounts and works practically the same as a sleeping lock.
You're right. I didn't look closely enough into what the
migration_entry_wait_on_locked() did. Sorry about that.
That would indeed fix the problem as well. Then the only argument
remaining for the get-a-reference-and-lock solution would be it's not
starvation prone in the same way. But that's definitely a problem I
think we could live with for now.
I'll give this code a test. BTW that removal of unlock_page() isn't
intentional, right?
Thanks,
Thomas
>
> - Alistair
>
> > https://docs.kernel.org/core-api/pin_user_pages.html
> >
> > Then as general improvements we should fully unmap pages before
> > calling
> > lru_add_drain_all() as MBrost suggest and finally, to be more nice
> > to
> > the system in the common cases, add a cond_resched() to
> > hmm_range_fault().
> >
> > Thanks,
> > Thomas
> >
> >
> >
> > >
> > > ---
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 2a55edc48a65..3e5e205ee279 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct vm_fault
> > > *vmf)
> > > pte_unmap_unlock(vmf->pte, vmf-
> > > > ptl);
> > > pgmap = page_pgmap(vmf->page);
> > > ret = pgmap->ops-
> > > > migrate_to_ram(vmf);
> > > - unlock_page(vmf->page);
> > > put_page(vmf->page);
> > > } else {
> > > - pte_unmap_unlock(vmf->pte, vmf-
> > > > ptl);
> > > + migration_entry_wait(vma->vm_mm,
> > > vmf->pmd,
> > > + vmf-
> > > >address);
> > > }
> > > } else if (softleaf_is_hwpoison(entry)) {
> > > ret = VM_FAULT_HWPOISON;
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 5169f9717f60..b676daf0f4e8 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -496,8 +496,6 @@ void migration_entry_wait(struct mm_struct
> > > *mm,
> > > pmd_t *pmd,
> > > goto out;
> > >
> > > entry = softleaf_from_pte(pte);
> > > - if (!softleaf_is_migration(entry))
> > > - goto out;
> > >
> > > migration_entry_wait_on_locked(entry, ptl);
> > > return;
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 11:44 ` Thomas Hellström
@ 2026-02-02 12:26 ` Alistair Popple
2026-02-02 14:07 ` Thomas Hellström
0 siblings, 1 reply; 30+ messages in thread
From: Alistair Popple @ 2026-02-02 12:26 UTC (permalink / raw)
To: Thomas Hellström
Cc: Matthew Brost, John Hubbard, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On 2026-02-02 at 22:44 +1100, Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote...
> On Mon, 2026-02-02 at 22:22 +1100, Alistair Popple wrote:
> > On 2026-02-02 at 21:41 +1100, Thomas Hellström
> > <thomas.hellstrom@linux.intel.com> wrote...
> > > On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote:
> > > > On 2026-02-02 at 20:30 +1100, Thomas Hellström
> > > > <thomas.hellstrom@linux.intel.com> wrote...
> > > > > Hi,
> > > > >
> > > > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> > > > > > On 2026-02-02 at 08:07 +1100, Matthew Brost
> > > > > > <matthew.brost@intel.com>
> > > > > > wrote...
> > > > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard
> > > > > > > wrote:
> > > > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard
> > > > > > > > > wrote:
> > > > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas
> > > > > > > > > > > Hellström
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas
> > > > > > > > > > > > > > Hellström
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > ...
> > > > > > > > > > > I’m not convinced the folio refcount has any
> > > > > > > > > > > bearing if
> > > > > > > > > > > we
> > > > > > > > > > > can take a
> > > > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m
> > > > > > > > > > > missing
> > > > > > > > > > > something.
> > > > > >
> > > > > > I think the point of the trylock vs. lock is that if you
> > > > > > can't
> > > > > > immediately
> > > > > > lock the page then it's an indication the page is undergoing
> > > > > > a
> > > > > > migration.
> > > > > > In other words there's no point waiting for the lock and then
> > > > > > trying
> > > > > > to call
> > > > > > migrate_to_ram() as the page will have already moved by the
> > > > > > time
> > > > > > you
> > > > > > acquire
> > > > > > the lock. Of course that just means you spin faulting until
> > > > > > the
> > > > > > page
> > > > > > finally
> > > > > > migrates.
> > > > > >
> > > > > > If I'm understanding the problem it sounds like we just want
> > > > > > to
> > > > > > sleep
> > > > > > until the
> > > > > > migration is complete, ie. same as the migration entry path.
> > > > > > We
> > > > > > don't
> > > > > > have a
> > > > > > device_private_entry_wait() function, but I don't think we
> > > > > > need
> > > > > > one,
> > > > > > see below.
> > > > > >
> > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > > > > > > --- a/mm/memory.c
> > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct
> > > > > > > > > vm_fault
> > > > > > > > > *vmf)
> > > > > > > > > vmf->page =
> > > > > > > > > softleaf_to_page(entry);
> > > > > > > > > ret =
> > > > > > > > > remove_device_exclusive_entry(vmf);
> > > > > > > > > } else if
> > > > > > > > > (softleaf_is_device_private(entry))
> > > > > > > > > {
> > > > > > > > > + struct dev_pagemap *pgmap;
> > > > > > > > > +
> > > > > > > > > if (vmf->flags &
> > > > > > > > > FAULT_FLAG_VMA_LOCK)
> > > > > > > > > {
> > > > > > > > > /*
> > > > > > > > > * migrate_to_ram is
> > > > > > > > > not
> > > > > > > > > yet
> > > > > > > > > ready to operate
> > > > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct
> > > > > > > > > vm_fault
> > > > > > > > > *vmf)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > vmf-
> > > > > > > > > > orig_pte)))
> > > > > > > > > goto unlock;
> > > > > > > > >
> > > > > > > > > - /*
> > > > > > > > > - * Get a page reference while
> > > > > > > > > we
> > > > > > > > > know
> > > > > > > > > the page can't be
> > > > > > > > > - * freed.
> > > > > > > > > - */
> > > > > > > > > - if (trylock_page(vmf->page)) {
> > > > > > > > > - struct dev_pagemap
> > > > > > > > > *pgmap;
> > > > > > > > > -
> > > > > > > > > - get_page(vmf->page);
> > > > > >
> > > > > > At this point we:
> > > > > > 1. Know the page needs to migrate
> > > > > > 2. Have the page locked
> > > > > > 3. Have a reference on the page
> > > > > > 4. Have the PTL locked
> > > > > >
> > > > > > Or in other words we have everything we need to install a
> > > > > > migration
> > > > > > entry,
> > > > > > so why not just do that? This thread would then proceed into
> > > > > > migrate_to_ram()
> > > > > > having already done migrate_vma_collect_pmd() for the
> > > > > > faulting
> > > > > > page
> > > > > > and any
> > > > > > other threads would just sleep in the wait on migration entry
> > > > > > path
> > > > > > until the
> > > > > > migration is complete, avoiding the livelock problem the
> > > > > > trylock
> > > > > > was
> > > > > > introduced
> > > > > > for in 1afaeb8293c9a.
> > > > > >
> > > > > > - Alistair
> > > > > >
> > > > > > > >
> > > > >
> > > > > There will always be a small time between when the page is
> > > > > locked
> > > > > and
> > > > > when we can install a migration entry. If the page only has a
> > > > > single
> > > > > mapcount, then the PTL lock is held during this time so the
> > > > > issue
> > > > > does
> > > > > not occur. But for multiple map-counts we need to release the
> > > > > PTL
> > > > > lock
> > > > > in migration to run try_to_migrate(), and before that, the
> > > > > migrate
> > > > > code
> > > > > is running lru_add_drain_all() and gets stuck.
> > > >
> > > > Oh right, my solution would be fine for the single mapping case
> > > > but I
> > > > hadn't
> > > > fully thought through the implications of other threads accessing
> > > > this for
> > > > multiple map-counts. Agree it doesn't solve anything there (the
> > > > rest
> > > > of the
> > > > threads would still spin on the trylock).
> > > >
> > > > Still we could use a similar solution for waiting on device-
> > > > private
> > > > entries as
> > > > we do for migration entries. Instead of spinning on the trylock
> > > > (ie.
> > > > PG_locked)
> > > > we could just wait on it to become unlocked if it's already
> > > > locked.
> > > > Would
> > > > something like the below completely untested code work?
> > > > (obviously
> > > > this is a bit
> > > > of hack, to do it properly you'd want to do more than just remove
> > > > the
> > > > check from
> > > > migration_entry_wait)
> > >
> > > Well I guess there could be failed migration where something is
> > > aborting the migration even after a page is locked. Also we must
> > > unlock
> > > the PTL lock before waiting otherwise we could deadlock.
> >
> > Yes, this is exactly what the migration entry wait code does. And if
> > there's a
> > failed migration, no problem, you just retry. That's not a deadlock
> > unless the
> > migration never succeeds and then your stuffed anyway.
> >
> > > I believe a robust solution would be to take a folio reference and
> > > do a
> > > sleeping lock like John's example. Then to assert that a folio pin-
> > > count, not ref-count is required to pin a device-private folio.
> > > That
> > > would eliminate the problem of the refcount held while locking
> > > blocking
> > > migration. It looks like that's fully consistent with
> >
> > Waiting on a migration entry like in my example below is exactly the
> > same as
> > sleeping on the page lock other than it just waits for the page to be
> > unlocked
> > rather than trying to lock it.
> >
> > Internally migration_entry_wait_on_locked() is just an open-coded
> > version
> > of folio_lock() which deals with dropping the PTL and that works
> > without a page
> > refcount.
> >
> > So I don't understand how this solution isn't robust? It requires no
> > funniness
> > with refcounts and works practically the same as a sleeping lock.
>
> You're right. I didn't look closely enough into what the
> migration_entry_wait_on_locked() did. Sorry about that.
No worries. I'm somewhat familiar with it from updating it specifically so it
wouldn't take a page reference as we used to have similar live-lock/starvation
issues in that path too.
> That would indeed fix the problem as well. Then the only argument
> remaining for the get-a-reference-and-lock solution would be it's not
> starvation prone in the same way. But that's definitely a problem I
> think we could live with for now.
I don't follow how this solution would be any more starvation prone than getting
a reference and locking - here the winning fault takes the lock and any other
faulting threads would just wait until it was released before returning from
the fault handler assuming it had been handled. But it's been a while since I've
thought about all the scenarios here so maybe I missed one.
> I'll give this code a test. BTW that removal of unlock_page() isn't
> intentional, right?
Thanks. And you're right, that was unintentional. Serves me for responding too
late at night :-)
- Alistair
> Thanks,
> Thomas
>
>
> >
> > - Alistair
> >
> > > https://docs.kernel.org/core-api/pin_user_pages.html
> > >
> > > Then as general improvements we should fully unmap pages before
> > > calling
> > > lru_add_drain_all() as MBrost suggest and finally, to be more nice
> > > to
> > > the system in the common cases, add a cond_resched() to
> > > hmm_range_fault().
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > >
> > > >
> > > > ---
> > > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 2a55edc48a65..3e5e205ee279 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct vm_fault
> > > > *vmf)
> > > > pte_unmap_unlock(vmf->pte, vmf-
> > > > > ptl);
> > > > pgmap = page_pgmap(vmf->page);
> > > > ret = pgmap->ops-
> > > > > migrate_to_ram(vmf);
> > > > - unlock_page(vmf->page);
> > > > put_page(vmf->page);
> > > > } else {
> > > > - pte_unmap_unlock(vmf->pte, vmf-
> > > > > ptl);
> > > > + migration_entry_wait(vma->vm_mm,
> > > > vmf->pmd,
> > > > + vmf-
> > > > >address);
> > > > }
> > > > } else if (softleaf_is_hwpoison(entry)) {
> > > > ret = VM_FAULT_HWPOISON;
> > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > index 5169f9717f60..b676daf0f4e8 100644
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -496,8 +496,6 @@ void migration_entry_wait(struct mm_struct
> > > > *mm,
> > > > pmd_t *pmd,
> > > > goto out;
> > > >
> > > > entry = softleaf_from_pte(pte);
> > > > - if (!softleaf_is_migration(entry))
> > > > - goto out;
> > > >
> > > > migration_entry_wait_on_locked(entry, ptl);
> > > > return;
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 12:26 ` Alistair Popple
@ 2026-02-02 14:07 ` Thomas Hellström
2026-02-02 23:13 ` Alistair Popple
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Hellström @ 2026-02-02 14:07 UTC (permalink / raw)
To: Alistair Popple
Cc: Matthew Brost, John Hubbard, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On Mon, 2026-02-02 at 23:26 +1100, Alistair Popple wrote:
> On 2026-02-02 at 22:44 +1100, Thomas Hellström
> <thomas.hellstrom@linux.intel.com> wrote...
> > On Mon, 2026-02-02 at 22:22 +1100, Alistair Popple wrote:
> > > On 2026-02-02 at 21:41 +1100, Thomas Hellström
> > > <thomas.hellstrom@linux.intel.com> wrote...
> > > > On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote:
> > > > > On 2026-02-02 at 20:30 +1100, Thomas Hellström
> > > > > <thomas.hellstrom@linux.intel.com> wrote...
> > > > > > Hi,
> > > > > >
> > > > > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> > > > > > > On 2026-02-02 at 08:07 +1100, Matthew Brost
> > > > > > > <matthew.brost@intel.com>
> > > > > > > wrote...
> > > > > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard
> > > > > > > > wrote:
> > > > > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John
> > > > > > > > > > Hubbard
> > > > > > > > > > wrote:
> > > > > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > > Hellström
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John
> > > > > > > > > > > > > Hubbard
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas
> > > > > > > > > > > > > > > Hellström
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > I’m not convinced the folio refcount has any
> > > > > > > > > > > > bearing if
> > > > > > > > > > > > we
> > > > > > > > > > > > can take a
> > > > > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m
> > > > > > > > > > > > missing
> > > > > > > > > > > > something.
> > > > > > >
> > > > > > > I think the point of the trylock vs. lock is that if you
> > > > > > > can't
> > > > > > > immediately
> > > > > > > lock the page then it's an indication the page is
> > > > > > > undergoing
> > > > > > > a
> > > > > > > migration.
> > > > > > > In other words there's no point waiting for the lock and
> > > > > > > then
> > > > > > > trying
> > > > > > > to call
> > > > > > > migrate_to_ram() as the page will have already moved by
> > > > > > > the
> > > > > > > time
> > > > > > > you
> > > > > > > acquire
> > > > > > > the lock. Of course that just means you spin faulting
> > > > > > > until
> > > > > > > the
> > > > > > > page
> > > > > > > finally
> > > > > > > migrates.
> > > > > > >
> > > > > > > If I'm understanding the problem it sounds like we just
> > > > > > > want
> > > > > > > to
> > > > > > > sleep
> > > > > > > until the
> > > > > > > migration is complete, ie. same as the migration entry
> > > > > > > path.
> > > > > > > We
> > > > > > > don't
> > > > > > > have a
> > > > > > > device_private_entry_wait() function, but I don't think
> > > > > > > we
> > > > > > > need
> > > > > > > one,
> > > > > > > see below.
> > > > > > >
> > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > > > > > > > --- a/mm/memory.c
> > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t
> > > > > > > > > > do_swap_page(struct
> > > > > > > > > > vm_fault
> > > > > > > > > > *vmf)
> > > > > > > > > > vmf->page =
> > > > > > > > > > softleaf_to_page(entry);
> > > > > > > > > > ret =
> > > > > > > > > > remove_device_exclusive_entry(vmf);
> > > > > > > > > > } else if
> > > > > > > > > > (softleaf_is_device_private(entry))
> > > > > > > > > > {
> > > > > > > > > > + struct dev_pagemap *pgmap;
> > > > > > > > > > +
> > > > > > > > > > if (vmf->flags &
> > > > > > > > > > FAULT_FLAG_VMA_LOCK)
> > > > > > > > > > {
> > > > > > > > > > /*
> > > > > > > > > > * migrate_to_ram
> > > > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > yet
> > > > > > > > > > ready to operate
> > > > > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t
> > > > > > > > > > do_swap_page(struct
> > > > > > > > > > vm_fault
> > > > > > > > > > *vmf)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > vmf-
> > > > > > > > > > > orig_pte)))
> > > > > > > > > > goto unlock;
> > > > > > > > > >
> > > > > > > > > > - /*
> > > > > > > > > > - * Get a page reference
> > > > > > > > > > while
> > > > > > > > > > we
> > > > > > > > > > know
> > > > > > > > > > the page can't be
> > > > > > > > > > - * freed.
> > > > > > > > > > - */
> > > > > > > > > > - if (trylock_page(vmf-
> > > > > > > > > > >page)) {
> > > > > > > > > > - struct dev_pagemap
> > > > > > > > > > *pgmap;
> > > > > > > > > > -
> > > > > > > > > > - get_page(vmf-
> > > > > > > > > > >page);
> > > > > > >
> > > > > > > At this point we:
> > > > > > > 1. Know the page needs to migrate
> > > > > > > 2. Have the page locked
> > > > > > > 3. Have a reference on the page
> > > > > > > 4. Have the PTL locked
> > > > > > >
> > > > > > > Or in other words we have everything we need to install a
> > > > > > > migration
> > > > > > > entry,
> > > > > > > so why not just do that? This thread would then proceed
> > > > > > > into
> > > > > > > migrate_to_ram()
> > > > > > > having already done migrate_vma_collect_pmd() for the
> > > > > > > faulting
> > > > > > > page
> > > > > > > and any
> > > > > > > other threads would just sleep in the wait on migration
> > > > > > > entry
> > > > > > > path
> > > > > > > until the
> > > > > > > migration is complete, avoiding the livelock problem the
> > > > > > > trylock
> > > > > > > was
> > > > > > > introduced
> > > > > > > for in 1afaeb8293c9a.
> > > > > > >
> > > > > > > - Alistair
> > > > > > >
> > > > > > > > >
> > > > > >
> > > > > > There will always be a small time between when the page is
> > > > > > locked
> > > > > > and
> > > > > > when we can install a migration entry. If the page only has
> > > > > > a
> > > > > > single
> > > > > > mapcount, then the PTL lock is held during this time so the
> > > > > > issue
> > > > > > does
> > > > > > not occur. But for multiple map-counts we need to release
> > > > > > the
> > > > > > PTL
> > > > > > lock
> > > > > > in migration to run try_to_migrate(), and before that, the
> > > > > > migrate
> > > > > > code
> > > > > > is running lru_add_drain_all() and gets stuck.
> > > > >
> > > > > Oh right, my solution would be fine for the single mapping
> > > > > case
> > > > > but I
> > > > > hadn't
> > > > > fully thought through the implications of other threads
> > > > > accessing
> > > > > this for
> > > > > multiple map-counts. Agree it doesn't solve anything there
> > > > > (the
> > > > > rest
> > > > > of the
> > > > > threads would still spin on the trylock).
> > > > >
> > > > > Still we could use a similar solution for waiting on device-
> > > > > private
> > > > > entries as
> > > > > we do for migration entries. Instead of spinning on the
> > > > > trylock
> > > > > (ie.
> > > > > PG_locked)
> > > > > we could just wait on it to become unlocked if it's already
> > > > > locked.
> > > > > Would
> > > > > something like the below completely untested code work?
> > > > > (obviously
> > > > > this is a bit
> > > > > of hack, to do it properly you'd want to do more than just
> > > > > remove
> > > > > the
> > > > > check from
> > > > > migration_entry_wait)
> > > >
> > > > Well I guess there could be failed migration where something is
> > > > aborting the migration even after a page is locked. Also we
> > > > must
> > > > unlock
> > > > the PTL lock before waiting otherwise we could deadlock.
> > >
> > > Yes, this is exactly what the migration entry wait code does. And
> > > if
> > > there's a
> > > failed migration, no problem, you just retry. That's not a
> > > deadlock
> > > unless the
> > > migration never succeeds and then your stuffed anyway.
> > >
> > > > I believe a robust solution would be to take a folio reference
> > > > and
> > > > do a
> > > > sleeping lock like John's example. Then to assert that a folio
> > > > pin-
> > > > count, not ref-count is required to pin a device-private folio.
> > > > That
> > > > would eliminate the problem of the refcount held while locking
> > > > blocking
> > > > migration. It looks like that's fully consistent with
> > >
> > > Waiting on a migration entry like in my example below is exactly
> > > the
> > > same as
> > > sleeping on the page lock other than it just waits for the page
> > > to be
> > > unlocked
> > > rather than trying to lock it.
> > >
> > > Internally migration_entry_wait_on_locked() is just an open-coded
> > > version
> > > of folio_lock() which deals with dropping the PTL and that works
> > > without a page
> > > refcount.
> > >
> > > So I don't understand how this solution isn't robust? It requires
> > > no
> > > funniness
> > > with refcounts and works practically the same as a sleeping lock.
> >
> > You're right. I didn't look closely enough into what the
> > migration_entry_wait_on_locked() did. Sorry about that.
>
> No worries. I'm somewhat familiar with it from updating it
> specifically so it
> wouldn't take a page reference as we used to have similar live-
> lock/starvation
> issues in that path too.
>
> > That would indeed fix the problem as well. Then the only argument
> > remaining for the get-a-reference-and-lock solution would be it's
> > not
> > starvation prone in the same way. But that's definitely a problem I
> > think we could live with for now.
>
> I don't follow how this solution would be any more starvation prone
> than getting
> a reference and locking - here the winning fault takes the lock and
> any other
> faulting threads would just wait until it was released before
> returning from
> the fault handler assuming it had been handled. But it's been a while
> since I've
> thought about all the scenarios here so maybe I missed one.
My thinking is that it would be if theoretical racing lock-holders
don't migrate to system, we can't *guarantee* migration will ever
happen. Although admittedly this is very unlikely to happen. If we
instead locked the page we'd on the other hand need to walk the page
table again to check whether the pte content was still valid....
>
> > I'll give this code a test. BTW that removal of unlock_page() isn't
> > intentional, right?
>
> Thanks. And you're right, that was unintentional. Serves me for
> responding too
> late at night :-)
So I ended up with this:
diff --git a/mm/memory.c b/mm/memory.c
index da360a6eb8a4..84b6019eac6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unlock_page(vmf->page);
put_page(vmf->page);
} else {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ pte_unmap(vmf->pte);
+ migration_entry_wait_on_locked(entry,
vmf->ptl);
}
} else if (softleaf_is_hwpoison(entry)) {
ret = VM_FAULT_HWPOISON;
--
2.52.0
Seems to be a working fix.
/Thomas
>
> - Alistair
>
> > Thanks,
> > Thomas
> >
> >
> > >
> > > - Alistair
> > >
> > > > https://docs.kernel.org/core-api/pin_user_pages.html
> > > >
> > > > Then as general improvements we should fully unmap pages before
> > > > calling
> > > > lru_add_drain_all() as MBrost suggest and finally, to be more
> > > > nice
> > > > to
> > > > the system in the common cases, add a cond_resched() to
> > > > hmm_range_fault().
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > >
> > > > >
> > > > > ---
> > > > >
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 2a55edc48a65..3e5e205ee279 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct
> > > > > vm_fault
> > > > > *vmf)
> > > > > pte_unmap_unlock(vmf->pte,
> > > > > vmf-
> > > > > > ptl);
> > > > > pgmap = page_pgmap(vmf-
> > > > > >page);
> > > > > ret = pgmap->ops-
> > > > > > migrate_to_ram(vmf);
> > > > > - unlock_page(vmf->page);
> > > > > put_page(vmf->page);
> > > > > } else {
> > > > > - pte_unmap_unlock(vmf->pte,
> > > > > vmf-
> > > > > > ptl);
> > > > > + migration_entry_wait(vma-
> > > > > >vm_mm,
> > > > > vmf->pmd,
> > > > > + vmf-
> > > > > > address);
> > > > > }
> > > > > } else if (softleaf_is_hwpoison(entry)) {
> > > > > ret = VM_FAULT_HWPOISON;
> > > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > > index 5169f9717f60..b676daf0f4e8 100644
> > > > > --- a/mm/migrate.c
> > > > > +++ b/mm/migrate.c
> > > > > @@ -496,8 +496,6 @@ void migration_entry_wait(struct
> > > > > mm_struct
> > > > > *mm,
> > > > > pmd_t *pmd,
> > > > > goto out;
> > > > >
> > > > > entry = softleaf_from_pte(pte);
> > > > > - if (!softleaf_is_migration(entry))
> > > > > - goto out;
> > > > >
> > > > > migration_entry_wait_on_locked(entry, ptl);
> > > > > return;
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 9:13 ` Thomas Hellström
2026-02-02 10:34 ` Alistair Popple
@ 2026-02-02 22:28 ` John Hubbard
2026-02-03 9:31 ` Thomas Hellström
1 sibling, 1 reply; 30+ messages in thread
From: John Hubbard @ 2026-02-02 22:28 UTC (permalink / raw)
To: Thomas Hellström, Matthew Brost
Cc: Andrew Morton, intel-xe, Ralph Campbell, Christoph Hellwig,
Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, linux-mm,
stable, dri-devel
On 2/2/26 1:13 AM, Thomas Hellström wrote:
> On Sat, 2026-01-31 at 13:42 -0800, John Hubbard wrote:
>> On 1/31/26 11:00 AM, Matthew Brost wrote:
>>> On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
>>>> On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
>>>>> On 1/30/26 10:00 AM, Andrew Morton wrote:
>>>>>> On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
>>>>>> <thomas.hellstrom@linux.intel.com> wrote:
>>>>> ...
>>
>>>
>>>> I'm also not sure a folio refcount should block migration after
>>>> the
>>>> introduction of pinned (like in pin_user_pages) pages. Rather
>>>> perhaps a
>>>> folio pin-count should block migration and in that case
>>>> do_swap_page()
>>>> can definitely do a sleeping folio lock and the problem is gone.
>>
>> A problem for that specific point is that pincount and refcount both
>> mean, "the page is pinned" (which in turn literally means "not
>> allowed
>> to migrate/move").
>
> Yeah this is what I actually want to challenge since this is what
> blocks us from doing a clean robust solution here. From brief reading
> of the docs around the pin-count implementation, I understand it as "If
> you want to access the struct page metadata, get a refcount, If you
> want to access the actual memory of a page, take a pin-count"
>
> I guess that might still not be true for all old instances in the
> kernel using get_user_pages() instead of pin_user_pages() for things
> like DMA, but perhaps we can set that in stone and document it at least
> for device-private pages for now which would be sufficient for the
> do_swap_pages() refcount not to block migration.
>
It's an interesting direction to go...
>
>>
>> (In fact, pincount is implemented in terms of refcount, in most
>> configurations still.)
>
> Yes but that's only a space optimization never intended to conflict,
> right? Meaning a pin-count will imply a refcount but a refcount will
> never imply a pin-count?
>
Unfortunately, they are more tightly linked than that today, at least until
someday when specialized folios are everywhere (at which point pincount
gets its own field).
Until then, it's not just a "space optimization", it's "overload refcount
to also do pincounting". And "let core mm continue to treat refcounts as
meaning that the page is pinned".
thanks,
--
John Hubbard
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 14:07 ` Thomas Hellström
@ 2026-02-02 23:13 ` Alistair Popple
0 siblings, 0 replies; 30+ messages in thread
From: Alistair Popple @ 2026-02-02 23:13 UTC (permalink / raw)
To: Thomas Hellström
Cc: Matthew Brost, John Hubbard, Andrew Morton, intel-xe,
Ralph Campbell, Christoph Hellwig, Jason Gunthorpe,
Jason Gunthorpe, Leon Romanovsky, linux-mm, stable, dri-devel
On 2026-02-03 at 01:07 +1100, Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote...
> On Mon, 2026-02-02 at 23:26 +1100, Alistair Popple wrote:
> > On 2026-02-02 at 22:44 +1100, Thomas Hellström
> > <thomas.hellstrom@linux.intel.com> wrote...
> > > On Mon, 2026-02-02 at 22:22 +1100, Alistair Popple wrote:
> > > > On 2026-02-02 at 21:41 +1100, Thomas Hellström
> > > > <thomas.hellstrom@linux.intel.com> wrote...
> > > > > On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote:
> > > > > > On 2026-02-02 at 20:30 +1100, Thomas Hellström
> > > > > > <thomas.hellstrom@linux.intel.com> wrote...
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> > > > > > > > On 2026-02-02 at 08:07 +1100, Matthew Brost
> > > > > > > > <matthew.brost@intel.com>
> > > > > > > > wrote...
> > > > > > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard
> > > > > > > > > wrote:
> > > > > > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > > > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John
> > > > > > > > > > > Hubbard
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100,
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > Hellström
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John
> > > > > > > > > > > > > > Hubbard
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas
> > > > > > > > > > > > > > > > Hellström
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > I’m not convinced the folio refcount has any
> > > > > > > > > > > > > bearing if
> > > > > > > > > > > > > we
> > > > > > > > > > > > > can take a
> > > > > > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m
> > > > > > > > > > > > > missing
> > > > > > > > > > > > > something.
> > > > > > > >
> > > > > > > > I think the point of the trylock vs. lock is that if you
> > > > > > > > can't
> > > > > > > > immediately
> > > > > > > > lock the page then it's an indication the page is
> > > > > > > > undergoing
> > > > > > > > a
> > > > > > > > migration.
> > > > > > > > In other words there's no point waiting for the lock and
> > > > > > > > then
> > > > > > > > trying
> > > > > > > > to call
> > > > > > > > migrate_to_ram() as the page will have already moved by
> > > > > > > > the
> > > > > > > > time
> > > > > > > > you
> > > > > > > > acquire
> > > > > > > > the lock. Of course that just means you spin faulting
> > > > > > > > until
> > > > > > > > the
> > > > > > > > page
> > > > > > > > finally
> > > > > > > > migrates.
> > > > > > > >
> > > > > > > > If I'm understanding the problem it sounds like we just
> > > > > > > > want
> > > > > > > > to
> > > > > > > > sleep
> > > > > > > > until the
> > > > > > > > migration is complete, ie. same as the migration entry
> > > > > > > > path.
> > > > > > > > We
> > > > > > > > don't
> > > > > > > > have a
> > > > > > > > device_private_entry_wait() function, but I don't think
> > > > > > > > we
> > > > > > > > need
> > > > > > > > one,
> > > > > > > > see below.
> > > > > > > >
> > > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > > > > > > > > --- a/mm/memory.c
> > > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t
> > > > > > > > > > > do_swap_page(struct
> > > > > > > > > > > vm_fault
> > > > > > > > > > > *vmf)
> > > > > > > > > > > vmf->page =
> > > > > > > > > > > softleaf_to_page(entry);
> > > > > > > > > > > ret =
> > > > > > > > > > > remove_device_exclusive_entry(vmf);
> > > > > > > > > > > } else if
> > > > > > > > > > > (softleaf_is_device_private(entry))
> > > > > > > > > > > {
> > > > > > > > > > > + struct dev_pagemap *pgmap;
> > > > > > > > > > > +
> > > > > > > > > > > if (vmf->flags &
> > > > > > > > > > > FAULT_FLAG_VMA_LOCK)
> > > > > > > > > > > {
> > > > > > > > > > > /*
> > > > > > > > > > > * migrate_to_ram
> > > > > > > > > > > is
> > > > > > > > > > > not
> > > > > > > > > > > yet
> > > > > > > > > > > ready to operate
> > > > > > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t
> > > > > > > > > > > do_swap_page(struct
> > > > > > > > > > > vm_fault
> > > > > > > > > > > *vmf)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > vmf-
> > > > > > > > > > > > orig_pte)))
> > > > > > > > > > > goto unlock;
> > > > > > > > > > >
> > > > > > > > > > > - /*
> > > > > > > > > > > - * Get a page reference
> > > > > > > > > > > while
> > > > > > > > > > > we
> > > > > > > > > > > know
> > > > > > > > > > > the page can't be
> > > > > > > > > > > - * freed.
> > > > > > > > > > > - */
> > > > > > > > > > > - if (trylock_page(vmf-
> > > > > > > > > > > >page)) {
> > > > > > > > > > > - struct dev_pagemap
> > > > > > > > > > > *pgmap;
> > > > > > > > > > > -
> > > > > > > > > > > - get_page(vmf-
> > > > > > > > > > > >page);
> > > > > > > >
> > > > > > > > At this point we:
> > > > > > > > 1. Know the page needs to migrate
> > > > > > > > 2. Have the page locked
> > > > > > > > 3. Have a reference on the page
> > > > > > > > 4. Have the PTL locked
> > > > > > > >
> > > > > > > > Or in other words we have everything we need to install a
> > > > > > > > migration
> > > > > > > > entry,
> > > > > > > > so why not just do that? This thread would then proceed
> > > > > > > > into
> > > > > > > > migrate_to_ram()
> > > > > > > > having already done migrate_vma_collect_pmd() for the
> > > > > > > > faulting
> > > > > > > > page
> > > > > > > > and any
> > > > > > > > other threads would just sleep in the wait on migration
> > > > > > > > entry
> > > > > > > > path
> > > > > > > > until the
> > > > > > > > migration is complete, avoiding the livelock problem the
> > > > > > > > trylock
> > > > > > > > was
> > > > > > > > introduced
> > > > > > > > for in 1afaeb8293c9a.
> > > > > > > >
> > > > > > > > - Alistair
> > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > > > There will always be a small time between when the page is
> > > > > > > locked
> > > > > > > and
> > > > > > > when we can install a migration entry. If the page only has
> > > > > > > a
> > > > > > > single
> > > > > > > mapcount, then the PTL lock is held during this time so the
> > > > > > > issue
> > > > > > > does
> > > > > > > not occur. But for multiple map-counts we need to release
> > > > > > > the
> > > > > > > PTL
> > > > > > > lock
> > > > > > > in migration to run try_to_migrate(), and before that, the
> > > > > > > migrate
> > > > > > > code
> > > > > > > is running lru_add_drain_all() and gets stuck.
> > > > > >
> > > > > > Oh right, my solution would be fine for the single mapping
> > > > > > case
> > > > > > but I
> > > > > > hadn't
> > > > > > fully thought through the implications of other threads
> > > > > > accessing
> > > > > > this for
> > > > > > multiple map-counts. Agree it doesn't solve anything there
> > > > > > (the
> > > > > > rest
> > > > > > of the
> > > > > > threads would still spin on the trylock).
> > > > > >
> > > > > > Still we could use a similar solution for waiting on device-
> > > > > > private
> > > > > > entries as
> > > > > > we do for migration entries. Instead of spinning on the
> > > > > > trylock
> > > > > > (ie.
> > > > > > PG_locked)
> > > > > > we could just wait on it to become unlocked if it's already
> > > > > > locked.
> > > > > > Would
> > > > > > something like the below completely untested code work?
> > > > > > (obviously
> > > > > > this is a bit
> > > > > > of hack, to do it properly you'd want to do more than just
> > > > > > remove
> > > > > > the
> > > > > > check from
> > > > > > migration_entry_wait)
> > > > >
> > > > > Well I guess there could be failed migration where something is
> > > > > aborting the migration even after a page is locked. Also we
> > > > > must
> > > > > unlock
> > > > > the PTL lock before waiting otherwise we could deadlock.
> > > >
> > > > Yes, this is exactly what the migration entry wait code does. And
> > > > if
> > > > there's a
> > > > failed migration, no problem, you just retry. That's not a
> > > > deadlock
> > > > unless the
> > > > migration never succeeds and then your stuffed anyway.
> > > >
> > > > > I believe a robust solution would be to take a folio reference
> > > > > and
> > > > > do a
> > > > > sleeping lock like John's example. Then to assert that a folio
> > > > > pin-
> > > > > count, not ref-count is required to pin a device-private folio.
> > > > > That
> > > > > would eliminate the problem of the refcount held while locking
> > > > > blocking
> > > > > migration. It looks like that's fully consistent with
> > > >
> > > > Waiting on a migration entry like in my example below is exactly
> > > > the
> > > > same as
> > > > sleeping on the page lock other than it just waits for the page
> > > > to be
> > > > unlocked
> > > > rather than trying to lock it.
> > > >
> > > > Internally migration_entry_wait_on_locked() is just an open-coded
> > > > version
> > > > of folio_lock() which deals with dropping the PTL and that works
> > > > without a page
> > > > refcount.
> > > >
> > > > So I don't understand how this solution isn't robust? It requires
> > > > no
> > > > funniness
> > > > with refcounts and works practically the same as a sleeping lock.
> > >
> > > You're right. I didn't look closely enough into what the
> > > migration_entry_wait_on_locked() did. Sorry about that.
> >
> > No worries. I'm somewhat familiar with it from updating it
> > specifically so it
> > wouldn't take a page reference as we used to have similar live-
> > lock/starvation
> > issues in that path too.
> >
> > > That would indeed fix the problem as well. Then the only argument
> > > remaining for the get-a-reference-and-lock solution would be it's
> > > not
> > > starvation prone in the same way. But that's definitely a problem I
> > > think we could live with for now.
> >
> > I don't follow how this solution would be any more starvation prone
> > than getting
> > a reference and locking - here the winning fault takes the lock and
> > any other
> > faulting threads would just wait until it was released before
> > returning from
> > the fault handler assuming it had been handled. But it's been a while
> > since I've
> > thought about all the scenarios here so maybe I missed one.
>
> My thinking is that it would be if theoretical racing lock-holders
> don't migrate to system, we can't *guarantee* migration will ever
> happen. Although admittedly this is very unlikely to happen. If we
> instead locked the page we'd on the other hand need to walk the page
> table again to check whether the pte content was still valid....
Oh I see what you mean. Something else could be continually grabbing the page
lock but not migrating meaning this thread would never get a chance to. I
doubt/agree that's not a concern in practice - AFAIK nothing other than a driver
or do_swap_page() should be taking the page lock and only for migration so
assuming the driver behaves the page will migrate (or result in a fatal error
due to eg. OOM).
That said if we did discover something else locking the page for reasons other
than migration and causing issues here we could wait on a page flag other than
PG_locked that was specific for migration. But hopefully that's not necessary.
> >
> > > I'll give this code a test. BTW that removal of unlock_page() isn't
> > > intentional, right?
> >
> > Thanks. And you're right, that was unintentional. Serves me for
> > responding too
> > late at night :-)
>
> So I ended up with this:
Thanks. That looks much more sane than what I posted.
> diff --git a/mm/memory.c b/mm/memory.c
> index da360a6eb8a4..84b6019eac6d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> unlock_page(vmf->page);
> put_page(vmf->page);
> } else {
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> + pte_unmap(vmf->pte);
> + migration_entry_wait_on_locked(entry,
> vmf->ptl);
> }
> } else if (softleaf_is_hwpoison(entry)) {
> ret = VM_FAULT_HWPOISON;
> --
> 2.52.0
>
>
> Seems to be a working fix.
Great. Seems like a good fix to me.
- Alistair
> /Thomas
>
>
> >
> > - Alistair
> >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > >
> > > > - Alistair
> > > >
> > > > > https://docs.kernel.org/core-api/pin_user_pages.html
> > > > >
> > > > > Then as general improvements we should fully unmap pages before
> > > > > calling
> > > > > lru_add_drain_all() as MBrost suggest and finally, to be more
> > > > > nice
> > > > > to
> > > > > the system in the common cases, add a cond_resched() to
> > > > > hmm_range_fault().
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > ---
> > > > > >
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 2a55edc48a65..3e5e205ee279 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct
> > > > > > vm_fault
> > > > > > *vmf)
> > > > > > pte_unmap_unlock(vmf->pte,
> > > > > > vmf-
> > > > > > > ptl);
> > > > > > pgmap = page_pgmap(vmf-
> > > > > > >page);
> > > > > > ret = pgmap->ops-
> > > > > > > migrate_to_ram(vmf);
> > > > > > - unlock_page(vmf->page);
> > > > > > put_page(vmf->page);
> > > > > > } else {
> > > > > > - pte_unmap_unlock(vmf->pte,
> > > > > > vmf-
> > > > > > > ptl);
> > > > > > + migration_entry_wait(vma-
> > > > > > >vm_mm,
> > > > > > vmf->pmd,
> > > > > > + vmf-
> > > > > > > address);
> > > > > > }
> > > > > > } else if (softleaf_is_hwpoison(entry)) {
> > > > > > ret = VM_FAULT_HWPOISON;
> > > > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > > > index 5169f9717f60..b676daf0f4e8 100644
> > > > > > --- a/mm/migrate.c
> > > > > > +++ b/mm/migrate.c
> > > > > > @@ -496,8 +496,6 @@ void migration_entry_wait(struct
> > > > > > mm_struct
> > > > > > *mm,
> > > > > > pmd_t *pmd,
> > > > > > goto out;
> > > > > >
> > > > > > entry = softleaf_from_pte(pte);
> > > > > > - if (!softleaf_is_migration(entry))
> > > > > > - goto out;
> > > > > >
> > > > > > migration_entry_wait_on_locked(entry, ptl);
> > > > > > return;
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-02 22:28 ` John Hubbard
@ 2026-02-03 9:31 ` Thomas Hellström
2026-02-04 1:13 ` pincount vs refcount: " John Hubbard
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Hellström @ 2026-02-03 9:31 UTC (permalink / raw)
To: John Hubbard, Matthew Brost
Cc: Andrew Morton, intel-xe, Ralph Campbell, Christoph Hellwig,
Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, linux-mm,
stable, dri-devel
On Mon, 2026-02-02 at 14:28 -0800, John Hubbard wrote:
> On 2/2/26 1:13 AM, Thomas Hellström wrote:
> > On Sat, 2026-01-31 at 13:42 -0800, John Hubbard wrote:
> > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström
> > > > wrote:
> > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström
> > > > > > > <thomas.hellstrom@linux.intel.com> wrote:
> > > > > > ...
> > >
> > > >
> > > > > I'm also not sure a folio refcount should block migration
> > > > > after
> > > > > the
> > > > > introduction of pinned (like in pin_user_pages) pages. Rather
> > > > > perhaps a
> > > > > folio pin-count should block migration and in that case
> > > > > do_swap_page()
> > > > > can definitely do a sleeping folio lock and the problem is
> > > > > gone.
> > >
> > > A problem for that specific point is that pincount and refcount
> > > both
> > > mean, "the page is pinned" (which in turn literally means "not
> > > allowed
> > > to migrate/move").
> >
> > Yeah this is what I actually want to challenge since this is what
> > blocks us from doing a clean robust solution here. From brief
> > reading
> > of the docs around the pin-count implementation, I understand it as
> > "If
> > you want to access the struct page metadata, get a refcount, If you
> > want to access the actual memory of a page, take a pin-count"
> >
> > I guess that might still not be true for all old instances in the
> > kernel using get_user_pages() instead of pin_user_pages() for
> > things
> > like DMA, but perhaps we can set that in stone and document it at
> > least
> > for device-private pages for now which would be sufficient for the
> > do_swap_pages() refcount not to block migration.
> >
>
> It's an interesting direction to go...
>
> >
> > >
> > > (In fact, pincount is implemented in terms of refcount, in most
> > > configurations still.)
> >
> > Yes but that's only a space optimization never intended to
> > conflict,
> > right? Meaning a pin-count will imply a refcount but a refcount
> > will
> > never imply a pin-count?
> >
> Unfortunately, they are more tightly linked than that today, at least
> until
> someday when specialized folios are everywhere (at which point
> pincount
> gets its own field).
>
> Until then, it's not just a "space optimization", it's "overload
> refcount
> to also do pincounting". And "let core mm continue to treat refcounts
> as
> meaning that the page is pinned".
So this is what I had in mind:
I think certainly this would work regardless of whether pincount is
implemented by means of refcount with a bias or not, and AFAICT it's
also consistent with
https://docs.kernel.org/core-api/pin_user_pages.html
But it would not work if some part of core mm grabs a page refcount and
*expects* that to pin a page in the sense that it should not be
migrated. But you're suggesting that's actually the case?
Thanks,
Thomas
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index a101a187e6da..c07a79995128 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -534,33 +534,15 @@ static void migrate_vma_collect(struct
migrate_vma *migrate)
* migrate_vma_check_page() - check if page is pinned or not
* @page: struct page to check
*
- * Pinned pages cannot be migrated. This is the same test as in
- * folio_migrate_mapping(), except that here we allow migration of a
- * ZONE_DEVICE page.
+ * Pinned pages cannot be migrated.
*/
static bool migrate_vma_check_page(struct page *page, struct page
*fault_page)
{
struct folio *folio = page_folio(page);
- /*
- * One extra ref because caller holds an extra reference,
either from
- * folio_isolate_lru() for a regular folio, or
migrate_vma_collect() for
- * a device folio.
- */
- int extra = 1 + (page == fault_page);
-
- /* Page from ZONE_DEVICE have one extra reference */
- if (folio_is_zone_device(folio))
- extra++;
-
- /* For file back page */
- if (folio_mapping(folio))
- extra += 1 + folio_has_private(folio);
-
- if ((folio_ref_count(folio) - extra) > folio_mapcount(folio))
- return false;
+ VM_WARN_ON_FOLIO(folio_test_lru(folio) || folio_mapped(folio),
folio);
- return true;
+ return !folio_maybe_dma_pinned(folio);
}
>
>
> thanks,
^ permalink raw reply [flat|nested] 30+ messages in thread
* pincount vs refcount: [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem
2026-02-03 9:31 ` Thomas Hellström
@ 2026-02-04 1:13 ` John Hubbard
0 siblings, 0 replies; 30+ messages in thread
From: John Hubbard @ 2026-02-04 1:13 UTC (permalink / raw)
To: Thomas Hellström, Matthew Brost
Cc: Andrew Morton, intel-xe, Ralph Campbell, Christoph Hellwig,
Jason Gunthorpe, Jason Gunthorpe, Leon Romanovsky, linux-mm,
stable, dri-devel
On 2/3/26 1:31 AM, Thomas Hellström wrote:
...
> So this is what I had in mind:
>
> I think certainly this would work regardless of whether pincount is
> implemented by means of refcount with a bias or not, and AFAICT it's
> also consistent with
>
> https://docs.kernel.org/core-api/pin_user_pages.html
>
> But it would not work if some part of core mm grabs a page refcount and
> *expects* that to pin a page in the sense that it should not be
> migrated. But you're suggesting that's actually the case?
Yes. The migration code itself, in fact, uses folio_get() with the
expectation that this blocks migration:
migrate_vma_collect_pmd():
/*
* By getting a reference on the folio we pin it and that blocks
* any kind of migration. Side effect is that it "freezes" the
* pte.
*
* We drop this reference after isolating the folio from the lru
* for non device folio (device folio are not on the lru and thus
* can't be dropped from it).
*/
folio = page_folio(page);
folio_get(folio);
I'm experiencing a bit of local-mind-boggle at decoupling refcount
from pincount, but part of that is probably just bias towards "but
it's always been that way!". haha :)
thanks,
--
John Hubbard
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2026-02-04 1:14 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-30 14:45 [PATCH] mm/hmm: Fix a hmm_range_fault() livelock / starvation problem Thomas Hellström
2026-01-30 18:00 ` Andrew Morton
2026-01-30 19:56 ` Thomas Hellström
2026-01-30 20:38 ` Andrew Morton
2026-01-30 21:01 ` Matthew Brost
2026-01-30 21:08 ` Andrew Morton
2026-01-31 0:59 ` Matthew Brost
2026-01-31 3:01 ` John Hubbard
2026-01-31 12:57 ` Thomas Hellström
2026-01-31 19:00 ` Matthew Brost
2026-01-31 21:42 ` John Hubbard
2026-02-01 19:24 ` Matthew Brost
2026-02-01 20:48 ` John Hubbard
2026-02-01 21:07 ` Matthew Brost
2026-02-02 0:10 ` Alistair Popple
2026-02-02 9:30 ` Thomas Hellström
2026-02-02 10:25 ` Alistair Popple
2026-02-02 10:41 ` Thomas Hellström
2026-02-02 11:22 ` Alistair Popple
2026-02-02 11:44 ` Thomas Hellström
2026-02-02 12:26 ` Alistair Popple
2026-02-02 14:07 ` Thomas Hellström
2026-02-02 23:13 ` Alistair Popple
2026-02-02 9:13 ` Thomas Hellström
2026-02-02 10:34 ` Alistair Popple
2026-02-02 10:51 ` Thomas Hellström
2026-02-02 11:28 ` Alistair Popple
2026-02-02 22:28 ` John Hubbard
2026-02-03 9:31 ` Thomas Hellström
2026-02-04 1:13 ` pincount vs refcount: " John Hubbard
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox