[RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration
@ 2026-04-10  3:23 John Hubbard
  2026-04-10  3:23 ` [RFC PATCH 1/2] mm: wake up folio refcount waiters on folio_put() John Hubbard
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: John Hubbard @ 2026-04-10  3:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Chris Li, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, LKML, linux-mm, John Hubbard

Hi,

This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can
wait for transient folio references to drain, instead of failing after a
fixed number of retries. The wait uses a one-second timeout. An
alternative approach would be to call wait_var_event_killable() with no
timeout, but that doesn't match as well with migration's "this will
probably work" API. In other words, a short sleeping wait is more
appropriate here.

When migrating pages for FOLL_LONGTERM pinning, migration can fail with
-EAGAIN if a folio has unexpected references. These references are often
transient, but the current retry loop gives up too quickly. This series
adds wait_var_event_timeout() at the retry points, paired with
wake_up_var() in folio_put() to wake the sleeper as soon as the refcount
drops.

The wake_up_var() calls in folio_put() are gated behind a static key,
disabled by default, so non-migration workloads pay zero cost.
migrate_pages() enables the key on entry when the reason is
MR_LONGTERM_PIN, and disables it on exit.

Toggling the key is not free. folio_put() is static inline, so every
compilation unit that calls it gets its own patch site (roughly 500 in
vmlinux, plus modules). On x86, jump label patching is batched (256
sites per batch, 3 IPI rounds per batch), so enabling the key costs
6-9 IPI broadcasts, a few hundred microseconds on a large machine.
That cost is paid twice per migrate_pages() call. Migration itself
spends several milliseconds per batch on LRU isolation, TLB flushes,
and page copies. Concurrent longterm-pin migrations after the first
just do an atomic_inc (no patching).

Matthew Brost offered to performance-test this series [1], as Intel has
tests that stress migration and good metrics to catch regressions.

[1] https://lore.kernel.org/all/aX+oUorOWPt1xbgw@lstrano-desk.jf.intel.com/

John Hubbard (2):
  mm: wake up folio refcount waiters on folio_put()
  mm/migrate: wait for folio refcount during longterm pin migration

 include/linux/mm.h |  8 ++++++++
 mm/migrate.c       | 30 ++++++++++++++++++++++++++++++
 mm/swap.c          | 10 +++++++++-
 3 files changed, 47 insertions(+), 1 deletion(-)

base-commit: 9a9c8ce300cd3859cc87b408ef552cd697cc2ab7
-- 
2.53.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/2] mm: wake up folio refcount waiters on folio_put()
  2026-04-10  3:23 [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
@ 2026-04-10  3:23 ` John Hubbard
  2026-04-10  3:23 ` [RFC PATCH 2/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: John Hubbard @ 2026-04-10  3:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Chris Li, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, LKML, linux-mm, John Hubbard

When a folio's reference count is decremented but doesn't reach zero,
wake up any waiters that might be waiting for the refcount to drop.
This enables migration code to wait for transient references to be
released instead of busy-retrying.

The wake_up_var() calls are gated behind a static key that is disabled
by default, so folio_put() compiles to a NOP on the wakeup path when
no migration is waiting. The static key is enabled by the migration
code in a subsequent commit.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  8 ++++++++
 mm/swap.c          | 10 +++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index abb4963c1f06..ccb723412c07 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -38,6 +38,8 @@
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 #include <linux/iommu-debug-pagealloc.h>
+#include <linux/jump_label.h>
+#include <linux/wait_bit.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -1798,6 +1800,8 @@ static inline __must_check bool try_get_page(struct page *page)
 	return true;
 }
 
+DECLARE_STATIC_KEY_FALSE(folio_put_wakeup_key);
+
 /**
  * folio_put - Decrement the reference count on a folio.
  * @folio: The folio.
@@ -1815,6 +1819,8 @@ static inline void folio_put(struct folio *folio)
 {
 	if (folio_put_testzero(folio))
 		__folio_put(folio);
+	else if (static_branch_unlikely(&folio_put_wakeup_key))
+		wake_up_var(&folio->_refcount);
 }
 
 /**
@@ -1835,6 +1841,8 @@ static inline void folio_put_refs(struct folio *folio, int refs)
 {
 	if (folio_ref_sub_and_test(folio, refs))
 		__folio_put(folio);
+	else if (static_branch_unlikely(&folio_put_wakeup_key))
+		wake_up_var(&folio->_refcount);
 }
 
 void folios_put_refs(struct folio_batch *folios, unsigned int *refs);
diff --git a/mm/swap.c b/mm/swap.c
index bb19ccbece46..e57baa40129c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -43,6 +43,9 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/pagemap.h>
 
+DEFINE_STATIC_KEY_FALSE(folio_put_wakeup_key);
+EXPORT_SYMBOL(folio_put_wakeup_key);
+
 /* How many pages do we try to swap or page in/out together? As a power of 2 */
 int page_cluster;
 static const int page_cluster_max = 31;
@@ -968,11 +971,16 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 			}
 			if (folio_ref_sub_and_test(folio, nr_refs))
 				free_zone_device_folio(folio);
+			else if (static_branch_unlikely(&folio_put_wakeup_key))
+				wake_up_var(&folio->_refcount);
 			continue;
 		}
 
-		if (!folio_ref_sub_and_test(folio, nr_refs))
+		if (!folio_ref_sub_and_test(folio, nr_refs)) {
+			if (static_branch_unlikely(&folio_put_wakeup_key))
+				wake_up_var(&folio->_refcount);
 			continue;
+		}
 
 		/* hugetlb has its own memcg */
 		if (folio_test_hugetlb(folio)) {
-- 
2.53.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/2] mm/migrate: wait for folio refcount during longterm pin migration
  2026-04-10  3:23 [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
  2026-04-10  3:23 ` [RFC PATCH 1/2] mm: wake up folio refcount waiters on folio_put() John Hubbard
@ 2026-04-10  3:23 ` John Hubbard
  2026-04-21  5:57   ` Alistair Popple
  2026-04-21  9:21   ` Huang, Ying
  2026-04-21  5:52 ` [RFC PATCH 0/2] " Alistair Popple
  2026-04-21  9:19 ` Huang, Ying
  3 siblings, 2 replies; 7+ messages in thread
From: John Hubbard @ 2026-04-10  3:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Chris Li, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, LKML, linux-mm, John Hubbard

When migrating pages for FOLL_LONGTERM pinning (MR_LONGTERM_PIN), the
migration can fail with -EAGAIN if the folio has unexpected references.
These references are often transient (e.g., from GPU operations like
cuMemset that will complete shortly).

Previously, the migration code would retry up to 10 times
(NR_MAX_MIGRATE_PAGES_RETRY), but this busy-retry approach failed when
the transient reference holder needed more time than the retry loop
provides.

Fix this by waiting up to one second for the folio's refcount to drop
to the expected value before retrying migration. The wait uses
wait_var_event_timeout() paired with the wake_up_var() calls added to
folio_put() in the previous commit. If the timeout expires, the
existing retry loop continues as before. The folio_put_wakeup_key
static key is enabled for the duration of migrate_pages() so that
folio_put() only wakes waiters when migration is active.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/migrate.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index 2c3d489ecf51..a5d9f85aa376 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -47,6 +47,8 @@
 #include <asm/tlbflush.h>
 
 #include <trace/events/migrate.h>
+#include <linux/jump_label.h>
+#include <linux/wait_bit.h>
 
 #include "internal.h"
 #include "swap.h"
@@ -1732,6 +1734,17 @@ static void migrate_folios_move(struct list_head *src_folios,
 			*retry += 1;
 			*thp_retry += is_thp;
 			*nr_retry_pages += nr_pages;
+			/*
+			 * For longterm pinning, wait for references
+			 * to be released before retrying.
+			 */
+			if (reason == MR_LONGTERM_PIN) {
+				int expected = folio_expected_ref_count(folio) + 1;
+
+				wait_var_event_timeout(&folio->_refcount,
+					folio_ref_count(folio) <= expected,
+					HZ);
+			}
 			break;
 		case 0:
 			stats->nr_succeeded += nr_pages;
@@ -1941,6 +1954,17 @@ static int migrate_pages_batch(struct list_head *from,
 				retry++;
 				thp_retry += is_thp;
 				nr_retry_pages += nr_pages;
+				/*
+				 * For longterm pinning, wait for references
+				 * to be released.
+				 */
+				if (reason == MR_LONGTERM_PIN) {
+					int expected = folio_expected_ref_count(folio) + 1;
+
+					wait_var_event_timeout(&folio->_refcount,
+							folio_ref_count(folio) <= expected,
+							HZ);
+				}
 				break;
 			case 0:
 				list_move_tail(&folio->lru, &unmap_folios);
@@ -2085,6 +2109,9 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
 
 	memset(&stats, 0, sizeof(stats));
 
+	if (reason == MR_LONGTERM_PIN)
+		static_branch_inc(&folio_put_wakeup_key);
+
 	rc_gather = migrate_hugetlbs(from, get_new_folio, put_new_folio, private,
 				     mode, reason, &stats, &ret_folios);
 	if (rc_gather < 0)
@@ -2137,6 +2164,9 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
 	if (!list_empty(from))
 		goto again;
 out:
+	if (reason == MR_LONGTERM_PIN)
+		static_branch_dec(&folio_put_wakeup_key);
+
 	/*
 	 * Put the permanent failure folio back to migration list, they
 	 * will be put back to the right list by the caller.
-- 
2.53.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration
  2026-04-10  3:23 [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
  2026-04-10  3:23 ` [RFC PATCH 1/2] mm: wake up folio refcount waiters on folio_put() John Hubbard
  2026-04-10  3:23 ` [RFC PATCH 2/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
@ 2026-04-21  5:52 ` Alistair Popple
  2026-04-21  9:19 ` Huang, Ying
  3 siblings, 0 replies; 7+ messages in thread
From: Alistair Popple @ 2026-04-21  5:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, LKML,
	linux-mm

On 2026-04-10 at 13:23 +1000, John Hubbard <jhubbard@nvidia.com> wrote...
> Hi,
> 
> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can
> wait for transient folio references to drain, instead of failing after a
> fixed number of retries. The wait uses a one-second timeout. An
> alternative approach would be to call wait_var_event_killable() with no
> timeout, but that doesn't match as well with migration's "this will
> probably work" API. In other words, a short sleeping wait is more
> appropriate here.

This is much better than retrying $RANDOM times. It also seems it would provide
a nice definition of what a transient vs. longterm pin is. Any pins longer than
the migration timeout would be longterm.

> When migrating pages for FOLL_LONGTERM pinning, migration can fail with
> -EAGAIN if a folio has unexpected references. These references are often
> transient, but the current retry loop gives up too quickly. This series
> adds wait_var_event_timeout() at the retry points, paired with
> wake_up_var() in folio_put() to wake the sleeper as soon as the refcount
> drops.

Nothing wrong with the above, just a minor nit that I wanted to check
my understanding of. FOLL_LONGTERM causing migration implies this is in
ZONE_MOVABLE, and the aim of ZONE_MOVABLE is that memory is always movable. That
implies any unexpected page references should *always* be transient, not often
transient. At least that's my understanding assuming drivers are behaving.

> The wake_up_var() calls in folio_put() are gated behind a static key,
> disabled by default, so non-migration workloads pay zero cost.
> migrate_pages() enables the key on entry when the reason is
> MR_LONGTERM_PIN, and disables it on exit.
> 
> Toggling the key is not free. folio_put() is static inline, so every
> compilation unit that calls it gets its own patch site (roughly 500 in
> vmlinux, plus modules). On x86, jump label patching is batched (256
> sites per batch, 3 IPI rounds per batch), so enabling the key costs
> 6-9 IPI broadcasts, a few hundred microseconds on a large machine.
> That cost is paid twice per migrate_pages() call. Migration itself
> spends several milliseconds per batch on LRU isolation, TLB flushes,
> and page copies. Concurrent longterm-pin migrations after the first
> just do an atomic_inc (no patching).
> 
> Matthew Brost offered to performance-test this series [1], as Intel has
> tests that stress migration and good metrics to catch regressions.
> 
> [1] https://lore.kernel.org/all/aX+oUorOWPt1xbgw@lstrano-desk.jf.intel.com/
> 
> John Hubbard (2):
>   mm: wake up folio refcount waiters on folio_put()
>   mm/migrate: wait for folio refcount during longterm pin migration
> 
>  include/linux/mm.h |  8 ++++++++
>  mm/migrate.c       | 30 ++++++++++++++++++++++++++++++
>  mm/swap.c          | 10 +++++++++-
>  3 files changed, 47 insertions(+), 1 deletion(-)
> 
> 
> base-commit: 9a9c8ce300cd3859cc87b408ef552cd697cc2ab7
> -- 
> 2.53.0
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 2/2] mm/migrate: wait for folio refcount during longterm pin migration
  2026-04-10  3:23 ` [RFC PATCH 2/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
@ 2026-04-21  5:57   ` Alistair Popple
  2026-04-21  9:21   ` Huang, Ying
  1 sibling, 0 replies; 7+ messages in thread
From: Alistair Popple @ 2026-04-21  5:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, LKML,
	linux-mm

On 2026-04-10 at 13:23 +1000, John Hubbard <jhubbard@nvidia.com> wrote...
> When migrating pages for FOLL_LONGTERM pinning (MR_LONGTERM_PIN), the
> migration can fail with -EAGAIN if the folio has unexpected references.
> These references are often transient (e.g., from GPU operations like
> cuMemset that will complete shortly).

Is there a reason this logic should only apply to FOLL_LONGTERM pinning?
Or could it also apply more generally to any ZONE_MOVABLE page, for which
migration should eventually succeed? Currently that has similar retry logic of
NR_MAX_MIGRATE_PAGES_RETRY times and give up.

We have a similar retry problems in mm/migrate_device.c:migrate_vma_*() so I
could see something similar being potentially useful there.

 - Alistair

> Previously, the migration code would retry up to 10 times
> (NR_MAX_MIGRATE_PAGES_RETRY), but this busy-retry approach failed when
> the transient reference holder needed more time than the retry loop
> provides.
> 
> Fix this by waiting up to one second for the folio's refcount to drop
> to the expected value before retrying migration. The wait uses
> wait_var_event_timeout() paired with the wake_up_var() calls added to
> folio_put() in the previous commit. If the timeout expires, the
> existing retry loop continues as before. The folio_put_wakeup_key
> static key is enabled for the duration of migrate_pages() so that
> folio_put() only wakes waiters when migration is active.
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/migrate.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2c3d489ecf51..a5d9f85aa376 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -47,6 +47,8 @@
>  #include <asm/tlbflush.h>
>  
>  #include <trace/events/migrate.h>
> +#include <linux/jump_label.h>
> +#include <linux/wait_bit.h>
>  
>  #include "internal.h"
>  #include "swap.h"
> @@ -1732,6 +1734,17 @@ static void migrate_folios_move(struct list_head *src_folios,
>  			*retry += 1;
>  			*thp_retry += is_thp;
>  			*nr_retry_pages += nr_pages;
> +			/*
> +			 * For longterm pinning, wait for references
> +			 * to be released before retrying.
> +			 */
> +			if (reason == MR_LONGTERM_PIN) {
> +				int expected = folio_expected_ref_count(folio) + 1;
> +
> +				wait_var_event_timeout(&folio->_refcount,
> +					folio_ref_count(folio) <= expected,
> +					HZ);
> +			}
>  			break;
>  		case 0:
>  			stats->nr_succeeded += nr_pages;
> @@ -1941,6 +1954,17 @@ static int migrate_pages_batch(struct list_head *from,
>  				retry++;
>  				thp_retry += is_thp;
>  				nr_retry_pages += nr_pages;
> +				/*
> +				 * For longterm pinning, wait for references
> +				 * to be released.
> +				 */
> +				if (reason == MR_LONGTERM_PIN) {
> +					int expected = folio_expected_ref_count(folio) + 1;
> +
> +					wait_var_event_timeout(&folio->_refcount,
> +							folio_ref_count(folio) <= expected,
> +							HZ);
> +				}
>  				break;
>  			case 0:
>  				list_move_tail(&folio->lru, &unmap_folios);
> @@ -2085,6 +2109,9 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
>  
>  	memset(&stats, 0, sizeof(stats));
>  
> +	if (reason == MR_LONGTERM_PIN)
> +		static_branch_inc(&folio_put_wakeup_key);
> +
>  	rc_gather = migrate_hugetlbs(from, get_new_folio, put_new_folio, private,
>  				     mode, reason, &stats, &ret_folios);
>  	if (rc_gather < 0)
> @@ -2137,6 +2164,9 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
>  	if (!list_empty(from))
>  		goto again;
>  out:
> +	if (reason == MR_LONGTERM_PIN)
> +		static_branch_dec(&folio_put_wakeup_key);
> +
>  	/*
>  	 * Put the permanent failure folio back to migration list, they
>  	 * will be put back to the right list by the caller.
> -- 
> 2.53.0
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration
  2026-04-10  3:23 [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
                   ` (2 preceding siblings ...)
  2026-04-21  5:52 ` [RFC PATCH 0/2] " Alistair Popple
@ 2026-04-21  9:19 ` Huang, Ying
  3 siblings, 0 replies; 7+ messages in thread
From: Huang, Ying @ 2026-04-21  9:19 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Alistair Popple, Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, LKML,
	linux-mm

Hi, John,

John Hubbard <jhubbard@nvidia.com> writes:

> Hi,
>
> This adds a bounded sleep to migration so that FOLL_LONGTERM pinning can
> wait for transient folio references to drain, instead of failing after a
> fixed number of retries. The wait uses a one-second timeout. An

Is the one-second timeout appropriate for all users?  Do some users
prefer fail-fast behavior instead?  If so, should we add another FOLL
flag to support a timed wait?

> alternative approach would be to call wait_var_event_killable() with no
> timeout, but that doesn't match as well with migration's "this will
> probably work" API. In other words, a short sleeping wait is more
> appropriate here.
>
> When migrating pages for FOLL_LONGTERM pinning, migration can fail with
> -EAGAIN if a folio has unexpected references. These references are often
> transient, but the current retry loop gives up too quickly. This series
> adds wait_var_event_timeout() at the retry points, paired with
> wake_up_var() in folio_put() to wake the sleeper as soon as the refcount
> drops.
>
> The wake_up_var() calls in folio_put() are gated behind a static key,
> disabled by default, so non-migration workloads pay zero cost.
> migrate_pages() enables the key on entry when the reason is
> MR_LONGTERM_PIN, and disables it on exit.
>
> Toggling the key is not free. folio_put() is static inline, so every
> compilation unit that calls it gets its own patch site (roughly 500 in
> vmlinux, plus modules). On x86, jump label patching is batched (256
> sites per batch, 3 IPI rounds per batch), so enabling the key costs
> 6-9 IPI broadcasts, a few hundred microseconds on a large machine.
> That cost is paid twice per migrate_pages() call. Migration itself
> spends several milliseconds per batch on LRU isolation, TLB flushes,
> and page copies. Concurrent longterm-pin migrations after the first
> just do an atomic_inc (no patching).
>
> Matthew Brost offered to performance-test this series [1], as Intel has
> tests that stress migration and good metrics to catch regressions.
>
> [1] https://lore.kernel.org/all/aX+oUorOWPt1xbgw@lstrano-desk.jf.intel.com/
>
> John Hubbard (2):
>   mm: wake up folio refcount waiters on folio_put()
>   mm/migrate: wait for folio refcount during longterm pin migration
>
>  include/linux/mm.h |  8 ++++++++
>  mm/migrate.c       | 30 ++++++++++++++++++++++++++++++
>  mm/swap.c          | 10 +++++++++-
>  3 files changed, 47 insertions(+), 1 deletion(-)
>
>
> base-commit: 9a9c8ce300cd3859cc87b408ef552cd697cc2ab7

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 2/2] mm/migrate: wait for folio refcount during longterm pin migration
  2026-04-10  3:23 ` [RFC PATCH 2/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
  2026-04-21  5:57   ` Alistair Popple
@ 2026-04-21  9:21   ` Huang, Ying
  1 sibling, 0 replies; 7+ messages in thread
From: Huang, Ying @ 2026-04-21  9:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Alistair Popple, Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, LKML,
	linux-mm

John Hubbard <jhubbard@nvidia.com> writes:

> When migrating pages for FOLL_LONGTERM pinning (MR_LONGTERM_PIN), the
> migration can fail with -EAGAIN if the folio has unexpected references.
> These references are often transient (e.g., from GPU operations like
> cuMemset that will complete shortly).
>
> Previously, the migration code would retry up to 10 times
> (NR_MAX_MIGRATE_PAGES_RETRY), but this busy-retry approach failed when
> the transient reference holder needed more time than the retry loop
> provides.
>
> Fix this by waiting up to one second for the folio's refcount to drop
> to the expected value before retrying migration. The wait uses
> wait_var_event_timeout() paired with the wake_up_var() calls added to
> folio_put() in the previous commit. If the timeout expires, the
> existing retry loop continues as before. The folio_put_wakeup_key
> static key is enabled for the duration of migrate_pages() so that
> folio_put() only wakes waiters when migration is active.
>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/migrate.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2c3d489ecf51..a5d9f85aa376 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -47,6 +47,8 @@
>  #include <asm/tlbflush.h>
>  
>  #include <trace/events/migrate.h>
> +#include <linux/jump_label.h>
> +#include <linux/wait_bit.h>
>  
>  #include "internal.h"
>  #include "swap.h"
> @@ -1732,6 +1734,17 @@ static void migrate_folios_move(struct list_head *src_folios,
>  			*retry += 1;
>  			*thp_retry += is_thp;
>  			*nr_retry_pages += nr_pages;
> +			/*
> +			 * For longterm pinning, wait for references
> +			 * to be released before retrying.
> +			 */
> +			if (reason == MR_LONGTERM_PIN) {
> +				int expected = folio_expected_ref_count(folio) + 1;
> +
> +				wait_var_event_timeout(&folio->_refcount,
> +					folio_ref_count(folio) <= expected,
> +					HZ);
> +			}
>  			break;
>  		case 0:
>  			stats->nr_succeeded += nr_pages;
> @@ -1941,6 +1954,17 @@ static int migrate_pages_batch(struct list_head *from,
>  				retry++;
>  				thp_retry += is_thp;
>  				nr_retry_pages += nr_pages;
> +				/*
> +				 * For longterm pinning, wait for references
> +				 * to be released.
> +				 */
> +				if (reason == MR_LONGTERM_PIN) {
> +					int expected = folio_expected_ref_count(folio) + 1;
> +
> +					wait_var_event_timeout(&folio->_refcount,
> +							folio_ref_count(folio) <= expected,
> +							HZ);
> +				}
>  				break;
>  			case 0:
>  				list_move_tail(&folio->lru, &unmap_folios);
> @@ -2085,6 +2109,9 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
>  
>  	memset(&stats, 0, sizeof(stats));
>  
> +	if (reason == MR_LONGTERM_PIN)
> +		static_branch_inc(&folio_put_wakeup_key);
> +

This should be done in migrate_pages_sync() before the sync loop.

>  	rc_gather = migrate_hugetlbs(from, get_new_folio, put_new_folio, private,
>  				     mode, reason, &stats, &ret_folios);
>  	if (rc_gather < 0)
> @@ -2137,6 +2164,9 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
>  	if (!list_empty(from))
>  		goto again;
>  out:
> +	if (reason == MR_LONGTERM_PIN)
> +		static_branch_dec(&folio_put_wakeup_key);
> +
>  	/*
>  	 * Put the permanent failure folio back to migration list, they
>  	 * will be put back to the right list by the caller.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-21  9:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-10  3:23 [RFC PATCH 0/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
2026-04-10  3:23 ` [RFC PATCH 1/2] mm: wake up folio refcount waiters on folio_put() John Hubbard
2026-04-10  3:23 ` [RFC PATCH 2/2] mm/migrate: wait for folio refcount during longterm pin migration John Hubbard
2026-04-21  5:57   ` Alistair Popple
2026-04-21  9:21   ` Huang, Ying
2026-04-21  5:52 ` [RFC PATCH 0/2] " Alistair Popple
2026-04-21  9:19 ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox