[PATCH v2 0/6] mm: better GUP pin lru_add_drain

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] mm: better GUP pin lru_add_drain_all()
@ 2025-09-08 22:12 Hugh Dickins
  2025-09-08 22:15 ` [PATCH v2 1/6] mm/gup: check ref_count instead of lru before migration Hugh Dickins
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Hugh Dickins @ 2025-09-08 22:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, David Hildenbrand, Frederick Mayle,
	Jason Gunthorpe, Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

Series of lru_add_drain_all()-related patches, arising from recent
mm/gup migration report from Will Deacon.  Based on 6.17-rc3 but apply
to replace v1 in mm.git.  I suggest all but 6/6 be hotfixes going to
6.17 and stable.

v1 was at
https://lore.kernel.org/linux-mm/a28b44f7-cdb4-8b81-4982-758ae774fbf7@google.com/
amd its
1/7 mm: fix folio_expected_ref_count() when PG_private_2
has been dropped from v2, per Matthew Wilcox.

1/6 mm/gup: check ref_count instead of lru before migration
    v1->v2: paragraph on PG_private_2 added to commit message
2/6 mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
    v1->v2: lru_add_drain() only when needed, per David
3/6 mm: Revert "mm/gup: clear the LRU flag of a page before
    v1->v2: Acked-by David added
4/6 mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
    v1->v2: Acked-by David added
5/6 mm: folio_may_be_lru_cached() unless folio_test_large()
    v1->v2: folio_may_be_lru_cached(): lru_ per David
6/6 mm: lru_add_drain_all() do local lru_add_drain() first
    v1->v2: Acked-by David added

 include/linux/swap.h |   10 ++++++++
 mm/gup.c             |   14 ++++++++---
 mm/mlock.c           |    6 ++--
 mm/swap.c            |   53 +++++++++++++++++++++++--------------------
 mm/vmscan.c          |    2 -
 5 files changed, 54 insertions(+), 31 deletions(-)

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/6] mm/gup: check ref_count instead of lru before migration
  2025-09-08 22:12 [PATCH v2 0/6] mm: better GUP pin lru_add_drain_all() Hugh Dickins
@ 2025-09-08 22:15 ` Hugh Dickins
  2025-09-09  7:54   ` David Hildenbrand
  2025-09-09 10:48   ` Kiryl Shutsemau
  2025-09-08 22:16 ` [PATCH v2 2/6] mm/gup: local lru_add_drain() to avoid lru_add_drain_all() Hugh Dickins
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2025-09-08 22:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, David Hildenbrand, Frederick Mayle,
	Jason Gunthorpe, Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

Will Deacon reports:-

When taking a longterm GUP pin via pin_user_pages(),
__gup_longterm_locked() tries to migrate target folios that should not
be longterm pinned, for example because they reside in a CMA region or
movable zone. This is done by first pinning all of the target folios
anyway, collecting all of the longterm-unpinnable target folios into a
list, dropping the pins that were just taken and finally handing the
list off to migrate_pages() for the actual migration.

It is critically important that no unexpected references are held on the
folios being migrated, otherwise the migration will fail and
pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
relatively easy to observe migration failures when running pKVM (which
uses pin_user_pages() on crosvm's virtual address space to resolve
stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
this results in the VM terminating prematurely.

In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
mapping of guest memory prior to the pinning. Subsequently, when
pin_user_pages() walks the page-table, the relevant 'pte' is not
present and so the faulting logic allocates a new folio, mlocks it
with mlock_folio() and maps it in the page-table.

Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
batch by pagevec"), mlock/munlock operations on a folio (formerly page),
are deferred. For example, mlock_folio() takes an additional reference
on the target folio before placing it into a per-cpu 'folio_batch' for
later processing by mlock_folio_batch(), which drops the refcount once
the operation is complete. Processing of the batches is coupled with
the LRU batch logic and can be forcefully drained with
lru_add_drain_all() but as long as a folio remains unprocessed on the
batch, its refcount will be elevated.

This deferred batching therefore interacts poorly with the pKVM pinning
scenario as we can find ourselves in a situation where the migration
code fails to migrate a folio due to the elevated refcount from the
pending mlock operation.

Hugh Dickins adds:-

!folio_test_lru() has never been a very reliable way to tell if an
lru_add_drain_all() is worth calling, to remove LRU cache references
to make the folio migratable: the LRU flag may be set even while the
folio is held with an extra reference in a per-CPU LRU cache.

5.18 commit 2fbb0c10d1e8 may have made it more unreliable.  Then 6.11
commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
to LRU batch") tried to make it reliable, by moving LRU flag clearing;
but missed the mlock/munlock batches, so still unreliable as reported.

And it turns out to be difficult to extend 33dfe9204f29's LRU flag
clearing to the mlock/munlock batches: if they do benefit from batching,
mlock/munlock cannot be so effective when easily suppressed while !LRU.

Instead, switch to an expected ref_count check, which was more reliable
all along: some more false positives (unhelpful drains) than before, and
never a guarantee that the folio will prove migratable, but better.

Note on PG_private_2: ceph and nfs are still using the deprecated
PG_private_2 flag, with the aid of netfs and filemap support functions.
Although it is consistently matched by an increment of folio ref_count,
folio_expected_ref_count() intentionally does not recognize it, and ceph
folio migration currently depends on that for PG_private_2 folios to be
rejected.  New references to the deprecated flag are discouraged, so do
not add it into the collect_longterm_unpinnable_folios() calculation:
but longterm pinning of transiently PG_private_2 ceph and nfs folios
(an uncommon case) may invoke a redundant lru_add_drain_all().  And
this makes easy the backport to earlier releases: up to and including
6.12, btrfs also used PG_private_2, but without a ref_count increment.

Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm:
add folio_expected_ref_count() for reference count calculation").

Reported-by: Will Deacon <will@kernel.org>
Closes: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index adffe663594d..82aec6443c0a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
 			continue;
 		}

-		if (!folio_test_lru(folio) && drain_allow) {
+		if (drain_allow && folio_ref_count(folio) !=
+				   folio_expected_ref_count(folio) + 1) {
 			lru_add_drain_all();
 			drain_allow = false;
 		}
-- 
2.51.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 2/6] mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
  2025-09-08 22:12 [PATCH v2 0/6] mm: better GUP pin lru_add_drain_all() Hugh Dickins
  2025-09-08 22:15 ` [PATCH v2 1/6] mm/gup: check ref_count instead of lru before migration Hugh Dickins
@ 2025-09-08 22:16 ` Hugh Dickins
  2025-09-09  7:56   ` David Hildenbrand
  2025-09-08 22:19 ` [PATCH v2 3/6] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch" Hugh Dickins
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2025-09-08 22:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, David Hildenbrand, Frederick Mayle,
	Jason Gunthorpe, Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

In many cases, if collect_longterm_unpinnable_folios() does need to
drain the LRU cache to release a reference, the cache in question is
on this same CPU, and much more efficiently drained by a preliminary
local lru_add_drain(), than the later cross-CPU lru_add_drain_all().

Marked for stable, to counter the increase in lru_add_drain_all()s
from "mm/gup: check ref_count instead of lru before migration".
Note for clean backports: can take 6.16 commit a03db236aebf ("gup:
optimize longterm pin_user_pages() for large folio") first.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/gup.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 82aec6443c0a..b47066a54f52 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2287,8 +2287,8 @@ static unsigned long collect_longterm_unpinnable_folios(
 		struct pages_or_folios *pofs)
 {
 	unsigned long collected = 0;
-	bool drain_allow = true;
 	struct folio *folio;
+	int drained = 0;
 	long i = 0;
 
 	for (folio = pofs_get_folio(pofs, i); folio;
@@ -2307,10 +2307,17 @@ static unsigned long collect_longterm_unpinnable_folios(
 			continue;
 		}
 
-		if (drain_allow && folio_ref_count(folio) !=
-				   folio_expected_ref_count(folio) + 1) {
+		if (drained == 0 &&
+				folio_ref_count(folio) !=
+				folio_expected_ref_count(folio) + 1) {
+			lru_add_drain();
+			drained = 1;
+		}
+		if (drained == 1 &&
+				folio_ref_count(folio) !=
+				folio_expected_ref_count(folio) + 1) {
 			lru_add_drain_all();
-			drain_allow = false;
+			drained = 2;
 		}
 
 		if (!folio_isolate_lru(folio))
-- 
2.51.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 3/6] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
  2025-09-08 22:12 [PATCH v2 0/6] mm: better GUP pin lru_add_drain_all() Hugh Dickins
  2025-09-08 22:15 ` [PATCH v2 1/6] mm/gup: check ref_count instead of lru before migration Hugh Dickins
  2025-09-08 22:16 ` [PATCH v2 2/6] mm/gup: local lru_add_drain() to avoid lru_add_drain_all() Hugh Dickins
@ 2025-09-08 22:19 ` Hugh Dickins
  2025-09-08 22:21 ` [PATCH v2 4/6] mm: Revert "mm: vmscan.c: fix OOM on swap stress test" Hugh Dickins
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2025-09-08 22:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, David Hildenbrand, Frederick Mayle,
	Jason Gunthorpe, Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

This reverts commit 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9:
now that collect_longterm_unpinnable_folios() is checking ref_count
instead of lru, and mlock/munlock do not participate in the revised
LRU flag clearing, those changes are misleading, and enlarge the
window during which mlock/munlock may miss an mlock_count update.

It is possible (I'd hesitate to claim probable) that the greater
likelihood of missed mlock_count updates would explain the "Realtime
threads delayed due to kcompactd0" observed on 6.12 in the Link below.
If that is the case, this reversion will help; but a complete solution
needs also a further patch, beyond the scope of this series.

Included some 80-column cleanup around folio_batch_add_and_move().

The role of folio_test_clear_lru() (before taking per-memcg lru_lock)
is questionable since 6.13 removed mem_cgroup_move_account() etc; but
perhaps there are still some races which need it - not examined here.

Link: https://lore.kernel.org/linux-mm/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
---
 mm/swap.c | 50 ++++++++++++++++++++++++++------------------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 3632dd061beb..6ae2d5680574 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -164,6 +164,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 	for (i = 0; i < folio_batch_count(fbatch); i++) {
 		struct folio *folio = fbatch->folios[i];
 
+		/* block memcg migration while the folio moves between lru */
+		if (move_fn != lru_add && !folio_test_clear_lru(folio))
+			continue;
+
 		folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
 		move_fn(lruvec, folio);
 
@@ -176,14 +180,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 }
 
 static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
-		struct folio *folio, move_fn_t move_fn,
-		bool on_lru, bool disable_irq)
+		struct folio *folio, move_fn_t move_fn, bool disable_irq)
 {
 	unsigned long flags;
 
-	if (on_lru && !folio_test_clear_lru(folio))
-		return;
-
 	folio_get(folio);
 
 	if (disable_irq)
@@ -191,8 +191,8 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 	else
 		local_lock(&cpu_fbatches.lock);
 
-	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) ||
-	    lru_cache_disabled())
+	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
+			folio_test_large(folio) || lru_cache_disabled())
 		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
 
 	if (disable_irq)
@@ -201,13 +201,13 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 		local_unlock(&cpu_fbatches.lock);
 }
 
-#define folio_batch_add_and_move(folio, op, on_lru)						\
-	__folio_batch_add_and_move(								\
-		&cpu_fbatches.op,								\
-		folio,										\
-		op,										\
-		on_lru,										\
-		offsetof(struct cpu_fbatches, op) >= offsetof(struct cpu_fbatches, lock_irq)	\
+#define folio_batch_add_and_move(folio, op)		\
+	__folio_batch_add_and_move(			\
+		&cpu_fbatches.op,			\
+		folio,					\
+		op,					\
+		offsetof(struct cpu_fbatches, op) >=	\
+		offsetof(struct cpu_fbatches, lock_irq)	\
 	)
 
 static void lru_move_tail(struct lruvec *lruvec, struct folio *folio)
@@ -231,10 +231,10 @@ static void lru_move_tail(struct lruvec *lruvec, struct folio *folio)
 void folio_rotate_reclaimable(struct folio *folio)
 {
 	if (folio_test_locked(folio) || folio_test_dirty(folio) ||
-	    folio_test_unevictable(folio))
+	    folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_move_tail, true);
+	folio_batch_add_and_move(folio, lru_move_tail);
 }
 
 void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
@@ -328,10 +328,11 @@ static void folio_activate_drain(int cpu)
 
 void folio_activate(struct folio *folio)
 {
-	if (folio_test_active(folio) || folio_test_unevictable(folio))
+	if (folio_test_active(folio) || folio_test_unevictable(folio) ||
+	    !folio_test_lru(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_activate, true);
+	folio_batch_add_and_move(folio, lru_activate);
 }
 
 #else
@@ -507,7 +508,7 @@ void folio_add_lru(struct folio *folio)
 	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
 		folio_set_active(folio);
 
-	folio_batch_add_and_move(folio, lru_add, false);
+	folio_batch_add_and_move(folio, lru_add);
 }
 EXPORT_SYMBOL(folio_add_lru);
 
@@ -685,13 +686,13 @@ void lru_add_drain_cpu(int cpu)
 void deactivate_file_folio(struct folio *folio)
 {
 	/* Deactivating an unevictable folio will not accelerate reclaim */
-	if (folio_test_unevictable(folio))
+	if (folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
 	if (lru_gen_enabled() && lru_gen_clear_refs(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_deactivate_file, true);
+	folio_batch_add_and_move(folio, lru_deactivate_file);
 }
 
 /*
@@ -704,13 +705,13 @@ void deactivate_file_folio(struct folio *folio)
  */
 void folio_deactivate(struct folio *folio)
 {
-	if (folio_test_unevictable(folio))
+	if (folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
 	if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_deactivate, true);
+	folio_batch_add_and_move(folio, lru_deactivate);
 }
 
 /**
@@ -723,10 +724,11 @@ void folio_deactivate(struct folio *folio)
 void folio_mark_lazyfree(struct folio *folio)
 {
 	if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) ||
+	    !folio_test_lru(folio) ||
 	    folio_test_swapcache(folio) || folio_test_unevictable(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_lazyfree, true);
+	folio_batch_add_and_move(folio, lru_lazyfree);
 }
 
 void lru_add_drain(void)
-- 
2.51.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 4/6] mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
  2025-09-08 22:12 [PATCH v2 0/6] mm: better GUP pin lru_add_drain_all() Hugh Dickins
                   ` (2 preceding siblings ...)
  2025-09-08 22:19 ` [PATCH v2 3/6] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch" Hugh Dickins
@ 2025-09-08 22:21 ` Hugh Dickins
  2025-09-08 22:23 ` [PATCH v2 5/6] mm: folio_may_be_lru_cached() unless folio_test_large() Hugh Dickins
  2025-09-08 22:24 ` [PATCH v2 6/6] mm: lru_add_drain_all() do local lru_add_drain() first Hugh Dickins
  5 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2025-09-08 22:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, David Hildenbrand, Frederick Mayle,
	Jason Gunthorpe, Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

This reverts commit 0885ef4705607936fc36a38fd74356e1c465b023: that
was a fix to the reverted 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a48aec8bfd92..674999999cd0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4507,7 +4507,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}
 
 	/* ineligible */
-	if (!folio_test_lru(folio) || zone > sc->reclaim_idx) {
+	if (zone > sc->reclaim_idx) {
 		gen = folio_inc_gen(lruvec, folio, false);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 5/6] mm: folio_may_be_lru_cached() unless folio_test_large()
  2025-09-08 22:12 [PATCH v2 0/6] mm: better GUP pin lru_add_drain_all() Hugh Dickins
                   ` (3 preceding siblings ...)
  2025-09-08 22:21 ` [PATCH v2 4/6] mm: Revert "mm: vmscan.c: fix OOM on swap stress test" Hugh Dickins
@ 2025-09-08 22:23 ` Hugh Dickins
  2025-09-09  7:57   ` David Hildenbrand
  2025-09-08 22:24 ` [PATCH v2 6/6] mm: lru_add_drain_all() do local lru_add_drain() first Hugh Dickins
  5 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2025-09-08 22:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, David Hildenbrand, Frederick Mayle,
	Jason Gunthorpe, Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as
a large folio is added: so collect_longterm_unpinnable_folios() just
wastes effort when calling lru_add_drain[_all]() on a large folio.

But although there is good reason not to batch up PMD-sized folios,
we might well benefit from batching a small number of low-order mTHPs
(though unclear how that "small number" limitation will be implemented).

So ask if folio_may_be_lru_cached() rather than !folio_test_large(), to
insulate those particular checks from future change.  Name preferred
to "folio_is_batchable" because large folios can well be put on a batch:
it's just the per-CPU LRU caches, drained much later, which need care.

Marked for stable, to counter the increase in lru_add_drain_all()s
from "mm/gup: check ref_count instead of lru before migration".

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/swap.h | 10 ++++++++++
 mm/gup.c             |  4 ++--
 mm/mlock.c           |  6 +++---
 mm/swap.c            |  2 +-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe6ed2cc3fd..7012a0f758d8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
 void mark_page_accessed(struct page *);
 void folio_mark_accessed(struct folio *);
 
+static inline bool folio_may_be_lru_cached(struct folio *folio)
+{
+	/*
+	 * Holding PMD-sized folios in per-CPU LRU cache unbalances accounting.
+	 * Holding small numbers of low-order mTHP folios in per-CPU LRU cache
+	 * will be sensible, but nobody has implemented and tested that yet.
+	 */
+	return !folio_test_large(folio);
+}
+
 extern atomic_t lru_disable_count;
 
 static inline bool lru_cache_disabled(void)
diff --git a/mm/gup.c b/mm/gup.c
index b47066a54f52..0bc4d140fc07 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2307,13 +2307,13 @@ static unsigned long collect_longterm_unpinnable_folios(
 			continue;
 		}
 
-		if (drained == 0 &&
+		if (drained == 0 && folio_may_be_lru_cached(folio) &&
 				folio_ref_count(folio) !=
 				folio_expected_ref_count(folio) + 1) {
 			lru_add_drain();
 			drained = 1;
 		}
-		if (drained == 1 &&
+		if (drained == 1 && folio_may_be_lru_cached(folio) &&
 				folio_ref_count(folio) !=
 				folio_expected_ref_count(folio) + 1) {
 			lru_add_drain_all();
diff --git a/mm/mlock.c b/mm/mlock.c
index a1d93ad33c6d..bb0776f5ef7c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -255,7 +255,7 @@ void mlock_folio(struct folio *folio)
 
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
-	    folio_test_large(folio) || lru_cache_disabled())
+	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
 }
@@ -278,7 +278,7 @@ void mlock_new_folio(struct folio *folio)
 
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, mlock_new(folio)) ||
-	    folio_test_large(folio) || lru_cache_disabled())
+	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
 }
@@ -299,7 +299,7 @@ void munlock_folio(struct folio *folio)
 	 */
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, folio) ||
-	    folio_test_large(folio) || lru_cache_disabled())
+	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
 }
diff --git a/mm/swap.c b/mm/swap.c
index 6ae2d5680574..b74ebe865dd9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -192,7 +192,7 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 		local_lock(&cpu_fbatches.lock);
 
 	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
-			folio_test_large(folio) || lru_cache_disabled())
+			!folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
 
 	if (disable_irq)
-- 
2.51.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 6/6] mm: lru_add_drain_all() do local lru_add_drain() first
  2025-09-08 22:12 [PATCH v2 0/6] mm: better GUP pin lru_add_drain_all() Hugh Dickins
                   ` (4 preceding siblings ...)
  2025-09-08 22:23 ` [PATCH v2 5/6] mm: folio_may_be_lru_cached() unless folio_test_large() Hugh Dickins
@ 2025-09-08 22:24 ` Hugh Dickins
  5 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2025-09-08 22:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, David Hildenbrand, Frederick Mayle,
	Jason Gunthorpe, Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

No numbers to back this up, but it seemed obvious to me, that if there
are competing lru_add_drain_all()ers, the work will be minimized if each
flushes its own local queues before locking and doing cross-CPU drains.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/swap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/swap.c b/mm/swap.c
index b74ebe865dd9..881e53b2877e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -834,6 +834,9 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	 */
 	this_gen = smp_load_acquire(&lru_drain_gen);
 
+	/* It helps everyone if we do our own local drain immediately. */
+	lru_add_drain();
+
 	mutex_lock(&lock);
 
 	/*
-- 
2.51.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/6] mm/gup: check ref_count instead of lru before migration
  2025-09-08 22:15 ` [PATCH v2 1/6] mm/gup: check ref_count instead of lru before migration Hugh Dickins
@ 2025-09-09  7:54   ` David Hildenbrand
  2025-09-09 10:48   ` Kiryl Shutsemau
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-09-09  7:54 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, Frederick Mayle, Jason Gunthorpe,
	Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

On 09.09.25 00:15, Hugh Dickins wrote:
> Will Deacon reports:-
> 
> When taking a longterm GUP pin via pin_user_pages(),
> __gup_longterm_locked() tries to migrate target folios that should not
> be longterm pinned, for example because they reside in a CMA region or
> movable zone. This is done by first pinning all of the target folios
> anyway, collecting all of the longterm-unpinnable target folios into a
> list, dropping the pins that were just taken and finally handing the
> list off to migrate_pages() for the actual migration.
> 
> It is critically important that no unexpected references are held on the
> folios being migrated, otherwise the migration will fail and
> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> relatively easy to observe migration failures when running pKVM (which
> uses pin_user_pages() on crosvm's virtual address space to resolve
> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> this results in the VM terminating prematurely.
> 
> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> mapping of guest memory prior to the pinning. Subsequently, when
> pin_user_pages() walks the page-table, the relevant 'pte' is not
> present and so the faulting logic allocates a new folio, mlocks it
> with mlock_folio() and maps it in the page-table.
> 
> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> are deferred. For example, mlock_folio() takes an additional reference
> on the target folio before placing it into a per-cpu 'folio_batch' for
> later processing by mlock_folio_batch(), which drops the refcount once
> the operation is complete. Processing of the batches is coupled with
> the LRU batch logic and can be forcefully drained with
> lru_add_drain_all() but as long as a folio remains unprocessed on the
> batch, its refcount will be elevated.
> 
> This deferred batching therefore interacts poorly with the pKVM pinning
> scenario as we can find ourselves in a situation where the migration
> code fails to migrate a folio due to the elevated refcount from the
> pending mlock operation.
> 
> Hugh Dickins adds:-
> 
> !folio_test_lru() has never been a very reliable way to tell if an
> lru_add_drain_all() is worth calling, to remove LRU cache references
> to make the folio migratable: the LRU flag may be set even while the
> folio is held with an extra reference in a per-CPU LRU cache.
> 
> 5.18 commit 2fbb0c10d1e8 may have made it more unreliable.  Then 6.11
> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
> to LRU batch") tried to make it reliable, by moving LRU flag clearing;
> but missed the mlock/munlock batches, so still unreliable as reported.
> 
> And it turns out to be difficult to extend 33dfe9204f29's LRU flag
> clearing to the mlock/munlock batches: if they do benefit from batching,
> mlock/munlock cannot be so effective when easily suppressed while !LRU.
> 
> Instead, switch to an expected ref_count check, which was more reliable
> all along: some more false positives (unhelpful drains) than before, and
> never a guarantee that the folio will prove migratable, but better.
> 
> Note on PG_private_2: ceph and nfs are still using the deprecated
> PG_private_2 flag, with the aid of netfs and filemap support functions.
> Although it is consistently matched by an increment of folio ref_count,
> folio_expected_ref_count() intentionally does not recognize it, and ceph
> folio migration currently depends on that for PG_private_2 folios to be
> rejected.  New references to the deprecated flag are discouraged, so do
> not add it into the collect_longterm_unpinnable_folios() calculation:
> but longterm pinning of transiently PG_private_2 ceph and nfs folios
> (an uncommon case) may invoke a redundant lru_add_drain_all().  And
> this makes easy the backport to earlier releases: up to and including
> 6.12, btrfs also used PG_private_2, but without a ref_count increment.
> 
> Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm:
> add folio_expected_ref_count() for reference count calculation").
> 
> Reported-by: Will Deacon <will@kernel.org>
> Closes: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>   mm/gup.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index adffe663594d..82aec6443c0a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>   			continue;
>   		}
>   
> -		if (!folio_test_lru(folio) && drain_allow) {
> +		if (drain_allow && folio_ref_count(folio) !=
> +				   folio_expected_ref_count(folio) + 1) {
>   			lru_add_drain_all();
>   			drain_allow = false;
>   		}

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/6] mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
  2025-09-08 22:16 ` [PATCH v2 2/6] mm/gup: local lru_add_drain() to avoid lru_add_drain_all() Hugh Dickins
@ 2025-09-09  7:56   ` David Hildenbrand
  2025-09-09 10:52     ` Kiryl Shutsemau
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-09-09  7:56 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, Frederick Mayle, Jason Gunthorpe,
	Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

On 09.09.25 00:16, Hugh Dickins wrote:
> In many cases, if collect_longterm_unpinnable_folios() does need to
> drain the LRU cache to release a reference, the cache in question is
> on this same CPU, and much more efficiently drained by a preliminary
> local lru_add_drain(), than the later cross-CPU lru_add_drain_all().
> 
> Marked for stable, to counter the increase in lru_add_drain_all()s
> from "mm/gup: check ref_count instead of lru before migration".
> Note for clean backports: can take 6.16 commit a03db236aebf ("gup:
> optimize longterm pin_user_pages() for large folio") first.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>   mm/gup.c | 15 +++++++++++----
>   1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 82aec6443c0a..b47066a54f52 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2287,8 +2287,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>   		struct pages_or_folios *pofs)
>   {
>   	unsigned long collected = 0;
> -	bool drain_allow = true;
>   	struct folio *folio;
> +	int drained = 0;
>   	long i = 0;
>   
>   	for (folio = pofs_get_folio(pofs, i); folio;
> @@ -2307,10 +2307,17 @@ static unsigned long collect_longterm_unpinnable_folios(
>   			continue;
>   		}
>   
> -		if (drain_allow && folio_ref_count(folio) !=
> -				   folio_expected_ref_count(folio) + 1) {
> +		if (drained == 0 &&
> +				folio_ref_count(folio) !=
> +				folio_expected_ref_count(folio) + 1) {

I would just have indented this as follows:

		if (drained == 0 &&
		    folio_ref_count(folio) != folio_expected_ref_count(folio) + 1) {

Same below.

In any case logic LGTM

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 5/6] mm: folio_may_be_lru_cached() unless folio_test_large()
  2025-09-08 22:23 ` [PATCH v2 5/6] mm: folio_may_be_lru_cached() unless folio_test_large() Hugh Dickins
@ 2025-09-09  7:57   ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-09-09  7:57 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Alexander Krabler, Aneesh Kumar K.V, Axel Rasmussen, Chris Li,
	Christoph Hellwig, Frederick Mayle, Jason Gunthorpe,
	Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

On 09.09.25 00:23, Hugh Dickins wrote:
> mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as
> a large folio is added: so collect_longterm_unpinnable_folios() just
> wastes effort when calling lru_add_drain[_all]() on a large folio.
> 
> But although there is good reason not to batch up PMD-sized folios,
> we might well benefit from batching a small number of low-order mTHPs
> (though unclear how that "small number" limitation will be implemented).
> 
> So ask if folio_may_be_lru_cached() rather than !folio_test_large(), to
> insulate those particular checks from future change.  Name preferred
> to "folio_is_batchable" because large folios can well be put on a batch:
> it's just the per-CPU LRU caches, drained much later, which need care.
> 
> Marked for stable, to counter the increase in lru_add_drain_all()s
> from "mm/gup: check ref_count instead of lru before migration".
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/6] mm/gup: check ref_count instead of lru before migration
  2025-09-08 22:15 ` [PATCH v2 1/6] mm/gup: check ref_count instead of lru before migration Hugh Dickins
  2025-09-09  7:54   ` David Hildenbrand
@ 2025-09-09 10:48   ` Kiryl Shutsemau
  1 sibling, 0 replies; 13+ messages in thread
From: Kiryl Shutsemau @ 2025-09-09 10:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Alexander Krabler, Aneesh Kumar K.V,
	Axel Rasmussen, Chris Li, Christoph Hellwig, David Hildenbrand,
	Frederick Mayle, Jason Gunthorpe, Johannes Weiner, John Hubbard,
	Keir Fraser, Konstantin Khlebnikov, Li Zhe, Matthew Wilcox,
	Peter Xu, Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu,
	Will Deacon, yangge, Yuanchu Xie, Yu Zhao, linux-kernel,
	linux-mm

On Mon, Sep 08, 2025 at 03:15:03PM -0700, Hugh Dickins wrote:
> Will Deacon reports:-
> 
> When taking a longterm GUP pin via pin_user_pages(),
> __gup_longterm_locked() tries to migrate target folios that should not
> be longterm pinned, for example because they reside in a CMA region or
> movable zone. This is done by first pinning all of the target folios
> anyway, collecting all of the longterm-unpinnable target folios into a
> list, dropping the pins that were just taken and finally handing the
> list off to migrate_pages() for the actual migration.
> 
> It is critically important that no unexpected references are held on the
> folios being migrated, otherwise the migration will fail and
> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> relatively easy to observe migration failures when running pKVM (which
> uses pin_user_pages() on crosvm's virtual address space to resolve
> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> this results in the VM terminating prematurely.
> 
> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> mapping of guest memory prior to the pinning. Subsequently, when
> pin_user_pages() walks the page-table, the relevant 'pte' is not
> present and so the faulting logic allocates a new folio, mlocks it
> with mlock_folio() and maps it in the page-table.
> 
> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> are deferred. For example, mlock_folio() takes an additional reference
> on the target folio before placing it into a per-cpu 'folio_batch' for
> later processing by mlock_folio_batch(), which drops the refcount once
> the operation is complete. Processing of the batches is coupled with
> the LRU batch logic and can be forcefully drained with
> lru_add_drain_all() but as long as a folio remains unprocessed on the
> batch, its refcount will be elevated.
> 
> This deferred batching therefore interacts poorly with the pKVM pinning
> scenario as we can find ourselves in a situation where the migration
> code fails to migrate a folio due to the elevated refcount from the
> pending mlock operation.
> 
> Hugh Dickins adds:-
> 
> !folio_test_lru() has never been a very reliable way to tell if an
> lru_add_drain_all() is worth calling, to remove LRU cache references
> to make the folio migratable: the LRU flag may be set even while the
> folio is held with an extra reference in a per-CPU LRU cache.
> 
> 5.18 commit 2fbb0c10d1e8 may have made it more unreliable.  Then 6.11
> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
> to LRU batch") tried to make it reliable, by moving LRU flag clearing;
> but missed the mlock/munlock batches, so still unreliable as reported.
> 
> And it turns out to be difficult to extend 33dfe9204f29's LRU flag
> clearing to the mlock/munlock batches: if they do benefit from batching,
> mlock/munlock cannot be so effective when easily suppressed while !LRU.
> 
> Instead, switch to an expected ref_count check, which was more reliable
> all along: some more false positives (unhelpful drains) than before, and
> never a guarantee that the folio will prove migratable, but better.
> 
> Note on PG_private_2: ceph and nfs are still using the deprecated
> PG_private_2 flag, with the aid of netfs and filemap support functions.
> Although it is consistently matched by an increment of folio ref_count,
> folio_expected_ref_count() intentionally does not recognize it, and ceph
> folio migration currently depends on that for PG_private_2 folios to be
> rejected.  New references to the deprecated flag are discouraged, so do
> not add it into the collect_longterm_unpinnable_folios() calculation:
> but longterm pinning of transiently PG_private_2 ceph and nfs folios
> (an uncommon case) may invoke a redundant lru_add_drain_all().  And
> this makes easy the backport to earlier releases: up to and including
> 6.12, btrfs also used PG_private_2, but without a ref_count increment.
> 
> Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm:
> add folio_expected_ref_count() for reference count calculation").
> 
> Reported-by: Will Deacon <will@kernel.org>
> Closes: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Acked-by: Kiryl Shutsemau <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/6] mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
  2025-09-09  7:56   ` David Hildenbrand
@ 2025-09-09 10:52     ` Kiryl Shutsemau
  2025-09-09 11:33       ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Kiryl Shutsemau @ 2025-09-09 10:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Andrew Morton, Alexander Krabler, Aneesh Kumar K.V,
	Axel Rasmussen, Chris Li, Christoph Hellwig, Frederick Mayle,
	Jason Gunthorpe, Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

On Tue, Sep 09, 2025 at 09:56:30AM +0200, David Hildenbrand wrote:
> On 09.09.25 00:16, Hugh Dickins wrote:
> > In many cases, if collect_longterm_unpinnable_folios() does need to
> > drain the LRU cache to release a reference, the cache in question is
> > on this same CPU, and much more efficiently drained by a preliminary
> > local lru_add_drain(), than the later cross-CPU lru_add_drain_all().
> > 
> > Marked for stable, to counter the increase in lru_add_drain_all()s
> > from "mm/gup: check ref_count instead of lru before migration".
> > Note for clean backports: can take 6.16 commit a03db236aebf ("gup:
> > optimize longterm pin_user_pages() for large folio") first.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   mm/gup.c | 15 +++++++++++----
> >   1 file changed, 11 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 82aec6443c0a..b47066a54f52 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2287,8 +2287,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> >   		struct pages_or_folios *pofs)
> >   {
> >   	unsigned long collected = 0;
> > -	bool drain_allow = true;
> >   	struct folio *folio;
> > +	int drained = 0;
> >   	long i = 0;
> >   	for (folio = pofs_get_folio(pofs, i); folio;
> > @@ -2307,10 +2307,17 @@ static unsigned long collect_longterm_unpinnable_folios(
> >   			continue;
> >   		}
> > -		if (drain_allow && folio_ref_count(folio) !=
> > -				   folio_expected_ref_count(folio) + 1) {
> > +		if (drained == 0 &&
> > +				folio_ref_count(folio) !=
> > +				folio_expected_ref_count(folio) + 1) {
> 
> I would just have indented this as follows:
> 
> 		if (drained == 0 &&
> 		    folio_ref_count(folio) != folio_expected_ref_count(folio) + 1) {

Do we want folio_check_expected_ref_count(folio, offset)?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/6] mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
  2025-09-09 10:52     ` Kiryl Shutsemau
@ 2025-09-09 11:33       ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-09-09 11:33 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Hugh Dickins, Andrew Morton, Alexander Krabler, Aneesh Kumar K.V,
	Axel Rasmussen, Chris Li, Christoph Hellwig, Frederick Mayle,
	Jason Gunthorpe, Johannes Weiner, John Hubbard, Keir Fraser,
	Konstantin Khlebnikov, Li Zhe, Matthew Wilcox, Peter Xu,
	Rik van Riel, Shivank Garg, Vlastimil Babka, Wei Xu, Will Deacon,
	yangge, Yuanchu Xie, Yu Zhao, linux-kernel, linux-mm

On 09.09.25 12:52, Kiryl Shutsemau wrote:
> On Tue, Sep 09, 2025 at 09:56:30AM +0200, David Hildenbrand wrote:
>> On 09.09.25 00:16, Hugh Dickins wrote:
>>> In many cases, if collect_longterm_unpinnable_folios() does need to
>>> drain the LRU cache to release a reference, the cache in question is
>>> on this same CPU, and much more efficiently drained by a preliminary
>>> local lru_add_drain(), than the later cross-CPU lru_add_drain_all().
>>>
>>> Marked for stable, to counter the increase in lru_add_drain_all()s
>>> from "mm/gup: check ref_count instead of lru before migration".
>>> Note for clean backports: can take 6.16 commit a03db236aebf ("gup:
>>> optimize longterm pin_user_pages() for large folio") first.
>>>
>>> Signed-off-by: Hugh Dickins <hughd@google.com>
>>> Cc: <stable@vger.kernel.org>
>>> ---
>>>    mm/gup.c | 15 +++++++++++----
>>>    1 file changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 82aec6443c0a..b47066a54f52 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2287,8 +2287,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>>>    		struct pages_or_folios *pofs)
>>>    {
>>>    	unsigned long collected = 0;
>>> -	bool drain_allow = true;
>>>    	struct folio *folio;
>>> +	int drained = 0;
>>>    	long i = 0;
>>>    	for (folio = pofs_get_folio(pofs, i); folio;
>>> @@ -2307,10 +2307,17 @@ static unsigned long collect_longterm_unpinnable_folios(
>>>    			continue;
>>>    		}
>>> -		if (drain_allow && folio_ref_count(folio) !=
>>> -				   folio_expected_ref_count(folio) + 1) {
>>> +		if (drained == 0 &&
>>> +				folio_ref_count(folio) !=
>>> +				folio_expected_ref_count(folio) + 1) {
>>
>> I would just have indented this as follows:
>>
>> 		if (drained == 0 &&
>> 		    folio_ref_count(folio) != folio_expected_ref_count(folio) + 1) {
> 
> Do we want folio_check_expected_ref_count(folio, offset)?

Not sure, if so outside of this patch series to also cover the other 
handful of cases.

	folio_has_unexpected_refs(folio, offset)

Would probably be clearer.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-09-09 11:33 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-08 22:12 [PATCH v2 0/6] mm: better GUP pin lru_add_drain_all() Hugh Dickins
2025-09-08 22:15 ` [PATCH v2 1/6] mm/gup: check ref_count instead of lru before migration Hugh Dickins
2025-09-09  7:54   ` David Hildenbrand
2025-09-09 10:48   ` Kiryl Shutsemau
2025-09-08 22:16 ` [PATCH v2 2/6] mm/gup: local lru_add_drain() to avoid lru_add_drain_all() Hugh Dickins
2025-09-09  7:56   ` David Hildenbrand
2025-09-09 10:52     ` Kiryl Shutsemau
2025-09-09 11:33       ` David Hildenbrand
2025-09-08 22:19 ` [PATCH v2 3/6] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch" Hugh Dickins
2025-09-08 22:21 ` [PATCH v2 4/6] mm: Revert "mm: vmscan.c: fix OOM on swap stress test" Hugh Dickins
2025-09-08 22:23 ` [PATCH v2 5/6] mm: folio_may_be_lru_cached() unless folio_test_large() Hugh Dickins
2025-09-09  7:57   ` David Hildenbrand
2025-09-08 22:24 ` [PATCH v2 6/6] mm: lru_add_drain_all() do local lru_add_drain() first Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox