[PATCH 0/2] Improve UFFDIO_MOVE scalability by removing anon

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock
@ 2025-09-18  5:51 Lokesh Gidra
  2025-09-18  5:51 ` [PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
  2025-09-18  5:51 ` [PATCH 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
  0 siblings, 2 replies; 26+ messages in thread
From: Lokesh Gidra @ 2025-09-18  5:51 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lokesh Gidra,
	David Hildenbrand, Lorenzo Stoakes, Harry Yoo, Peter Xu,
	Suren Baghdasaryan, Barry Song, SeongJae Park

Userfaultfd has a scalability issue in its UFFDIO_MOVE ioctl, which is
heavily used in Android as its java garbage collector uses it for
concurrent heap compaction.

The issue arises because UFFDIO_MOVE updates folio->mapping to an
anon_vma with a different root, in order to move the folio from a src
VMA to dst VMA. It performs the operation with the folio locked, but
this is insufficient, because rmap_walk() can be performed on non-KSM
anonymous folios without folio lock.

This means that UFFDIO_MOVE has to acquire the anon_vma write lock
of the root anon_vma belonging to the folio it wishes to move.

This causes scalability bottleneck when multiple threads perform
UFFDIO_MOVE simultanously on distinct pages of the same src VMA. In
field traces of arm64 android devices, we have observed janky user
interactions due to long (sometimes over ~50ms) uninterruptible
sleeps on main UI thread caused by anon_vma lock contention in
UFFDIO_MOVE. This is particularly severe during the beginning of
GC's compaction phase when it is likely to have multiple threads
involved.

This patch resolves the issue by removing the exception in rmap_walk()
for non-KSM anon folios by ensuring that all folios are locked during
rmap walk. This is less problematic than it might seem, as the only
major caller which utilises this mode is shrink_active_list().

To assess the impact of locking non-KSM anon folios in
shrink_active_list(), we performed an app cycle test on an arm64
android device. During the whole duration of the test there were over
140k invocations of the function, out of which over 29k had at least
one non-KSM anon folio on which folio_referenced() was called. In none
of these invocations folio_trylock() failed.

Of course, we now take a lock where we wouldn't previously have. In the
past it would have had a major impact in causing a CoW write fault to
copy a page in do_wp_page(), as commit 09854ba94c6a ("mm: do_wp_page()
simplification") caused a failure to obtain folio lock to result in a
page copy even if one wasn't necessary.

However, since commit 6c287605fd56 ("mm: remember exclusively mapped
anonymous pages with PG_anon_exclusive"), and the introduction of the
folio anon exclusive flag, this issue is significantly mitigated.

The only case remaining that we might worry about from this perspective
is that of read-only folios immediately after fork where the anon
exclusive bit will not have been set yet.

We note however in the case of read-only just-forked folios that
wp_can_reuse_anon_folio() will notice the raised reference count
established by shrink_active_list() via isolate_lru_folios() and refuse
to reuse in any case, so this will in fact have no impact - the folio
lock is ultimately immaterial here.

All-in-all it appears that there is little opportunity for meaningful
negative impact from this change.

As a result of changing our approach to locking, we can remove all
the code that took steps to acquire an anon_vma write lock instead
of a folio lock. This results in a significant simplification and
scalability improvement of the code (currently only in UFFDIO_MOVE).
Furthermore, as a side-effect, folio_lock_anon_vma_read() gets simpler
as we don't need to worry that folio->mapping may have changed under us.

Prior discussions on this can be found at [1, 2].

[1] https://lore.kernel.org/all/CA+EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5+4aDx5jh2DvY6UHg@mail.gmail.com/
[2] https://lore.kernel.org/all/20250908044950.311548-1-lokeshgidra@google.com/

Lokesh Gidra (2):
  mm: always call rmap_walk() on locked folios
  mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE

CC: David Hildenbrand <david@redhat.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
CC: Harry Yoo <harry.yoo@oracle.com>
CC: Peter Xu <peterx@redhat.com>
CC: Suren Baghdasaryan <surenb@google.com>
CC: Barry Song <baohua@kernel.org>
CC: SeongJae Park <sj@kernel.org>
---
 mm/damon/ops-common.c | 16 +++--------
 mm/huge_memory.c      | 22 +--------------
 mm/memory-failure.c   |  3 +++
 mm/page_idle.c        |  8 ++----
 mm/rmap.c             | 42 +++++++++--------------------
 mm/userfaultfd.c      | 62 ++++++++-----------------------------------
 6 files changed, 33 insertions(+), 120 deletions(-)

base-commit: 27efecc552641210647138ad3936229e7dacdf42
-- 
2.51.0.384.g4c02a37b29-goog

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-18  5:51 [PATCH 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Lokesh Gidra
@ 2025-09-18  5:51 ` Lokesh Gidra
  2025-09-18 11:57   ` Lorenzo Stoakes
  2025-09-18 12:15   ` David Hildenbrand
  2025-09-18  5:51 ` [PATCH 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
  1 sibling, 2 replies; 26+ messages in thread
From: Lokesh Gidra @ 2025-09-18  5:51 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lokesh Gidra,
	David Hildenbrand, Lorenzo Stoakes, Harry Yoo, Peter Xu,
	Suren Baghdasaryan, Barry Song, SeongJae Park

Guarantee that rmap_walk() is called on locked folios so that threads
changing folio->mapping and folio->index for non-KSM anon folios can
serialize on fine-grained folio lock rather than anon_vma lock. Other
folio types are already always locked before rmap_walk().

This is in preparation for removing anon_vma write-lock from
UFFDIO_MOVE.

CC: David Hildenbrand <david@redhat.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
CC: Harry Yoo <harry.yoo@oracle.com>
CC: Peter Xu <peterx@redhat.com>
CC: Suren Baghdasaryan <surenb@google.com>
CC: Barry Song <baohua@kernel.org>
CC: SeongJae Park <sj@kernel.org>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
 mm/damon/ops-common.c | 16 ++++------------
 mm/memory-failure.c   |  3 +++
 mm/page_idle.c        |  8 ++------
 mm/rmap.c             | 42 ++++++++++++------------------------------
 4 files changed, 21 insertions(+), 48 deletions(-)

diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
index 998c5180a603..f61d6dde13dc 100644
--- a/mm/damon/ops-common.c
+++ b/mm/damon/ops-common.c
@@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
 		.rmap_one = damon_folio_mkold_one,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
-	bool need_lock;
 
 	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
 		folio_set_idle(folio);
 		return;
 	}
 
-	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
-	if (need_lock && !folio_trylock(folio))
+	if (!folio_trylock(folio))
 		return;
 
 	rmap_walk(folio, &rwc);
-
-	if (need_lock)
-		folio_unlock(folio);
+	folio_unlock(folio);
 
 }
 
@@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
 		.rmap_one = damon_folio_young_one,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
-	bool need_lock;
 
 	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
 		if (folio_test_idle(folio))
@@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
 			return true;
 	}
 
-	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
-	if (need_lock && !folio_trylock(folio))
+	if (!folio_trylock(folio))
 		return false;
 
 	rmap_walk(folio, &rwc);
-
-	if (need_lock)
-		folio_unlock(folio);
+	folio_unlock(folio);
 
 	return accessed;
 }
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a24806bb8e82..f698df156bf8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2143,7 +2143,10 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
 {
 	LIST_HEAD(tokill);
 
+	folio_lock(folio);
 	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
+	folio_unlock(folio);
+
 	kill_procs(&tokill, true, pfn, flags);
 }
 
diff --git a/mm/page_idle.c b/mm/page_idle.c
index a82b340dc204..9bf573d22e87 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -101,19 +101,15 @@ static void page_idle_clear_pte_refs(struct folio *folio)
 		.rmap_one = page_idle_clear_pte_refs_one,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
-	bool need_lock;
 
 	if (!folio_mapped(folio) || !folio_raw_mapping(folio))
 		return;
 
-	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
-	if (need_lock && !folio_trylock(folio))
+	if (!folio_trylock(folio))
 		return;
 
 	rmap_walk(folio, &rwc);
-
-	if (need_lock)
-		folio_unlock(folio);
+	folio_unlock(folio);
 }
 
 static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
diff --git a/mm/rmap.c b/mm/rmap.c
index 34333ae3bd80..90584f5da379 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -489,17 +489,15 @@ void __init anon_vma_init(void)
  * if there is a mapcount, we can dereference the anon_vma after observing
  * those.
  *
- * NOTE: the caller should normally hold folio lock when calling this.  If
- * not, the caller needs to double check the anon_vma didn't change after
- * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
- * concurrently without folio lock protection). See folio_lock_anon_vma_read()
- * which has already covered that, and comment above remap_pages().
+ * NOTE: the caller should hold folio lock when calling this.
  */
 struct anon_vma *folio_get_anon_vma(const struct folio *folio)
 {
 	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+
 	rcu_read_lock();
 	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
 	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
@@ -546,7 +544,8 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
-retry:
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+
 	rcu_read_lock();
 	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
 	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
@@ -557,17 +556,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 	anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
 	root_anon_vma = READ_ONCE(anon_vma->root);
 	if (down_read_trylock(&root_anon_vma->rwsem)) {
-		/*
-		 * folio_move_anon_rmap() might have changed the anon_vma as we
-		 * might not hold the folio lock here.
-		 */
-		if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
-			     anon_mapping)) {
-			up_read(&root_anon_vma->rwsem);
-			rcu_read_unlock();
-			goto retry;
-		}
-
 		/*
 		 * If the folio is still mapped, then this anon_vma is still
 		 * its anon_vma, and holding the mutex ensures that it will
@@ -602,18 +590,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
-	/*
-	 * folio_move_anon_rmap() might have changed the anon_vma as we might
-	 * not hold the folio lock here.
-	 */
-	if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
-		     anon_mapping)) {
-		anon_vma_unlock_read(anon_vma);
-		put_anon_vma(anon_vma);
-		anon_vma = NULL;
-		goto retry;
-	}
-
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock
@@ -1005,7 +981,7 @@ int folio_referenced(struct folio *folio, int is_locked,
 	if (!folio_raw_mapping(folio))
 		return 0;
 
-	if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
+	if (!is_locked) {
 		we_locked = folio_trylock(folio);
 		if (!we_locked)
 			return 1;
@@ -2815,6 +2791,12 @@ static void rmap_walk_anon(struct folio *folio,
 	pgoff_t pgoff_start, pgoff_end;
 	struct anon_vma_chain *avc;
 
+	/*
+	 * The folio lock ensures that folio->mapping can't be changed under us
+	 * to an anon_vma with different root.
+	 */
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+
 	if (locked) {
 		anon_vma = folio_anon_vma(folio);
 		/* anon_vma disappear under us? */
-- 
2.51.0.384.g4c02a37b29-goog



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
  2025-09-18  5:51 [PATCH 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Lokesh Gidra
  2025-09-18  5:51 ` [PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
@ 2025-09-18  5:51 ` Lokesh Gidra
  2025-09-18 12:38   ` Lorenzo Stoakes
  1 sibling, 1 reply; 26+ messages in thread
From: Lokesh Gidra @ 2025-09-18  5:51 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lokesh Gidra,
	David Hildenbrand, Lorenzo Stoakes, Peter Xu, Suren Baghdasaryan,
	Barry Song

Now that rmap_walk() is guaranteed to be called with the folio lock
held, we can stop serializing on the src VMA anon_vma lock when moving
an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.

When moving a folio, we modify folio->mapping through
folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
while we could have concurrent RMAP walks would be dangerous. Therefore,
to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
meant that when multiple threads called UFFDIO_MOVE concurrently on
distinct pages of the same src VMA, they would serialize on it, hurting
scalability.

In addition to avoiding the scalability bottleneck, this patch also
simplifies the complicated lock dance that UFFDIO_MOVE has to go through
between RCU, folio-lock, ptl, and anon_vma.

folio_move_anon_rmap() already enforces that the folio is locked. So
when we have the folio locked we can no longer race with concurrent
rmap_walk() as used by folio_referenced() and hence the anon_vma lock
is no longer required.

Note that this handling is now the same as for other
folio_move_anon_rmap() users that also do not hold the anon_vma lock --
namely COW reuse handling. These users never required the anon_vma lock
as they are only moving the anon VMA closer to the anon_vma leaf of the
VMA, for example, from an anon_vma root to a leaf of that root. rmap
walks were always able to tolerate that scenario.

CC: David Hildenbrand <david@redhat.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
CC: Peter Xu <peterx@redhat.com>
CC: Suren Baghdasaryan <surenb@google.com>
CC: Barry Song <baohua@kernel.org>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
 mm/huge_memory.c | 22 +----------------
 mm/userfaultfd.c | 62 +++++++++---------------------------------------
 2 files changed, 12 insertions(+), 72 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5acca24bbabb..f444c142a8be 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	pmd_t _dst_pmd, src_pmdval;
 	struct page *src_page;
 	struct folio *src_folio;
-	struct anon_vma *src_anon_vma;
 	spinlock_t *src_ptl, *dst_ptl;
 	pgtable_t src_pgtable;
 	struct mmu_notifier_range range;
@@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 				src_addr + HPAGE_PMD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 
-	if (src_folio) {
+	if (src_folio)
 		folio_lock(src_folio);
 
-		/*
-		 * split_huge_page walks the anon_vma chain without the page
-		 * lock. Serialize against it with the anon_vma lock, the page
-		 * lock is not enough.
-		 */
-		src_anon_vma = folio_get_anon_vma(src_folio);
-		if (!src_anon_vma) {
-			err = -EAGAIN;
-			goto unlock_folio;
-		}
-		anon_vma_lock_write(src_anon_vma);
-	} else
-		src_anon_vma = NULL;
-
 	dst_ptl = pmd_lockptr(mm, dst_pmd);
 	double_pt_lock(src_ptl, dst_ptl);
 	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
@@ -2643,11 +2628,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
-	if (src_anon_vma) {
-		anon_vma_unlock_write(src_anon_vma);
-		put_anon_vma(src_anon_vma);
-	}
-unlock_folio:
 	/* unblock rmap walks */
 	if (src_folio)
 		folio_unlock(src_folio);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af61b95c89e4..6be65089085e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1035,8 +1035,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
  */
 static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
 						 unsigned long src_addr,
-						 pte_t *src_pte, pte_t *dst_pte,
-						 struct anon_vma *src_anon_vma)
+						 pte_t *src_pte, pte_t *dst_pte)
 {
 	pte_t orig_dst_pte, orig_src_pte;
 	struct folio *folio;
@@ -1052,8 +1051,7 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
 	folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
 	if (!folio || !folio_trylock(folio))
 		return NULL;
-	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
-	    folio_anon_vma(folio) != src_anon_vma) {
+	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)) {
 		folio_unlock(folio);
 		return NULL;
 	}
@@ -1061,9 +1059,8 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
 }
 
 /*
- * Moves src folios to dst in a batch as long as they share the same
- * anon_vma as the first folio, are not large, and can successfully
- * take the lock via folio_trylock().
+ * Moves src folios to dst in a batch as long as they are not large, and can
+ * successfully take the lock via folio_trylock().
  */
 static long move_present_ptes(struct mm_struct *mm,
 			      struct vm_area_struct *dst_vma,
@@ -1073,8 +1070,7 @@ static long move_present_ptes(struct mm_struct *mm,
 			      pte_t orig_dst_pte, pte_t orig_src_pte,
 			      pmd_t *dst_pmd, pmd_t dst_pmdval,
 			      spinlock_t *dst_ptl, spinlock_t *src_ptl,
-			      struct folio **first_src_folio, unsigned long len,
-			      struct anon_vma *src_anon_vma)
+			      struct folio **first_src_folio, unsigned long len)
 {
 	int err = 0;
 	struct folio *src_folio = *first_src_folio;
@@ -1132,8 +1128,8 @@ static long move_present_ptes(struct mm_struct *mm,
 		src_pte++;
 
 		folio_unlock(src_folio);
-		src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
-							dst_pte, src_anon_vma);
+		src_folio = check_ptes_for_batched_move(src_vma, src_addr,
+							src_pte, dst_pte);
 		if (!src_folio)
 			break;
 	}
@@ -1263,7 +1259,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 	pmd_t dummy_pmdval;
 	pmd_t dst_pmdval;
 	struct folio *src_folio = NULL;
-	struct anon_vma *src_anon_vma = NULL;
 	struct mmu_notifier_range range;
 	long ret = 0;
 
@@ -1347,9 +1342,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 		}
 
 		/*
-		 * Pin and lock both source folio and anon_vma. Since we are in
-		 * RCU read section, we can't block, so on contention have to
-		 * unmap the ptes, obtain the lock and retry.
+		 * Pin and lock source folio. Since we are in RCU read section,
+		 * we can't block, so on contention have to unmap the ptes,
+		 * obtain the lock and retry.
 		 */
 		if (!src_folio) {
 			struct folio *folio;
@@ -1423,33 +1418,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 			goto retry;
 		}
 
-		if (!src_anon_vma) {
-			/*
-			 * folio_referenced walks the anon_vma chain
-			 * without the folio lock. Serialize against it with
-			 * the anon_vma lock, the folio lock is not enough.
-			 */
-			src_anon_vma = folio_get_anon_vma(src_folio);
-			if (!src_anon_vma) {
-				/* page was unmapped from under us */
-				ret = -EAGAIN;
-				goto out;
-			}
-			if (!anon_vma_trylock_write(src_anon_vma)) {
-				pte_unmap(src_pte);
-				pte_unmap(dst_pte);
-				src_pte = dst_pte = NULL;
-				/* now we can block and wait */
-				anon_vma_lock_write(src_anon_vma);
-				goto retry;
-			}
-		}
-
 		ret = move_present_ptes(mm, dst_vma, src_vma,
 					dst_addr, src_addr, dst_pte, src_pte,
 					orig_dst_pte, orig_src_pte, dst_pmd,
 					dst_pmdval, dst_ptl, src_ptl, &src_folio,
-					len, src_anon_vma);
+					len);
 	} else {
 		struct folio *folio = NULL;
 
@@ -1515,10 +1488,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 	}
 
 out:
-	if (src_anon_vma) {
-		anon_vma_unlock_write(src_anon_vma);
-		put_anon_vma(src_anon_vma);
-	}
 	if (src_folio) {
 		folio_unlock(src_folio);
 		folio_put(src_folio);
@@ -1792,15 +1761,6 @@ static void uffd_move_unlock(struct vm_area_struct *dst_vma,
  * virtual regions without knowing if there are transparent hugepage
  * in the regions or not, but preventing the risk of having to split
  * the hugepmd during the remap.
- *
- * If there's any rmap walk that is taking the anon_vma locks without
- * first obtaining the folio lock (the only current instance is
- * folio_referenced), they will have to verify if the folio->mapping
- * has changed after taking the anon_vma lock. If it changed they
- * should release the lock and retry obtaining a new anon_vma, because
- * it means the anon_vma was changed by move_pages() before the lock
- * could be obtained. This is the only additional complexity added to
- * the rmap code to provide this anonymous page remapping functionality.
  */
 ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 		   unsigned long src_start, unsigned long len, __u64 mode)
-- 
2.51.0.384.g4c02a37b29-goog



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-18  5:51 ` [PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
@ 2025-09-18 11:57   ` Lorenzo Stoakes
  2025-09-19  5:45     ` Lokesh Gidra
  2025-09-18 12:15   ` David Hildenbrand
  1 sibling, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 11:57 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On Wed, Sep 17, 2025 at 10:51:34PM -0700, Lokesh Gidra wrote:
> Guarantee that rmap_walk() is called on locked folios so that threads
> changing folio->mapping and folio->index for non-KSM anon folios can
> serialize on fine-grained folio lock rather than anon_vma lock. Other
> folio types are already always locked before rmap_walk().

Be good to explain why you're doing certain things, like adding the folio
lock to kill_procs_now().

Also worth noting that you're going from _definitely_ locking non-KSM anon
to _not necessarily_ locking it.

You should explain why you think this is fine (in general - rmap callers do
some other check to see if what is expected to mapped did happen so it's
fine, or otherwise treat things as best-effort).

You should probably also put some information about performance impact
here, I think Barry provided some?

>
> This is in preparation for removing anon_vma write-lock from
> UFFDIO_MOVE.

Thanks for mentioning this :)

>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Harry Yoo <harry.yoo@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> CC: SeongJae Park <sj@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>

OK so you're making:

folio_lock_anon_vma_read()
folio_get_anon_vma()

Require folio locks.

folio_lock_anon_vma_read() is called from:

connect_procs_anon() - changed to take folio lock.
page_idle_clear_pte_refs() - changed to take folio lock.
damon_folio_mkold() - changed to take folio lock.
damon_folio_young() - changed to take folio lock.
folio_referenced() - changed to take folio lock.
try_to_unmap() - ???
try_to_migrate() - ???

9I note that we allow a TTU_RMAP_LOCKED walk in the above unmap, migrate
cases too, wonder how these will interact?)

folio_get_anon_vma() is called from:

move_pages_huge_pmd() - already holds folio lock.
__folio_split() - already holds folio lock.
move_pages_ptes() [uffd] - already holds folio lock.
migrate_folio_unmap() - already holds folio lock.
unmap_and_move_huge_page() - already holds folio lock.

Can you:

a. Confirm the try_to_unmap() and try_to_migrate() cases take the folio
   lock. Explicitly list the callers and how they acquire the folio lock.

b. Update the commit message to include the above. You're making a _very_
   sensitive locking change here, it's important to demonstrate that you've
   considered all cases.

Thanks!

> ---
>  mm/damon/ops-common.c | 16 ++++------------
>  mm/memory-failure.c   |  3 +++
>  mm/page_idle.c        |  8 ++------
>  mm/rmap.c             | 42 ++++++++++++------------------------------
>  4 files changed, 21 insertions(+), 48 deletions(-)
>
> diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> index 998c5180a603..f61d6dde13dc 100644
> --- a/mm/damon/ops-common.c
> +++ b/mm/damon/ops-common.c
> @@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
>  		.rmap_one = damon_folio_mkold_one,
>  		.anon_lock = folio_lock_anon_vma_read,
>  	};
> -	bool need_lock;
>
>  	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
>  		folio_set_idle(folio);
>  		return;
>  	}
>
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>  		return;

This _seems_ to be best effort and not relying on anon always
succeeding. So should be fine.

>
>  	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);
>
>  }
>
> @@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
>  		.rmap_one = damon_folio_young_one,
>  		.anon_lock = folio_lock_anon_vma_read,
>  	};
> -	bool need_lock;
>
>  	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
>  		if (folio_test_idle(folio))
> @@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
>  			return true;
>  	}
>
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>  		return false;

Same as above here.

>
>  	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);
>
>  	return accessed;
>  }
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a24806bb8e82..f698df156bf8 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2143,7 +2143,10 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
>  {
>  	LIST_HEAD(tokill);
>
> +	folio_lock(folio);
>  	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> +	folio_unlock(folio);
> +

Good. I hate how this works.

>  	kill_procs(&tokill, true, pfn, flags);
>  }
>
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index a82b340dc204..9bf573d22e87 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -101,19 +101,15 @@ static void page_idle_clear_pte_refs(struct folio *folio)
>  		.rmap_one = page_idle_clear_pte_refs_one,
>  		.anon_lock = folio_lock_anon_vma_read,
>  	};
> -	bool need_lock;
>
>  	if (!folio_mapped(folio) || !folio_raw_mapping(folio))
>  		return;
>
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>  		return;

This checks folio idle bit after so that's fine for anon to not succeed due
to contention.

>
>  	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);
>  }
>
>  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 34333ae3bd80..90584f5da379 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -489,17 +489,15 @@ void __init anon_vma_init(void)
>   * if there is a mapcount, we can dereference the anon_vma after observing
>   * those.
>   *
> - * NOTE: the caller should normally hold folio lock when calling this.  If
> - * not, the caller needs to double check the anon_vma didn't change after
> - * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
> - * concurrently without folio lock protection). See folio_lock_anon_vma_read()
> - * which has already covered that, and comment above remap_pages().
> + * NOTE: the caller should hold folio lock when calling this.
>   */
>  struct anon_vma *folio_get_anon_vma(const struct folio *folio)
>  {
>  	struct anon_vma *anon_vma = NULL;
>  	unsigned long anon_mapping;
>
> +	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +
>  	rcu_read_lock();
>  	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>  	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -546,7 +544,8 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	struct anon_vma *root_anon_vma;
>  	unsigned long anon_mapping;
>
> -retry:
> +	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +
>  	rcu_read_lock();
>  	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>  	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -557,17 +556,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
>  	root_anon_vma = READ_ONCE(anon_vma->root);
>  	if (down_read_trylock(&root_anon_vma->rwsem)) {
> -		/*
> -		 * folio_move_anon_rmap() might have changed the anon_vma as we
> -		 * might not hold the folio lock here.
> -		 */
> -		if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -			     anon_mapping)) {
> -			up_read(&root_anon_vma->rwsem);
> -			rcu_read_unlock();
> -			goto retry;
> -		}
> -
>  		/*
>  		 * If the folio is still mapped, then this anon_vma is still
>  		 * its anon_vma, and holding the mutex ensures that it will
> @@ -602,18 +590,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	rcu_read_unlock();
>  	anon_vma_lock_read(anon_vma);
>
> -	/*
> -	 * folio_move_anon_rmap() might have changed the anon_vma as we might
> -	 * not hold the folio lock here.
> -	 */
> -	if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -		     anon_mapping)) {
> -		anon_vma_unlock_read(anon_vma);
> -		put_anon_vma(anon_vma);
> -		anon_vma = NULL;
> -		goto retry;
> -	}
> -
>  	if (atomic_dec_and_test(&anon_vma->refcount)) {
>  		/*
>  		 * Oops, we held the last refcount, release the lock
> @@ -1005,7 +981,7 @@ int folio_referenced(struct folio *folio, int is_locked,
>  	if (!folio_raw_mapping(folio))
>  		return 0;
>
> -	if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +	if (!is_locked) {
>  		we_locked = folio_trylock(folio);
>  		if (!we_locked)
>  			return 1;
> @@ -2815,6 +2791,12 @@ static void rmap_walk_anon(struct folio *folio,
>  	pgoff_t pgoff_start, pgoff_end;
>  	struct anon_vma_chain *avc;
>
> +	/*
> +	 * The folio lock ensures that folio->mapping can't be changed under us
> +	 * to an anon_vma with different root.
> +	 */
> +	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +
>  	if (locked) {
>  		anon_vma = folio_anon_vma(folio);
>  		/* anon_vma disappear under us? */
> --
> 2.51.0.384.g4c02a37b29-goog
>
>

All of the above actual changes to the locking logic looks ok to me though.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-18  5:51 ` [PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
  2025-09-18 11:57   ` Lorenzo Stoakes
@ 2025-09-18 12:15   ` David Hildenbrand
  2025-09-19  6:09     ` Lokesh Gidra
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-09-18 12:15 UTC (permalink / raw)
  To: Lokesh Gidra, akpm
  Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On 18.09.25 07:51, Lokesh Gidra wrote:
> Guarantee that rmap_walk() is called on locked folios so that threads
> changing folio->mapping and folio->index for non-KSM anon folios can
> serialize on fine-grained folio lock rather than anon_vma lock. Other
> folio types are already always locked before rmap_walk().
> 
> This is in preparation for removing anon_vma write-lock from
> UFFDIO_MOVE.

I think quite a big bunch of the cover letter belongs into this patch 
here. In essence, why is it okay, what could be problematic, history 
etc, etc.

> 
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Harry Yoo <harry.yoo@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> CC: SeongJae Park <sj@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> ---
>   mm/damon/ops-common.c | 16 ++++------------
>   mm/memory-failure.c   |  3 +++
>   mm/page_idle.c        |  8 ++------
>   mm/rmap.c             | 42 ++++++++++++------------------------------
>   4 files changed, 21 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> index 998c5180a603..f61d6dde13dc 100644
> --- a/mm/damon/ops-common.c
> +++ b/mm/damon/ops-common.c
> @@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
>   		.rmap_one = damon_folio_mkold_one,
>   		.anon_lock = folio_lock_anon_vma_read,
>   	};
> -	bool need_lock;
>   
>   	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
>   		folio_set_idle(folio);
>   		return;
>   	}
>   
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>   		return;
>   
>   	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);
>   
>   }
>   
> @@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
>   		.rmap_one = damon_folio_young_one,
>   		.anon_lock = folio_lock_anon_vma_read,
>   	};
> -	bool need_lock;
>   
>   	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
>   		if (folio_test_idle(folio))
> @@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
>   			return true;
>   	}
>   
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>   		return false;
>   
>   	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);
>   
>   	return accessed;
>   }
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a24806bb8e82..f698df156bf8 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2143,7 +2143,10 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
>   {
>   	LIST_HEAD(tokill);
>   
> +	folio_lock(folio);
>   	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> +	folio_unlock(folio);
> +
>   	kill_procs(&tokill, true, pfn, flags);
>   }

Is mf_generic_kill_procs()->collect_procs() problematic?

The dax_lock_folio/dax_unlock_folio part confuses me: I think it only 
locks an entry in the page cache but not the actual folio ("Lock the DAX 
entry corresponding to a folio")?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
  2025-09-18  5:51 ` [PATCH 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
@ 2025-09-18 12:38   ` Lorenzo Stoakes
  2025-09-19  6:30     ` Lokesh Gidra
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 12:38 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Peter Xu, Suren Baghdasaryan, Barry Song

On Wed, Sep 17, 2025 at 10:51:35PM -0700, Lokesh Gidra wrote:
> Now that rmap_walk() is guaranteed to be called with the folio lock
> held, we can stop serializing on the src VMA anon_vma lock when moving
> an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
>
> When moving a folio, we modify folio->mapping through
> folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
> while we could have concurrent RMAP walks would be dangerous. Therefore,
> to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
> meant that when multiple threads called UFFDIO_MOVE concurrently on
> distinct pages of the same src VMA, they would serialize on it, hurting
> scalability.
>
> In addition to avoiding the scalability bottleneck, this patch also
> simplifies the complicated lock dance that UFFDIO_MOVE has to go through
> between RCU, folio-lock, ptl, and anon_vma.
>
> folio_move_anon_rmap() already enforces that the folio is locked. So
> when we have the folio locked we can no longer race with concurrent
> rmap_walk() as used by folio_referenced() and hence the anon_vma lock

And other rmap callers right?

> is no longer required.
>
> Note that this handling is now the same as for other
> folio_move_anon_rmap() users that also do not hold the anon_vma lock --
> namely COW reuse handling. These users never required the anon_vma lock
> as they are only moving the anon VMA closer to the anon_vma leaf of the
> VMA, for example, from an anon_vma root to a leaf of that root. rmap
> walks were always able to tolerate that scenario.

Which users?

>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> ---
>  mm/huge_memory.c | 22 +----------------
>  mm/userfaultfd.c | 62 +++++++++---------------------------------------
>  2 files changed, 12 insertions(+), 72 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5acca24bbabb..f444c142a8be 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>  	pmd_t _dst_pmd, src_pmdval;
>  	struct page *src_page;
>  	struct folio *src_folio;
> -	struct anon_vma *src_anon_vma;
>  	spinlock_t *src_ptl, *dst_ptl;
>  	pgtable_t src_pgtable;
>  	struct mmu_notifier_range range;
> @@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>  				src_addr + HPAGE_PMD_SIZE);
>  	mmu_notifier_invalidate_range_start(&range);
>
> -	if (src_folio) {
> +	if (src_folio)
>  		folio_lock(src_folio);
>
> -		/*
> -		 * split_huge_page walks the anon_vma chain without the page
> -		 * lock. Serialize against it with the anon_vma lock, the page
> -		 * lock is not enough.
> -		 */
> -		src_anon_vma = folio_get_anon_vma(src_folio);
> -		if (!src_anon_vma) {
> -			err = -EAGAIN;
> -			goto unlock_folio;
> -		}
> -		anon_vma_lock_write(src_anon_vma);
> -	} else
> -		src_anon_vma = NULL;
> -

Hmm this seems an odd thing to include in the uffd change. Why not just include
it in the last commit or as a separate commit?

>  	dst_ptl = pmd_lockptr(mm, dst_pmd);
>  	double_pt_lock(src_ptl, dst_ptl);
>  	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> @@ -2643,11 +2628,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>  	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
>  unlock_ptls:
>  	double_pt_unlock(src_ptl, dst_ptl);
> -	if (src_anon_vma) {
> -		anon_vma_unlock_write(src_anon_vma);
> -		put_anon_vma(src_anon_vma);
> -	}
> -unlock_folio:
>  	/* unblock rmap walks */
>  	if (src_folio)
>  		folio_unlock(src_folio);
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index af61b95c89e4..6be65089085e 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1035,8 +1035,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
>   */
>  static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
>  						 unsigned long src_addr,
> -						 pte_t *src_pte, pte_t *dst_pte,
> -						 struct anon_vma *src_anon_vma)
> +						 pte_t *src_pte, pte_t *dst_pte)
>  {
>  	pte_t orig_dst_pte, orig_src_pte;
>  	struct folio *folio;
> @@ -1052,8 +1051,7 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
>  	folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
>  	if (!folio || !folio_trylock(folio))
>  		return NULL;
> -	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> -	    folio_anon_vma(folio) != src_anon_vma) {
> +	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)) {
>  		folio_unlock(folio);
>  		return NULL;
>  	}

It's good to unwind this obviously, though god I hate all these open coded checks.

Let me also rant about how we seem to duplicate half of mm in uffd
code. Yuck. This is really not how this should have been done AT ALL.

> @@ -1061,9 +1059,8 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
>  }
>
>  /*
> - * Moves src folios to dst in a batch as long as they share the same
> - * anon_vma as the first folio, are not large, and can successfully
> - * take the lock via folio_trylock().
> + * Moves src folios to dst in a batch as long as they are not large, and can
> + * successfully take the lock via folio_trylock().
>   */
>  static long move_present_ptes(struct mm_struct *mm,
>  			      struct vm_area_struct *dst_vma,
> @@ -1073,8 +1070,7 @@ static long move_present_ptes(struct mm_struct *mm,
>  			      pte_t orig_dst_pte, pte_t orig_src_pte,
>  			      pmd_t *dst_pmd, pmd_t dst_pmdval,
>  			      spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -			      struct folio **first_src_folio, unsigned long len,
> -			      struct anon_vma *src_anon_vma)
> +			      struct folio **first_src_folio, unsigned long len)
>  {
>  	int err = 0;
>  	struct folio *src_folio = *first_src_folio;
> @@ -1132,8 +1128,8 @@ static long move_present_ptes(struct mm_struct *mm,
>  		src_pte++;
>
>  		folio_unlock(src_folio);
> -		src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
> -							dst_pte, src_anon_vma);
> +		src_folio = check_ptes_for_batched_move(src_vma, src_addr,
> +							src_pte, dst_pte);
>  		if (!src_folio)
>  			break;
>  	}
> @@ -1263,7 +1259,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  	pmd_t dummy_pmdval;
>  	pmd_t dst_pmdval;
>  	struct folio *src_folio = NULL;
> -	struct anon_vma *src_anon_vma = NULL;
>  	struct mmu_notifier_range range;
>  	long ret = 0;
>
> @@ -1347,9 +1342,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  		}
>
>  		/*
> -		 * Pin and lock both source folio and anon_vma. Since we are in
> -		 * RCU read section, we can't block, so on contention have to
> -		 * unmap the ptes, obtain the lock and retry.
> +		 * Pin and lock source folio. Since we are in RCU read section,
> +		 * we can't block, so on contention have to unmap the ptes,
> +		 * obtain the lock and retry.

Not sure what pinning the anon_vma meant anyway :)

>  		 */
>  		if (!src_folio) {
>  			struct folio *folio;
> @@ -1423,33 +1418,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  			goto retry;
>  		}
>
> -		if (!src_anon_vma) {
> -			/*
> -			 * folio_referenced walks the anon_vma chain
> -			 * without the folio lock. Serialize against it with
> -			 * the anon_vma lock, the folio lock is not enough.
> -			 */
> -			src_anon_vma = folio_get_anon_vma(src_folio);
> -			if (!src_anon_vma) {
> -				/* page was unmapped from under us */
> -				ret = -EAGAIN;
> -				goto out;
> -			}
> -			if (!anon_vma_trylock_write(src_anon_vma)) {
> -				pte_unmap(src_pte);
> -				pte_unmap(dst_pte);
> -				src_pte = dst_pte = NULL;
> -				/* now we can block and wait */
> -				anon_vma_lock_write(src_anon_vma);
> -				goto retry;
> -			}
> -		}
> -
>  		ret = move_present_ptes(mm, dst_vma, src_vma,
>  					dst_addr, src_addr, dst_pte, src_pte,
>  					orig_dst_pte, orig_src_pte, dst_pmd,
>  					dst_pmdval, dst_ptl, src_ptl, &src_folio,
> -					len, src_anon_vma);
> +					len);
>  	} else {
>  		struct folio *folio = NULL;
>
> @@ -1515,10 +1488,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  	}
>
>  out:
> -	if (src_anon_vma) {
> -		anon_vma_unlock_write(src_anon_vma);
> -		put_anon_vma(src_anon_vma);
> -	}
>  	if (src_folio) {
>  		folio_unlock(src_folio);
>  		folio_put(src_folio);
> @@ -1792,15 +1761,6 @@ static void uffd_move_unlock(struct vm_area_struct *dst_vma,
>   * virtual regions without knowing if there are transparent hugepage
>   * in the regions or not, but preventing the risk of having to split
>   * the hugepmd during the remap.
> - *
> - * If there's any rmap walk that is taking the anon_vma locks without
> - * first obtaining the folio lock (the only current instance is
> - * folio_referenced), they will have to verify if the folio->mapping
> - * has changed after taking the anon_vma lock. If it changed they
> - * should release the lock and retry obtaining a new anon_vma, because
> - * it means the anon_vma was changed by move_pages() before the lock
> - * could be obtained. This is the only additional complexity added to
> - * the rmap code to provide this anonymous page remapping functionality.
>   */
>  ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>  		   unsigned long src_start, unsigned long len, __u64 mode)
> --
> 2.51.0.384.g4c02a37b29-goog
>
>

Rest of logic looks OK to me!


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-18 11:57   ` Lorenzo Stoakes
@ 2025-09-19  5:45     ` Lokesh Gidra
  2025-09-19  9:59       ` Lorenzo Stoakes
  2025-11-03 14:58       ` Lorenzo Stoakes
  0 siblings, 2 replies; 26+ messages in thread
From: Lokesh Gidra @ 2025-09-19  5:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On Thu, Sep 18, 2025 at 4:57 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Sep 17, 2025 at 10:51:34PM -0700, Lokesh Gidra wrote:
> > Guarantee that rmap_walk() is called on locked folios so that threads
> > changing folio->mapping and folio->index for non-KSM anon folios can
> > serialize on fine-grained folio lock rather than anon_vma lock. Other
> > folio types are already always locked before rmap_walk().
>
> Be good to explain why you're doing certain things, like adding the folio
> lock to kill_procs_now().

Agreed! I'll add in the next version.
>
> Also worth noting that you're going from _definitely_ locking non-KSM anon
> to _not necessarily_ locking it.

Will do. But, just to be clear, you mean the opposite right?
>
> You should explain why you think this is fine (in general - rmap callers do
> some other check to see if what is expected to mapped did happen so it's
> fine, or otherwise treat things as best-effort).

Sure thing.
>
> You should probably also put some information about performance impact
> here, I think Barry provided some?
>
I added that in the cover letter. Basically the impact of trylocking
non-KSM anon folios (in folio_referenced()) on active_shrink_list().
I'll move it here.
> >
> > This is in preparation for removing anon_vma write-lock from
> > UFFDIO_MOVE.
>
> Thanks for mentioning this :)
>
I got your message loud and clear the last time :)
> >
> > CC: David Hildenbrand <david@redhat.com>
> > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > CC: Harry Yoo <harry.yoo@oracle.com>
> > CC: Peter Xu <peterx@redhat.com>
> > CC: Suren Baghdasaryan <surenb@google.com>
> > CC: Barry Song <baohua@kernel.org>
> > CC: SeongJae Park <sj@kernel.org>
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
>
> OK so you're making:
>
> folio_lock_anon_vma_read()
> folio_get_anon_vma()
>
> Require folio locks.
>
> folio_lock_anon_vma_read() is called from:
>
> connect_procs_anon() - changed to take folio lock.
> page_idle_clear_pte_refs() - changed to take folio lock.
> damon_folio_mkold() - changed to take folio lock.
> damon_folio_young() - changed to take folio lock.
> folio_referenced() - changed to take folio lock.
> try_to_unmap() - ???
> try_to_migrate() - ???
>
> 9I note that we allow a TTU_RMAP_LOCKED walk in the above unmap, migrate
> cases too, wonder how these will interact?)
>
> folio_get_anon_vma() is called from:
>
> move_pages_huge_pmd() - already holds folio lock.
> __folio_split() - already holds folio lock.
> move_pages_ptes() [uffd] - already holds folio lock.
> migrate_folio_unmap() - already holds folio lock.
> unmap_and_move_huge_page() - already holds folio lock.
>
> Can you:
>
> a. Confirm the try_to_unmap() and try_to_migrate() cases take the folio
>    lock. Explicitly list the callers and how they acquire the folio lock.
>
Description comments of both try_to_migrate() and try_to_unmap() say
that the caller must hold folio lock. But just to be safe, I went
through all the callers, and all of them are holding the folio lock:

try_to_unmap() is called from:
unmap_folio()
collapse_file()
shrink_folio_list()
shink_folio_list()->unmap_poisoned_folio()
do_migrate_range()->unmap_poisoned_folio()
try_memory_failure_hugetlb()->hwpoison_user_mappings()->unmap_poisoned_folio()
memory_failure()->hwpoison_user_mappings()->unmap_poisoned_folio()

try_to_migrate() is called from:
unmap_folio()
unmap_and_move_huge_page()
migrate_folio_unmap()
migrate_vma_collect()->migrate_vma_collect_pmd() acquires in case of
migrate_vma_setup()->migrate_vma_unmap()->migrate_device_unmap()
migrate_device_pfn_lock() acquires folio locks in the following cases:
migrate_device_range()->migrate_device_unmap()
migrate_device_pfns()->migrate_device_unmap()

All the callers of rmap_walk()/rmap_walk_locked() are already covered
in the folio_lock_anon_vma_read() list that you added, except
remove_migration_ptes(), which is called from:
__folio_split()->remap_page()
migrate_device_unmap()
__migrate_device_finalize()
unmap_and_move_huge_page()
migrate_folio_unmap() locks the folio for the following two in
migrate_pages_batch():
migrate_folio_move()
migrate_folio_undo_src()

> b. Update the commit message to include the above. You're making a _very_
>    sensitive locking change here, it's important to demonstrate that you've
>    considered all cases.
Certainly, will do.
>
> Thanks!
>
> > ---
> >  mm/damon/ops-common.c | 16 ++++------------
> >  mm/memory-failure.c   |  3 +++
> >  mm/page_idle.c        |  8 ++------
> >  mm/rmap.c             | 42 ++++++++++++------------------------------
> >  4 files changed, 21 insertions(+), 48 deletions(-)
> >
> > diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> > index 998c5180a603..f61d6dde13dc 100644
> > --- a/mm/damon/ops-common.c
> > +++ b/mm/damon/ops-common.c
> > @@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
> >               .rmap_one = damon_folio_mkold_one,
> >               .anon_lock = folio_lock_anon_vma_read,
> >       };
> > -     bool need_lock;
> >
> >       if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
> >               folio_set_idle(folio);
> >               return;
> >       }
> >
> > -     need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> > -     if (need_lock && !folio_trylock(folio))
> > +     if (!folio_trylock(folio))
> >               return;
>
> This _seems_ to be best effort and not relying on anon always
> succeeding. So should be fine.
>
> >
> >       rmap_walk(folio, &rwc);
> > -
> > -     if (need_lock)
> > -             folio_unlock(folio);
> > +     folio_unlock(folio);
> >
> >  }
> >
> > @@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
> >               .rmap_one = damon_folio_young_one,
> >               .anon_lock = folio_lock_anon_vma_read,
> >       };
> > -     bool need_lock;
> >
> >       if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
> >               if (folio_test_idle(folio))
> > @@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
> >                       return true;
> >       }
> >
> > -     need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> > -     if (need_lock && !folio_trylock(folio))
> > +     if (!folio_trylock(folio))
> >               return false;
>
> Same as above here.
>
> >
> >       rmap_walk(folio, &rwc);
> > -
> > -     if (need_lock)
> > -             folio_unlock(folio);
> > +     folio_unlock(folio);
> >
> >       return accessed;
> >  }
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index a24806bb8e82..f698df156bf8 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2143,7 +2143,10 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
> >  {
> >       LIST_HEAD(tokill);
> >
> > +     folio_lock(folio);
> >       collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> > +     folio_unlock(folio);
> > +
>
> Good. I hate how this works.
>
> >       kill_procs(&tokill, true, pfn, flags);
> >  }
> >
> > diff --git a/mm/page_idle.c b/mm/page_idle.c
> > index a82b340dc204..9bf573d22e87 100644
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -101,19 +101,15 @@ static void page_idle_clear_pte_refs(struct folio *folio)
> >               .rmap_one = page_idle_clear_pte_refs_one,
> >               .anon_lock = folio_lock_anon_vma_read,
> >       };
> > -     bool need_lock;
> >
> >       if (!folio_mapped(folio) || !folio_raw_mapping(folio))
> >               return;
> >
> > -     need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> > -     if (need_lock && !folio_trylock(folio))
> > +     if (!folio_trylock(folio))
> >               return;
>
> This checks folio idle bit after so that's fine for anon to not succeed due
> to contention.
>
> >
> >       rmap_walk(folio, &rwc);
> > -
> > -     if (need_lock)
> > -             folio_unlock(folio);
> > +     folio_unlock(folio);
> >  }
> >
> >  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 34333ae3bd80..90584f5da379 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -489,17 +489,15 @@ void __init anon_vma_init(void)
> >   * if there is a mapcount, we can dereference the anon_vma after observing
> >   * those.
> >   *
> > - * NOTE: the caller should normally hold folio lock when calling this.  If
> > - * not, the caller needs to double check the anon_vma didn't change after
> > - * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
> > - * concurrently without folio lock protection). See folio_lock_anon_vma_read()
> > - * which has already covered that, and comment above remap_pages().
> > + * NOTE: the caller should hold folio lock when calling this.
> >   */
> >  struct anon_vma *folio_get_anon_vma(const struct folio *folio)
> >  {
> >       struct anon_vma *anon_vma = NULL;
> >       unsigned long anon_mapping;
> >
> > +     VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > +
> >       rcu_read_lock();
> >       anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
> >       if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> > @@ -546,7 +544,8 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> >       struct anon_vma *root_anon_vma;
> >       unsigned long anon_mapping;
> >
> > -retry:
> > +     VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > +
> >       rcu_read_lock();
> >       anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
> >       if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> > @@ -557,17 +556,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> >       anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
> >       root_anon_vma = READ_ONCE(anon_vma->root);
> >       if (down_read_trylock(&root_anon_vma->rwsem)) {
> > -             /*
> > -              * folio_move_anon_rmap() might have changed the anon_vma as we
> > -              * might not hold the folio lock here.
> > -              */
> > -             if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> > -                          anon_mapping)) {
> > -                     up_read(&root_anon_vma->rwsem);
> > -                     rcu_read_unlock();
> > -                     goto retry;
> > -             }
> > -
> >               /*
> >                * If the folio is still mapped, then this anon_vma is still
> >                * its anon_vma, and holding the mutex ensures that it will
> > @@ -602,18 +590,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> >       rcu_read_unlock();
> >       anon_vma_lock_read(anon_vma);
> >
> > -     /*
> > -      * folio_move_anon_rmap() might have changed the anon_vma as we might
> > -      * not hold the folio lock here.
> > -      */
> > -     if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> > -                  anon_mapping)) {
> > -             anon_vma_unlock_read(anon_vma);
> > -             put_anon_vma(anon_vma);
> > -             anon_vma = NULL;
> > -             goto retry;
> > -     }
> > -
> >       if (atomic_dec_and_test(&anon_vma->refcount)) {
> >               /*
> >                * Oops, we held the last refcount, release the lock
> > @@ -1005,7 +981,7 @@ int folio_referenced(struct folio *folio, int is_locked,
> >       if (!folio_raw_mapping(folio))
> >               return 0;
> >
> > -     if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > +     if (!is_locked) {
> >               we_locked = folio_trylock(folio);
> >               if (!we_locked)
> >                       return 1;
> > @@ -2815,6 +2791,12 @@ static void rmap_walk_anon(struct folio *folio,
> >       pgoff_t pgoff_start, pgoff_end;
> >       struct anon_vma_chain *avc;
> >
> > +     /*
> > +      * The folio lock ensures that folio->mapping can't be changed under us
> > +      * to an anon_vma with different root.
> > +      */
> > +     VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > +
> >       if (locked) {
> >               anon_vma = folio_anon_vma(folio);
> >               /* anon_vma disappear under us? */
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >
> >
>
> All of the above actual changes to the locking logic looks ok to me though.
Awesome! :)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-18 12:15   ` David Hildenbrand
@ 2025-09-19  6:09     ` Lokesh Gidra
  2025-09-24 10:00       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Lokesh Gidra @ 2025-09-19  6:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On Thu, Sep 18, 2025 at 5:16 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.09.25 07:51, Lokesh Gidra wrote:
> > Guarantee that rmap_walk() is called on locked folios so that threads
> > changing folio->mapping and folio->index for non-KSM anon folios can
> > serialize on fine-grained folio lock rather than anon_vma lock. Other
> > folio types are already always locked before rmap_walk().
> >
> > This is in preparation for removing anon_vma write-lock from
> > UFFDIO_MOVE.
>
> I think quite a big bunch of the cover letter belongs into this patch
> here. In essence, why is it okay, what could be problematic, history
> etc, etc.
>
Actually, I thought that eventually the cover letter's text will be
copied here so I didn't want to duplicate it. But I guess I can re-do
with cover letter only containing the 'why' part and each patch
containing the 'what'.
> >
> > CC: David Hildenbrand <david@redhat.com>
> > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > CC: Harry Yoo <harry.yoo@oracle.com>
> > CC: Peter Xu <peterx@redhat.com>
> > CC: Suren Baghdasaryan <surenb@google.com>
> > CC: Barry Song <baohua@kernel.org>
> > CC: SeongJae Park <sj@kernel.org>
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> >   mm/damon/ops-common.c | 16 ++++------------
> >   mm/memory-failure.c   |  3 +++
> >   mm/page_idle.c        |  8 ++------
> >   mm/rmap.c             | 42 ++++++++++++------------------------------
> >   4 files changed, 21 insertions(+), 48 deletions(-)
> >
> > diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> > index 998c5180a603..f61d6dde13dc 100644
> > --- a/mm/damon/ops-common.c
> > +++ b/mm/damon/ops-common.c
> > @@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
> >               .rmap_one = damon_folio_mkold_one,
> >               .anon_lock = folio_lock_anon_vma_read,
> >       };
> > -     bool need_lock;
> >
> >       if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
> >               folio_set_idle(folio);
> >               return;
> >       }
> >
> > -     need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> > -     if (need_lock && !folio_trylock(folio))
> > +     if (!folio_trylock(folio))
> >               return;
> >
> >       rmap_walk(folio, &rwc);
> > -
> > -     if (need_lock)
> > -             folio_unlock(folio);
> > +     folio_unlock(folio);
> >
> >   }
> >
> > @@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
> >               .rmap_one = damon_folio_young_one,
> >               .anon_lock = folio_lock_anon_vma_read,
> >       };
> > -     bool need_lock;
> >
> >       if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
> >               if (folio_test_idle(folio))
> > @@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
> >                       return true;
> >       }
> >
> > -     need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> > -     if (need_lock && !folio_trylock(folio))
> > +     if (!folio_trylock(folio))
> >               return false;
> >
> >       rmap_walk(folio, &rwc);
> > -
> > -     if (need_lock)
> > -             folio_unlock(folio);
> > +     folio_unlock(folio);
> >
> >       return accessed;
> >   }
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index a24806bb8e82..f698df156bf8 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2143,7 +2143,10 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
> >   {
> >       LIST_HEAD(tokill);
> >
> > +     folio_lock(folio);
> >       collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> > +     folio_unlock(folio);
> > +
> >       kill_procs(&tokill, true, pfn, flags);
> >   }
>
> Is mf_generic_kill_procs()->collect_procs() problematic?
>
> The dax_lock_folio/dax_unlock_folio part confuses me: I think it only
> locks an entry in the page cache but not the actual folio ("Lock the DAX
> entry corresponding to a folio")?
Yeah. The name dax_lock_folio() gives an impression as if the folio is
locked but it isn't :)

IIUC, a dax folio can't have an anon_vma (folio->mapping is actually
an address_space instead of anon_vma), right? So, I thought it wasn't
required to actually lock the folio in this case. Please let me know
if you want me to still lock the folio around collect_procs(), or add
a comment?
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
  2025-09-18 12:38   ` Lorenzo Stoakes
@ 2025-09-19  6:30     ` Lokesh Gidra
  2025-09-19  9:57       ` Lorenzo Stoakes
  0 siblings, 1 reply; 26+ messages in thread
From: Lokesh Gidra @ 2025-09-19  6:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Peter Xu, Suren Baghdasaryan, Barry Song

On Thu, Sep 18, 2025 at 5:38 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Sep 17, 2025 at 10:51:35PM -0700, Lokesh Gidra wrote:
> > Now that rmap_walk() is guaranteed to be called with the folio lock
> > held, we can stop serializing on the src VMA anon_vma lock when moving
> > an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
> >
> > When moving a folio, we modify folio->mapping through
> > folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
> > while we could have concurrent RMAP walks would be dangerous. Therefore,
> > to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
> > meant that when multiple threads called UFFDIO_MOVE concurrently on
> > distinct pages of the same src VMA, they would serialize on it, hurting
> > scalability.
> >
> > In addition to avoiding the scalability bottleneck, this patch also
> > simplifies the complicated lock dance that UFFDIO_MOVE has to go through
> > between RCU, folio-lock, ptl, and anon_vma.
> >
> > folio_move_anon_rmap() already enforces that the folio is locked. So
> > when we have the folio locked we can no longer race with concurrent
> > rmap_walk() as used by folio_referenced() and hence the anon_vma lock
>
> And other rmap callers right?
Right. Will fix it in the next version.
>
> > is no longer required.
> >
> > Note that this handling is now the same as for other
> > folio_move_anon_rmap() users that also do not hold the anon_vma lock --
> > namely COW reuse handling. These users never required the anon_vma lock
> > as they are only moving the anon VMA closer to the anon_vma leaf of the
> > VMA, for example, from an anon_vma root to a leaf of that root. rmap
> > walks were always able to tolerate that scenario.
>
> Which users?

The COW reusers, namely:
do_wp_page()->wp_can_reuse_anon_folio()
do_huge_pmd_wp_page()
hugetlb_wp()

>
> >
> > CC: David Hildenbrand <david@redhat.com>
> > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > CC: Peter Xu <peterx@redhat.com>
> > CC: Suren Baghdasaryan <surenb@google.com>
> > CC: Barry Song <baohua@kernel.org>
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> >  mm/huge_memory.c | 22 +----------------
> >  mm/userfaultfd.c | 62 +++++++++---------------------------------------
> >  2 files changed, 12 insertions(+), 72 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 5acca24bbabb..f444c142a8be 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> >       pmd_t _dst_pmd, src_pmdval;
> >       struct page *src_page;
> >       struct folio *src_folio;
> > -     struct anon_vma *src_anon_vma;
> >       spinlock_t *src_ptl, *dst_ptl;
> >       pgtable_t src_pgtable;
> >       struct mmu_notifier_range range;
> > @@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> >                               src_addr + HPAGE_PMD_SIZE);
> >       mmu_notifier_invalidate_range_start(&range);
> >
> > -     if (src_folio) {
> > +     if (src_folio)
> >               folio_lock(src_folio);
> >
> > -             /*
> > -              * split_huge_page walks the anon_vma chain without the page
> > -              * lock. Serialize against it with the anon_vma lock, the page
> > -              * lock is not enough.
> > -              */
> > -             src_anon_vma = folio_get_anon_vma(src_folio);
> > -             if (!src_anon_vma) {
> > -                     err = -EAGAIN;
> > -                     goto unlock_folio;
> > -             }
> > -             anon_vma_lock_write(src_anon_vma);
> > -     } else
> > -             src_anon_vma = NULL;
> > -
>
> Hmm this seems an odd thing to include in the uffd change. Why not just include
> it in the last commit or as a separate commit?

I'm not sure I follow. What am I including here?

BTW, IMHO, the comment is wrong here. folio split code already
acquires folio lock. The anon_vma lock is required here for the same
reason as non-large page case - to avoid concurrent rmap walks.
>
> >       dst_ptl = pmd_lockptr(mm, dst_pmd);
> >       double_pt_lock(src_ptl, dst_ptl);
> >       if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> > @@ -2643,11 +2628,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> >       pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> >  unlock_ptls:
> >       double_pt_unlock(src_ptl, dst_ptl);
> > -     if (src_anon_vma) {
> > -             anon_vma_unlock_write(src_anon_vma);
> > -             put_anon_vma(src_anon_vma);
> > -     }
> > -unlock_folio:
> >       /* unblock rmap walks */
> >       if (src_folio)
> >               folio_unlock(src_folio);
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index af61b95c89e4..6be65089085e 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1035,8 +1035,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> >   */
> >  static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> >                                                unsigned long src_addr,
> > -                                              pte_t *src_pte, pte_t *dst_pte,
> > -                                              struct anon_vma *src_anon_vma)
> > +                                              pte_t *src_pte, pte_t *dst_pte)
> >  {
> >       pte_t orig_dst_pte, orig_src_pte;
> >       struct folio *folio;
> > @@ -1052,8 +1051,7 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> >       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> >       if (!folio || !folio_trylock(folio))
> >               return NULL;
> > -     if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > -         folio_anon_vma(folio) != src_anon_vma) {
> > +     if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)) {
> >               folio_unlock(folio);
> >               return NULL;
> >       }
>
> It's good to unwind this obviously, though god I hate all these open coded checks.
>
> Let me also rant about how we seem to duplicate half of mm in uffd
> code. Yuck. This is really not how this should have been done AT ALL.
>
> > @@ -1061,9 +1059,8 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> >  }
> >
> >  /*
> > - * Moves src folios to dst in a batch as long as they share the same
> > - * anon_vma as the first folio, are not large, and can successfully
> > - * take the lock via folio_trylock().
> > + * Moves src folios to dst in a batch as long as they are not large, and can
> > + * successfully take the lock via folio_trylock().
> >   */
> >  static long move_present_ptes(struct mm_struct *mm,
> >                             struct vm_area_struct *dst_vma,
> > @@ -1073,8 +1070,7 @@ static long move_present_ptes(struct mm_struct *mm,
> >                             pte_t orig_dst_pte, pte_t orig_src_pte,
> >                             pmd_t *dst_pmd, pmd_t dst_pmdval,
> >                             spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > -                           struct folio **first_src_folio, unsigned long len,
> > -                           struct anon_vma *src_anon_vma)
> > +                           struct folio **first_src_folio, unsigned long len)
> >  {
> >       int err = 0;
> >       struct folio *src_folio = *first_src_folio;
> > @@ -1132,8 +1128,8 @@ static long move_present_ptes(struct mm_struct *mm,
> >               src_pte++;
> >
> >               folio_unlock(src_folio);
> > -             src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
> > -                                                     dst_pte, src_anon_vma);
> > +             src_folio = check_ptes_for_batched_move(src_vma, src_addr,
> > +                                                     src_pte, dst_pte);
> >               if (!src_folio)
> >                       break;
> >       }
> > @@ -1263,7 +1259,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> >       pmd_t dummy_pmdval;
> >       pmd_t dst_pmdval;
> >       struct folio *src_folio = NULL;
> > -     struct anon_vma *src_anon_vma = NULL;
> >       struct mmu_notifier_range range;
> >       long ret = 0;
> >
> > @@ -1347,9 +1342,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> >               }
> >
> >               /*
> > -              * Pin and lock both source folio and anon_vma. Since we are in
> > -              * RCU read section, we can't block, so on contention have to
> > -              * unmap the ptes, obtain the lock and retry.
> > +              * Pin and lock source folio. Since we are in RCU read section,
> > +              * we can't block, so on contention have to unmap the ptes,
> > +              * obtain the lock and retry.
>
> Not sure what pinning the anon_vma meant anyway :)
>
> >                */
> >               if (!src_folio) {
> >                       struct folio *folio;
> > @@ -1423,33 +1418,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> >                       goto retry;
> >               }
> >
> > -             if (!src_anon_vma) {
> > -                     /*
> > -                      * folio_referenced walks the anon_vma chain
> > -                      * without the folio lock. Serialize against it with
> > -                      * the anon_vma lock, the folio lock is not enough.
> > -                      */
> > -                     src_anon_vma = folio_get_anon_vma(src_folio);
> > -                     if (!src_anon_vma) {
> > -                             /* page was unmapped from under us */
> > -                             ret = -EAGAIN;
> > -                             goto out;
> > -                     }
> > -                     if (!anon_vma_trylock_write(src_anon_vma)) {
> > -                             pte_unmap(src_pte);
> > -                             pte_unmap(dst_pte);
> > -                             src_pte = dst_pte = NULL;
> > -                             /* now we can block and wait */
> > -                             anon_vma_lock_write(src_anon_vma);
> > -                             goto retry;
> > -                     }
> > -             }
> > -
> >               ret = move_present_ptes(mm, dst_vma, src_vma,
> >                                       dst_addr, src_addr, dst_pte, src_pte,
> >                                       orig_dst_pte, orig_src_pte, dst_pmd,
> >                                       dst_pmdval, dst_ptl, src_ptl, &src_folio,
> > -                                     len, src_anon_vma);
> > +                                     len);
> >       } else {
> >               struct folio *folio = NULL;
> >
> > @@ -1515,10 +1488,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> >       }
> >
> >  out:
> > -     if (src_anon_vma) {
> > -             anon_vma_unlock_write(src_anon_vma);
> > -             put_anon_vma(src_anon_vma);
> > -     }
> >       if (src_folio) {
> >               folio_unlock(src_folio);
> >               folio_put(src_folio);
> > @@ -1792,15 +1761,6 @@ static void uffd_move_unlock(struct vm_area_struct *dst_vma,
> >   * virtual regions without knowing if there are transparent hugepage
> >   * in the regions or not, but preventing the risk of having to split
> >   * the hugepmd during the remap.
> > - *
> > - * If there's any rmap walk that is taking the anon_vma locks without
> > - * first obtaining the folio lock (the only current instance is
> > - * folio_referenced), they will have to verify if the folio->mapping
> > - * has changed after taking the anon_vma lock. If it changed they
> > - * should release the lock and retry obtaining a new anon_vma, because
> > - * it means the anon_vma was changed by move_pages() before the lock
> > - * could be obtained. This is the only additional complexity added to
> > - * the rmap code to provide this anonymous page remapping functionality.
> >   */
> >  ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> >                  unsigned long src_start, unsigned long len, __u64 mode)
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >
> >
>
> Rest of logic looks OK to me!


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
  2025-09-19  6:30     ` Lokesh Gidra
@ 2025-09-19  9:57       ` Lorenzo Stoakes
  2025-09-19 18:34         ` Lokesh Gidra
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-09-19  9:57 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Peter Xu, Suren Baghdasaryan, Barry Song

On Thu, Sep 18, 2025 at 11:30:48PM -0700, Lokesh Gidra wrote:
> On Thu, Sep 18, 2025 at 5:38 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Sep 17, 2025 at 10:51:35PM -0700, Lokesh Gidra wrote:
> > > Now that rmap_walk() is guaranteed to be called with the folio lock
> > > held, we can stop serializing on the src VMA anon_vma lock when moving
> > > an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
> > >
> > > When moving a folio, we modify folio->mapping through
> > > folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
> > > while we could have concurrent RMAP walks would be dangerous. Therefore,
> > > to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
> > > meant that when multiple threads called UFFDIO_MOVE concurrently on
> > > distinct pages of the same src VMA, they would serialize on it, hurting
> > > scalability.
> > >
> > > In addition to avoiding the scalability bottleneck, this patch also
> > > simplifies the complicated lock dance that UFFDIO_MOVE has to go through
> > > between RCU, folio-lock, ptl, and anon_vma.
> > >
> > > folio_move_anon_rmap() already enforces that the folio is locked. So
> > > when we have the folio locked we can no longer race with concurrent
> > > rmap_walk() as used by folio_referenced() and hence the anon_vma lock
> >
> > And other rmap callers right?
> Right. Will fix it in the next version.

Thanks!

> >
> > > is no longer required.
> > >
> > > Note that this handling is now the same as for other
> > > folio_move_anon_rmap() users that also do not hold the anon_vma lock --
> > > namely COW reuse handling. These users never required the anon_vma lock
> > > as they are only moving the anon VMA closer to the anon_vma leaf of the
> > > VMA, for example, from an anon_vma root to a leaf of that root. rmap
> > > walks were always able to tolerate that scenario.
> >
> > Which users?
>
> The COW reusers, namely:
> do_wp_page()->wp_can_reuse_anon_folio()
> do_huge_pmd_wp_page()
> hugetlb_wp()

Right let's put this in the commit message is what I mean :)

>
> >
> > >
> > > CC: David Hildenbrand <david@redhat.com>
> > > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > CC: Peter Xu <peterx@redhat.com>
> > > CC: Suren Baghdasaryan <surenb@google.com>
> > > CC: Barry Song <baohua@kernel.org>
> > > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > > ---
> > >  mm/huge_memory.c | 22 +----------------
> > >  mm/userfaultfd.c | 62 +++++++++---------------------------------------
> > >  2 files changed, 12 insertions(+), 72 deletions(-)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 5acca24bbabb..f444c142a8be 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> > >       pmd_t _dst_pmd, src_pmdval;
> > >       struct page *src_page;
> > >       struct folio *src_folio;
> > > -     struct anon_vma *src_anon_vma;
> > >       spinlock_t *src_ptl, *dst_ptl;
> > >       pgtable_t src_pgtable;
> > >       struct mmu_notifier_range range;
> > > @@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> > >                               src_addr + HPAGE_PMD_SIZE);
> > >       mmu_notifier_invalidate_range_start(&range);
> > >
> > > -     if (src_folio) {
> > > +     if (src_folio)
> > >               folio_lock(src_folio);
> > >
> > > -             /*
> > > -              * split_huge_page walks the anon_vma chain without the page
> > > -              * lock. Serialize against it with the anon_vma lock, the page
> > > -              * lock is not enough.
> > > -              */
> > > -             src_anon_vma = folio_get_anon_vma(src_folio);
> > > -             if (!src_anon_vma) {
> > > -                     err = -EAGAIN;
> > > -                     goto unlock_folio;
> > > -             }
> > > -             anon_vma_lock_write(src_anon_vma);
> > > -     } else
> > > -             src_anon_vma = NULL;
> > > -
> >
> > Hmm this seems an odd thing to include in the uffd change. Why not just include
> > it in the last commit or as a separate commit?

You're changing move_pages_huge_pmd() here in a change that's about the uffd
change, seems unrelated no?

>
> I'm not sure I follow. What am I including here?
>
> BTW, IMHO, the comment is wrong here. folio split code already
> acquires folio lock. The anon_vma lock is required here for the same
> reason as non-large page case - to avoid concurrent rmap walks.

This is called via split_huge_page() used by KMS and memory failure, not the
usual folio split logic afaict.

But those callers all take the folio look afaict :)

So yeah the comment is wrong it seems!


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-19  5:45     ` Lokesh Gidra
@ 2025-09-19  9:59       ` Lorenzo Stoakes
  2025-11-03 14:58       ` Lorenzo Stoakes
  1 sibling, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-09-19  9:59 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On Thu, Sep 18, 2025 at 10:45:21PM -0700, Lokesh Gidra wrote:
> On Thu, Sep 18, 2025 at 4:57 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Sep 17, 2025 at 10:51:34PM -0700, Lokesh Gidra wrote:
> > > Guarantee that rmap_walk() is called on locked folios so that threads
> > > changing folio->mapping and folio->index for non-KSM anon folios can
> > > serialize on fine-grained folio lock rather than anon_vma lock. Other
> > > folio types are already always locked before rmap_walk().
> >
> > Be good to explain why you're doing certain things, like adding the folio
> > lock to kill_procs_now().
>
> Agreed! I'll add in the next version.

Great, thanks! :)

> >
> > Also worth noting that you're going from _definitely_ locking non-KSM anon
> > to _not necessarily_ locking it.
>
> Will do. But, just to be clear, you mean the opposite right?
> >
> > You should explain why you think this is fine (in general - rmap callers do
> > some other check to see if what is expected to mapped did happen so it's
> > fine, or otherwise treat things as best-effort).
>
> Sure thing.

Thanks!

> >
> > You should probably also put some information about performance impact
> > here, I think Barry provided some?
> >
> I added that in the cover letter. Basically the impact of trylocking
> non-KSM anon folios (in folio_referenced()) on active_shrink_list().
> I'll move it here.
> > >
> > > This is in preparation for removing anon_vma write-lock from
> > > UFFDIO_MOVE.
> >
> > Thanks for mentioning this :)
> >
> I got your message loud and clear the last time :)
> > >
> > > CC: David Hildenbrand <david@redhat.com>
> > > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > CC: Harry Yoo <harry.yoo@oracle.com>
> > > CC: Peter Xu <peterx@redhat.com>
> > > CC: Suren Baghdasaryan <surenb@google.com>
> > > CC: Barry Song <baohua@kernel.org>
> > > CC: SeongJae Park <sj@kernel.org>
> > > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> >
> > OK so you're making:
> >
> > folio_lock_anon_vma_read()
> > folio_get_anon_vma()
> >
> > Require folio locks.
> >
> > folio_lock_anon_vma_read() is called from:
> >
> > connect_procs_anon() - changed to take folio lock.
> > page_idle_clear_pte_refs() - changed to take folio lock.
> > damon_folio_mkold() - changed to take folio lock.
> > damon_folio_young() - changed to take folio lock.
> > folio_referenced() - changed to take folio lock.
> > try_to_unmap() - ???
> > try_to_migrate() - ???
> >
> > 9I note that we allow a TTU_RMAP_LOCKED walk in the above unmap, migrate
> > cases too, wonder how these will interact?)
> >
> > folio_get_anon_vma() is called from:
> >
> > move_pages_huge_pmd() - already holds folio lock.
> > __folio_split() - already holds folio lock.
> > move_pages_ptes() [uffd] - already holds folio lock.
> > migrate_folio_unmap() - already holds folio lock.
> > unmap_and_move_huge_page() - already holds folio lock.
> >
> > Can you:
> >
> > a. Confirm the try_to_unmap() and try_to_migrate() cases take the folio
> >    lock. Explicitly list the callers and how they acquire the folio lock.
> >
> Description comments of both try_to_migrate() and try_to_unmap() say
> that the caller must hold folio lock. But just to be safe, I went
> through all the callers, and all of them are holding the folio lock:
>
> try_to_unmap() is called from:
> unmap_folio()
> collapse_file()
> shrink_folio_list()
> shink_folio_list()->unmap_poisoned_folio()
> do_migrate_range()->unmap_poisoned_folio()
> try_memory_failure_hugetlb()->hwpoison_user_mappings()->unmap_poisoned_folio()
> memory_failure()->hwpoison_user_mappings()->unmap_poisoned_folio()
>
> try_to_migrate() is called from:
> unmap_folio()
> unmap_and_move_huge_page()
> migrate_folio_unmap()
> migrate_vma_collect()->migrate_vma_collect_pmd() acquires in case of
> migrate_vma_setup()->migrate_vma_unmap()->migrate_device_unmap()
> migrate_device_pfn_lock() acquires folio locks in the following cases:
> migrate_device_range()->migrate_device_unmap()
> migrate_device_pfns()->migrate_device_unmap()
>
> All the callers of rmap_walk()/rmap_walk_locked() are already covered
> in the folio_lock_anon_vma_read() list that you added, except
> remove_migration_ptes(), which is called from:
> __folio_split()->remap_page()
> migrate_device_unmap()
> __migrate_device_finalize()
> unmap_and_move_huge_page()
> migrate_folio_unmap() locks the folio for the following two in
> migrate_pages_batch():
> migrate_folio_move()
> migrate_folio_undo_src()

Awesome thanks for checking that! I did think it was probably fine, but
it's important to be as thorough as we can be.

>
> > b. Update the commit message to include the above. You're making a _very_
> >    sensitive locking change here, it's important to demonstrate that you've
> >    considered all cases.
> Certainly, will do.
> >
> > Thanks!
> >
> > > ---
> > >  mm/damon/ops-common.c | 16 ++++------------
> > >  mm/memory-failure.c   |  3 +++
> > >  mm/page_idle.c        |  8 ++------
> > >  mm/rmap.c             | 42 ++++++++++++------------------------------
> > >  4 files changed, 21 insertions(+), 48 deletions(-)
> > >
> > > diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> > > index 998c5180a603..f61d6dde13dc 100644
> > > --- a/mm/damon/ops-common.c
> > > +++ b/mm/damon/ops-common.c
> > > @@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
> > >               .rmap_one = damon_folio_mkold_one,
> > >               .anon_lock = folio_lock_anon_vma_read,
> > >       };
> > > -     bool need_lock;
> > >
> > >       if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
> > >               folio_set_idle(folio);
> > >               return;
> > >       }
> > >
> > > -     need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> > > -     if (need_lock && !folio_trylock(folio))
> > > +     if (!folio_trylock(folio))
> > >               return;
> >
> > This _seems_ to be best effort and not relying on anon always
> > succeeding. So should be fine.
> >
> > >
> > >       rmap_walk(folio, &rwc);
> > > -
> > > -     if (need_lock)
> > > -             folio_unlock(folio);
> > > +     folio_unlock(folio);
> > >
> > >  }
> > >
> > > @@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
> > >               .rmap_one = damon_folio_young_one,
> > >               .anon_lock = folio_lock_anon_vma_read,
> > >       };
> > > -     bool need_lock;
> > >
> > >       if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
> > >               if (folio_test_idle(folio))
> > > @@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
> > >                       return true;
> > >       }
> > >
> > > -     need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> > > -     if (need_lock && !folio_trylock(folio))
> > > +     if (!folio_trylock(folio))
> > >               return false;
> >
> > Same as above here.
> >
> > >
> > >       rmap_walk(folio, &rwc);
> > > -
> > > -     if (need_lock)
> > > -             folio_unlock(folio);
> > > +     folio_unlock(folio);
> > >
> > >       return accessed;
> > >  }
> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index a24806bb8e82..f698df156bf8 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -2143,7 +2143,10 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
> > >  {
> > >       LIST_HEAD(tokill);
> > >
> > > +     folio_lock(folio);
> > >       collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> > > +     folio_unlock(folio);
> > > +
> >
> > Good. I hate how this works.
> >
> > >       kill_procs(&tokill, true, pfn, flags);
> > >  }
> > >
> > > diff --git a/mm/page_idle.c b/mm/page_idle.c
> > > index a82b340dc204..9bf573d22e87 100644
> > > --- a/mm/page_idle.c
> > > +++ b/mm/page_idle.c
> > > @@ -101,19 +101,15 @@ static void page_idle_clear_pte_refs(struct folio *folio)
> > >               .rmap_one = page_idle_clear_pte_refs_one,
> > >               .anon_lock = folio_lock_anon_vma_read,
> > >       };
> > > -     bool need_lock;
> > >
> > >       if (!folio_mapped(folio) || !folio_raw_mapping(folio))
> > >               return;
> > >
> > > -     need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> > > -     if (need_lock && !folio_trylock(folio))
> > > +     if (!folio_trylock(folio))
> > >               return;
> >
> > This checks folio idle bit after so that's fine for anon to not succeed due
> > to contention.
> >
> > >
> > >       rmap_walk(folio, &rwc);
> > > -
> > > -     if (need_lock)
> > > -             folio_unlock(folio);
> > > +     folio_unlock(folio);
> > >  }
> > >
> > >  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 34333ae3bd80..90584f5da379 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -489,17 +489,15 @@ void __init anon_vma_init(void)
> > >   * if there is a mapcount, we can dereference the anon_vma after observing
> > >   * those.
> > >   *
> > > - * NOTE: the caller should normally hold folio lock when calling this.  If
> > > - * not, the caller needs to double check the anon_vma didn't change after
> > > - * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
> > > - * concurrently without folio lock protection). See folio_lock_anon_vma_read()
> > > - * which has already covered that, and comment above remap_pages().
> > > + * NOTE: the caller should hold folio lock when calling this.
> > >   */
> > >  struct anon_vma *folio_get_anon_vma(const struct folio *folio)
> > >  {
> > >       struct anon_vma *anon_vma = NULL;
> > >       unsigned long anon_mapping;
> > >
> > > +     VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > > +
> > >       rcu_read_lock();
> > >       anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
> > >       if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> > > @@ -546,7 +544,8 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> > >       struct anon_vma *root_anon_vma;
> > >       unsigned long anon_mapping;
> > >
> > > -retry:
> > > +     VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > > +
> > >       rcu_read_lock();
> > >       anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
> > >       if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> > > @@ -557,17 +556,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> > >       anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
> > >       root_anon_vma = READ_ONCE(anon_vma->root);
> > >       if (down_read_trylock(&root_anon_vma->rwsem)) {
> > > -             /*
> > > -              * folio_move_anon_rmap() might have changed the anon_vma as we
> > > -              * might not hold the folio lock here.
> > > -              */
> > > -             if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> > > -                          anon_mapping)) {
> > > -                     up_read(&root_anon_vma->rwsem);
> > > -                     rcu_read_unlock();
> > > -                     goto retry;
> > > -             }
> > > -
> > >               /*
> > >                * If the folio is still mapped, then this anon_vma is still
> > >                * its anon_vma, and holding the mutex ensures that it will
> > > @@ -602,18 +590,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> > >       rcu_read_unlock();
> > >       anon_vma_lock_read(anon_vma);
> > >
> > > -     /*
> > > -      * folio_move_anon_rmap() might have changed the anon_vma as we might
> > > -      * not hold the folio lock here.
> > > -      */
> > > -     if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> > > -                  anon_mapping)) {
> > > -             anon_vma_unlock_read(anon_vma);
> > > -             put_anon_vma(anon_vma);
> > > -             anon_vma = NULL;
> > > -             goto retry;
> > > -     }
> > > -
> > >       if (atomic_dec_and_test(&anon_vma->refcount)) {
> > >               /*
> > >                * Oops, we held the last refcount, release the lock
> > > @@ -1005,7 +981,7 @@ int folio_referenced(struct folio *folio, int is_locked,
> > >       if (!folio_raw_mapping(folio))
> > >               return 0;
> > >
> > > -     if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > > +     if (!is_locked) {
> > >               we_locked = folio_trylock(folio);
> > >               if (!we_locked)
> > >                       return 1;
> > > @@ -2815,6 +2791,12 @@ static void rmap_walk_anon(struct folio *folio,
> > >       pgoff_t pgoff_start, pgoff_end;
> > >       struct anon_vma_chain *avc;
> > >
> > > +     /*
> > > +      * The folio lock ensures that folio->mapping can't be changed under us
> > > +      * to an anon_vma with different root.
> > > +      */
> > > +     VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > > +
> > >       if (locked) {
> > >               anon_vma = folio_anon_vma(folio);
> > >               /* anon_vma disappear under us? */
> > > --
> > > 2.51.0.384.g4c02a37b29-goog
> > >
> > >
> >
> > All of the above actual changes to the locking logic looks ok to me though.
> Awesome! :)
>

:) we're nearly there on this... now you, David and I have gone through
this in detail I'm confident this is actually a really quite nice
improvement in general and by unravelling some of the locking hell we might
actually get some other benefits out of this (as referenced by Barry) :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
  2025-09-19  9:57       ` Lorenzo Stoakes
@ 2025-09-19 18:34         ` Lokesh Gidra
  0 siblings, 0 replies; 26+ messages in thread
From: Lokesh Gidra @ 2025-09-19 18:34 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Peter Xu, Suren Baghdasaryan, Barry Song

On Fri, Sep 19, 2025 at 2:58 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Sep 18, 2025 at 11:30:48PM -0700, Lokesh Gidra wrote:
> > On Thu, Sep 18, 2025 at 5:38 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Wed, Sep 17, 2025 at 10:51:35PM -0700, Lokesh Gidra wrote:
> > > > Now that rmap_walk() is guaranteed to be called with the folio lock
> > > > held, we can stop serializing on the src VMA anon_vma lock when moving
> > > > an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
> > > >
> > > > When moving a folio, we modify folio->mapping through
> > > > folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
> > > > while we could have concurrent RMAP walks would be dangerous. Therefore,
> > > > to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
> > > > meant that when multiple threads called UFFDIO_MOVE concurrently on
> > > > distinct pages of the same src VMA, they would serialize on it, hurting
> > > > scalability.
> > > >
> > > > In addition to avoiding the scalability bottleneck, this patch also
> > > > simplifies the complicated lock dance that UFFDIO_MOVE has to go through
> > > > between RCU, folio-lock, ptl, and anon_vma.
> > > >
> > > > folio_move_anon_rmap() already enforces that the folio is locked. So
> > > > when we have the folio locked we can no longer race with concurrent
> > > > rmap_walk() as used by folio_referenced() and hence the anon_vma lock
> > >
> > > And other rmap callers right?
> > Right. Will fix it in the next version.
>
> Thanks!
>
> > >
> > > > is no longer required.
> > > >
> > > > Note that this handling is now the same as for other
> > > > folio_move_anon_rmap() users that also do not hold the anon_vma lock --
> > > > namely COW reuse handling. These users never required the anon_vma lock
> > > > as they are only moving the anon VMA closer to the anon_vma leaf of the
> > > > VMA, for example, from an anon_vma root to a leaf of that root. rmap
> > > > walks were always able to tolerate that scenario.
> > >
> > > Which users?
> >
> > The COW reusers, namely:
> > do_wp_page()->wp_can_reuse_anon_folio()
> > do_huge_pmd_wp_page()
> > hugetlb_wp()
>
> Right let's put this in the commit message is what I mean :)
>
> >
> > >
> > > >
> > > > CC: David Hildenbrand <david@redhat.com>
> > > > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > CC: Peter Xu <peterx@redhat.com>
> > > > CC: Suren Baghdasaryan <surenb@google.com>
> > > > CC: Barry Song <baohua@kernel.org>
> > > > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > > > ---
> > > >  mm/huge_memory.c | 22 +----------------
> > > >  mm/userfaultfd.c | 62 +++++++++---------------------------------------
> > > >  2 files changed, 12 insertions(+), 72 deletions(-)
> > > >
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index 5acca24bbabb..f444c142a8be 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> > > >       pmd_t _dst_pmd, src_pmdval;
> > > >       struct page *src_page;
> > > >       struct folio *src_folio;
> > > > -     struct anon_vma *src_anon_vma;
> > > >       spinlock_t *src_ptl, *dst_ptl;
> > > >       pgtable_t src_pgtable;
> > > >       struct mmu_notifier_range range;
> > > > @@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> > > >                               src_addr + HPAGE_PMD_SIZE);
> > > >       mmu_notifier_invalidate_range_start(&range);
> > > >
> > > > -     if (src_folio) {
> > > > +     if (src_folio)
> > > >               folio_lock(src_folio);
> > > >
> > > > -             /*
> > > > -              * split_huge_page walks the anon_vma chain without the page
> > > > -              * lock. Serialize against it with the anon_vma lock, the page
> > > > -              * lock is not enough.
> > > > -              */
> > > > -             src_anon_vma = folio_get_anon_vma(src_folio);
> > > > -             if (!src_anon_vma) {
> > > > -                     err = -EAGAIN;
> > > > -                     goto unlock_folio;
> > > > -             }
> > > > -             anon_vma_lock_write(src_anon_vma);
> > > > -     } else
> > > > -             src_anon_vma = NULL;
> > > > -
> > >
> > > Hmm this seems an odd thing to include in the uffd change. Why not just include
> > > it in the last commit or as a separate commit?
>
> You're changing move_pages_huge_pmd() here in a change that's about the uffd
> change, seems unrelated no?

This function is a part of UFFDIO_MOVE only :) It handles the
huge-page case of UFFDIO_MOVE and there are no other callers. But let
me know if you would like this in a separate patch.
>
> >
> > I'm not sure I follow. What am I including here?
> >
> > BTW, IMHO, the comment is wrong here. folio split code already
> > acquires folio lock. The anon_vma lock is required here for the same
> > reason as non-large page case - to avoid concurrent rmap walks.
>
> This is called via split_huge_page() used by KMS and memory failure, not the
> usual folio split logic afaict.
>
> But those callers all take the folio look afaict :)
Sorry, yes that's what I meant. The real issue here also is
rmap_walk() because of which anon_vma lock was required and not what
is mentioned in the comment.
>
> So yeah the comment is wrong it seems!


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-19  6:09     ` Lokesh Gidra
@ 2025-09-24 10:00       ` David Hildenbrand
  2025-09-24 19:17         ` Lokesh Gidra
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-09-24 10:00 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park, Dan Williams, Alistair Popple

>>
>> Is mf_generic_kill_procs()->collect_procs() problematic?
>>
>> The dax_lock_folio/dax_unlock_folio part confuses me: I think it only
>> locks an entry in the page cache but not the actual folio ("Lock the DAX
>> entry corresponding to a folio")?
> Yeah. The name dax_lock_folio() gives an impression as if the folio is
> locked but it isn't :)

Sorry for the late reply, I saw you posted v2 in the meantime.

> 
> IIUC, a dax folio can't have an anon_vma (folio->mapping is actually
> an address_space instead of anon_vma), right?

We have these weird device-private dax folios that are anonymous and 
should have the anon_vma set up.

> So, I thought it wasn't
> required to actually lock the folio in this case. Please let me know
> if you want me to still lock the folio around collect_procs(), or add
> a comment?

I think we can end up reaching memory_failure_dev_pagemap() with an 
anonymous dax folio.

Not sure if anything would prevent us into calling

mf_generic_kill_procs()->collect_procs()->collect_procs_anon()->folio_lock_anon_vma_read()

CCing Dan and AAlistair

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-24 10:00       ` David Hildenbrand
@ 2025-09-24 19:17         ` Lokesh Gidra
  2025-09-25 11:06           ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Lokesh Gidra @ 2025-09-24 19:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park, Dan Williams, Alistair Popple

On Wed, Sep 24, 2025 at 3:00 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>
> >> Is mf_generic_kill_procs()->collect_procs() problematic?
> >>
> >> The dax_lock_folio/dax_unlock_folio part confuses me: I think it only
> >> locks an entry in the page cache but not the actual folio ("Lock the DAX
> >> entry corresponding to a folio")?
> > Yeah. The name dax_lock_folio() gives an impression as if the folio is
> > locked but it isn't :)
>
> Sorry for the late reply, I saw you posted v2 in the meantime.
>
> >
> > IIUC, a dax folio can't have an anon_vma (folio->mapping is actually
> > an address_space instead of anon_vma), right?
>
> We have these weird device-private dax folios that are anonymous and
> should have the anon_vma set up.
>
> > So, I thought it wasn't
> > required to actually lock the folio in this case. Please let me know
> > if you want me to still lock the folio around collect_procs(), or add
> > a comment?
>
> I think we can end up reaching memory_failure_dev_pagemap() with an
> anonymous dax folio.
>
> Not sure if anything would prevent us into calling
>
> mf_generic_kill_procs()->collect_procs()->collect_procs_anon()->folio_lock_anon_vma_read()
>
I must be missing something but dax_lock_folio() dereferences
folio->mapping (to get to host) without checking for
FOLIO_MAPPING_FLAGS presence. If it were an anon folio, wouldn't that
be a problem? And then in collect_procs() we obviously check for
folio_test_anon() on the same folio before calling
collect_procs_anon().

> CCing Dan and AAlistair
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-24 19:17         ` Lokesh Gidra
@ 2025-09-25 11:06           ` David Hildenbrand
  2025-10-02  6:46             ` Lokesh Gidra
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-09-25 11:06 UTC (permalink / raw)
  To: Lokesh Gidra, Dan Williams, Alistair Popple
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On 24.09.25 21:17, Lokesh Gidra wrote:
> On Wed, Sep 24, 2025 at 3:00 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>
>>>> Is mf_generic_kill_procs()->collect_procs() problematic?
>>>>
>>>> The dax_lock_folio/dax_unlock_folio part confuses me: I think it only
>>>> locks an entry in the page cache but not the actual folio ("Lock the DAX
>>>> entry corresponding to a folio")?
>>> Yeah. The name dax_lock_folio() gives an impression as if the folio is
>>> locked but it isn't :)
>>
>> Sorry for the late reply, I saw you posted v2 in the meantime.
>>
>>>
>>> IIUC, a dax folio can't have an anon_vma (folio->mapping is actually
>>> an address_space instead of anon_vma), right?
>>
>> We have these weird device-private dax folios that are anonymous and
>> should have the anon_vma set up.
>>
>>> So, I thought it wasn't
>>> required to actually lock the folio in this case. Please let me know
>>> if you want me to still lock the folio around collect_procs(), or add
>>> a comment?
>>
>> I think we can end up reaching memory_failure_dev_pagemap() with an
>> anonymous dax folio.
>>
>> Not sure if anything would prevent us into calling
>>
>> mf_generic_kill_procs()->collect_procs()->collect_procs_anon()->folio_lock_anon_vma_read()
>>
> I must be missing something but dax_lock_folio() dereferences

Maybe the code is missing something :)

> folio->mapping (to get to host) without checking for
> FOLIO_MAPPING_FLAGS presence. If it were an anon folio, wouldn't that
> be a problem? And then in collect_procs() we obviously check for
> folio_test_anon() on the same folio before calling
> collect_procs_anon().

Right, if we reach dax_lock_folio() with an anon folio we are probably 
already messing things up? Not sure if the "!dax_mapping(mapping)" would 
save us, likely not, because it would be an anon_vma.

Hopefully Dan+Alistair have a clue how this works and if it already 
works as expected with device-private anon folios.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-25 11:06           ` David Hildenbrand
@ 2025-10-02  6:46             ` Lokesh Gidra
  2025-10-02  7:22               ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Lokesh Gidra @ 2025-10-02  6:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, Alistair Popple, akpm, linux-mm, kaleshsingh,
	ngeoffray, jannh, Lorenzo Stoakes, Harry Yoo, Peter Xu,
	Suren Baghdasaryan, Barry Song, SeongJae Park

On Thu, Sep 25, 2025 at 4:07 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 24.09.25 21:17, Lokesh Gidra wrote:
> > On Wed, Sep 24, 2025 at 3:00 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>>
> >>>> Is mf_generic_kill_procs()->collect_procs() problematic?
> >>>>
> >>>> The dax_lock_folio/dax_unlock_folio part confuses me: I think it only
> >>>> locks an entry in the page cache but not the actual folio ("Lock the DAX
> >>>> entry corresponding to a folio")?
> >>> Yeah. The name dax_lock_folio() gives an impression as if the folio is
> >>> locked but it isn't :)
> >>
> >> Sorry for the late reply, I saw you posted v2 in the meantime.
> >>
> >>>
> >>> IIUC, a dax folio can't have an anon_vma (folio->mapping is actually
> >>> an address_space instead of anon_vma), right?
> >>
> >> We have these weird device-private dax folios that are anonymous and
> >> should have the anon_vma set up.
> >>
> >>> So, I thought it wasn't
> >>> required to actually lock the folio in this case. Please let me know
> >>> if you want me to still lock the folio around collect_procs(), or add
> >>> a comment?
> >>
> >> I think we can end up reaching memory_failure_dev_pagemap() with an
> >> anonymous dax folio.
> >>
> >> Not sure if anything would prevent us into calling
> >>
> >> mf_generic_kill_procs()->collect_procs()->collect_procs_anon()->folio_lock_anon_vma_read()
> >>
> > I must be missing something but dax_lock_folio() dereferences
>
> Maybe the code is missing something :)
>
> > folio->mapping (to get to host) without checking for
> > FOLIO_MAPPING_FLAGS presence. If it were an anon folio, wouldn't that
> > be a problem? And then in collect_procs() we obviously check for
> > folio_test_anon() on the same folio before calling
> > collect_procs_anon().
>
> Right, if we reach dax_lock_folio() with an anon folio we are probably
> already messing things up? Not sure if the "!dax_mapping(mapping)" would
> save us, likely not, because it would be an anon_vma.
>
> Hopefully Dan+Alistair have a clue how this works and if it already
> works as expected with device-private anon folios.
>
Hi David,

Any suggestion how to make progress? Should I upload v3 with
mf_generic_kill_procs()->collect_procs() in folio-lock critical
section?
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-10-02  6:46             ` Lokesh Gidra
@ 2025-10-02  7:22               ` David Hildenbrand
  2025-10-02  7:48                 ` Lokesh Gidra
  2025-10-03 23:02                 ` Peter Xu
  0 siblings, 2 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-10-02  7:22 UTC (permalink / raw)
  To: Lokesh Gidra, Dan Williams, Alistair Popple
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On 02.10.25 08:46, Lokesh Gidra wrote:
> On Thu, Sep 25, 2025 at 4:07 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 24.09.25 21:17, Lokesh Gidra wrote:
>>> On Wed, Sep 24, 2025 at 3:00 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>>>>
>>>>>> Is mf_generic_kill_procs()->collect_procs() problematic?
>>>>>>
>>>>>> The dax_lock_folio/dax_unlock_folio part confuses me: I think it only
>>>>>> locks an entry in the page cache but not the actual folio ("Lock the DAX
>>>>>> entry corresponding to a folio")?
>>>>> Yeah. The name dax_lock_folio() gives an impression as if the folio is
>>>>> locked but it isn't :)
>>>>
>>>> Sorry for the late reply, I saw you posted v2 in the meantime.
>>>>
>>>>>
>>>>> IIUC, a dax folio can't have an anon_vma (folio->mapping is actually
>>>>> an address_space instead of anon_vma), right?
>>>>
>>>> We have these weird device-private dax folios that are anonymous and
>>>> should have the anon_vma set up.
>>>>
>>>>> So, I thought it wasn't
>>>>> required to actually lock the folio in this case. Please let me know
>>>>> if you want me to still lock the folio around collect_procs(), or add
>>>>> a comment?
>>>>
>>>> I think we can end up reaching memory_failure_dev_pagemap() with an
>>>> anonymous dax folio.
>>>>
>>>> Not sure if anything would prevent us into calling
>>>>
>>>> mf_generic_kill_procs()->collect_procs()->collect_procs_anon()->folio_lock_anon_vma_read()
>>>>
>>> I must be missing something but dax_lock_folio() dereferences
>>
>> Maybe the code is missing something :)
>>
>>> folio->mapping (to get to host) without checking for
>>> FOLIO_MAPPING_FLAGS presence. If it were an anon folio, wouldn't that
>>> be a problem? And then in collect_procs() we obviously check for
>>> folio_test_anon() on the same folio before calling
>>> collect_procs_anon().
>>
>> Right, if we reach dax_lock_folio() with an anon folio we are probably
>> already messing things up? Not sure if the "!dax_mapping(mapping)" would
>> save us, likely not, because it would be an anon_vma.
>>
>> Hopefully Dan+Alistair have a clue how this works and if it already
>> works as expected with device-private anon folios.
>>
> Hi David,
> 
> Any suggestion how to make progress? Should I upload v3 with
> mf_generic_kill_procs()->collect_procs() in folio-lock critical
> section?

Staring at mf_generic_kill_procs() once again, we have this

switch (pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_COHERENT:
	rc = -ENXIO;
	goto unlock;
...
}

IIRC, all anon device folios are MEMORY_DEVICE_PRIVATE, so we should 
actually not succeed in this function and just abort.

We still call dax_lock_folio() earlier, which likely doesn't do the 
right thing for anon device folios, but that's an existing issue.

So regarding your patch I think we're good!

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-10-02  7:22               ` David Hildenbrand
@ 2025-10-02  7:48                 ` Lokesh Gidra
  2025-10-03 23:02                 ` Peter Xu
  1 sibling, 0 replies; 26+ messages in thread
From: Lokesh Gidra @ 2025-10-02  7:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, Alistair Popple, akpm, linux-mm, kaleshsingh,
	ngeoffray, jannh, Lorenzo Stoakes, Harry Yoo, Peter Xu,
	Suren Baghdasaryan, Barry Song, SeongJae Park

On Thu, Oct 2, 2025 at 12:22 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.10.25 08:46, Lokesh Gidra wrote:
> > On Thu, Sep 25, 2025 at 4:07 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 24.09.25 21:17, Lokesh Gidra wrote:
> >>> On Wed, Sep 24, 2025 at 3:00 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>>>>
> >>>>>> Is mf_generic_kill_procs()->collect_procs() problematic?
> >>>>>>
> >>>>>> The dax_lock_folio/dax_unlock_folio part confuses me: I think it only
> >>>>>> locks an entry in the page cache but not the actual folio ("Lock the DAX
> >>>>>> entry corresponding to a folio")?
> >>>>> Yeah. The name dax_lock_folio() gives an impression as if the folio is
> >>>>> locked but it isn't :)
> >>>>
> >>>> Sorry for the late reply, I saw you posted v2 in the meantime.
> >>>>
> >>>>>
> >>>>> IIUC, a dax folio can't have an anon_vma (folio->mapping is actually
> >>>>> an address_space instead of anon_vma), right?
> >>>>
> >>>> We have these weird device-private dax folios that are anonymous and
> >>>> should have the anon_vma set up.
> >>>>
> >>>>> So, I thought it wasn't
> >>>>> required to actually lock the folio in this case. Please let me know
> >>>>> if you want me to still lock the folio around collect_procs(), or add
> >>>>> a comment?
> >>>>
> >>>> I think we can end up reaching memory_failure_dev_pagemap() with an
> >>>> anonymous dax folio.
> >>>>
> >>>> Not sure if anything would prevent us into calling
> >>>>
> >>>> mf_generic_kill_procs()->collect_procs()->collect_procs_anon()->folio_lock_anon_vma_read()
> >>>>
> >>> I must be missing something but dax_lock_folio() dereferences
> >>
> >> Maybe the code is missing something :)
> >>
> >>> folio->mapping (to get to host) without checking for
> >>> FOLIO_MAPPING_FLAGS presence. If it were an anon folio, wouldn't that
> >>> be a problem? And then in collect_procs() we obviously check for
> >>> folio_test_anon() on the same folio before calling
> >>> collect_procs_anon().
> >>
> >> Right, if we reach dax_lock_folio() with an anon folio we are probably
> >> already messing things up? Not sure if the "!dax_mapping(mapping)" would
> >> save us, likely not, because it would be an anon_vma.
> >>
> >> Hopefully Dan+Alistair have a clue how this works and if it already
> >> works as expected with device-private anon folios.
> >>
> > Hi David,
> >
> > Any suggestion how to make progress? Should I upload v3 with
> > mf_generic_kill_procs()->collect_procs() in folio-lock critical
> > section?
>
> Staring at mf_generic_kill_procs() once again, we have this
>
> switch (pgmap->type) {
> case MEMORY_DEVICE_PRIVATE:
> case MEMORY_DEVICE_COHERENT:
>         rc = -ENXIO;
>         goto unlock;
> ...
> }
>
> IIRC, all anon device folios are MEMORY_DEVICE_PRIVATE, so we should
> actually not succeed in this function and just abort.
>
> We still call dax_lock_folio() earlier, which likely doesn't do the
> right thing for anon device folios, but that's an existing issue.
>
> So regarding your patch I think we're good!
>
Awesome! Thanks so much for looking into this. Can you please
ack/review the v2's 1st patch also :-)

> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-10-02  7:22               ` David Hildenbrand
  2025-10-02  7:48                 ` Lokesh Gidra
@ 2025-10-03 23:02                 ` Peter Xu
  2025-10-06  6:43                   ` David Hildenbrand
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Xu @ 2025-10-03 23:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lokesh Gidra, Dan Williams, Alistair Popple, akpm, linux-mm,
	kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes, Harry Yoo,
	Suren Baghdasaryan, Barry Song, SeongJae Park

On Thu, Oct 02, 2025 at 09:22:45AM +0200, David Hildenbrand wrote:
> Staring at mf_generic_kill_procs() once again, we have this
> 
> switch (pgmap->type) {
> case MEMORY_DEVICE_PRIVATE:
> case MEMORY_DEVICE_COHERENT:
> 	rc = -ENXIO;
> 	goto unlock;
> ...
> }
> 
> IIRC, all anon device folios are MEMORY_DEVICE_PRIVATE, so we should
> actually not succeed in this function and just abort.

The doc says:

 * MEMORY_DEVICE_PRIVATE:
 * Device memory that is not directly addressable by the CPU: CPU can neither
 * read nor write private memory. In this case, we do still have struct pages
 * backing the device memory. Doing so simplifies the implementation, but it is
 * important to remember that there are certain points at which the struct page
 * must be treated as an opaque object, rather than a "normal" struct page.

It does not intepret to anon private memories, but something that the cpu
cannot access..

The only way to inteprete this from reading dax_lock_folio() is
folio->mapping should normally points to a real address_space of dax but
never an anon_vma, otherwise dax_mapping() looks inproper indeed.

So I also agree with the conclusion so far, looks like it should be good to
ignore the dax path for this problem.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-10-03 23:02                 ` Peter Xu
@ 2025-10-06  6:43                   ` David Hildenbrand
  2025-10-06 19:49                     ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-10-06  6:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Lokesh Gidra, Dan Williams, Alistair Popple, akpm, linux-mm,
	kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes, Harry Yoo,
	Suren Baghdasaryan, Barry Song, SeongJae Park

On 04.10.25 01:02, Peter Xu wrote:
> On Thu, Oct 02, 2025 at 09:22:45AM +0200, David Hildenbrand wrote:
>> Staring at mf_generic_kill_procs() once again, we have this
>>
>> switch (pgmap->type) {
>> case MEMORY_DEVICE_PRIVATE:
>> case MEMORY_DEVICE_COHERENT:
>> 	rc = -ENXIO;
>> 	goto unlock;
>> ...
>> }
>>
>> IIRC, all anon device folios are MEMORY_DEVICE_PRIVATE, so we should
>> actually not succeed in this function and just abort.
> 
> The doc says:
> 
>   * MEMORY_DEVICE_PRIVATE:
>   * Device memory that is not directly addressable by the CPU: CPU can neither
>   * read nor write private memory. In this case, we do still have struct pages
>   * backing the device memory. Doing so simplifies the implementation, but it is
>   * important to remember that there are certain points at which the struct page
>   * must be treated as an opaque object, rather than a "normal" struct page.
> 
> It does not intepret to anon private memories, but something that the cpu
> cannot access..

I would not trust the docs that much ;)

We establish device-private folios when a driver requests to migrate an 
ordinary (non-zone-device) anonymous folio to device-private memory 
(e.g., due to a device page fault).

When the CPU re-accesses that memory, we spot a device-private entry and 
trigger migration back from device-private memory to ordinary 
(CPU-accessible) RAM.

The latter is triggered in do_swap_page() where we handle 
is_device_private_entry() to call pgmap->ops->migrate_to_ram(vmf);

For migration and everything around that to work, obviously the 
device-private page also must be anonymous.

It'a s bit more obvious in copy_nonpresent_pte() where we call 
folio_try_dup_anon_rmap_pte() for is_device_private_entry(), because 
these are just anonymous folios.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-10-06  6:43                   ` David Hildenbrand
@ 2025-10-06 19:49                     ` Peter Xu
  2025-10-06 20:02                       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2025-10-06 19:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lokesh Gidra, Dan Williams, Alistair Popple, akpm, linux-mm,
	kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes, Harry Yoo,
	Suren Baghdasaryan, Barry Song, SeongJae Park

On Mon, Oct 06, 2025 at 08:43:17AM +0200, David Hildenbrand wrote:
> On 04.10.25 01:02, Peter Xu wrote:
> > On Thu, Oct 02, 2025 at 09:22:45AM +0200, David Hildenbrand wrote:
> > > Staring at mf_generic_kill_procs() once again, we have this
> > > 
> > > switch (pgmap->type) {
> > > case MEMORY_DEVICE_PRIVATE:
> > > case MEMORY_DEVICE_COHERENT:
> > > 	rc = -ENXIO;
> > > 	goto unlock;
> > > ...
> > > }
> > > 
> > > IIRC, all anon device folios are MEMORY_DEVICE_PRIVATE, so we should
> > > actually not succeed in this function and just abort.
> > 
> > The doc says:
> > 
> >   * MEMORY_DEVICE_PRIVATE:
> >   * Device memory that is not directly addressable by the CPU: CPU can neither
> >   * read nor write private memory. In this case, we do still have struct pages
> >   * backing the device memory. Doing so simplifies the implementation, but it is
> >   * important to remember that there are certain points at which the struct page
> >   * must be treated as an opaque object, rather than a "normal" struct page.
> > 
> > It does not intepret to anon private memories, but something that the cpu
> > cannot access..
> 
> I would not trust the docs that much ;)
> 
> We establish device-private folios when a driver requests to migrate an
> ordinary (non-zone-device) anonymous folio to device-private memory (e.g.,
> due to a device page fault).
> 
> When the CPU re-accesses that memory, we spot a device-private entry and
> trigger migration back from device-private memory to ordinary
> (CPU-accessible) RAM.
> 
> The latter is triggered in do_swap_page() where we handle
> is_device_private_entry() to call pgmap->ops->migrate_to_ram(vmf);
> 
> For migration and everything around that to work, obviously the
> device-private page also must be anonymous.
> 
> It'a s bit more obvious in copy_nonpresent_pte() where we call
> folio_try_dup_anon_rmap_pte() for is_device_private_entry(), because these
> are just anonymous folios.

Ah, I actually didn't expect folio_test_anon() will return true for the
ZONE_DEVICE folios that replaced a previously private anon folios..

To me, that was only some housekeeping for the ZONE_DEVICE folios
remembering the mapping/index/... somewhere, so that when they got migrated
back to private anon we know how to recover these things.  It doesn't mean
it's a "real" anon private?  After all, it's a ZONE_DEVICE folio, and the
memmap does not 1:1 maps to some RAM backends.

But I confess I'm not familiar with these.  Fortunately, I think either way
it doesn't seem to affect the design of the series being reviewed.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-10-06 19:49                     ` Peter Xu
@ 2025-10-06 20:02                       ` David Hildenbrand
  2025-10-06 20:50                         ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-10-06 20:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Lokesh Gidra, Dan Williams, Alistair Popple, akpm, linux-mm,
	kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes, Harry Yoo,
	Suren Baghdasaryan, Barry Song, SeongJae Park

On 06.10.25 21:49, Peter Xu wrote:
> On Mon, Oct 06, 2025 at 08:43:17AM +0200, David Hildenbrand wrote:
>> On 04.10.25 01:02, Peter Xu wrote:
>>> On Thu, Oct 02, 2025 at 09:22:45AM +0200, David Hildenbrand wrote:
>>>> Staring at mf_generic_kill_procs() once again, we have this
>>>>
>>>> switch (pgmap->type) {
>>>> case MEMORY_DEVICE_PRIVATE:
>>>> case MEMORY_DEVICE_COHERENT:
>>>> 	rc = -ENXIO;
>>>> 	goto unlock;
>>>> ...
>>>> }
>>>>
>>>> IIRC, all anon device folios are MEMORY_DEVICE_PRIVATE, so we should
>>>> actually not succeed in this function and just abort.
>>>
>>> The doc says:
>>>
>>>    * MEMORY_DEVICE_PRIVATE:
>>>    * Device memory that is not directly addressable by the CPU: CPU can neither
>>>    * read nor write private memory. In this case, we do still have struct pages
>>>    * backing the device memory. Doing so simplifies the implementation, but it is
>>>    * important to remember that there are certain points at which the struct page
>>>    * must be treated as an opaque object, rather than a "normal" struct page.
>>>
>>> It does not intepret to anon private memories, but something that the cpu
>>> cannot access..
>>
>> I would not trust the docs that much ;)
>>
>> We establish device-private folios when a driver requests to migrate an
>> ordinary (non-zone-device) anonymous folio to device-private memory (e.g.,
>> due to a device page fault).
>>
>> When the CPU re-accesses that memory, we spot a device-private entry and
>> trigger migration back from device-private memory to ordinary
>> (CPU-accessible) RAM.
>>
>> The latter is triggered in do_swap_page() where we handle
>> is_device_private_entry() to call pgmap->ops->migrate_to_ram(vmf);
>>
>> For migration and everything around that to work, obviously the
>> device-private page also must be anonymous.
>>
>> It'a s bit more obvious in copy_nonpresent_pte() where we call
>> folio_try_dup_anon_rmap_pte() for is_device_private_entry(), because these
>> are just anonymous folios.
> 
> Ah, I actually didn't expect folio_test_anon() will return true for the
> ZONE_DEVICE folios that replaced a previously private anon folios..
> 
> To me, that was only some housekeeping for the ZONE_DEVICE folios
> remembering the mapping/index/... somewhere, so that when they got migrated
> back to private anon we know how to recover these things.  It doesn't mean
> it's a "real" anon private?

Just like other anonymous folios we can walk the rmap to unmap/migrate 
them. And just like anonymous folios, they are not added to a page cache.

COW-sharing etc. works the very same way for them.

So I think they are really just anonymous folios representing a chunk of 
device memory that is not addressable by the CPU. Apart from that, 
nothing should be too special about them (well, except the way they are 
mapped into page tables of course).

> After all, it's a ZONE_DEVICE folio, and the
> memmap does not 1:1 maps to some RAM backends.
> 
> But I confess I'm not familiar with these.  Fortunately, I think either way
> it doesn't seem to affect the design of the series being reviewed.

Yes.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-10-06 20:02                       ` David Hildenbrand
@ 2025-10-06 20:50                         ` Peter Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Xu @ 2025-10-06 20:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lokesh Gidra, Dan Williams, Alistair Popple, akpm, linux-mm,
	kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes, Harry Yoo,
	Suren Baghdasaryan, Barry Song, SeongJae Park

On Mon, Oct 06, 2025 at 10:02:45PM +0200, David Hildenbrand wrote:
> Just like other anonymous folios we can walk the rmap to unmap/migrate them.
> And just like anonymous folios, they are not added to a page cache.
> 
> COW-sharing etc. works the very same way for them.

I agree. After all, many of the folio fields are reused here.

> 
> So I think they are really just anonymous folios representing a chunk of
> device memory that is not addressable by the CPU. Apart from that, nothing
> should be too special about them (well, except the way they are mapped into
> page tables of course).

I think they're still very special, though.. not only because they're
essentially swap entries, but also because I don't strictly think they're
really "anonymous", per the definition of normal anon private folios..

After all, they are closely attached to whatever device driver is providing
the service under the hood.  It doesn't really have a page cache, but it
kind of having it maintained in the device model..

The "private" side is accurate and uncontroversial.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-19  5:45     ` Lokesh Gidra
  2025-09-19  9:59       ` Lorenzo Stoakes
@ 2025-11-03 14:58       ` Lorenzo Stoakes
  2025-11-03 15:46         ` Lokesh Gidra
  1 sibling, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-11-03 14:58 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On Thu, Sep 18, 2025 at 10:45:21PM -0700, Lokesh Gidra wrote:
> On Thu, Sep 18, 2025 at 4:57 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Sep 17, 2025 at 10:51:34PM -0700, Lokesh Gidra wrote:
> > > Guarantee that rmap_walk() is called on locked folios so that threads
> > > changing folio->mapping and folio->index for non-KSM anon folios can
> > > serialize on fine-grained folio lock rather than anon_vma lock. Other
> > > folio types are already always locked before rmap_walk().
> >
> > Be good to explain why you're doing certain things, like adding the folio
> > lock to kill_procs_now().
>
> Agreed! I'll add in the next version.

Hi Lokesh,

I realise this was trivial, but it triggered me to hold off on further review on
the basis you'd send another version.

Are you actually planning to do so? As it seems set to merge without me having
completed my review.

I'll try to get to reviewing the rest of this, I just happened to have a look
and wondered why my tags weren't there...

You may as well hold off until I finish review anyway.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-11-03 14:58       ` Lorenzo Stoakes
@ 2025-11-03 15:46         ` Lokesh Gidra
  2025-11-03 16:38           ` Lorenzo Stoakes
  0 siblings, 1 reply; 26+ messages in thread
From: Lokesh Gidra @ 2025-11-03 15:46 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On Mon, Nov 3, 2025 at 6:59 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Sep 18, 2025 at 10:45:21PM -0700, Lokesh Gidra wrote:
> > On Thu, Sep 18, 2025 at 4:57 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Wed, Sep 17, 2025 at 10:51:34PM -0700, Lokesh Gidra wrote:
> > > > Guarantee that rmap_walk() is called on locked folios so that threads
> > > > changing folio->mapping and folio->index for non-KSM anon folios can
> > > > serialize on fine-grained folio lock rather than anon_vma lock. Other
> > > > folio types are already always locked before rmap_walk().
> > >
> > > Be good to explain why you're doing certain things, like adding the folio
> > > lock to kill_procs_now().
> >
> > Agreed! I'll add in the next version.
>
> Hi Lokesh,
>
> I realise this was trivial, but it triggered me to hold off on further review on
> the basis you'd send another version.
>
> Are you actually planning to do so? As it seems set to merge without me having
> completed my review.
>
Hi Lorenzo,

I think somehow the v2 fell under your radar. I think I addressed all
your comments in v2
(https://lore.kernel.org/all/20250923071019.775806-1-lokeshgidra@google.com/).
Please let me know if I missed anything.

> I'll try to get to reviewing the rest of this, I just happened to have a look
> and wondered why my tags weren't there...
>
I was actually wondering why you had not at all responded to v2 :) I
look forward to you reviewing it. It would be best to have your tags
on it.

Thanks, Lokesh

> You may as well hold off until I finish review anyway.
>
> Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-11-03 15:46         ` Lokesh Gidra
@ 2025-11-03 16:38           ` Lorenzo Stoakes
  0 siblings, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-11-03 16:38 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On Mon, Nov 03, 2025 at 07:46:57AM -0800, Lokesh Gidra wrote:
> On Mon, Nov 3, 2025 at 6:59 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Sep 18, 2025 at 10:45:21PM -0700, Lokesh Gidra wrote:
> > > On Thu, Sep 18, 2025 at 4:57 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Wed, Sep 17, 2025 at 10:51:34PM -0700, Lokesh Gidra wrote:
> > > > > Guarantee that rmap_walk() is called on locked folios so that threads
> > > > > changing folio->mapping and folio->index for non-KSM anon folios can
> > > > > serialize on fine-grained folio lock rather than anon_vma lock. Other
> > > > > folio types are already always locked before rmap_walk().
> > > >
> > > > Be good to explain why you're doing certain things, like adding the folio
> > > > lock to kill_procs_now().
> > >
> > > Agreed! I'll add in the next version.
> >
> > Hi Lokesh,
> >
> > I realise this was trivial, but it triggered me to hold off on further review on
> > the basis you'd send another version.
> >
> > Are you actually planning to do so? As it seems set to merge without me having
> > completed my review.
> >
> Hi Lorenzo,
>
> I think somehow the v2 fell under your radar. I think I addressed all
> your comments in v2
> (https://lore.kernel.org/all/20250923071019.775806-1-lokeshgidra@google.com/).
> Please let me know if I missed anything.

Doh!

Clearly. Sorry about that, this is my bad.

I have no idea how I missed that...!

>
> > I'll try to get to reviewing the rest of this, I just happened to have a look
> > and wondered why my tags weren't there...
> >
> I was actually wondering why you had not at all responded to v2 :) I
> look forward to you reviewing it. It would be best to have your tags
> on it.

Yeah, this obviously was not on purpose :)

Will take a look at that!

>
> Thanks, Lokesh
>
> > You may as well hold off until I finish review anyway.
> >
> > Thanks, Lorenzo
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-11-03 16:38 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-18  5:51 [PATCH 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Lokesh Gidra
2025-09-18  5:51 ` [PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
2025-09-18 11:57   ` Lorenzo Stoakes
2025-09-19  5:45     ` Lokesh Gidra
2025-09-19  9:59       ` Lorenzo Stoakes
2025-11-03 14:58       ` Lorenzo Stoakes
2025-11-03 15:46         ` Lokesh Gidra
2025-11-03 16:38           ` Lorenzo Stoakes
2025-09-18 12:15   ` David Hildenbrand
2025-09-19  6:09     ` Lokesh Gidra
2025-09-24 10:00       ` David Hildenbrand
2025-09-24 19:17         ` Lokesh Gidra
2025-09-25 11:06           ` David Hildenbrand
2025-10-02  6:46             ` Lokesh Gidra
2025-10-02  7:22               ` David Hildenbrand
2025-10-02  7:48                 ` Lokesh Gidra
2025-10-03 23:02                 ` Peter Xu
2025-10-06  6:43                   ` David Hildenbrand
2025-10-06 19:49                     ` Peter Xu
2025-10-06 20:02                       ` David Hildenbrand
2025-10-06 20:50                         ` Peter Xu
2025-09-18  5:51 ` [PATCH 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
2025-09-18 12:38   ` Lorenzo Stoakes
2025-09-19  6:30     ` Lokesh Gidra
2025-09-19  9:57       ` Lorenzo Stoakes
2025-09-19 18:34         ` Lokesh Gidra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox