[RFC PATCH 1/2] mm: always call rmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
@ 2025-09-08  4:49 Lokesh Gidra
  2025-09-08  4:49 ` [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl Lokesh Gidra
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-08  4:49 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kaleshsingh, ngeoffray, Lokesh Gidra,
	David Hildenbrand, Lorenzo Stoakes, Harry Yoo, Peter Xu,
	Suren Baghdasaryan, Barry Song, SeongJae Park

Prior discussion about this can be found at [1].

rmap_walk() requires all folios, except non-KSM anon, to be locked. This
implies that when threads update folio->mapping to an anon_vma with
different root (currently only done by UFFDIO MOVE), they have to
serialize against rmap_walk() with write-lock on the anon_vma, hurting
scalability. Furthermore, this necessitates rechecking anon_vma when
pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).

This can be simplified quite a bit by ensuring that rmap_walk() is
always called on locked folios. Among the few callers of rmap_walk() on
unlocked anon folios, shrink_active_list()->folio_referenced() is the
only performance critical one.

shrink_active_list() doesn't act differently depending on what
folio_referenced() returns for an anon folio. So returning 1 when it
is contended, like in case of other folio types, wouldn't have any
negative impact.

Furthermore, as David pointed out in the previous discussion [2], this
could potentially only affect R/O pages after fork as PG_anon_exclusive
is not set. But, such folios are already isolated (prior to calling
folio_referenced()) by grabbing a reference and clearing LRU, so
do_wp_page()->wp_can_reuse_anon_folio() would not reuse such folios
anyways.

[1] https://lore.kernel.org/all/CA+EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5+4aDx5jh2DvY6UHg@mail.gmail.com/
[2] https://lore.kernel.org/all/dc92aef8-757f-4432-923e-70d92d13fb37@redhat.com/

CC: David Hildenbrand <david@redhat.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
CC: Harry Yoo <harry.yoo@oracle.com>
CC: Peter Xu <peterx@redhat.com>
CC: Suren Baghdasaryan <surenb@google.com>
CC: Barry Song <baohua@kernel.org>
CC: SeongJae Park <sj@kernel.org>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
 mm/damon/ops-common.c | 16 ++++------------
 mm/page_idle.c        |  8 ++------
 mm/rmap.c             | 40 ++++++++++------------------------------
 3 files changed, 16 insertions(+), 48 deletions(-)

diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
index 998c5180a603..f61d6dde13dc 100644
--- a/mm/damon/ops-common.c
+++ b/mm/damon/ops-common.c
@@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
 		.rmap_one = damon_folio_mkold_one,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
-	bool need_lock;
 
 	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
 		folio_set_idle(folio);
 		return;
 	}
 
-	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
-	if (need_lock && !folio_trylock(folio))
+	if (!folio_trylock(folio))
 		return;
 
 	rmap_walk(folio, &rwc);
-
-	if (need_lock)
-		folio_unlock(folio);
+	folio_unlock(folio);
 
 }
 
@@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
 		.rmap_one = damon_folio_young_one,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
-	bool need_lock;
 
 	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
 		if (folio_test_idle(folio))
@@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
 			return true;
 	}
 
-	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
-	if (need_lock && !folio_trylock(folio))
+	if (!folio_trylock(folio))
 		return false;
 
 	rmap_walk(folio, &rwc);
-
-	if (need_lock)
-		folio_unlock(folio);
+	folio_unlock(folio);
 
 	return accessed;
 }
diff --git a/mm/page_idle.c b/mm/page_idle.c
index a82b340dc204..9bf573d22e87 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -101,19 +101,15 @@ static void page_idle_clear_pte_refs(struct folio *folio)
 		.rmap_one = page_idle_clear_pte_refs_one,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
-	bool need_lock;
 
 	if (!folio_mapped(folio) || !folio_raw_mapping(folio))
 		return;
 
-	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
-	if (need_lock && !folio_trylock(folio))
+	if (!folio_trylock(folio))
 		return;
 
 	rmap_walk(folio, &rwc);
-
-	if (need_lock)
-		folio_unlock(folio);
+	folio_unlock(folio);
 }
 
 static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
diff --git a/mm/rmap.c b/mm/rmap.c
index 34333ae3bd80..fc53f31434f4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -489,17 +489,15 @@ void __init anon_vma_init(void)
  * if there is a mapcount, we can dereference the anon_vma after observing
  * those.
  *
- * NOTE: the caller should normally hold folio lock when calling this.  If
- * not, the caller needs to double check the anon_vma didn't change after
- * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
- * concurrently without folio lock protection). See folio_lock_anon_vma_read()
- * which has already covered that, and comment above remap_pages().
+ * NOTE: the caller should hold folio lock when calling this.
  */
 struct anon_vma *folio_get_anon_vma(const struct folio *folio)
 {
 	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+
 	rcu_read_lock();
 	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
 	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
@@ -546,7 +544,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
-retry:
 	rcu_read_lock();
 	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
 	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
@@ -557,17 +554,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 	anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
 	root_anon_vma = READ_ONCE(anon_vma->root);
 	if (down_read_trylock(&root_anon_vma->rwsem)) {
-		/*
-		 * folio_move_anon_rmap() might have changed the anon_vma as we
-		 * might not hold the folio lock here.
-		 */
-		if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
-			     anon_mapping)) {
-			up_read(&root_anon_vma->rwsem);
-			rcu_read_unlock();
-			goto retry;
-		}
-
 		/*
 		 * If the folio is still mapped, then this anon_vma is still
 		 * its anon_vma, and holding the mutex ensures that it will
@@ -602,18 +588,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
-	/*
-	 * folio_move_anon_rmap() might have changed the anon_vma as we might
-	 * not hold the folio lock here.
-	 */
-	if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
-		     anon_mapping)) {
-		anon_vma_unlock_read(anon_vma);
-		put_anon_vma(anon_vma);
-		anon_vma = NULL;
-		goto retry;
-	}
-
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock
@@ -1005,7 +979,7 @@ int folio_referenced(struct folio *folio, int is_locked,
 	if (!folio_raw_mapping(folio))
 		return 0;
 
-	if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
+	if (!is_locked) {
 		we_locked = folio_trylock(folio);
 		if (!we_locked)
 			return 1;
@@ -2815,6 +2789,12 @@ static void rmap_walk_anon(struct folio *folio,
 	pgoff_t pgoff_start, pgoff_end;
 	struct anon_vma_chain *avc;
 
+	/*
+	 * The folio lock ensures that folio->mapping can be changed under us to
+	 * an anon_vma with different root, like UFFDIO MOVE.
+	 */
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
 	if (locked) {
 		anon_vma = folio_anon_vma(folio);
 		/* anon_vma disappear under us? */

base-commit: b024763926d2726978dff6588b81877d000159c1
-- 
2.51.0.355.g5224444f11-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl
  2025-09-08  4:49 [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
@ 2025-09-08  4:49 ` Lokesh Gidra
  2025-09-11 20:07   ` Lorenzo Stoakes
  2025-09-12  9:15   ` David Hildenbrand
  2025-09-08 21:47 ` [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Barry Song
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-08  4:49 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kaleshsingh, ngeoffray, Lokesh Gidra,
	David Hildenbrand, Lorenzo Stoakes, Peter Xu, Suren Baghdasaryan,
	Barry Song

Since rmap_walk() is now always called on locked anon folios, we don't
have to serialize on anon_vma lock when updating folio->mapping.

This helps avoid contention on src anon_vma when multiple threads are
simultaneously moving distinct pages from the same src vma.

CC: David Hildenbrand <david@redhat.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
CC: Peter Xu <peterx@redhat.com>
CC: Suren Baghdasaryan <surenb@google.com>
CC: Barry Song <baohua@kernel.org>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
 mm/huge_memory.c | 22 +----------------
 mm/userfaultfd.c | 62 +++++++++---------------------------------------
 2 files changed, 12 insertions(+), 72 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26cedfcd7418..5cd3957f92d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	pmd_t _dst_pmd, src_pmdval;
 	struct page *src_page;
 	struct folio *src_folio;
-	struct anon_vma *src_anon_vma;
 	spinlock_t *src_ptl, *dst_ptl;
 	pgtable_t src_pgtable;
 	struct mmu_notifier_range range;
@@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 				src_addr + HPAGE_PMD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 
-	if (src_folio) {
+	if (src_folio)
 		folio_lock(src_folio);
 
-		/*
-		 * split_huge_page walks the anon_vma chain without the page
-		 * lock. Serialize against it with the anon_vma lock, the page
-		 * lock is not enough.
-		 */
-		src_anon_vma = folio_get_anon_vma(src_folio);
-		if (!src_anon_vma) {
-			err = -EAGAIN;
-			goto unlock_folio;
-		}
-		anon_vma_lock_write(src_anon_vma);
-	} else
-		src_anon_vma = NULL;
-
 	dst_ptl = pmd_lockptr(mm, dst_pmd);
 	double_pt_lock(src_ptl, dst_ptl);
 	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
@@ -2643,11 +2628,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
-	if (src_anon_vma) {
-		anon_vma_unlock_write(src_anon_vma);
-		put_anon_vma(src_anon_vma);
-	}
-unlock_folio:
 	/* unblock rmap walks */
 	if (src_folio)
 		folio_unlock(src_folio);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 50aaa8dcd24c..1a36760a36c7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1035,8 +1035,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
  */
 static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
 						 unsigned long src_addr,
-						 pte_t *src_pte, pte_t *dst_pte,
-						 struct anon_vma *src_anon_vma)
+						 pte_t *src_pte, pte_t *dst_pte)
 {
 	pte_t orig_dst_pte, orig_src_pte;
 	struct folio *folio;
@@ -1052,8 +1051,7 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
 	folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
 	if (!folio || !folio_trylock(folio))
 		return NULL;
-	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
-	    folio_anon_vma(folio) != src_anon_vma) {
+	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)) {
 		folio_unlock(folio);
 		return NULL;
 	}
@@ -1061,9 +1059,8 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
 }
 
 /*
- * Moves src folios to dst in a batch as long as they share the same
- * anon_vma as the first folio, are not large, and can successfully
- * take the lock via folio_trylock().
+ * Moves src folios to dst in a batch as long as they are not large, and can
+ * successfully take the lock via folio_trylock().
  */
 static long move_present_ptes(struct mm_struct *mm,
 			      struct vm_area_struct *dst_vma,
@@ -1073,8 +1070,7 @@ static long move_present_ptes(struct mm_struct *mm,
 			      pte_t orig_dst_pte, pte_t orig_src_pte,
 			      pmd_t *dst_pmd, pmd_t dst_pmdval,
 			      spinlock_t *dst_ptl, spinlock_t *src_ptl,
-			      struct folio **first_src_folio, unsigned long len,
-			      struct anon_vma *src_anon_vma)
+			      struct folio **first_src_folio, unsigned long len)
 {
 	int err = 0;
 	struct folio *src_folio = *first_src_folio;
@@ -1132,8 +1128,8 @@ static long move_present_ptes(struct mm_struct *mm,
 		src_pte++;
 
 		folio_unlock(src_folio);
-		src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
-							dst_pte, src_anon_vma);
+		src_folio = check_ptes_for_batched_move(src_vma, src_addr,
+							src_pte, dst_pte);
 		if (!src_folio)
 			break;
 	}
@@ -1263,7 +1259,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 	pmd_t dummy_pmdval;
 	pmd_t dst_pmdval;
 	struct folio *src_folio = NULL;
-	struct anon_vma *src_anon_vma = NULL;
 	struct mmu_notifier_range range;
 	long ret = 0;
 
@@ -1347,9 +1342,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 		}
 
 		/*
-		 * Pin and lock both source folio and anon_vma. Since we are in
-		 * RCU read section, we can't block, so on contention have to
-		 * unmap the ptes, obtain the lock and retry.
+		 * Pin and lock source folio. Since we are in RCU read section,
+		 * we can't block, so on contention have to unmap the ptes,
+		 * obtain the lock and retry.
 		 */
 		if (!src_folio) {
 			struct folio *folio;
@@ -1423,33 +1418,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 			goto retry;
 		}
 
-		if (!src_anon_vma) {
-			/*
-			 * folio_referenced walks the anon_vma chain
-			 * without the folio lock. Serialize against it with
-			 * the anon_vma lock, the folio lock is not enough.
-			 */
-			src_anon_vma = folio_get_anon_vma(src_folio);
-			if (!src_anon_vma) {
-				/* page was unmapped from under us */
-				ret = -EAGAIN;
-				goto out;
-			}
-			if (!anon_vma_trylock_write(src_anon_vma)) {
-				pte_unmap(src_pte);
-				pte_unmap(dst_pte);
-				src_pte = dst_pte = NULL;
-				/* now we can block and wait */
-				anon_vma_lock_write(src_anon_vma);
-				goto retry;
-			}
-		}
-
 		ret = move_present_ptes(mm, dst_vma, src_vma,
 					dst_addr, src_addr, dst_pte, src_pte,
 					orig_dst_pte, orig_src_pte, dst_pmd,
 					dst_pmdval, dst_ptl, src_ptl, &src_folio,
-					len, src_anon_vma);
+					len);
 	} else {
 		struct folio *folio = NULL;
 
@@ -1516,10 +1489,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 	}
 
 out:
-	if (src_anon_vma) {
-		anon_vma_unlock_write(src_anon_vma);
-		put_anon_vma(src_anon_vma);
-	}
 	if (src_folio) {
 		folio_unlock(src_folio);
 		folio_put(src_folio);
@@ -1793,15 +1762,6 @@ static void uffd_move_unlock(struct vm_area_struct *dst_vma,
  * virtual regions without knowing if there are transparent hugepage
  * in the regions or not, but preventing the risk of having to split
  * the hugepmd during the remap.
- *
- * If there's any rmap walk that is taking the anon_vma locks without
- * first obtaining the folio lock (the only current instance is
- * folio_referenced), they will have to verify if the folio->mapping
- * has changed after taking the anon_vma lock. If it changed they
- * should release the lock and retry obtaining a new anon_vma, because
- * it means the anon_vma was changed by move_pages() before the lock
- * could be obtained. This is the only additional complexity added to
- * the rmap code to provide this anonymous page remapping functionality.
  */
 ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 		   unsigned long src_start, unsigned long len, __u64 mode)
-- 
2.51.0.355.g5224444f11-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-08  4:49 [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
  2025-09-08  4:49 ` [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl Lokesh Gidra
@ 2025-09-08 21:47 ` Barry Song
  2025-09-08 22:12   ` Lokesh Gidra
  2025-09-10 10:10 ` Harry Yoo
  2025-09-11 19:39 ` Lorenzo Stoakes
  3 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-09-08 21:47 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Peter Xu, Suren Baghdasaryan,
	SeongJae Park

On Mon, Sep 8, 2025 at 12:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> Prior discussion about this can be found at [1].
>
> rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> implies that when threads update folio->mapping to an anon_vma with
> different root (currently only done by UFFDIO MOVE), they have to
> serialize against rmap_walk() with write-lock on the anon_vma, hurting
> scalability. Furthermore, this necessitates rechecking anon_vma when
> pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
>
> This can be simplified quite a bit by ensuring that rmap_walk() is
> always called on locked folios. Among the few callers of rmap_walk() on
> unlocked anon folios, shrink_active_list()->folio_referenced() is the
> only performance critical one.

As I understand it, shrink_inactive_list() also invokes folio_referenced().
Shouldn’t it be called just as often as shrink_active_list()?

>
> shrink_active_list() doesn't act differently depending on what
> folio_referenced() returns for an anon folio. So returning 1 when it
> is contended, like in case of other folio types, wouldn't have any
> negative impact.

A complaint was raised that the LRU could become slightly disordered:
https://lore.kernel.org/linux-mm/20240219141703.3851-1-lipeifeng@oppo.com/

We can re-test to confirm if this is the case.

Thanks
Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-08 21:47 ` [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Barry Song
@ 2025-09-08 22:12   ` Lokesh Gidra
  2025-09-09  0:40     ` Barry Song
  0 siblings, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-08 22:12 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Peter Xu, Suren Baghdasaryan,
	SeongJae Park

On Mon, Sep 8, 2025 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Sep 8, 2025 at 12:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > Prior discussion about this can be found at [1].
> >
> > rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> > implies that when threads update folio->mapping to an anon_vma with
> > different root (currently only done by UFFDIO MOVE), they have to
> > serialize against rmap_walk() with write-lock on the anon_vma, hurting
> > scalability. Furthermore, this necessitates rechecking anon_vma when
> > pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> >
> > This can be simplified quite a bit by ensuring that rmap_walk() is
> > always called on locked folios. Among the few callers of rmap_walk() on
> > unlocked anon folios, shrink_active_list()->folio_referenced() is the
> > only performance critical one.
>
> As I understand it, shrink_inactive_list() also invokes folio_referenced().
> Shouldn’t it be called just as often as shrink_active_list()?

I'm only talking about those callers which call rmap_walk() without
locking anon folio. The
shrink_inactive_list()->folio_check_references()->folio_referenced()
path that you are talking about always locks the folio. So the
behavior in that case wouldn't change.
>
> >
> > shrink_active_list() doesn't act differently depending on what
> > folio_referenced() returns for an anon folio. So returning 1 when it
> > is contended, like in case of other folio types, wouldn't have any
> > negative impact.
>
> A complaint was raised that the LRU could become slightly disordered:
> https://lore.kernel.org/linux-mm/20240219141703.3851-1-lipeifeng@oppo.com/
>
> We can re-test to confirm if this is the case.
The patch in the link you provided is suggesting to control try-lock
for anon_vma lock. But here we are dealing with folio lock. Not sure
if the ordering issue will be there in this case.
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-08 22:12   ` Lokesh Gidra
@ 2025-09-09  0:40     ` Barry Song
  2025-09-09  5:37       ` Lokesh Gidra
  0 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-09-09  0:40 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Peter Xu, Suren Baghdasaryan,
	SeongJae Park

On Tue, Sep 9, 2025 at 6:12 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Mon, Sep 8, 2025 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Sep 8, 2025 at 12:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > >
> > > Prior discussion about this can be found at [1].
> > >
> > > rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> > > implies that when threads update folio->mapping to an anon_vma with
> > > different root (currently only done by UFFDIO MOVE), they have to
> > > serialize against rmap_walk() with write-lock on the anon_vma, hurting
> > > scalability. Furthermore, this necessitates rechecking anon_vma when
> > > pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> > >
> > > This can be simplified quite a bit by ensuring that rmap_walk() is
> > > always called on locked folios. Among the few callers of rmap_walk() on
> > > unlocked anon folios, shrink_active_list()->folio_referenced() is the
> > > only performance critical one.
> >
> > As I understand it, shrink_inactive_list() also invokes folio_referenced().
> > Shouldn’t it be called just as often as shrink_active_list()?
>
> I'm only talking about those callers which call rmap_walk() without
> locking anon folio. The
> shrink_inactive_list()->folio_check_references()->folio_referenced()
> path that you are talking about always locks the folio. So the
> behavior in that case wouldn't change.

Thanks for the clarification. Could you add a note about this if there
is a v2?

> >
> > >
> > > shrink_active_list() doesn't act differently depending on what
> > > folio_referenced() returns for an anon folio. So returning 1 when it
> > > is contended, like in case of other folio types, wouldn't have any
> > > negative impact.
> >
> > A complaint was raised that the LRU could become slightly disordered:
> > https://lore.kernel.org/linux-mm/20240219141703.3851-1-lipeifeng@oppo.com/
> >
> > We can re-test to confirm if this is the case.
> The patch in the link you provided is suggesting to control try-lock
> for anon_vma lock. But here we are dealing with folio lock. Not sure
> if the ordering issue will be there in this case.

Right. Not sure what percentage of folios will be contended; I assume
it is minor. Maybe you could share some data on this in a v2?

Thanks
Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-09  0:40     ` Barry Song
@ 2025-09-09  5:37       ` Lokesh Gidra
  2025-09-09  5:51         ` Barry Song
  0 siblings, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-09  5:37 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Peter Xu, Suren Baghdasaryan,
	SeongJae Park

On Mon, Sep 8, 2025 at 5:40 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 6:12 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Mon, Sep 8, 2025 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Sep 8, 2025 at 12:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > >
> > > > Prior discussion about this can be found at [1].
> > > >
> > > > rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> > > > implies that when threads update folio->mapping to an anon_vma with
> > > > different root (currently only done by UFFDIO MOVE), they have to
> > > > serialize against rmap_walk() with write-lock on the anon_vma, hurting
> > > > scalability. Furthermore, this necessitates rechecking anon_vma when
> > > > pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> > > >
> > > > This can be simplified quite a bit by ensuring that rmap_walk() is
> > > > always called on locked folios. Among the few callers of rmap_walk() on
> > > > unlocked anon folios, shrink_active_list()->folio_referenced() is the
> > > > only performance critical one.
> > >
> > > As I understand it, shrink_inactive_list() also invokes folio_referenced().
> > > Shouldn’t it be called just as often as shrink_active_list()?
> >
> > I'm only talking about those callers which call rmap_walk() without
> > locking anon folio. The
> > shrink_inactive_list()->folio_check_references()->folio_referenced()
> > path that you are talking about always locks the folio. So the
> > behavior in that case wouldn't change.
>
> Thanks for the clarification. Could you add a note about this if there
> is a v2?
>
Certainly, will do.
> > >
> > > >
> > > > shrink_active_list() doesn't act differently depending on what
> > > > folio_referenced() returns for an anon folio. So returning 1 when it
> > > > is contended, like in case of other folio types, wouldn't have any
> > > > negative impact.
> > >
> > > A complaint was raised that the LRU could become slightly disordered:
> > > https://lore.kernel.org/linux-mm/20240219141703.3851-1-lipeifeng@oppo.com/
> > >
> > > We can re-test to confirm if this is the case.
> > The patch in the link you provided is suggesting to control try-lock
> > for anon_vma lock. But here we are dealing with folio lock. Not sure
> > if the ordering issue will be there in this case.
>
> Right. Not sure what percentage of folios will be contended; I assume
> it is minor. Maybe you could share some data on this in a v2?

Any suggestion on how (or which test/benchmark) would be good to
gather data on this?

IIUC, shink_active_list() doesn't behave any differently whether there
is contention or not, except when it's an executable file folio. So I
doubt such data would be any useful. Please correct me if I'm wrong.
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-09  5:37       ` Lokesh Gidra
@ 2025-09-09  5:51         ` Barry Song
  2025-09-09  5:56           ` Lokesh Gidra
  0 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-09-09  5:51 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Peter Xu, Suren Baghdasaryan,
	SeongJae Park

On Tue, Sep 9, 2025 at 1:37 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Mon, Sep 8, 2025 at 5:40 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Sep 9, 2025 at 6:12 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > >
> > > On Mon, Sep 8, 2025 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Mon, Sep 8, 2025 at 12:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > >
> > > > > Prior discussion about this can be found at [1].
> > > > >
> > > > > rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> > > > > implies that when threads update folio->mapping to an anon_vma with
> > > > > different root (currently only done by UFFDIO MOVE), they have to
> > > > > serialize against rmap_walk() with write-lock on the anon_vma, hurting
> > > > > scalability. Furthermore, this necessitates rechecking anon_vma when
> > > > > pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> > > > >
> > > > > This can be simplified quite a bit by ensuring that rmap_walk() is
> > > > > always called on locked folios. Among the few callers of rmap_walk() on
> > > > > unlocked anon folios, shrink_active_list()->folio_referenced() is the
> > > > > only performance critical one.
> > > >
> > > > As I understand it, shrink_inactive_list() also invokes folio_referenced().
> > > > Shouldn’t it be called just as often as shrink_active_list()?
> > >
> > > I'm only talking about those callers which call rmap_walk() without
> > > locking anon folio. The
> > > shrink_inactive_list()->folio_check_references()->folio_referenced()
> > > path that you are talking about always locks the folio. So the
> > > behavior in that case wouldn't change.
> >
> > Thanks for the clarification. Could you add a note about this if there
> > is a v2?
> >
> Certainly, will do.
> > > >
> > > > >
> > > > > shrink_active_list() doesn't act differently depending on what
> > > > > folio_referenced() returns for an anon folio. So returning 1 when it
> > > > > is contended, like in case of other folio types, wouldn't have any
> > > > > negative impact.
> > > >
> > > > A complaint was raised that the LRU could become slightly disordered:
> > > > https://lore.kernel.org/linux-mm/20240219141703.3851-1-lipeifeng@oppo.com/
> > > >
> > > > We can re-test to confirm if this is the case.
> > > The patch in the link you provided is suggesting to control try-lock
> > > for anon_vma lock. But here we are dealing with folio lock. Not sure
> > > if the ordering issue will be there in this case.
> >
> > Right. Not sure what percentage of folios will be contended; I assume
> > it is minor. Maybe you could share some data on this in a v2?
>
> Any suggestion on how (or which test/benchmark) would be good to
> gather data on this?
>
> IIUC, shink_active_list() doesn't behave any differently whether there
> is contention or not, except when it's an executable file folio. So I
> doubt such data would be any useful. Please correct me if I'm wrong.

Since we skipped clearing the PTE young bit in folio_referenced_one, a
cold page might be misidentified as hot during shrink_inactive_list().
My understanding is that as long as the percentage is small, this
shouldn't be a concern.

Thanks
Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-09  5:51         ` Barry Song
@ 2025-09-09  5:56           ` Lokesh Gidra
  2025-09-09  6:01             ` Barry Song
  0 siblings, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-09  5:56 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Peter Xu, Suren Baghdasaryan,
	SeongJae Park

On Mon, Sep 8, 2025 at 10:52 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 1:37 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Mon, Sep 8, 2025 at 5:40 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, Sep 9, 2025 at 6:12 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > >
> > > > On Mon, Sep 8, 2025 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Mon, Sep 8, 2025 at 12:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > > >
> > > > > > Prior discussion about this can be found at [1].
> > > > > >
> > > > > > rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> > > > > > implies that when threads update folio->mapping to an anon_vma with
> > > > > > different root (currently only done by UFFDIO MOVE), they have to
> > > > > > serialize against rmap_walk() with write-lock on the anon_vma, hurting
> > > > > > scalability. Furthermore, this necessitates rechecking anon_vma when
> > > > > > pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> > > > > >
> > > > > > This can be simplified quite a bit by ensuring that rmap_walk() is
> > > > > > always called on locked folios. Among the few callers of rmap_walk() on
> > > > > > unlocked anon folios, shrink_active_list()->folio_referenced() is the
> > > > > > only performance critical one.
> > > > >
> > > > > As I understand it, shrink_inactive_list() also invokes folio_referenced().
> > > > > Shouldn’t it be called just as often as shrink_active_list()?
> > > >
> > > > I'm only talking about those callers which call rmap_walk() without
> > > > locking anon folio. The
> > > > shrink_inactive_list()->folio_check_references()->folio_referenced()
> > > > path that you are talking about always locks the folio. So the
> > > > behavior in that case wouldn't change.
> > >
> > > Thanks for the clarification. Could you add a note about this if there
> > > is a v2?
> > >
> > Certainly, will do.
> > > > >
> > > > > >
> > > > > > shrink_active_list() doesn't act differently depending on what
> > > > > > folio_referenced() returns for an anon folio. So returning 1 when it
> > > > > > is contended, like in case of other folio types, wouldn't have any
> > > > > > negative impact.
> > > > >
> > > > > A complaint was raised that the LRU could become slightly disordered:
> > > > > https://lore.kernel.org/linux-mm/20240219141703.3851-1-lipeifeng@oppo.com/
> > > > >
> > > > > We can re-test to confirm if this is the case.
> > > > The patch in the link you provided is suggesting to control try-lock
> > > > for anon_vma lock. But here we are dealing with folio lock. Not sure
> > > > if the ordering issue will be there in this case.
> > >
> > > Right. Not sure what percentage of folios will be contended; I assume
> > > it is minor. Maybe you could share some data on this in a v2?
> >
> > Any suggestion on how (or which test/benchmark) would be good to
> > gather data on this?
> >
> > IIUC, shink_active_list() doesn't behave any differently whether there
> > is contention or not, except when it's an executable file folio. So I
> > doubt such data would be any useful. Please correct me if I'm wrong.
>
> Since we skipped clearing the PTE young bit in folio_referenced_one, a
> cold page might be misidentified as hot during shrink_inactive_list().
> My understanding is that as long as the percentage is small, this
> shouldn't be a concern.
I see. That makes a lot more sense why folio_referenced() is called on
all folios in shrink_active_list(). I missed that young bit clearing
before.

Any suggestions on a good testcase to gather this data?
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-09  5:56           ` Lokesh Gidra
@ 2025-09-09  6:01             ` Barry Song
  2025-09-11 19:05               ` Lokesh Gidra
  0 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-09-09  6:01 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Peter Xu, Suren Baghdasaryan,
	SeongJae Park

On Tue, Sep 9, 2025 at 1:57 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Mon, Sep 8, 2025 at 10:52 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Sep 9, 2025 at 1:37 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > >
> > > On Mon, Sep 8, 2025 at 5:40 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 9, 2025 at 6:12 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > >
> > > > > On Mon, Sep 8, 2025 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Sep 8, 2025 at 12:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > > > >
> > > > > > > Prior discussion about this can be found at [1].
> > > > > > >
> > > > > > > rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> > > > > > > implies that when threads update folio->mapping to an anon_vma with
> > > > > > > different root (currently only done by UFFDIO MOVE), they have to
> > > > > > > serialize against rmap_walk() with write-lock on the anon_vma, hurting
> > > > > > > scalability. Furthermore, this necessitates rechecking anon_vma when
> > > > > > > pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> > > > > > >
> > > > > > > This can be simplified quite a bit by ensuring that rmap_walk() is
> > > > > > > always called on locked folios. Among the few callers of rmap_walk() on
> > > > > > > unlocked anon folios, shrink_active_list()->folio_referenced() is the
> > > > > > > only performance critical one.
> > > > > >
> > > > > > As I understand it, shrink_inactive_list() also invokes folio_referenced().
> > > > > > Shouldn’t it be called just as often as shrink_active_list()?
> > > > >
> > > > > I'm only talking about those callers which call rmap_walk() without
> > > > > locking anon folio. The
> > > > > shrink_inactive_list()->folio_check_references()->folio_referenced()
> > > > > path that you are talking about always locks the folio. So the
> > > > > behavior in that case wouldn't change.
> > > >
> > > > Thanks for the clarification. Could you add a note about this if there
> > > > is a v2?
> > > >
> > > Certainly, will do.
> > > > > >
> > > > > > >
> > > > > > > shrink_active_list() doesn't act differently depending on what
> > > > > > > folio_referenced() returns for an anon folio. So returning 1 when it
> > > > > > > is contended, like in case of other folio types, wouldn't have any
> > > > > > > negative impact.
> > > > > >
> > > > > > A complaint was raised that the LRU could become slightly disordered:
> > > > > > https://lore.kernel.org/linux-mm/20240219141703.3851-1-lipeifeng@oppo.com/
> > > > > >
> > > > > > We can re-test to confirm if this is the case.
> > > > > The patch in the link you provided is suggesting to control try-lock
> > > > > for anon_vma lock. But here we are dealing with folio lock. Not sure
> > > > > if the ordering issue will be there in this case.
> > > >
> > > > Right. Not sure what percentage of folios will be contended; I assume
> > > > it is minor. Maybe you could share some data on this in a v2?
> > >
> > > Any suggestion on how (or which test/benchmark) would be good to
> > > gather data on this?
> > >
> > > IIUC, shink_active_list() doesn't behave any differently whether there
> > > is contention or not, except when it's an executable file folio. So I
> > > doubt such data would be any useful. Please correct me if I'm wrong.
> >
> > Since we skipped clearing the PTE young bit in folio_referenced_one, a
> > cold page might be misidentified as hot during shrink_inactive_list().
> > My understanding is that as long as the percentage is small, this
> > shouldn't be a concern.
> I see. That makes a lot more sense why folio_referenced() is called on
> all folios in shrink_active_list(). I missed that young bit clearing
> before.
>
> Any suggestions on a good testcase to gather this data?

I would run monkey for a few hours with some debug counters, e.g. how
many folios pass through shrink_active_list(), how many get contended
and moved to the inactive list without clearing the young bit. If the
percentage is small, we can just ignore this disordered LRU behavior.

Thanks
Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-08  4:49 [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
  2025-09-08  4:49 ` [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl Lokesh Gidra
  2025-09-08 21:47 ` [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Barry Song
@ 2025-09-10 10:10 ` Harry Yoo
  2025-09-10 15:33   ` Lokesh Gidra
  2025-09-12  3:29   ` Miaohe Lin
  2025-09-11 19:39 ` Lorenzo Stoakes
  3 siblings, 2 replies; 21+ messages in thread
From: Harry Yoo @ 2025-09-10 10:10 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park, Miaohe Lin, Naoya Horiguchi

On Sun, Sep 07, 2025 at 09:49:49PM -0700, Lokesh Gidra wrote:
> Prior discussion about this can be found at [1].
> 
> rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> implies that when threads update folio->mapping to an anon_vma with
> different root (currently only done by UFFDIO MOVE), they have to
> serialize against rmap_walk() with write-lock on the anon_vma, hurting
> scalability. Furthermore, this necessitates rechecking anon_vma when
> pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> 
> This can be simplified quite a bit by ensuring that rmap_walk() is
> always called on locked folios. Among the few callers of rmap_walk() on
> unlocked anon folios, shrink_active_list()->folio_referenced() is the
> only performance critical one.
> 
> shrink_active_list() doesn't act differently depending on what
> folio_referenced() returns for an anon folio. So returning 1 when it
> is contended, like in case of other folio types, wouldn't have any
> negative impact.
> 
> Furthermore, as David pointed out in the previous discussion [2], this
> could potentially only affect R/O pages after fork as PG_anon_exclusive
> is not set. But, such folios are already isolated (prior to calling
> folio_referenced()) by grabbing a reference and clearing LRU, so
> do_wp_page()->wp_can_reuse_anon_folio() would not reuse such folios
> anyways.
> 
> [1] https://lore.kernel.org/all/CA*EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5*4aDx5jh2DvY6UHg@mail.gmail.com
> [2] https://lore.kernel.org/all/dc92aef8-757f-4432-923e-70d92d13fb37@redhat.com
> 
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Harry Yoo <harry.yoo@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> CC: SeongJae Park <sj@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> ---
>  mm/damon/ops-common.c | 16 ++++------------
>  mm/page_idle.c        |  8 ++------
>  mm/rmap.c             | 40 ++++++++++------------------------------
>  3 files changed, 16 insertions(+), 48 deletions(-)
> 
> @@ -557,17 +554,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
>  	root_anon_vma = READ_ONCE(anon_vma->root);
>  	if (down_read_trylock(&root_anon_vma->rwsem)) {
> -		/*
> -		 * folio_move_anon_rmap() might have changed the anon_vma as we
> -		 * might not hold the folio lock here.
> -		 */
> -		if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -			     anon_mapping)) {
> -			up_read(&root_anon_vma->rwsem);
> -			rcu_read_unlock();
> -			goto retry;
> -		}
> -

folio_lock_anon_vma_read() can be called without folio lock in a path:
memory_failure() -> kill_procs_now() -> collect_procs() ->
collect_procs_anon().

Not sure why collect_procs_{anon,ksm,file,fsdax} do not use rmap_walk()
functionality :/

Should we take folio lock before calling kill_procs_now() in
memory_failure()?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-10 10:10 ` Harry Yoo
@ 2025-09-10 15:33   ` Lokesh Gidra
  2025-09-11  8:40     ` Harry Yoo
  2025-09-12  3:29   ` Miaohe Lin
  1 sibling, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-10 15:33 UTC (permalink / raw)
  To: Harry Yoo
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park, Miaohe Lin, Naoya Horiguchi

On Wed, Sep 10, 2025 at 3:10 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Sun, Sep 07, 2025 at 09:49:49PM -0700, Lokesh Gidra wrote:
> > Prior discussion about this can be found at [1].
> >
> > rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> > implies that when threads update folio->mapping to an anon_vma with
> > different root (currently only done by UFFDIO MOVE), they have to
> > serialize against rmap_walk() with write-lock on the anon_vma, hurting
> > scalability. Furthermore, this necessitates rechecking anon_vma when
> > pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> >
> > This can be simplified quite a bit by ensuring that rmap_walk() is
> > always called on locked folios. Among the few callers of rmap_walk() on
> > unlocked anon folios, shrink_active_list()->folio_referenced() is the
> > only performance critical one.
> >
> > shrink_active_list() doesn't act differently depending on what
> > folio_referenced() returns for an anon folio. So returning 1 when it
> > is contended, like in case of other folio types, wouldn't have any
> > negative impact.
> >
> > Furthermore, as David pointed out in the previous discussion [2], this
> > could potentially only affect R/O pages after fork as PG_anon_exclusive
> > is not set. But, such folios are already isolated (prior to calling
> > folio_referenced()) by grabbing a reference and clearing LRU, so
> > do_wp_page()->wp_can_reuse_anon_folio() would not reuse such folios
> > anyways.
> >
> > [1] https://lore.kernel.org/all/CA*EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5*4aDx5jh2DvY6UHg@mail.gmail.com
> > [2] https://lore.kernel.org/all/dc92aef8-757f-4432-923e-70d92d13fb37@redhat.com
> >
> > CC: David Hildenbrand <david@redhat.com>
> > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > CC: Harry Yoo <harry.yoo@oracle.com>
> > CC: Peter Xu <peterx@redhat.com>
> > CC: Suren Baghdasaryan <surenb@google.com>
> > CC: Barry Song <baohua@kernel.org>
> > CC: SeongJae Park <sj@kernel.org>
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> >  mm/damon/ops-common.c | 16 ++++------------
> >  mm/page_idle.c        |  8 ++------
> >  mm/rmap.c             | 40 ++++++++++------------------------------
> >  3 files changed, 16 insertions(+), 48 deletions(-)
> >
> > @@ -557,17 +554,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> >       anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
> >       root_anon_vma = READ_ONCE(anon_vma->root);
> >       if (down_read_trylock(&root_anon_vma->rwsem)) {
> > -             /*
> > -              * folio_move_anon_rmap() might have changed the anon_vma as we
> > -              * might not hold the folio lock here.
> > -              */
> > -             if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> > -                          anon_mapping)) {
> > -                     up_read(&root_anon_vma->rwsem);
> > -                     rcu_read_unlock();
> > -                     goto retry;
> > -             }
> > -
>
> folio_lock_anon_vma_read() can be called without folio lock in a path:
> memory_failure() -> kill_procs_now() -> collect_procs() ->
> collect_procs_anon().
>
Thanks for catching this. Fell off the cracks for me.

> Not sure why collect_procs_{anon,ksm,file,fsdax} do not use rmap_walk()
> functionality :/
>
> Should we take folio lock before calling kill_procs_now() in
> memory_failure()?

To me it seems minimal (and sufficient) to put the collect_procs()
call in kill_procs_now() inside folio lock's critical section.
>
> --
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-10 15:33   ` Lokesh Gidra
@ 2025-09-11  8:40     ` Harry Yoo
  0 siblings, 0 replies; 21+ messages in thread
From: Harry Yoo @ 2025-09-11  8:40 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park, Miaohe Lin, Naoya Horiguchi

On Wed, Sep 10, 2025 at 08:33:54AM -0700, Lokesh Gidra wrote:
> On Wed, Sep 10, 2025 at 3:10 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Sun, Sep 07, 2025 at 09:49:49PM -0700, Lokesh Gidra wrote:
> > > Prior discussion about this can be found at [1].
> > >
> > > rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> > > implies that when threads update folio->mapping to an anon_vma with
> > > different root (currently only done by UFFDIO MOVE), they have to
> > > serialize against rmap_walk() with write-lock on the anon_vma, hurting
> > > scalability. Furthermore, this necessitates rechecking anon_vma when
> > > pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> > >
> > > This can be simplified quite a bit by ensuring that rmap_walk() is
> > > always called on locked folios. Among the few callers of rmap_walk() on
> > > unlocked anon folios, shrink_active_list()->folio_referenced() is the
> > > only performance critical one.
> > >
> > > shrink_active_list() doesn't act differently depending on what
> > > folio_referenced() returns for an anon folio. So returning 1 when it
> > > is contended, like in case of other folio types, wouldn't have any
> > > negative impact.
> > >
> > > Furthermore, as David pointed out in the previous discussion [2], this
> > > could potentially only affect R/O pages after fork as PG_anon_exclusive
> > > is not set. But, such folios are already isolated (prior to calling
> > > folio_referenced()) by grabbing a reference and clearing LRU, so
> > > do_wp_page()->wp_can_reuse_anon_folio() would not reuse such folios
> > > anyways.
> > >
> > > [1] https://lore.kernel.org/all/CA*EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5*4aDx5jh2DvY6UHg@mail.gmail.com
> > > [2] https://lore.kernel.org/all/dc92aef8-757f-4432-923e-70d92d13fb37@redhat.com
> > >
> > > CC: David Hildenbrand <david@redhat.com>
> > > CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > CC: Harry Yoo <harry.yoo@oracle.com>
> > > CC: Peter Xu <peterx@redhat.com>
> > > CC: Suren Baghdasaryan <surenb@google.com>
> > > CC: Barry Song <baohua@kernel.org>
> > > CC: SeongJae Park <sj@kernel.org>
> > > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > > ---
> > >  mm/damon/ops-common.c | 16 ++++------------
> > >  mm/page_idle.c        |  8 ++------
> > >  mm/rmap.c             | 40 ++++++++++------------------------------
> > >  3 files changed, 16 insertions(+), 48 deletions(-)
> > >
> > > @@ -557,17 +554,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> > >       anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
> > >       root_anon_vma = READ_ONCE(anon_vma->root);
> > >       if (down_read_trylock(&root_anon_vma->rwsem)) {
> > > -             /*
> > > -              * folio_move_anon_rmap() might have changed the anon_vma as we
> > > -              * might not hold the folio lock here.
> > > -              */
> > > -             if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> > > -                          anon_mapping)) {
> > > -                     up_read(&root_anon_vma->rwsem);
> > > -                     rcu_read_unlock();
> > > -                     goto retry;
> > > -             }
> > > -
> >
> > folio_lock_anon_vma_read() can be called without folio lock in a path:
> > memory_failure() -> kill_procs_now() -> collect_procs() ->
> > collect_procs_anon().
> >
>
> Thanks for catching this. Fell off the cracks for me.

No problem ;)

> > Not sure why collect_procs_{anon,ksm,file,fsdax} do not use rmap_walk()
> > functionality :/
> >
> > Should we take folio lock before calling kill_procs_now() in
> > memory_failure()?
> 
> To me it seems minimal (and sufficient) to put the collect_procs()
> call in kill_procs_now() inside folio lock's critical section.

Sounds sufficient to me.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-09  6:01             ` Barry Song
@ 2025-09-11 19:05               ` Lokesh Gidra
  2025-09-12  5:10                 ` Barry Song
  0 siblings, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-11 19:05 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Peter Xu, Suren Baghdasaryan,
	SeongJae Park

On Mon, Sep 8, 2025 at 11:01 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 1:57 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Mon, Sep 8, 2025 at 10:52 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, Sep 9, 2025 at 1:37 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > >
> > > > On Mon, Sep 8, 2025 at 5:40 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 9, 2025 at 6:12 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > > >
> > > > > > On Mon, Sep 8, 2025 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Mon, Sep 8, 2025 at 12:50 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > > > > >
> > > > > > > > Prior discussion about this can be found at [1].
> > > > > > > >
> > > > > > > > rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> > > > > > > > implies that when threads update folio->mapping to an anon_vma with
> > > > > > > > different root (currently only done by UFFDIO MOVE), they have to
> > > > > > > > serialize against rmap_walk() with write-lock on the anon_vma, hurting
> > > > > > > > scalability. Furthermore, this necessitates rechecking anon_vma when
> > > > > > > > pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> > > > > > > >
> > > > > > > > This can be simplified quite a bit by ensuring that rmap_walk() is
> > > > > > > > always called on locked folios. Among the few callers of rmap_walk() on
> > > > > > > > unlocked anon folios, shrink_active_list()->folio_referenced() is the
> > > > > > > > only performance critical one.
> > > > > > >
> > > > > > > As I understand it, shrink_inactive_list() also invokes folio_referenced().
> > > > > > > Shouldn’t it be called just as often as shrink_active_list()?
> > > > > >
> > > > > > I'm only talking about those callers which call rmap_walk() without
> > > > > > locking anon folio. The
> > > > > > shrink_inactive_list()->folio_check_references()->folio_referenced()
> > > > > > path that you are talking about always locks the folio. So the
> > > > > > behavior in that case wouldn't change.
> > > > >
> > > > > Thanks for the clarification. Could you add a note about this if there
> > > > > is a v2?
> > > > >
> > > > Certainly, will do.
> > > > > > >
> > > > > > > >
> > > > > > > > shrink_active_list() doesn't act differently depending on what
> > > > > > > > folio_referenced() returns for an anon folio. So returning 1 when it
> > > > > > > > is contended, like in case of other folio types, wouldn't have any
> > > > > > > > negative impact.
> > > > > > >
> > > > > > > A complaint was raised that the LRU could become slightly disordered:
> > > > > > > https://lore.kernel.org/linux-mm/20240219141703.3851-1-lipeifeng@oppo.com/
> > > > > > >
> > > > > > > We can re-test to confirm if this is the case.
> > > > > > The patch in the link you provided is suggesting to control try-lock
> > > > > > for anon_vma lock. But here we are dealing with folio lock. Not sure
> > > > > > if the ordering issue will be there in this case.
> > > > >
> > > > > Right. Not sure what percentage of folios will be contended; I assume
> > > > > it is minor. Maybe you could share some data on this in a v2?
> > > >
> > > > Any suggestion on how (or which test/benchmark) would be good to
> > > > gather data on this?
> > > >
> > > > IIUC, shink_active_list() doesn't behave any differently whether there
> > > > is contention or not, except when it's an executable file folio. So I
> > > > doubt such data would be any useful. Please correct me if I'm wrong.
> > >
> > > Since we skipped clearing the PTE young bit in folio_referenced_one, a
> > > cold page might be misidentified as hot during shrink_inactive_list().
> > > My understanding is that as long as the percentage is small, this
> > > shouldn't be a concern.
> > I see. That makes a lot more sense why folio_referenced() is called on
> > all folios in shrink_active_list(). I missed that young bit clearing
> > before.
> >
> > Any suggestions on a good testcase to gather this data?
>
> I would run monkey for a few hours with some debug counters, e.g. how
> many folios pass through shrink_active_list(), how many get contended
> and moved to the inactive list without clearing the young bit. If the
> percentage is small, we can just ignore this disordered LRU behavior.
>
Thanks for the suggestion, Barry.

Monkey test wasn't successful in creating sufficient mem pressure. So,
I used an app cycle test. It took over 1 hour to complete it on an
arm64 Android device with memory limited to 6GB.

During the test shrink_active_list() got called over 140k times. Out
of that, over 29k invocations had at least one non-KSM anon folio.
None of the folio_referenced() calls on these folios ended up with
contention i.e. folio_trylock() failing.

So, as thought, this patch doesn't seem to have any negative effect on
shrink_active_list().

> Thanks
> Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-08  4:49 [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
                   ` (2 preceding siblings ...)
  2025-09-10 10:10 ` Harry Yoo
@ 2025-09-11 19:39 ` Lorenzo Stoakes
  2025-09-12  9:03   ` David Hildenbrand
  3 siblings, 1 reply; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 19:39 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

Please please please use a cover letter if there's more than 1 patch :)

I really dislike the 2/2 replying to the 1/2.

On Sun, Sep 07, 2025 at 09:49:49PM -0700, Lokesh Gidra wrote:
> Prior discussion about this can be found at [1].
>
> rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> implies that when threads update folio->mapping to an anon_vma with
> different root (currently only done by UFFDIO MOVE), they have to

I said this on the discussion thread, but can we please stop dancing around
and acting as if this isn't an entirely uffd-specific patch please :)

Let's very explicitly say that's why we're doing this.

> serialize against rmap_walk() with write-lock on the anon_vma, hurting
> scalability. Furthermore, this necessitates rechecking anon_vma when
> pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).

THis is really quite confusing, you're compressing far too much information
into a single sentence.

Let's reword this to make it clearer like:

	Userfaultfd has a scaling issue with its UFFDIO_MOVE operation, an
	operation that is heavily used in android [insert reason why].

	The issue arises because UFFDIO_MOVE updates folio->mapping to an
	anon_vma with a different root. It acquires the folio lock to do
	so, but this is insufficient, because rmap_walk() has a mode in
	which a folio lock need not be acquired, exclusive to non-KSM
	anonymous folios.

	This means that UFFDIO_MOVE has to acquire the anon_vma write lock
	of the root anon_vma belonging to the folio it wishes to move.

	This has resulted in scalability issues due to contention between
	[insert contention information]. We have observed:

	[ insert some data to back this up ]

	This patch resolves the issue by removing this exception. This is
	less problematic than it might seem, as the only caller which
	utilises this mode is shrink_active_list().

Something like this is _a lot_ clearer I think.

>
> This can be simplified quite a bit by ensuring that rmap_walk() is
> always called on locked folios. Among the few callers of rmap_walk() on
> unlocked anon folios, shrink_active_list()->folio_referenced() is the
> only performance critical one.

Let's please not call this a simplification, I mean yes we simplify the
code per se, but we're fundamentally changing the locking logic.

Let's explicitly say that.

Also I find it odd that you say shrink_active_list()->folio_referenced() is
'performance critical', I mean if so, surely this series is broken then?

I'd delete that, the entire basis of this being ok is that it's _not_
performance critical to make this change.

>
> shrink_active_list() doesn't act differently depending on what
> folio_referenced() returns for an anon folio. So returning 1 when it
> is contended, like in case of other folio types, wouldn't have any
> negative impact.

I think this is a little unclear. I mean it very much _does_ behave
differently if it returns 0. So better to say it treats folio_referenced()
as a boolean, so if the folio_referenced() call returns an incorrect
referenced count while still indicating it is referenced that's fine.

>
> Furthermore, as David pointed out in the previous discussion [2], this
> could potentially only affect R/O pages after fork as PG_anon_exclusive
> is not set. But, such folios are already isolated (prior to calling
> folio_referenced()) by grabbing a reference and clearing LRU, so
> do_wp_page()->wp_can_reuse_anon_folio() would not reuse such folios
> anyways.

I think this needs to be expanded too, this is still very unclear.

I think clearer to explicitly expand upon this, like e.g.:

	"Of course we now take a lock where we wouldn't previously have
	done so. In the past would have had a major impact in causing a CoW
	write fault to copy a page in do_wp_page(), since commit
	09854ba94c6a ("mm: do_wp_page() simplification") caused a failure
	to obtain a folio lock to result in a copy even if one wasn't
	necessary.

	However, since commit 6c287605fd56 ("mm: remember exclusively
	mapped anonymous pages with PG_anon_exclusive"), and the
	introduction of the folio anon exclusive flag, this issue is
	significantly mitigated.

	The only case remaining that we might worry about from this
	perspective is that of read-only folios immediately after fork
	where the anon exclusive bit will not have been set yet.

	We note however in the case of read-only just-forked folios that
	wp_can_reuse_anon_folio() will notice the raised reference count
	established by shrink_active_list() via isolate_lru_folios() and
	refuse to reuse in any case, so this will in fact have no impact -
	the folio lock is ultimately immaterial here.

	All-in-all it appears that there is little opportunity for
	meaningful negative impact from this change".

And now, having said all of that, you can talk about simplification,
something like:

	"As a result of changing our approach to locking, we can remove all
	the code that took steps to acuqire an anon_vma write lock instead
	of a folio lock. This results in a significant simplification of
	the code."

:)

>
> [1] https://lore.kernel.org/all/CA+EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5+4aDx5jh2DvY6UHg@mail.gmail.com/
> [2] https://lore.kernel.org/all/dc92aef8-757f-4432-923e-70d92d13fb37@redhat.com/
>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Harry Yoo <harry.yoo@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> CC: SeongJae Park <sj@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> ---
>  mm/damon/ops-common.c | 16 ++++------------
>  mm/page_idle.c        |  8 ++------
>  mm/rmap.c             | 40 ++++++++++------------------------------
>  3 files changed, 16 insertions(+), 48 deletions(-)
>
> diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> index 998c5180a603..f61d6dde13dc 100644
> --- a/mm/damon/ops-common.c
> +++ b/mm/damon/ops-common.c
> @@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
>  		.rmap_one = damon_folio_mkold_one,
>  		.anon_lock = folio_lock_anon_vma_read,
>  	};
> -	bool need_lock;
>
>  	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
>  		folio_set_idle(folio);
>  		return;
>  	}
>
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>  		return;
>
>  	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);
>
>  }
>
> @@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
>  		.rmap_one = damon_folio_young_one,
>  		.anon_lock = folio_lock_anon_vma_read,
>  	};
> -	bool need_lock;
>
>  	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
>  		if (folio_test_idle(folio))
> @@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
>  			return true;
>  	}
>
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>  		return false;
>
>  	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);

Oh wow damon did this too... Are we sure we caught all such cases? Have you
checked carefully?

Hate that we've open-coded this all over the place.

>
>  	return accessed;
>  }
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index a82b340dc204..9bf573d22e87 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -101,19 +101,15 @@ static void page_idle_clear_pte_refs(struct folio *folio)
>  		.rmap_one = page_idle_clear_pte_refs_one,
>  		.anon_lock = folio_lock_anon_vma_read,
>  	};
> -	bool need_lock;
>
>  	if (!folio_mapped(folio) || !folio_raw_mapping(folio))
>  		return;
>
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>  		return;

And more open-coding here, god. Horrid. It's quite nice to attack this actually.

>
>  	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);
>  }
>
>  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 34333ae3bd80..fc53f31434f4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -489,17 +489,15 @@ void __init anon_vma_init(void)
>   * if there is a mapcount, we can dereference the anon_vma after observing
>   * those.
>   *
> - * NOTE: the caller should normally hold folio lock when calling this.  If
> - * not, the caller needs to double check the anon_vma didn't change after
> - * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
> - * concurrently without folio lock protection). See folio_lock_anon_vma_read()
> - * which has already covered that, and comment above remap_pages().
> + * NOTE: the caller should hold folio lock when calling this.
>   */
>  struct anon_vma *folio_get_anon_vma(const struct folio *folio)
>  {
>  	struct anon_vma *anon_vma = NULL;
>  	unsigned long anon_mapping;
>
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +
>  	rcu_read_lock();
>  	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>  	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -546,7 +544,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	struct anon_vma *root_anon_vma;
>  	unsigned long anon_mapping;
>
> -retry:
>  	rcu_read_lock();
>  	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>  	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -557,17 +554,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
>  	root_anon_vma = READ_ONCE(anon_vma->root);
>  	if (down_read_trylock(&root_anon_vma->rwsem)) {
> -		/*
> -		 * folio_move_anon_rmap() might have changed the anon_vma as we
> -		 * might not hold the folio lock here.
> -		 */
> -		if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -			     anon_mapping)) {
> -			up_read(&root_anon_vma->rwsem);
> -			rcu_read_unlock();
> -			goto retry;
> -		}
> -
>  		/*
>  		 * If the folio is still mapped, then this anon_vma is still
>  		 * its anon_vma, and holding the mutex ensures that it will
> @@ -602,18 +588,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	rcu_read_unlock();
>  	anon_vma_lock_read(anon_vma);
>
> -	/*
> -	 * folio_move_anon_rmap() might have changed the anon_vma as we might
> -	 * not hold the folio lock here.
> -	 */
> -	if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -		     anon_mapping)) {
> -		anon_vma_unlock_read(anon_vma);
> -		put_anon_vma(anon_vma);
> -		anon_vma = NULL;
> -		goto retry;
> -	}
> -
>  	if (atomic_dec_and_test(&anon_vma->refcount)) {
>  		/*
>  		 * Oops, we held the last refcount, release the lock
> @@ -1005,7 +979,7 @@ int folio_referenced(struct folio *folio, int is_locked,
>  	if (!folio_raw_mapping(folio))
>  		return 0;
>
> -	if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +	if (!is_locked) {
>  		we_locked = folio_trylock(folio);
>  		if (!we_locked)
>  			return 1;
> @@ -2815,6 +2789,12 @@ static void rmap_walk_anon(struct folio *folio,
>  	pgoff_t pgoff_start, pgoff_end;
>  	struct anon_vma_chain *avc;
>
> +	/*
> +	 * The folio lock ensures that folio->mapping can be changed under us to

Can't? :)

> +	 * an anon_vma with different root, like UFFDIO MOVE.

Not a fan of this 'like UFFDIO_MOVE'. I mean here I think we should just
drop that bit.

> +	 */
> +	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

PLEASE do not add BUG_ON(), checkpatch.pl will warn you on this and it's
not something we do these days pretty much ever. VM_WARN_ON_ONCE_FOLIO()
please.

> +
>  	if (locked) {
>  		anon_vma = folio_anon_vma(folio);
>  		/* anon_vma disappear under us? */
>
> base-commit: b024763926d2726978dff6588b81877d000159c1
> --
> 2.51.0.355.g5224444f11-goog
>
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl
  2025-09-08  4:49 ` [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl Lokesh Gidra
@ 2025-09-11 20:07   ` Lorenzo Stoakes
  2025-09-12  9:15   ` David Hildenbrand
  1 sibling, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 20:07 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Peter Xu, Suren Baghdasaryan, Barry Song

Subject is weird, it's anon_vma not anon-vma and mm/userfaultfd seems more
appropriate as a prefix.

How about:

userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE


On Sun, Sep 07, 2025 at 09:49:50PM -0700, Lokesh Gidra wrote:
> Since rmap_walk() is now always called on locked anon folios, we don't
> have to serialize on anon_vma lock when updating folio->mapping.

You're again being indirect about when this happens, please EXPLICITLY mention
uffd and UFFDIO_MOVE please.

I mean, this patch ilterailiy has userfautlfd in the prefix of the subject,
there's no need to be indirect here :)

>
> This helps avoid contention on src anon_vma when multiple threads are
> simultaneously moving distinct pages from the same src vma.

It feels like you need to add more here. I don't know what 'src' anon_vma is,
spell it out as the anon_vma belonging to the folio which you intend to move,
and be a little more specific - when multiple threads invoke UFFDIO_MOVE on the
same VMA.

Also here is a _prime_ chance to mention the simplification which is
something you do achieve here (very clearly) as a result of changing the
locking approach.

>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>

This generally LGTM, but let's revisit on respin w/commit msg update.

I do like the simplification here :)

Cheers, Lorenzo

> ---
>  mm/huge_memory.c | 22 +----------------
>  mm/userfaultfd.c | 62 +++++++++---------------------------------------
>  2 files changed, 12 insertions(+), 72 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd7418..5cd3957f92d4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>  	pmd_t _dst_pmd, src_pmdval;
>  	struct page *src_page;
>  	struct folio *src_folio;
> -	struct anon_vma *src_anon_vma;
>  	spinlock_t *src_ptl, *dst_ptl;
>  	pgtable_t src_pgtable;
>  	struct mmu_notifier_range range;
> @@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>  				src_addr + HPAGE_PMD_SIZE);
>  	mmu_notifier_invalidate_range_start(&range);
>
> -	if (src_folio) {
> +	if (src_folio)
>  		folio_lock(src_folio);
>
> -		/*
> -		 * split_huge_page walks the anon_vma chain without the page
> -		 * lock. Serialize against it with the anon_vma lock, the page
> -		 * lock is not enough.
> -		 */
> -		src_anon_vma = folio_get_anon_vma(src_folio);
> -		if (!src_anon_vma) {
> -			err = -EAGAIN;
> -			goto unlock_folio;
> -		}
> -		anon_vma_lock_write(src_anon_vma);
> -	} else
> -		src_anon_vma = NULL;
> -
>  	dst_ptl = pmd_lockptr(mm, dst_pmd);
>  	double_pt_lock(src_ptl, dst_ptl);
>  	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> @@ -2643,11 +2628,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>  	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
>  unlock_ptls:
>  	double_pt_unlock(src_ptl, dst_ptl);
> -	if (src_anon_vma) {
> -		anon_vma_unlock_write(src_anon_vma);
> -		put_anon_vma(src_anon_vma);
> -	}
> -unlock_folio:
>  	/* unblock rmap walks */
>  	if (src_folio)
>  		folio_unlock(src_folio);
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 50aaa8dcd24c..1a36760a36c7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1035,8 +1035,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
>   */
>  static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
>  						 unsigned long src_addr,
> -						 pte_t *src_pte, pte_t *dst_pte,
> -						 struct anon_vma *src_anon_vma)
> +						 pte_t *src_pte, pte_t *dst_pte)
>  {
>  	pte_t orig_dst_pte, orig_src_pte;
>  	struct folio *folio;
> @@ -1052,8 +1051,7 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
>  	folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
>  	if (!folio || !folio_trylock(folio))
>  		return NULL;
> -	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> -	    folio_anon_vma(folio) != src_anon_vma) {
> +	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)) {
>  		folio_unlock(folio);
>  		return NULL;
>  	}
> @@ -1061,9 +1059,8 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
>  }
>
>  /*
> - * Moves src folios to dst in a batch as long as they share the same
> - * anon_vma as the first folio, are not large, and can successfully
> - * take the lock via folio_trylock().
> + * Moves src folios to dst in a batch as long as they are not large, and can
> + * successfully take the lock via folio_trylock().
>   */
>  static long move_present_ptes(struct mm_struct *mm,
>  			      struct vm_area_struct *dst_vma,
> @@ -1073,8 +1070,7 @@ static long move_present_ptes(struct mm_struct *mm,
>  			      pte_t orig_dst_pte, pte_t orig_src_pte,
>  			      pmd_t *dst_pmd, pmd_t dst_pmdval,
>  			      spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -			      struct folio **first_src_folio, unsigned long len,
> -			      struct anon_vma *src_anon_vma)
> +			      struct folio **first_src_folio, unsigned long len)
>  {
>  	int err = 0;
>  	struct folio *src_folio = *first_src_folio;
> @@ -1132,8 +1128,8 @@ static long move_present_ptes(struct mm_struct *mm,
>  		src_pte++;
>
>  		folio_unlock(src_folio);
> -		src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
> -							dst_pte, src_anon_vma);
> +		src_folio = check_ptes_for_batched_move(src_vma, src_addr,
> +							src_pte, dst_pte);
>  		if (!src_folio)
>  			break;
>  	}
> @@ -1263,7 +1259,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  	pmd_t dummy_pmdval;
>  	pmd_t dst_pmdval;
>  	struct folio *src_folio = NULL;
> -	struct anon_vma *src_anon_vma = NULL;
>  	struct mmu_notifier_range range;
>  	long ret = 0;
>
> @@ -1347,9 +1342,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  		}
>
>  		/*
> -		 * Pin and lock both source folio and anon_vma. Since we are in
> -		 * RCU read section, we can't block, so on contention have to
> -		 * unmap the ptes, obtain the lock and retry.
> +		 * Pin and lock source folio. Since we are in RCU read section,
> +		 * we can't block, so on contention have to unmap the ptes,
> +		 * obtain the lock and retry.
>  		 */
>  		if (!src_folio) {
>  			struct folio *folio;
> @@ -1423,33 +1418,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  			goto retry;
>  		}
>
> -		if (!src_anon_vma) {
> -			/*
> -			 * folio_referenced walks the anon_vma chain
> -			 * without the folio lock. Serialize against it with
> -			 * the anon_vma lock, the folio lock is not enough.
> -			 */
> -			src_anon_vma = folio_get_anon_vma(src_folio);
> -			if (!src_anon_vma) {
> -				/* page was unmapped from under us */
> -				ret = -EAGAIN;
> -				goto out;
> -			}
> -			if (!anon_vma_trylock_write(src_anon_vma)) {
> -				pte_unmap(src_pte);
> -				pte_unmap(dst_pte);
> -				src_pte = dst_pte = NULL;
> -				/* now we can block and wait */
> -				anon_vma_lock_write(src_anon_vma);
> -				goto retry;
> -			}
> -		}
> -
>  		ret = move_present_ptes(mm, dst_vma, src_vma,
>  					dst_addr, src_addr, dst_pte, src_pte,
>  					orig_dst_pte, orig_src_pte, dst_pmd,
>  					dst_pmdval, dst_ptl, src_ptl, &src_folio,
> -					len, src_anon_vma);
> +					len);
>  	} else {
>  		struct folio *folio = NULL;
>
> @@ -1516,10 +1489,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  	}
>
>  out:
> -	if (src_anon_vma) {
> -		anon_vma_unlock_write(src_anon_vma);
> -		put_anon_vma(src_anon_vma);
> -	}
>  	if (src_folio) {
>  		folio_unlock(src_folio);
>  		folio_put(src_folio);
> @@ -1793,15 +1762,6 @@ static void uffd_move_unlock(struct vm_area_struct *dst_vma,
>   * virtual regions without knowing if there are transparent hugepage
>   * in the regions or not, but preventing the risk of having to split
>   * the hugepmd during the remap.
> - *
> - * If there's any rmap walk that is taking the anon_vma locks without
> - * first obtaining the folio lock (the only current instance is
> - * folio_referenced), they will have to verify if the folio->mapping
> - * has changed after taking the anon_vma lock. If it changed they
> - * should release the lock and retry obtaining a new anon_vma, because
> - * it means the anon_vma was changed by move_pages() before the lock
> - * could be obtained. This is the only additional complexity added to
> - * the rmap code to provide this anonymous page remapping functionality.
>   */
>  ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>  		   unsigned long src_start, unsigned long len, __u64 mode)
> --
> 2.51.0.355.g5224444f11-goog
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-10 10:10 ` Harry Yoo
  2025-09-10 15:33   ` Lokesh Gidra
@ 2025-09-12  3:29   ` Miaohe Lin
  1 sibling, 0 replies; 21+ messages in thread
From: Miaohe Lin @ 2025-09-12  3:29 UTC (permalink / raw)
  To: Harry Yoo, Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park, Naoya Horiguchi

On 2025/9/10 18:10, Harry Yoo wrote:
> On Sun, Sep 07, 2025 at 09:49:49PM -0700, Lokesh Gidra wrote:
>> Prior discussion about this can be found at [1].
>>
>> rmap_walk() requires all folios, except non-KSM anon, to be locked. This
>> implies that when threads update folio->mapping to an anon_vma with
>> different root (currently only done by UFFDIO MOVE), they have to
>> serialize against rmap_walk() with write-lock on the anon_vma, hurting
>> scalability. Furthermore, this necessitates rechecking anon_vma when
>> pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
>>
>> This can be simplified quite a bit by ensuring that rmap_walk() is
>> always called on locked folios. Among the few callers of rmap_walk() on
>> unlocked anon folios, shrink_active_list()->folio_referenced() is the
>> only performance critical one.
>>
>> shrink_active_list() doesn't act differently depending on what
>> folio_referenced() returns for an anon folio. So returning 1 when it
>> is contended, like in case of other folio types, wouldn't have any
>> negative impact.
>>
>> Furthermore, as David pointed out in the previous discussion [2], this
>> could potentially only affect R/O pages after fork as PG_anon_exclusive
>> is not set. But, such folios are already isolated (prior to calling
>> folio_referenced()) by grabbing a reference and clearing LRU, so
>> do_wp_page()->wp_can_reuse_anon_folio() would not reuse such folios
>> anyways.
>>
>> [1] https://lore.kernel.org/all/CA*EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5*4aDx5jh2DvY6UHg@mail.gmail.com
>> [2] https://lore.kernel.org/all/dc92aef8-757f-4432-923e-70d92d13fb37@redhat.com
>>
>> CC: David Hildenbrand <david@redhat.com>
>> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> CC: Harry Yoo <harry.yoo@oracle.com>
>> CC: Peter Xu <peterx@redhat.com>
>> CC: Suren Baghdasaryan <surenb@google.com>
>> CC: Barry Song <baohua@kernel.org>
>> CC: SeongJae Park <sj@kernel.org>
>> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
>> ---
>>  mm/damon/ops-common.c | 16 ++++------------
>>  mm/page_idle.c        |  8 ++------
>>  mm/rmap.c             | 40 ++++++++++------------------------------
>>  3 files changed, 16 insertions(+), 48 deletions(-)
>>
>> @@ -557,17 +554,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>>  	anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
>>  	root_anon_vma = READ_ONCE(anon_vma->root);
>>  	if (down_read_trylock(&root_anon_vma->rwsem)) {
>> -		/*
>> -		 * folio_move_anon_rmap() might have changed the anon_vma as we
>> -		 * might not hold the folio lock here.
>> -		 */
>> -		if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
>> -			     anon_mapping)) {
>> -			up_read(&root_anon_vma->rwsem);
>> -			rcu_read_unlock();
>> -			goto retry;
>> -		}
>> -
> 
> folio_lock_anon_vma_read() can be called without folio lock in a path:
> memory_failure() -> kill_procs_now() -> collect_procs() ->
> collect_procs_anon().
> 
> Not sure why collect_procs_{anon,ksm,file,fsdax} do not use rmap_walk()
> functionality :/

Because we need to find all processes having the page mapped and kill them.
But there's no convenient way to get back to mapped processes from the VMAs.
So a brute-force search over all running processes is used here.

> 
> Should we take folio lock before calling kill_procs_now() in
> memory_failure()?

I agree with Lokesh that we should put kill_procs_now() inside folio lock's critical section.

Thanks.
.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-11 19:05               ` Lokesh Gidra
@ 2025-09-12  5:10                 ` Barry Song
  0 siblings, 0 replies; 21+ messages in thread
From: Barry Song @ 2025-09-12  5:10 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Peter Xu, Suren Baghdasaryan,
	SeongJae Park

[...]
> > > > >
> > > > > IIUC, shink_active_list() doesn't behave any differently whether there
> > > > > is contention or not, except when it's an executable file folio. So I
> > > > > doubt such data would be any useful. Please correct me if I'm wrong.
> > > >
> > > > Since we skipped clearing the PTE young bit in folio_referenced_one, a
> > > > cold page might be misidentified as hot during shrink_inactive_list().
> > > > My understanding is that as long as the percentage is small, this
> > > > shouldn't be a concern.
> > > I see. That makes a lot more sense why folio_referenced() is called on
> > > all folios in shrink_active_list(). I missed that young bit clearing
> > > before.
> > >
> > > Any suggestions on a good testcase to gather this data?
> >
> > I would run monkey for a few hours with some debug counters, e.g. how
> > many folios pass through shrink_active_list(), how many get contended
> > and moved to the inactive list without clearing the young bit. If the
> > percentage is small, we can just ignore this disordered LRU behavior.
> >
> Thanks for the suggestion, Barry.
>
> Monkey test wasn't successful in creating sufficient mem pressure. So,
> I used an app cycle test. It took over 1 hour to complete it on an
> arm64 Android device with memory limited to 6GB.
>
> During the test shrink_active_list() got called over 140k times. Out
> of that, over 29k invocations had at least one non-KSM anon folio.
> None of the folio_referenced() calls on these folios ended up with
> contention i.e. folio_trylock() failing.
>
> So, as thought, this patch doesn't seem to have any negative effect on
> shrink_active_list().

Cool, thanks! It seems to meet expectations, given that a folio is
a very small granularity. I also raised another discussion [1],
which, if it works out, should provide one more reason why we
need your patches.


[1] https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/

Thanks
Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-11 19:39 ` Lorenzo Stoakes
@ 2025-09-12  9:03   ` David Hildenbrand
  2025-09-13  4:27     ` Lokesh Gidra
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2025-09-12  9:03 UTC (permalink / raw)
  To: Lorenzo Stoakes, Lokesh Gidra
  Cc: akpm, linux-mm, kaleshsingh, ngeoffray, Harry Yoo, Peter Xu,
	Suren Baghdasaryan, Barry Song, SeongJae Park

On 11.09.25 21:39, Lorenzo Stoakes wrote:
> Please please please use a cover letter if there's more than 1 patch :)
> 
> I really dislike the 2/2 replying to the 1/2.

+1000

> 
> On Sun, Sep 07, 2025 at 09:49:49PM -0700, Lokesh Gidra wrote:
>> Prior discussion about this can be found at [1].
>>
>> rmap_walk() requires all folios, except non-KSM anon, to be locked. This
>> implies that when threads update folio->mapping to an anon_vma with
>> different root (currently only done by UFFDIO MOVE), they have to
> 
> I said this on the discussion thread, but can we please stop dancing around
> and acting as if this isn't an entirely uffd-specific patch please :)
> 
> Let's very explicitly say that's why we're doing this.
> 
>> serialize against rmap_walk() with write-lock on the anon_vma, hurting
>> scalability. Furthermore, this necessitates rechecking anon_vma when
>> pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> 
> THis is really quite confusing, you're compressing far too much information
> into a single sentence.
> 
> Let's reword this to make it clearer like:
> 
> 	Userfaultfd has a scaling issue with its UFFDIO_MOVE operation, an
> 	operation that is heavily used in android [insert reason why].
> 
> 	The issue arises because UFFDIO_MOVE updates folio->mapping to an
> 	anon_vma with a different root. It acquires the folio lock to do
> 	so, but this is insufficient, because rmap_walk() has a mode in
> 	which a folio lock need not be acquired, exclusive to non-KSM
> 	anonymous folios.
> 
> 	This means that UFFDIO_MOVE has to acquire the anon_vma write lock
> 	of the root anon_vma belonging to the folio it wishes to move.
> 
> 	This has resulted in scalability issues due to contention between
> 	[insert contention information]. We have observed:
> 
> 	[ insert some data to back this up ]
> 
> 	This patch resolves the issue by removing this exception. This is
> 	less problematic than it might seem, as the only caller which
> 	utilises this mode is shrink_active_list().
> 
> Something like this is _a lot_ clearer I think.

Yes, fully agreed.

> 
>>
>> This can be simplified quite a bit by ensuring that rmap_walk() is
>> always called on locked folios. Among the few callers of rmap_walk() on
>> unlocked anon folios, shrink_active_list()->folio_referenced() is the
>> only performance critical one.
> 
> Let's please not call this a simplification, I mean yes we simplify the
> code per se, but we're fundamentally changing the locking logic.
> 
> Let's explicitly say that.
> 
> Also I find it odd that you say shrink_active_list()->folio_referenced() is
> 'performance critical', I mean if so, surely this series is broken then?
> 
> I'd delete that, the entire basis of this being ok is that it's _not_
> performance critical to make this change.

I think we can mention that as a side-effect of this performance 
optimization for uffd, folio_get_anon_vma() gets simpler and we no 
langer handle locking of anon folios different to locking of other 
(pagecache, ksm) folios.


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl
  2025-09-08  4:49 ` [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl Lokesh Gidra
  2025-09-11 20:07   ` Lorenzo Stoakes
@ 2025-09-12  9:15   ` David Hildenbrand
  1 sibling, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2025-09-12  9:15 UTC (permalink / raw)
  To: Lokesh Gidra, akpm
  Cc: linux-mm, kaleshsingh, ngeoffray, Lorenzo Stoakes, Peter Xu,
	Suren Baghdasaryan, Barry Song

On 08.09.25 06:49, Lokesh Gidra wrote:
> Since rmap_walk() is now always called on locked anon folios, we don't
> have to serialize on anon_vma lock when updating folio->mapping.

I would write that as

"Now that rmap_walk() is guaranteed to be called with the folio lock 
held, we can stop serializing on the src VMA anon_vma lock when moving 
an exclusive folio from one a src VMA to a dst VMA.

When moving a folio, we modify folio->mapping through 
folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
while we could have concurrent RMAP walks would be dangerous.

folio_move_anon_rmap() already enforces that the folio is locked. So
when we have the folio locked we can no longer race with concurrent
rmap_walk() as used by folio_referenced() and the anon_vma lock is no 
longer required.

Note that this handling is now the same as for other 
folio_move_anon_rmap() users that also do not hold the anon_vma lock -- 
namely COW reuse handling. These users never required the anon_vma lock
as they are only moving the anon VMA closer to the anon_vma leaf of the 
VMA, for example, from an anon_vma root to a leaf of that root. rmap 
walks were always able to tolerate that scenario."

Something like that.

> 
> This helps avoid contention on src anon_vma when multiple threads are
> simultaneously moving distinct pages from the same src vma.
> 
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> ---

-- 
Cheers

David / dhildenb

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-12  9:03   ` David Hildenbrand
@ 2025-09-13  4:27     ` Lokesh Gidra
  2025-09-15 11:27       ` Lorenzo Stoakes
  0 siblings, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-13  4:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, akpm, linux-mm, kaleshsingh, ngeoffray,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On Fri, Sep 12, 2025 at 2:03 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.09.25 21:39, Lorenzo Stoakes wrote:
> > Please please please use a cover letter if there's more than 1 patch :)
> >
> > I really dislike the 2/2 replying to the 1/2.
>
> +1000
>
> >
> > On Sun, Sep 07, 2025 at 09:49:49PM -0700, Lokesh Gidra wrote:
> >> Prior discussion about this can be found at [1].
> >>
> >> rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> >> implies that when threads update folio->mapping to an anon_vma with
> >> different root (currently only done by UFFDIO MOVE), they have to
> >
> > I said this on the discussion thread, but can we please stop dancing around
> > and acting as if this isn't an entirely uffd-specific patch please :)
> >
> > Let's very explicitly say that's why we're doing this.
> >
> >> serialize against rmap_walk() with write-lock on the anon_vma, hurting
> >> scalability. Furthermore, this necessitates rechecking anon_vma when
> >> pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).
> >
> > THis is really quite confusing, you're compressing far too much information
> > into a single sentence.
> >
> > Let's reword this to make it clearer like:
> >
> >       Userfaultfd has a scaling issue with its UFFDIO_MOVE operation, an
> >       operation that is heavily used in android [insert reason why].
> >
> >       The issue arises because UFFDIO_MOVE updates folio->mapping to an
> >       anon_vma with a different root. It acquires the folio lock to do
> >       so, but this is insufficient, because rmap_walk() has a mode in
> >       which a folio lock need not be acquired, exclusive to non-KSM
> >       anonymous folios.
> >
> >       This means that UFFDIO_MOVE has to acquire the anon_vma write lock
> >       of the root anon_vma belonging to the folio it wishes to move.
> >
> >       This has resulted in scalability issues due to contention between
> >       [insert contention information]. We have observed:
> >
> >       [ insert some data to back this up ]
> >
> >       This patch resolves the issue by removing this exception. This is
> >       less problematic than it might seem, as the only caller which
> >       utilises this mode is shrink_active_list().
> >
> > Something like this is _a lot_ clearer I think.
>
> Yes, fully agreed.
>
> >
> >>
> >> This can be simplified quite a bit by ensuring that rmap_walk() is
> >> always called on locked folios. Among the few callers of rmap_walk() on
> >> unlocked anon folios, shrink_active_list()->folio_referenced() is the
> >> only performance critical one.
> >
> > Let's please not call this a simplification, I mean yes we simplify the
> > code per se, but we're fundamentally changing the locking logic.
> >
> > Let's explicitly say that.
> >
> > Also I find it odd that you say shrink_active_list()->folio_referenced() is
> > 'performance critical', I mean if so, surely this series is broken then?
> >
> > I'd delete that, the entire basis of this being ok is that it's _not_
> > performance critical to make this change.
>
> I think we can mention that as a side-effect of this performance
> optimization for uffd, folio_get_anon_vma() gets simpler and we no
> langer handle locking of anon folios different to locking of other
> (pagecache, ksm) folios.
>
Thank you both for the valuable feedback. I'll upload next version
within few days addressing all the comments.
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
  2025-09-13  4:27     ` Lokesh Gidra
@ 2025-09-15 11:27       ` Lorenzo Stoakes
  0 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 11:27 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: David Hildenbrand, akpm, linux-mm, kaleshsingh, ngeoffray,
	Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
	SeongJae Park

On Fri, Sep 12, 2025 at 09:27:03PM -0700, Lokesh Gidra wrote:
> Thank you both for the valuable feedback. I'll upload next version
> within few days addressing all the comments.

Thanks! :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-09-15 11:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-08  4:49 [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
2025-09-08  4:49 ` [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl Lokesh Gidra
2025-09-11 20:07   ` Lorenzo Stoakes
2025-09-12  9:15   ` David Hildenbrand
2025-09-08 21:47 ` [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Barry Song
2025-09-08 22:12   ` Lokesh Gidra
2025-09-09  0:40     ` Barry Song
2025-09-09  5:37       ` Lokesh Gidra
2025-09-09  5:51         ` Barry Song
2025-09-09  5:56           ` Lokesh Gidra
2025-09-09  6:01             ` Barry Song
2025-09-11 19:05               ` Lokesh Gidra
2025-09-12  5:10                 ` Barry Song
2025-09-10 10:10 ` Harry Yoo
2025-09-10 15:33   ` Lokesh Gidra
2025-09-11  8:40     ` Harry Yoo
2025-09-12  3:29   ` Miaohe Lin
2025-09-11 19:39 ` Lorenzo Stoakes
2025-09-12  9:03   ` David Hildenbrand
2025-09-13  4:27     ` Lokesh Gidra
2025-09-15 11:27       ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox