* [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock
@ 2025-09-23 7:10 Lokesh Gidra
2025-09-23 7:10 ` [PATCH v2 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Lokesh Gidra @ 2025-09-23 7:10 UTC (permalink / raw)
To: akpm
Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lokesh Gidra,
David Hildenbrand, Lorenzo Stoakes, Harry Yoo, Peter Xu,
Suren Baghdasaryan, Barry Song, SeongJae Park
Userfaultfd has a scalability issue in its UFFDIO_MOVE ioctl, which is
heavily used in Android as its java garbage collector uses it for
concurrent heap compaction.
The issue arises because UFFDIO_MOVE updates folio->mapping to an
anon_vma with a different root, in order to move the folio from a src
VMA to dst VMA. It performs the operation with the folio locked, but
this is insufficient, because rmap_walk() can be performed on non-KSM
anonymous folios without folio lock.
This means that UFFDIO_MOVE has to acquire the anon_vma write lock
of the root anon_vma belonging to the folio it wishes to move.
This causes scalability bottleneck when multiple threads perform
UFFDIO_MOVE simultanously on distinct pages of the same src VMA. In
field traces of arm64 android devices, we have observed janky user
interactions due to long (sometimes over ~50ms) uninterruptible
sleeps on main UI thread caused by anon_vma lock contention in
UFFDIO_MOVE. This is particularly severe during the beginning of
GC's compaction phase when it is likely to have multiple threads
involved.
This patch resolves the issue by removing the exception in rmap_walk()
for non-KSM anon folios by ensuring that all folios are locked during
rmap walk. This is less problematic than it might seem, as the only
major caller which utilises this mode is shrink_active_list(), which is
covered in detail in the first patch of this series.
As a result of changing our approach to locking, we can remove all
the code that took steps to acquire an anon_vma write lock instead
of a folio lock. This results in a significant simplification and
scalability improvement of the code (currently only in UFFDIO_MOVE).
Furthermore, as a side-effect, folio_lock_anon_vma_read() gets simpler
as we don't need to worry that folio->mapping may have changed under us.
Prior discussions on this can be found at [1, 2].
Changes since v1 [3]:
1. Move relevant parts of cover letter description to first patch, per
David Hildenbrand.
2. Enumerate all callers of rmap_walk(), folio_lock_anon_vma_read(), and
folio_get_anon_vma(), per Lorenzo Stoakes.
3. Make other corrections/improvements to commit message, per Lorenzo
Stoakes.
[1] https://lore.kernel.org/all/CA+EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5+4aDx5jh2DvY6UHg@mail.gmail.com/
[2] https://lore.kernel.org/all/20250908044950.311548-1-lokeshgidra@google.com/
[3] https://lore.kernel.org/all/20250918055135.2881413-1-lokeshgidra@google.com/
Lokesh Gidra (2):
mm: always call rmap_walk() on locked folios
mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
CC: David Hildenbrand <david@redhat.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
CC: Harry Yoo <harry.yoo@oracle.com>
CC: Peter Xu <peterx@redhat.com>
CC: Suren Baghdasaryan <surenb@google.com>
CC: Barry Song <baohua@kernel.org>
CC: SeongJae Park <sj@kernel.org>
---
mm/damon/ops-common.c | 16 +++--------
mm/huge_memory.c | 22 +--------------
mm/memory-failure.c | 3 +++
mm/page_idle.c | 8 ++----
mm/rmap.c | 42 +++++++++--------------------
mm/userfaultfd.c | 62 ++++++++-----------------------------------
6 files changed, 33 insertions(+), 120 deletions(-)
--
2.51.0.534.gc79095c0ca-goog
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 1/2] mm: always call rmap_walk() on locked folios
2025-09-23 7:10 [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Lokesh Gidra
@ 2025-09-23 7:10 ` Lokesh Gidra
2025-09-24 10:06 ` David Hildenbrand
2025-11-03 17:51 ` Lorenzo Stoakes
2025-09-23 7:10 ` [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
2025-10-03 23:03 ` [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Peter Xu
2 siblings, 2 replies; 9+ messages in thread
From: Lokesh Gidra @ 2025-09-23 7:10 UTC (permalink / raw)
To: akpm
Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lokesh Gidra,
David Hildenbrand, Lorenzo Stoakes, Harry Yoo, Peter Xu,
Suren Baghdasaryan, Barry Song, SeongJae Park
Guarantee that rmap_walk() is called on locked folios so that threads
changing folio->mapping and folio->index for non-KSM anon folios can
serialize on fine-grained folio lock rather than anon_vma lock. Other
folio types are already always locked before rmap_walk(). With this, we
are going from 'not necessarily' locking the non-KSM anon folio to
'definitely' locking it during rmap walks.
This patch is in preparation for removing anon_vma write-lock from
UFFDIO_MOVE.
With this patch, three functions are now expected to be called with a
locked folio. To be careful of not missing any case, here is the
exhaustive list of all their callers.
1) rmap_walk() is called from:
a) folio_referenced()
b) damon_folio_mkold()
c) damon_folio_young()
d) page_idle_clear_pte_refs()
e) try_to_unmap()
f) try_to_migrate()
g) folio_mkclean()
h) remove_migration_ptes()
In the above list, first 4 are changed in this patch to try-lock non-KSM
anon folios, similar to other types of folios. The remaining functions
in the list already hold folio lock when calling rmap_walk().
2) folio_lock_anon_vma_read() is called from following functions:
a) collect_procs_anon()
b) page_idle_clear_pte_refs()
c) damon_folio_mkold()
d) damon_folio_young()
e) folio_referenced()
f) try_to_unmap()
g) try_to_migrate()
All the functions in above list, except collect_procs_anon(), are
covered by the rmap_walk() list above. For collect_procs_anon(), with
kill_procs_now() changed to take folio lock in this patch ensures that
all callers of folio_lock_anon_vma_read() now hold the lock.
3) folio_get_anon_vma() is called from following functions, all of which
already hold the folio lock:
a) move_pages_huge_pmd()
b) __folio_split()
c) move_pages_ptes()
d) migrate_folio_unmap()
e) unmap_and_move_huge_page()
Functionally, this patch doesn't break the logic because rmap walkers
generally do some other check to see if what is expected to mapped did
happen so it's fine, or otherwise treat things as best-effort.
Among the 4 functions changed in this patch, folio_referenced() is the
only core-mm function, and is also frequently accessed. To assess the
impact of locking non-KSM anon folios in
shrink_active_list()->folio_referenced() path, we performed an app cycle
test on an arm64 android device. During the whole duration of the test
there were over 140k invocations of shrink_active_list(), out of which
over 29k had at least one non-KSM anon folio on which folio_referenced()
was called. In none of these invocations folio_trylock() failed.
Of course, we now take a lock where we wouldn't previously have. In the
past it would have had a major impact in causing a CoW write fault to
copy a page in do_wp_page(), as commit 09854ba94c6a ("mm: do_wp_page()
simplification") caused a failure to obtain folio lock to result in a
page copy even if one wasn't necessary.
However, since commit 6c287605fd56 ("mm: remember exclusively mapped
anonymous pages with PG_anon_exclusive"), and the introduction of the
folio anon exclusive flag, this issue is significantly mitigated.
The only case remaining that we might worry about from this perspective
is that of read-only folios immediately after fork where the anon
exclusive bit will not have been set yet.
We note however in the case of read-only just-forked folios that
wp_can_reuse_anon_folio() will notice the raised reference count
established by shrink_active_list() via isolate_lru_folios() and refuse
to reuse in any case, so this will in fact have no impact - the folio
lock is ultimately immaterial here.
All-in-all it appears that there is little opportunity for meaningful
negative impact from this change.
CC: David Hildenbrand <david@redhat.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
CC: Harry Yoo <harry.yoo@oracle.com>
CC: Peter Xu <peterx@redhat.com>
CC: Suren Baghdasaryan <surenb@google.com>
CC: Barry Song <baohua@kernel.org>
CC: SeongJae Park <sj@kernel.org>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
mm/damon/ops-common.c | 16 ++++------------
mm/memory-failure.c | 3 +++
mm/page_idle.c | 8 ++------
mm/rmap.c | 42 ++++++++++++------------------------------
4 files changed, 21 insertions(+), 48 deletions(-)
diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
index 998c5180a603..f61d6dde13dc 100644
--- a/mm/damon/ops-common.c
+++ b/mm/damon/ops-common.c
@@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
.rmap_one = damon_folio_mkold_one,
.anon_lock = folio_lock_anon_vma_read,
};
- bool need_lock;
if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
folio_set_idle(folio);
return;
}
- need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
- if (need_lock && !folio_trylock(folio))
+ if (!folio_trylock(folio))
return;
rmap_walk(folio, &rwc);
-
- if (need_lock)
- folio_unlock(folio);
+ folio_unlock(folio);
}
@@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
.rmap_one = damon_folio_young_one,
.anon_lock = folio_lock_anon_vma_read,
};
- bool need_lock;
if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
if (folio_test_idle(folio))
@@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
return true;
}
- need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
- if (need_lock && !folio_trylock(folio))
+ if (!folio_trylock(folio))
return false;
rmap_walk(folio, &rwc);
-
- if (need_lock)
- folio_unlock(folio);
+ folio_unlock(folio);
return accessed;
}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a24806bb8e82..f698df156bf8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2143,7 +2143,10 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
{
LIST_HEAD(tokill);
+ folio_lock(folio);
collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
+ folio_unlock(folio);
+
kill_procs(&tokill, true, pfn, flags);
}
diff --git a/mm/page_idle.c b/mm/page_idle.c
index a82b340dc204..9bf573d22e87 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -101,19 +101,15 @@ static void page_idle_clear_pte_refs(struct folio *folio)
.rmap_one = page_idle_clear_pte_refs_one,
.anon_lock = folio_lock_anon_vma_read,
};
- bool need_lock;
if (!folio_mapped(folio) || !folio_raw_mapping(folio))
return;
- need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
- if (need_lock && !folio_trylock(folio))
+ if (!folio_trylock(folio))
return;
rmap_walk(folio, &rwc);
-
- if (need_lock)
- folio_unlock(folio);
+ folio_unlock(folio);
}
static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
diff --git a/mm/rmap.c b/mm/rmap.c
index 0bc7cf8b7359..fd9f18670440 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -489,17 +489,15 @@ void __init anon_vma_init(void)
* if there is a mapcount, we can dereference the anon_vma after observing
* those.
*
- * NOTE: the caller should normally hold folio lock when calling this. If
- * not, the caller needs to double check the anon_vma didn't change after
- * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
- * concurrently without folio lock protection). See folio_lock_anon_vma_read()
- * which has already covered that, and comment above remap_pages().
+ * NOTE: the caller should hold folio lock when calling this.
*/
struct anon_vma *folio_get_anon_vma(const struct folio *folio)
{
struct anon_vma *anon_vma = NULL;
unsigned long anon_mapping;
+ VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+
rcu_read_lock();
anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
@@ -546,7 +544,8 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
struct anon_vma *root_anon_vma;
unsigned long anon_mapping;
-retry:
+ VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+
rcu_read_lock();
anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
@@ -557,17 +556,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
root_anon_vma = READ_ONCE(anon_vma->root);
if (down_read_trylock(&root_anon_vma->rwsem)) {
- /*
- * folio_move_anon_rmap() might have changed the anon_vma as we
- * might not hold the folio lock here.
- */
- if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
- anon_mapping)) {
- up_read(&root_anon_vma->rwsem);
- rcu_read_unlock();
- goto retry;
- }
-
/*
* If the folio is still mapped, then this anon_vma is still
* its anon_vma, and holding the mutex ensures that it will
@@ -602,18 +590,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
rcu_read_unlock();
anon_vma_lock_read(anon_vma);
- /*
- * folio_move_anon_rmap() might have changed the anon_vma as we might
- * not hold the folio lock here.
- */
- if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
- anon_mapping)) {
- anon_vma_unlock_read(anon_vma);
- put_anon_vma(anon_vma);
- anon_vma = NULL;
- goto retry;
- }
-
if (atomic_dec_and_test(&anon_vma->refcount)) {
/*
* Oops, we held the last refcount, release the lock
@@ -988,7 +964,7 @@ int folio_referenced(struct folio *folio, int is_locked,
if (!folio_raw_mapping(folio))
return 0;
- if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
+ if (!is_locked) {
we_locked = folio_trylock(folio);
if (!we_locked)
return 1;
@@ -2820,6 +2796,12 @@ static void rmap_walk_anon(struct folio *folio,
pgoff_t pgoff_start, pgoff_end;
struct anon_vma_chain *avc;
+ /*
+ * The folio lock ensures that folio->mapping can't be changed under us
+ * to an anon_vma with different root.
+ */
+ VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+
if (locked) {
anon_vma = folio_anon_vma(folio);
/* anon_vma disappear under us? */
--
2.51.0.534.gc79095c0ca-goog
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
2025-09-23 7:10 [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Lokesh Gidra
2025-09-23 7:10 ` [PATCH v2 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
@ 2025-09-23 7:10 ` Lokesh Gidra
2025-09-24 10:07 ` David Hildenbrand
2025-11-03 17:52 ` Lorenzo Stoakes
2025-10-03 23:03 ` [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Peter Xu
2 siblings, 2 replies; 9+ messages in thread
From: Lokesh Gidra @ 2025-09-23 7:10 UTC (permalink / raw)
To: akpm
Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lokesh Gidra,
David Hildenbrand, Lorenzo Stoakes, Peter Xu, Suren Baghdasaryan,
Barry Song
Now that rmap_walk() is guaranteed to be called with the folio lock
held, we can stop serializing on the src VMA anon_vma lock when moving
an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
When moving a folio, we modify folio->mapping through
folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
while we could have concurrent RMAP walks would be dangerous. Therefore,
to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
meant that when multiple threads called UFFDIO_MOVE concurrently on
distinct pages of the same src VMA, they would serialize on it, hurting
scalability.
In addition to avoiding the scalability bottleneck, this patch also
simplifies the complicated lock dance that UFFDIO_MOVE has to go through
between RCU, folio-lock, ptl, and anon_vma.
folio_move_anon_rmap() already enforces that the folio is locked. So
when we have the folio locked we can no longer race with concurrent
rmap_walk() as used by folio_referenced() and others who call it on
unlocked non-KSM anon folios, and therefore the anon_vma lock is no
longer required.
Note that this handling is now the same as for other
folio_move_anon_rmap() users that also do not hold the anon_vma lock --
namely COW reuse handling (do_wp_page()->wp_can_reuse_anon_folio(),
do_huge_pmd_wp_page(), and hugetlb_wp()). These users never required the
anon_vma lock as they are only moving the anon VMA closer to the anon_vma
leaf of the VMA, for example, from an anon_vma root to a leaf of that root.
rmap walks were always able to tolerate that scenario.
CC: David Hildenbrand <david@redhat.com>
CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
CC: Peter Xu <peterx@redhat.com>
CC: Suren Baghdasaryan <surenb@google.com>
CC: Barry Song <baohua@kernel.org>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
mm/huge_memory.c | 22 +----------------
mm/userfaultfd.c | 62 +++++++++---------------------------------------
2 files changed, 12 insertions(+), 72 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225..a16e3778b544 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
pmd_t _dst_pmd, src_pmdval;
struct page *src_page;
struct folio *src_folio;
- struct anon_vma *src_anon_vma;
spinlock_t *src_ptl, *dst_ptl;
pgtable_t src_pgtable;
struct mmu_notifier_range range;
@@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
src_addr + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
- if (src_folio) {
+ if (src_folio)
folio_lock(src_folio);
- /*
- * split_huge_page walks the anon_vma chain without the page
- * lock. Serialize against it with the anon_vma lock, the page
- * lock is not enough.
- */
- src_anon_vma = folio_get_anon_vma(src_folio);
- if (!src_anon_vma) {
- err = -EAGAIN;
- goto unlock_folio;
- }
- anon_vma_lock_write(src_anon_vma);
- } else
- src_anon_vma = NULL;
-
dst_ptl = pmd_lockptr(mm, dst_pmd);
double_pt_lock(src_ptl, dst_ptl);
if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
@@ -2643,11 +2628,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
unlock_ptls:
double_pt_unlock(src_ptl, dst_ptl);
- if (src_anon_vma) {
- anon_vma_unlock_write(src_anon_vma);
- put_anon_vma(src_anon_vma);
- }
-unlock_folio:
/* unblock rmap walks */
if (src_folio)
folio_unlock(src_folio);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af61b95c89e4..6be65089085e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1035,8 +1035,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
*/
static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
unsigned long src_addr,
- pte_t *src_pte, pte_t *dst_pte,
- struct anon_vma *src_anon_vma)
+ pte_t *src_pte, pte_t *dst_pte)
{
pte_t orig_dst_pte, orig_src_pte;
struct folio *folio;
@@ -1052,8 +1051,7 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
if (!folio || !folio_trylock(folio))
return NULL;
- if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
- folio_anon_vma(folio) != src_anon_vma) {
+ if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)) {
folio_unlock(folio);
return NULL;
}
@@ -1061,9 +1059,8 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
}
/*
- * Moves src folios to dst in a batch as long as they share the same
- * anon_vma as the first folio, are not large, and can successfully
- * take the lock via folio_trylock().
+ * Moves src folios to dst in a batch as long as they are not large, and can
+ * successfully take the lock via folio_trylock().
*/
static long move_present_ptes(struct mm_struct *mm,
struct vm_area_struct *dst_vma,
@@ -1073,8 +1070,7 @@ static long move_present_ptes(struct mm_struct *mm,
pte_t orig_dst_pte, pte_t orig_src_pte,
pmd_t *dst_pmd, pmd_t dst_pmdval,
spinlock_t *dst_ptl, spinlock_t *src_ptl,
- struct folio **first_src_folio, unsigned long len,
- struct anon_vma *src_anon_vma)
+ struct folio **first_src_folio, unsigned long len)
{
int err = 0;
struct folio *src_folio = *first_src_folio;
@@ -1132,8 +1128,8 @@ static long move_present_ptes(struct mm_struct *mm,
src_pte++;
folio_unlock(src_folio);
- src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
- dst_pte, src_anon_vma);
+ src_folio = check_ptes_for_batched_move(src_vma, src_addr,
+ src_pte, dst_pte);
if (!src_folio)
break;
}
@@ -1263,7 +1259,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
pmd_t dummy_pmdval;
pmd_t dst_pmdval;
struct folio *src_folio = NULL;
- struct anon_vma *src_anon_vma = NULL;
struct mmu_notifier_range range;
long ret = 0;
@@ -1347,9 +1342,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
}
/*
- * Pin and lock both source folio and anon_vma. Since we are in
- * RCU read section, we can't block, so on contention have to
- * unmap the ptes, obtain the lock and retry.
+ * Pin and lock source folio. Since we are in RCU read section,
+ * we can't block, so on contention have to unmap the ptes,
+ * obtain the lock and retry.
*/
if (!src_folio) {
struct folio *folio;
@@ -1423,33 +1418,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
goto retry;
}
- if (!src_anon_vma) {
- /*
- * folio_referenced walks the anon_vma chain
- * without the folio lock. Serialize against it with
- * the anon_vma lock, the folio lock is not enough.
- */
- src_anon_vma = folio_get_anon_vma(src_folio);
- if (!src_anon_vma) {
- /* page was unmapped from under us */
- ret = -EAGAIN;
- goto out;
- }
- if (!anon_vma_trylock_write(src_anon_vma)) {
- pte_unmap(src_pte);
- pte_unmap(dst_pte);
- src_pte = dst_pte = NULL;
- /* now we can block and wait */
- anon_vma_lock_write(src_anon_vma);
- goto retry;
- }
- }
-
ret = move_present_ptes(mm, dst_vma, src_vma,
dst_addr, src_addr, dst_pte, src_pte,
orig_dst_pte, orig_src_pte, dst_pmd,
dst_pmdval, dst_ptl, src_ptl, &src_folio,
- len, src_anon_vma);
+ len);
} else {
struct folio *folio = NULL;
@@ -1515,10 +1488,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
}
out:
- if (src_anon_vma) {
- anon_vma_unlock_write(src_anon_vma);
- put_anon_vma(src_anon_vma);
- }
if (src_folio) {
folio_unlock(src_folio);
folio_put(src_folio);
@@ -1792,15 +1761,6 @@ static void uffd_move_unlock(struct vm_area_struct *dst_vma,
* virtual regions without knowing if there are transparent hugepage
* in the regions or not, but preventing the risk of having to split
* the hugepmd during the remap.
- *
- * If there's any rmap walk that is taking the anon_vma locks without
- * first obtaining the folio lock (the only current instance is
- * folio_referenced), they will have to verify if the folio->mapping
- * has changed after taking the anon_vma lock. If it changed they
- * should release the lock and retry obtaining a new anon_vma, because
- * it means the anon_vma was changed by move_pages() before the lock
- * could be obtained. This is the only additional complexity added to
- * the rmap code to provide this anonymous page remapping functionality.
*/
ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
unsigned long src_start, unsigned long len, __u64 mode)
--
2.51.0.534.gc79095c0ca-goog
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] mm: always call rmap_walk() on locked folios
2025-09-23 7:10 ` [PATCH v2 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
@ 2025-09-24 10:06 ` David Hildenbrand
2025-10-02 7:56 ` David Hildenbrand
2025-11-03 17:51 ` Lorenzo Stoakes
1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-09-24 10:06 UTC (permalink / raw)
To: Lokesh Gidra, akpm
Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes,
Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
SeongJae Park
On 23.09.25 09:10, Lokesh Gidra wrote:
> Guarantee that rmap_walk() is called on locked folios so that threads
> changing folio->mapping and folio->index for non-KSM anon folios can
> serialize on fine-grained folio lock rather than anon_vma lock. Other
> folio types are already always locked before rmap_walk(). With this, we
> are going from 'not necessarily' locking the non-KSM anon folio to
> 'definitely' locking it during rmap walks.
>
> This patch is in preparation for removing anon_vma write-lock from
> UFFDIO_MOVE.
>
> With this patch, three functions are now expected to be called with a
> locked folio. To be careful of not missing any case, here is the
> exhaustive list of all their callers.
>
> 1) rmap_walk() is called from:
>
> a) folio_referenced()
> b) damon_folio_mkold()
> c) damon_folio_young()
> d) page_idle_clear_pte_refs()
> e) try_to_unmap()
> f) try_to_migrate()
> g) folio_mkclean()
> h) remove_migration_ptes()
>
> In the above list, first 4 are changed in this patch to try-lock non-KSM
> anon folios, similar to other types of folios. The remaining functions
> in the list already hold folio lock when calling rmap_walk().
>
> 2) folio_lock_anon_vma_read() is called from following functions:
>
> a) collect_procs_anon()
> b) page_idle_clear_pte_refs()
> c) damon_folio_mkold()
> d) damon_folio_young()
> e) folio_referenced()
> f) try_to_unmap()
> g) try_to_migrate()
>
> All the functions in above list, except collect_procs_anon(), are
> covered by the rmap_walk() list above. For collect_procs_anon(), with
> kill_procs_now() changed to take folio lock in this patch ensures that
> all callers of folio_lock_anon_vma_read() now hold the lock.
>
> 3) folio_get_anon_vma() is called from following functions, all of which
> already hold the folio lock:
>
> a) move_pages_huge_pmd()
> b) __folio_split()
> c) move_pages_ptes()
> d) migrate_folio_unmap()
> e) unmap_and_move_huge_page()
>
> Functionally, this patch doesn't break the logic because rmap walkers
> generally do some other check to see if what is expected to mapped did
> happen so it's fine, or otherwise treat things as best-effort.
>
> Among the 4 functions changed in this patch, folio_referenced() is the
> only core-mm function, and is also frequently accessed. To assess the
> impact of locking non-KSM anon folios in
> shrink_active_list()->folio_referenced() path, we performed an app cycle
> test on an arm64 android device. During the whole duration of the test
> there were over 140k invocations of shrink_active_list(), out of which
> over 29k had at least one non-KSM anon folio on which folio_referenced()
> was called. In none of these invocations folio_trylock() failed.
Yeah, that's the expectation: it should be a rare event.
>
> Of course, we now take a lock where we wouldn't previously have. In the
> past it would have had a major impact in causing a CoW write fault to
> copy a page in do_wp_page(), as commit 09854ba94c6a ("mm: do_wp_page()
> simplification") caused a failure to obtain folio lock to result in a
> page copy even if one wasn't necessary.
>
> However, since commit 6c287605fd56 ("mm: remember exclusively mapped
> anonymous pages with PG_anon_exclusive"), and the introduction of the
> folio anon exclusive flag, this issue is significantly mitigated.
>
> The only case remaining that we might worry about from this perspective
> is that of read-only folios immediately after fork where the anon
> exclusive bit will not have been set yet.
>
> We note however in the case of read-only just-forked folios that
> wp_can_reuse_anon_folio() will notice the raised reference count
> established by shrink_active_list() via isolate_lru_folios() and refuse
> to reuse in any case, so this will in fact have no impact - the folio
> lock is ultimately immaterial here.
>
> All-in-all it appears that there is little opportunity for meaningful
> negative impact from this change.
Yes, that's a good summary now, thanks a bunch!
[...]
>
> static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0bc7cf8b7359..fd9f18670440 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -489,17 +489,15 @@ void __init anon_vma_init(void)
> * if there is a mapcount, we can dereference the anon_vma after observing
> * those.
> *
> - * NOTE: the caller should normally hold folio lock when calling this. If
> - * not, the caller needs to double check the anon_vma didn't change after
> - * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
> - * concurrently without folio lock protection). See folio_lock_anon_vma_read()
> - * which has already covered that, and comment above remap_pages().
> + * NOTE: the caller should hold folio lock when calling this.
> */
> struct anon_vma *folio_get_anon_vma(const struct folio *folio)
> {
> struct anon_vma *anon_vma = NULL;
> unsigned long anon_mapping;
>
> + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +
As raised in v1, I am not sure about device-private dax folios that are
anonymous where we could end up here through
memory_failure_dev_pagemap() and not having the folio locked.
If so we might have to throw in an actual folio lock somewhere on that path.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
2025-09-23 7:10 ` [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
@ 2025-09-24 10:07 ` David Hildenbrand
2025-11-03 17:52 ` Lorenzo Stoakes
1 sibling, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-09-24 10:07 UTC (permalink / raw)
To: Lokesh Gidra, akpm
Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes,
Peter Xu, Suren Baghdasaryan, Barry Song
On 23.09.25 09:10, Lokesh Gidra wrote:
> Now that rmap_walk() is guaranteed to be called with the folio lock
> held, we can stop serializing on the src VMA anon_vma lock when moving
> an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
>
> When moving a folio, we modify folio->mapping through
> folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
> while we could have concurrent RMAP walks would be dangerous. Therefore,
> to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
> meant that when multiple threads called UFFDIO_MOVE concurrently on
> distinct pages of the same src VMA, they would serialize on it, hurting
> scalability.
>
> In addition to avoiding the scalability bottleneck, this patch also
> simplifies the complicated lock dance that UFFDIO_MOVE has to go through
> between RCU, folio-lock, ptl, and anon_vma.
>
> folio_move_anon_rmap() already enforces that the folio is locked. So
> when we have the folio locked we can no longer race with concurrent
> rmap_walk() as used by folio_referenced() and others who call it on
> unlocked non-KSM anon folios, and therefore the anon_vma lock is no
> longer required.
>
> Note that this handling is now the same as for other
> folio_move_anon_rmap() users that also do not hold the anon_vma lock --
> namely COW reuse handling (do_wp_page()->wp_can_reuse_anon_folio(),
> do_huge_pmd_wp_page(), and hugetlb_wp()). These users never required the
> anon_vma lock as they are only moving the anon VMA closer to the anon_vma
> leaf of the VMA, for example, from an anon_vma root to a leaf of that root.
> rmap walks were always able to tolerate that scenario.
>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] mm: always call rmap_walk() on locked folios
2025-09-24 10:06 ` David Hildenbrand
@ 2025-10-02 7:56 ` David Hildenbrand
0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-10-02 7:56 UTC (permalink / raw)
To: Lokesh Gidra, akpm
Cc: linux-mm, kaleshsingh, ngeoffray, jannh, Lorenzo Stoakes,
Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
SeongJae Park
>>
>> static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 0bc7cf8b7359..fd9f18670440 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -489,17 +489,15 @@ void __init anon_vma_init(void)
>> * if there is a mapcount, we can dereference the anon_vma after observing
>> * those.
>> *
>> - * NOTE: the caller should normally hold folio lock when calling this. If
>> - * not, the caller needs to double check the anon_vma didn't change after
>> - * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
>> - * concurrently without folio lock protection). See folio_lock_anon_vma_read()
>> - * which has already covered that, and comment above remap_pages().
>> + * NOTE: the caller should hold folio lock when calling this.
>> */
>> struct anon_vma *folio_get_anon_vma(const struct folio *folio)
>> {
>> struct anon_vma *anon_vma = NULL;
>> unsigned long anon_mapping;
>>
>> + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>> +
>
> As raised in v1, I am not sure about device-private dax folios that are
> anonymous where we could end up here through
> memory_failure_dev_pagemap() and not having the folio locked.
As discussed in the v1 thread, we should be bailing out for
device_private folios earlier, and IIUC, these are the only ones that
can be anon. Likely there is still something shaky with anon folios
earlier on that path, but that's not related to this series.
So let's see if this patch will reveal any unexpected surprises
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock
2025-09-23 7:10 [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Lokesh Gidra
2025-09-23 7:10 ` [PATCH v2 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
2025-09-23 7:10 ` [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
@ 2025-10-03 23:03 ` Peter Xu
2 siblings, 0 replies; 9+ messages in thread
From: Peter Xu @ 2025-10-03 23:03 UTC (permalink / raw)
To: Lokesh Gidra
Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
Lorenzo Stoakes, Harry Yoo, Suren Baghdasaryan, Barry Song,
SeongJae Park
On Tue, Sep 23, 2025 at 12:10:17AM -0700, Lokesh Gidra wrote:
> Userfaultfd has a scalability issue in its UFFDIO_MOVE ioctl, which is
> heavily used in Android as its java garbage collector uses it for
> concurrent heap compaction.
>
> The issue arises because UFFDIO_MOVE updates folio->mapping to an
> anon_vma with a different root, in order to move the folio from a src
> VMA to dst VMA. It performs the operation with the folio locked, but
> this is insufficient, because rmap_walk() can be performed on non-KSM
> anonymous folios without folio lock.
>
> This means that UFFDIO_MOVE has to acquire the anon_vma write lock
> of the root anon_vma belonging to the folio it wishes to move.
>
> This causes scalability bottleneck when multiple threads perform
> UFFDIO_MOVE simultanously on distinct pages of the same src VMA. In
> field traces of arm64 android devices, we have observed janky user
> interactions due to long (sometimes over ~50ms) uninterruptible
> sleeps on main UI thread caused by anon_vma lock contention in
> UFFDIO_MOVE. This is particularly severe during the beginning of
> GC's compaction phase when it is likely to have multiple threads
> involved.
>
> This patch resolves the issue by removing the exception in rmap_walk()
> for non-KSM anon folios by ensuring that all folios are locked during
> rmap walk. This is less problematic than it might seem, as the only
> major caller which utilises this mode is shrink_active_list(), which is
> covered in detail in the first patch of this series.
>
> As a result of changing our approach to locking, we can remove all
> the code that took steps to acquire an anon_vma write lock instead
> of a folio lock. This results in a significant simplification and
> scalability improvement of the code (currently only in UFFDIO_MOVE).
> Furthermore, as a side-effect, folio_lock_anon_vma_read() gets simpler
> as we don't need to worry that folio->mapping may have changed under us.
>
> Prior discussions on this can be found at [1, 2].
>
> Changes since v1 [3]:
> 1. Move relevant parts of cover letter description to first patch, per
> David Hildenbrand.
> 2. Enumerate all callers of rmap_walk(), folio_lock_anon_vma_read(), and
> folio_get_anon_vma(), per Lorenzo Stoakes.
> 3. Make other corrections/improvements to commit message, per Lorenzo
> Stoakes.
>
> [1] https://lore.kernel.org/all/CA+EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5+4aDx5jh2DvY6UHg@mail.gmail.com/
> [2] https://lore.kernel.org/all/20250908044950.311548-1-lokeshgidra@google.com/
> [3] https://lore.kernel.org/all/20250918055135.2881413-1-lokeshgidra@google.com/
>
> Lokesh Gidra (2):
> mm: always call rmap_walk() on locked folios
> mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Harry Yoo <harry.yoo@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> CC: SeongJae Park <sj@kernel.org>
FWIW:
Acked-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] mm: always call rmap_walk() on locked folios
2025-09-23 7:10 ` [PATCH v2 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
2025-09-24 10:06 ` David Hildenbrand
@ 2025-11-03 17:51 ` Lorenzo Stoakes
1 sibling, 0 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-11-03 17:51 UTC (permalink / raw)
To: Lokesh Gidra
Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
Harry Yoo, Peter Xu, Suren Baghdasaryan, Barry Song,
SeongJae Park
Apologies for late review, I somehow missed this...!
On Tue, Sep 23, 2025 at 12:10:18AM -0700, Lokesh Gidra wrote:
> Guarantee that rmap_walk() is called on locked folios so that threads
> changing folio->mapping and folio->index for non-KSM anon folios can
> serialize on fine-grained folio lock rather than anon_vma lock. Other
> folio types are already always locked before rmap_walk(). With this, we
> are going from 'not necessarily' locking the non-KSM anon folio to
> 'definitely' locking it during rmap walks.
>
> This patch is in preparation for removing anon_vma write-lock from
> UFFDIO_MOVE.
>
> With this patch, three functions are now expected to be called with a
> locked folio. To be careful of not missing any case, here is the
> exhaustive list of all their callers.
>
> 1) rmap_walk() is called from:
>
> a) folio_referenced()
> b) damon_folio_mkold()
> c) damon_folio_young()
> d) page_idle_clear_pte_refs()
> e) try_to_unmap()
> f) try_to_migrate()
> g) folio_mkclean()
> h) remove_migration_ptes()
>
> In the above list, first 4 are changed in this patch to try-lock non-KSM
> anon folios, similar to other types of folios. The remaining functions
> in the list already hold folio lock when calling rmap_walk().
>
> 2) folio_lock_anon_vma_read() is called from following functions:
>
> a) collect_procs_anon()
> b) page_idle_clear_pte_refs()
> c) damon_folio_mkold()
> d) damon_folio_young()
> e) folio_referenced()
> f) try_to_unmap()
> g) try_to_migrate()
>
> All the functions in above list, except collect_procs_anon(), are
> covered by the rmap_walk() list above. For collect_procs_anon(), with
> kill_procs_now() changed to take folio lock in this patch ensures that
> all callers of folio_lock_anon_vma_read() now hold the lock.
>
> 3) folio_get_anon_vma() is called from following functions, all of which
> already hold the folio lock:
>
> a) move_pages_huge_pmd()
> b) __folio_split()
> c) move_pages_ptes()
> d) migrate_folio_unmap()
> e) unmap_and_move_huge_page()
>
> Functionally, this patch doesn't break the logic because rmap walkers
> generally do some other check to see if what is expected to mapped did
> happen so it's fine, or otherwise treat things as best-effort.
>
> Among the 4 functions changed in this patch, folio_referenced() is the
> only core-mm function, and is also frequently accessed. To assess the
> impact of locking non-KSM anon folios in
> shrink_active_list()->folio_referenced() path, we performed an app cycle
> test on an arm64 android device. During the whole duration of the test
> there were over 140k invocations of shrink_active_list(), out of which
> over 29k had at least one non-KSM anon folio on which folio_referenced()
> was called. In none of these invocations folio_trylock() failed.
>
> Of course, we now take a lock where we wouldn't previously have. In the
> past it would have had a major impact in causing a CoW write fault to
> copy a page in do_wp_page(), as commit 09854ba94c6a ("mm: do_wp_page()
> simplification") caused a failure to obtain folio lock to result in a
> page copy even if one wasn't necessary.
>
> However, since commit 6c287605fd56 ("mm: remember exclusively mapped
> anonymous pages with PG_anon_exclusive"), and the introduction of the
> folio anon exclusive flag, this issue is significantly mitigated.
Thanks this is great!
>
> The only case remaining that we might worry about from this perspective
> is that of read-only folios immediately after fork where the anon
> exclusive bit will not have been set yet.
>
> We note however in the case of read-only just-forked folios that
> wp_can_reuse_anon_folio() will notice the raised reference count
> established by shrink_active_list() via isolate_lru_folios() and refuse
> to reuse in any case, so this will in fact have no impact - the folio
> lock is ultimately immaterial here.
Great!
>
> All-in-all it appears that there is little opportunity for meaningful
> negative impact from this change.
Thanks.
>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Harry Yoo <harry.yoo@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> CC: SeongJae Park <sj@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
LGTM, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/damon/ops-common.c | 16 ++++------------
> mm/memory-failure.c | 3 +++
> mm/page_idle.c | 8 ++------
> mm/rmap.c | 42 ++++++++++++------------------------------
> 4 files changed, 21 insertions(+), 48 deletions(-)
>
> diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> index 998c5180a603..f61d6dde13dc 100644
> --- a/mm/damon/ops-common.c
> +++ b/mm/damon/ops-common.c
> @@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
> .rmap_one = damon_folio_mkold_one,
> .anon_lock = folio_lock_anon_vma_read,
> };
> - bool need_lock;
>
> if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
> folio_set_idle(folio);
> return;
> }
>
> - need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> - if (need_lock && !folio_trylock(folio))
> + if (!folio_trylock(folio))
> return;
>
> rmap_walk(folio, &rwc);
> -
> - if (need_lock)
> - folio_unlock(folio);
> + folio_unlock(folio);
>
> }
>
> @@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
> .rmap_one = damon_folio_young_one,
> .anon_lock = folio_lock_anon_vma_read,
> };
> - bool need_lock;
>
> if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
> if (folio_test_idle(folio))
> @@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
> return true;
> }
>
> - need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> - if (need_lock && !folio_trylock(folio))
> + if (!folio_trylock(folio))
> return false;
>
> rmap_walk(folio, &rwc);
> -
> - if (need_lock)
> - folio_unlock(folio);
> + folio_unlock(folio);
>
> return accessed;
> }
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a24806bb8e82..f698df156bf8 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2143,7 +2143,10 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
> {
> LIST_HEAD(tokill);
>
> + folio_lock(folio);
> collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> + folio_unlock(folio);
> +
> kill_procs(&tokill, true, pfn, flags);
> }
>
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index a82b340dc204..9bf573d22e87 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -101,19 +101,15 @@ static void page_idle_clear_pte_refs(struct folio *folio)
> .rmap_one = page_idle_clear_pte_refs_one,
> .anon_lock = folio_lock_anon_vma_read,
> };
> - bool need_lock;
>
> if (!folio_mapped(folio) || !folio_raw_mapping(folio))
> return;
>
> - need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> - if (need_lock && !folio_trylock(folio))
> + if (!folio_trylock(folio))
> return;
>
> rmap_walk(folio, &rwc);
> -
> - if (need_lock)
> - folio_unlock(folio);
> + folio_unlock(folio);
> }
>
> static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0bc7cf8b7359..fd9f18670440 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -489,17 +489,15 @@ void __init anon_vma_init(void)
> * if there is a mapcount, we can dereference the anon_vma after observing
> * those.
> *
> - * NOTE: the caller should normally hold folio lock when calling this. If
> - * not, the caller needs to double check the anon_vma didn't change after
> - * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
> - * concurrently without folio lock protection). See folio_lock_anon_vma_read()
> - * which has already covered that, and comment above remap_pages().
> + * NOTE: the caller should hold folio lock when calling this.
> */
> struct anon_vma *folio_get_anon_vma(const struct folio *folio)
> {
> struct anon_vma *anon_vma = NULL;
> unsigned long anon_mapping;
>
> + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +
> rcu_read_lock();
> anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
> if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -546,7 +544,8 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> struct anon_vma *root_anon_vma;
> unsigned long anon_mapping;
>
> -retry:
> + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +
> rcu_read_lock();
> anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
> if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -557,17 +556,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
> root_anon_vma = READ_ONCE(anon_vma->root);
> if (down_read_trylock(&root_anon_vma->rwsem)) {
> - /*
> - * folio_move_anon_rmap() might have changed the anon_vma as we
> - * might not hold the folio lock here.
> - */
> - if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> - anon_mapping)) {
> - up_read(&root_anon_vma->rwsem);
> - rcu_read_unlock();
> - goto retry;
> - }
> -
> /*
> * If the folio is still mapped, then this anon_vma is still
> * its anon_vma, and holding the mutex ensures that it will
> @@ -602,18 +590,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
> rcu_read_unlock();
> anon_vma_lock_read(anon_vma);
>
> - /*
> - * folio_move_anon_rmap() might have changed the anon_vma as we might
> - * not hold the folio lock here.
> - */
> - if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> - anon_mapping)) {
> - anon_vma_unlock_read(anon_vma);
> - put_anon_vma(anon_vma);
> - anon_vma = NULL;
> - goto retry;
> - }
> -
> if (atomic_dec_and_test(&anon_vma->refcount)) {
> /*
> * Oops, we held the last refcount, release the lock
> @@ -988,7 +964,7 @@ int folio_referenced(struct folio *folio, int is_locked,
> if (!folio_raw_mapping(folio))
> return 0;
>
> - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> + if (!is_locked) {
> we_locked = folio_trylock(folio);
> if (!we_locked)
> return 1;
> @@ -2820,6 +2796,12 @@ static void rmap_walk_anon(struct folio *folio,
> pgoff_t pgoff_start, pgoff_end;
> struct anon_vma_chain *avc;
>
> + /*
> + * The folio lock ensures that folio->mapping can't be changed under us
> + * to an anon_vma with different root.
> + */
> + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +
> if (locked) {
> anon_vma = folio_anon_vma(folio);
> /* anon_vma disappear under us? */
> --
> 2.51.0.534.gc79095c0ca-goog
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
2025-09-23 7:10 ` [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
2025-09-24 10:07 ` David Hildenbrand
@ 2025-11-03 17:52 ` Lorenzo Stoakes
1 sibling, 0 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-11-03 17:52 UTC (permalink / raw)
To: Lokesh Gidra
Cc: akpm, linux-mm, kaleshsingh, ngeoffray, jannh, David Hildenbrand,
Peter Xu, Suren Baghdasaryan, Barry Song
On Tue, Sep 23, 2025 at 12:10:19AM -0700, Lokesh Gidra wrote:
> Now that rmap_walk() is guaranteed to be called with the folio lock
> held, we can stop serializing on the src VMA anon_vma lock when moving
> an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
>
> When moving a folio, we modify folio->mapping through
> folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
> while we could have concurrent RMAP walks would be dangerous. Therefore,
> to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
> meant that when multiple threads called UFFDIO_MOVE concurrently on
> distinct pages of the same src VMA, they would serialize on it, hurting
> scalability.
>
> In addition to avoiding the scalability bottleneck, this patch also
> simplifies the complicated lock dance that UFFDIO_MOVE has to go through
> between RCU, folio-lock, ptl, and anon_vma.
>
> folio_move_anon_rmap() already enforces that the folio is locked. So
> when we have the folio locked we can no longer race with concurrent
> rmap_walk() as used by folio_referenced() and others who call it on
> unlocked non-KSM anon folios, and therefore the anon_vma lock is no
> longer required.
>
> Note that this handling is now the same as for other
> folio_move_anon_rmap() users that also do not hold the anon_vma lock --
> namely COW reuse handling (do_wp_page()->wp_can_reuse_anon_folio(),
> do_huge_pmd_wp_page(), and hugetlb_wp()). These users never required the
> anon_vma lock as they are only moving the anon VMA closer to the anon_vma
> leaf of the VMA, for example, from an anon_vma root to a leaf of that root.
> rmap walks were always able to tolerate that scenario.
>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
LGTM, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/huge_memory.c | 22 +----------------
> mm/userfaultfd.c | 62 +++++++++---------------------------------------
> 2 files changed, 12 insertions(+), 72 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4225..a16e3778b544 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> pmd_t _dst_pmd, src_pmdval;
> struct page *src_page;
> struct folio *src_folio;
> - struct anon_vma *src_anon_vma;
> spinlock_t *src_ptl, *dst_ptl;
> pgtable_t src_pgtable;
> struct mmu_notifier_range range;
> @@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> src_addr + HPAGE_PMD_SIZE);
> mmu_notifier_invalidate_range_start(&range);
>
> - if (src_folio) {
> + if (src_folio)
> folio_lock(src_folio);
>
> - /*
> - * split_huge_page walks the anon_vma chain without the page
> - * lock. Serialize against it with the anon_vma lock, the page
> - * lock is not enough.
> - */
> - src_anon_vma = folio_get_anon_vma(src_folio);
> - if (!src_anon_vma) {
> - err = -EAGAIN;
> - goto unlock_folio;
> - }
> - anon_vma_lock_write(src_anon_vma);
> - } else
> - src_anon_vma = NULL;
> -
> dst_ptl = pmd_lockptr(mm, dst_pmd);
> double_pt_lock(src_ptl, dst_ptl);
> if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> @@ -2643,11 +2628,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> unlock_ptls:
> double_pt_unlock(src_ptl, dst_ptl);
> - if (src_anon_vma) {
> - anon_vma_unlock_write(src_anon_vma);
> - put_anon_vma(src_anon_vma);
> - }
> -unlock_folio:
> /* unblock rmap walks */
> if (src_folio)
> folio_unlock(src_folio);
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index af61b95c89e4..6be65089085e 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1035,8 +1035,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> */
> static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> unsigned long src_addr,
> - pte_t *src_pte, pte_t *dst_pte,
> - struct anon_vma *src_anon_vma)
> + pte_t *src_pte, pte_t *dst_pte)
> {
> pte_t orig_dst_pte, orig_src_pte;
> struct folio *folio;
> @@ -1052,8 +1051,7 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> if (!folio || !folio_trylock(folio))
> return NULL;
> - if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> - folio_anon_vma(folio) != src_anon_vma) {
> + if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)) {
> folio_unlock(folio);
> return NULL;
> }
> @@ -1061,9 +1059,8 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> }
>
> /*
> - * Moves src folios to dst in a batch as long as they share the same
> - * anon_vma as the first folio, are not large, and can successfully
> - * take the lock via folio_trylock().
> + * Moves src folios to dst in a batch as long as they are not large, and can
> + * successfully take the lock via folio_trylock().
> */
> static long move_present_ptes(struct mm_struct *mm,
> struct vm_area_struct *dst_vma,
> @@ -1073,8 +1070,7 @@ static long move_present_ptes(struct mm_struct *mm,
> pte_t orig_dst_pte, pte_t orig_src_pte,
> pmd_t *dst_pmd, pmd_t dst_pmdval,
> spinlock_t *dst_ptl, spinlock_t *src_ptl,
> - struct folio **first_src_folio, unsigned long len,
> - struct anon_vma *src_anon_vma)
> + struct folio **first_src_folio, unsigned long len)
> {
> int err = 0;
> struct folio *src_folio = *first_src_folio;
> @@ -1132,8 +1128,8 @@ static long move_present_ptes(struct mm_struct *mm,
> src_pte++;
>
> folio_unlock(src_folio);
> - src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
> - dst_pte, src_anon_vma);
> + src_folio = check_ptes_for_batched_move(src_vma, src_addr,
> + src_pte, dst_pte);
> if (!src_folio)
> break;
> }
> @@ -1263,7 +1259,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> pmd_t dummy_pmdval;
> pmd_t dst_pmdval;
> struct folio *src_folio = NULL;
> - struct anon_vma *src_anon_vma = NULL;
> struct mmu_notifier_range range;
> long ret = 0;
>
> @@ -1347,9 +1342,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> }
>
> /*
> - * Pin and lock both source folio and anon_vma. Since we are in
> - * RCU read section, we can't block, so on contention have to
> - * unmap the ptes, obtain the lock and retry.
> + * Pin and lock source folio. Since we are in RCU read section,
> + * we can't block, so on contention have to unmap the ptes,
> + * obtain the lock and retry.
> */
> if (!src_folio) {
> struct folio *folio;
> @@ -1423,33 +1418,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> goto retry;
> }
>
> - if (!src_anon_vma) {
> - /*
> - * folio_referenced walks the anon_vma chain
> - * without the folio lock. Serialize against it with
> - * the anon_vma lock, the folio lock is not enough.
> - */
> - src_anon_vma = folio_get_anon_vma(src_folio);
> - if (!src_anon_vma) {
> - /* page was unmapped from under us */
> - ret = -EAGAIN;
> - goto out;
> - }
> - if (!anon_vma_trylock_write(src_anon_vma)) {
> - pte_unmap(src_pte);
> - pte_unmap(dst_pte);
> - src_pte = dst_pte = NULL;
> - /* now we can block and wait */
> - anon_vma_lock_write(src_anon_vma);
> - goto retry;
> - }
> - }
> -
> ret = move_present_ptes(mm, dst_vma, src_vma,
> dst_addr, src_addr, dst_pte, src_pte,
> orig_dst_pte, orig_src_pte, dst_pmd,
> dst_pmdval, dst_ptl, src_ptl, &src_folio,
> - len, src_anon_vma);
> + len);
> } else {
> struct folio *folio = NULL;
>
> @@ -1515,10 +1488,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> }
>
> out:
> - if (src_anon_vma) {
> - anon_vma_unlock_write(src_anon_vma);
> - put_anon_vma(src_anon_vma);
> - }
> if (src_folio) {
> folio_unlock(src_folio);
> folio_put(src_folio);
> @@ -1792,15 +1761,6 @@ static void uffd_move_unlock(struct vm_area_struct *dst_vma,
> * virtual regions without knowing if there are transparent hugepage
> * in the regions or not, but preventing the risk of having to split
> * the hugepmd during the remap.
> - *
> - * If there's any rmap walk that is taking the anon_vma locks without
> - * first obtaining the folio lock (the only current instance is
> - * folio_referenced), they will have to verify if the folio->mapping
> - * has changed after taking the anon_vma lock. If it changed they
> - * should release the lock and retry obtaining a new anon_vma, because
> - * it means the anon_vma was changed by move_pages() before the lock
> - * could be obtained. This is the only additional complexity added to
> - * the rmap code to provide this anonymous page remapping functionality.
> */
> ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> unsigned long src_start, unsigned long len, __u64 mode)
> --
> 2.51.0.534.gc79095c0ca-goog
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-11-03 17:52 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-23 7:10 [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Lokesh Gidra
2025-09-23 7:10 ` [PATCH v2 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
2025-09-24 10:06 ` David Hildenbrand
2025-10-02 7:56 ` David Hildenbrand
2025-11-03 17:51 ` Lorenzo Stoakes
2025-09-23 7:10 ` [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
2025-09-24 10:07 ` David Hildenbrand
2025-11-03 17:52 ` Lorenzo Stoakes
2025-10-03 23:03 ` [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Peter Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox