From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
kaleshsingh@google.com, ngeoffray@google.com, jannh@google.com,
David Hildenbrand <david@redhat.com>,
Peter Xu <peterx@redhat.com>,
Suren Baghdasaryan <surenb@google.com>,
Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
Date: Mon, 3 Nov 2025 17:52:32 +0000 [thread overview]
Message-ID: <4acead1b-6b35-4104-b1a4-c96089c1caa2@lucifer.local> (raw)
In-Reply-To: <20250923071019.775806-3-lokeshgidra@google.com>
On Tue, Sep 23, 2025 at 12:10:19AM -0700, Lokesh Gidra wrote:
> Now that rmap_walk() is guaranteed to be called with the folio lock
> held, we can stop serializing on the src VMA anon_vma lock when moving
> an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
>
> When moving a folio, we modify folio->mapping through
> folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
> while we could have concurrent RMAP walks would be dangerous. Therefore,
> to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
> meant that when multiple threads called UFFDIO_MOVE concurrently on
> distinct pages of the same src VMA, they would serialize on it, hurting
> scalability.
>
> In addition to avoiding the scalability bottleneck, this patch also
> simplifies the complicated lock dance that UFFDIO_MOVE has to go through
> between RCU, folio-lock, ptl, and anon_vma.
>
> folio_move_anon_rmap() already enforces that the folio is locked. So
> when we have the folio locked we can no longer race with concurrent
> rmap_walk() as used by folio_referenced() and others who call it on
> unlocked non-KSM anon folios, and therefore the anon_vma lock is no
> longer required.
>
> Note that this handling is now the same as for other
> folio_move_anon_rmap() users that also do not hold the anon_vma lock --
> namely COW reuse handling (do_wp_page()->wp_can_reuse_anon_folio(),
> do_huge_pmd_wp_page(), and hugetlb_wp()). These users never required the
> anon_vma lock as they are only moving the anon VMA closer to the anon_vma
> leaf of the VMA, for example, from an anon_vma root to a leaf of that root.
> rmap walks were always able to tolerate that scenario.
>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
LGTM, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/huge_memory.c | 22 +----------------
> mm/userfaultfd.c | 62 +++++++++---------------------------------------
> 2 files changed, 12 insertions(+), 72 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4225..a16e3778b544 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> pmd_t _dst_pmd, src_pmdval;
> struct page *src_page;
> struct folio *src_folio;
> - struct anon_vma *src_anon_vma;
> spinlock_t *src_ptl, *dst_ptl;
> pgtable_t src_pgtable;
> struct mmu_notifier_range range;
> @@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> src_addr + HPAGE_PMD_SIZE);
> mmu_notifier_invalidate_range_start(&range);
>
> - if (src_folio) {
> + if (src_folio)
> folio_lock(src_folio);
>
> - /*
> - * split_huge_page walks the anon_vma chain without the page
> - * lock. Serialize against it with the anon_vma lock, the page
> - * lock is not enough.
> - */
> - src_anon_vma = folio_get_anon_vma(src_folio);
> - if (!src_anon_vma) {
> - err = -EAGAIN;
> - goto unlock_folio;
> - }
> - anon_vma_lock_write(src_anon_vma);
> - } else
> - src_anon_vma = NULL;
> -
> dst_ptl = pmd_lockptr(mm, dst_pmd);
> double_pt_lock(src_ptl, dst_ptl);
> if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> @@ -2643,11 +2628,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> unlock_ptls:
> double_pt_unlock(src_ptl, dst_ptl);
> - if (src_anon_vma) {
> - anon_vma_unlock_write(src_anon_vma);
> - put_anon_vma(src_anon_vma);
> - }
> -unlock_folio:
> /* unblock rmap walks */
> if (src_folio)
> folio_unlock(src_folio);
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index af61b95c89e4..6be65089085e 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1035,8 +1035,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> */
> static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> unsigned long src_addr,
> - pte_t *src_pte, pte_t *dst_pte,
> - struct anon_vma *src_anon_vma)
> + pte_t *src_pte, pte_t *dst_pte)
> {
> pte_t orig_dst_pte, orig_src_pte;
> struct folio *folio;
> @@ -1052,8 +1051,7 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> if (!folio || !folio_trylock(folio))
> return NULL;
> - if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> - folio_anon_vma(folio) != src_anon_vma) {
> + if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)) {
> folio_unlock(folio);
> return NULL;
> }
> @@ -1061,9 +1059,8 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> }
>
> /*
> - * Moves src folios to dst in a batch as long as they share the same
> - * anon_vma as the first folio, are not large, and can successfully
> - * take the lock via folio_trylock().
> + * Moves src folios to dst in a batch as long as they are not large, and can
> + * successfully take the lock via folio_trylock().
> */
> static long move_present_ptes(struct mm_struct *mm,
> struct vm_area_struct *dst_vma,
> @@ -1073,8 +1070,7 @@ static long move_present_ptes(struct mm_struct *mm,
> pte_t orig_dst_pte, pte_t orig_src_pte,
> pmd_t *dst_pmd, pmd_t dst_pmdval,
> spinlock_t *dst_ptl, spinlock_t *src_ptl,
> - struct folio **first_src_folio, unsigned long len,
> - struct anon_vma *src_anon_vma)
> + struct folio **first_src_folio, unsigned long len)
> {
> int err = 0;
> struct folio *src_folio = *first_src_folio;
> @@ -1132,8 +1128,8 @@ static long move_present_ptes(struct mm_struct *mm,
> src_pte++;
>
> folio_unlock(src_folio);
> - src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
> - dst_pte, src_anon_vma);
> + src_folio = check_ptes_for_batched_move(src_vma, src_addr,
> + src_pte, dst_pte);
> if (!src_folio)
> break;
> }
> @@ -1263,7 +1259,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> pmd_t dummy_pmdval;
> pmd_t dst_pmdval;
> struct folio *src_folio = NULL;
> - struct anon_vma *src_anon_vma = NULL;
> struct mmu_notifier_range range;
> long ret = 0;
>
> @@ -1347,9 +1342,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> }
>
> /*
> - * Pin and lock both source folio and anon_vma. Since we are in
> - * RCU read section, we can't block, so on contention have to
> - * unmap the ptes, obtain the lock and retry.
> + * Pin and lock source folio. Since we are in RCU read section,
> + * we can't block, so on contention have to unmap the ptes,
> + * obtain the lock and retry.
> */
> if (!src_folio) {
> struct folio *folio;
> @@ -1423,33 +1418,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> goto retry;
> }
>
> - if (!src_anon_vma) {
> - /*
> - * folio_referenced walks the anon_vma chain
> - * without the folio lock. Serialize against it with
> - * the anon_vma lock, the folio lock is not enough.
> - */
> - src_anon_vma = folio_get_anon_vma(src_folio);
> - if (!src_anon_vma) {
> - /* page was unmapped from under us */
> - ret = -EAGAIN;
> - goto out;
> - }
> - if (!anon_vma_trylock_write(src_anon_vma)) {
> - pte_unmap(src_pte);
> - pte_unmap(dst_pte);
> - src_pte = dst_pte = NULL;
> - /* now we can block and wait */
> - anon_vma_lock_write(src_anon_vma);
> - goto retry;
> - }
> - }
> -
> ret = move_present_ptes(mm, dst_vma, src_vma,
> dst_addr, src_addr, dst_pte, src_pte,
> orig_dst_pte, orig_src_pte, dst_pmd,
> dst_pmdval, dst_ptl, src_ptl, &src_folio,
> - len, src_anon_vma);
> + len);
> } else {
> struct folio *folio = NULL;
>
> @@ -1515,10 +1488,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> }
>
> out:
> - if (src_anon_vma) {
> - anon_vma_unlock_write(src_anon_vma);
> - put_anon_vma(src_anon_vma);
> - }
> if (src_folio) {
> folio_unlock(src_folio);
> folio_put(src_folio);
> @@ -1792,15 +1761,6 @@ static void uffd_move_unlock(struct vm_area_struct *dst_vma,
> * virtual regions without knowing if there are transparent hugepage
> * in the regions or not, but preventing the risk of having to split
> * the hugepmd during the remap.
> - *
> - * If there's any rmap walk that is taking the anon_vma locks without
> - * first obtaining the folio lock (the only current instance is
> - * folio_referenced), they will have to verify if the folio->mapping
> - * has changed after taking the anon_vma lock. If it changed they
> - * should release the lock and retry obtaining a new anon_vma, because
> - * it means the anon_vma was changed by move_pages() before the lock
> - * could be obtained. This is the only additional complexity added to
> - * the rmap code to provide this anonymous page remapping functionality.
> */
> ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> unsigned long src_start, unsigned long len, __u64 mode)
> --
> 2.51.0.534.gc79095c0ca-goog
>
>
next prev parent reply other threads:[~2025-11-03 17:52 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-23 7:10 [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Lokesh Gidra
2025-09-23 7:10 ` [PATCH v2 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
2025-09-24 10:06 ` David Hildenbrand
2025-10-02 7:56 ` David Hildenbrand
2025-11-03 17:51 ` Lorenzo Stoakes
2025-09-23 7:10 ` [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
2025-09-24 10:07 ` David Hildenbrand
2025-11-03 17:52 ` Lorenzo Stoakes [this message]
2025-10-03 23:03 ` [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4acead1b-6b35-4104-b1a4-c96089c1caa2@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=david@redhat.com \
--cc=jannh@google.com \
--cc=kaleshsingh@google.com \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=ngeoffray@google.com \
--cc=peterx@redhat.com \
--cc=surenb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox