Re: [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	kaleshsingh@google.com, ngeoffray@google.com, jannh@google.com,
	David Hildenbrand <david@redhat.com>,
	Peter Xu <peterx@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE
Date: Mon, 3 Nov 2025 17:52:32 +0000	[thread overview]
Message-ID: <4acead1b-6b35-4104-b1a4-c96089c1caa2@lucifer.local> (raw)
In-Reply-To: <20250923071019.775806-3-lokeshgidra@google.com>

On Tue, Sep 23, 2025 at 12:10:19AM -0700, Lokesh Gidra wrote:
> Now that rmap_walk() is guaranteed to be called with the folio lock
> held, we can stop serializing on the src VMA anon_vma lock when moving
> an exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
>
> When moving a folio, we modify folio->mapping through
> folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
> while we could have concurrent RMAP walks would be dangerous. Therefore,
> to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
> meant that when multiple threads called UFFDIO_MOVE concurrently on
> distinct pages of the same src VMA, they would serialize on it, hurting
> scalability.
>
> In addition to avoiding the scalability bottleneck, this patch also
> simplifies the complicated lock dance that UFFDIO_MOVE has to go through
> between RCU, folio-lock, ptl, and anon_vma.
>
> folio_move_anon_rmap() already enforces that the folio is locked. So
> when we have the folio locked we can no longer race with concurrent
> rmap_walk() as used by folio_referenced() and others who call it on
> unlocked non-KSM anon folios, and therefore the anon_vma lock is no
> longer required.
>
> Note that this handling is now the same as for other
> folio_move_anon_rmap() users that also do not hold the anon_vma lock --
> namely COW reuse handling (do_wp_page()->wp_can_reuse_anon_folio(),
> do_huge_pmd_wp_page(), and hugetlb_wp()). These users never required the
> anon_vma lock as they are only moving the anon VMA closer to the anon_vma
> leaf of the VMA, for example, from an anon_vma root to a leaf of that root.
> rmap walks were always able to tolerate that scenario.
>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c | 22 +----------------
>  mm/userfaultfd.c | 62 +++++++++---------------------------------------
>  2 files changed, 12 insertions(+), 72 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4225..a16e3778b544 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2533,7 +2533,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>  	pmd_t _dst_pmd, src_pmdval;
>  	struct page *src_page;
>  	struct folio *src_folio;
> -	struct anon_vma *src_anon_vma;
>  	spinlock_t *src_ptl, *dst_ptl;
>  	pgtable_t src_pgtable;
>  	struct mmu_notifier_range range;
> @@ -2582,23 +2581,9 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>  				src_addr + HPAGE_PMD_SIZE);
>  	mmu_notifier_invalidate_range_start(&range);
>
> -	if (src_folio) {
> +	if (src_folio)
>  		folio_lock(src_folio);
>
> -		/*
> -		 * split_huge_page walks the anon_vma chain without the page
> -		 * lock. Serialize against it with the anon_vma lock, the page
> -		 * lock is not enough.
> -		 */
> -		src_anon_vma = folio_get_anon_vma(src_folio);
> -		if (!src_anon_vma) {
> -			err = -EAGAIN;
> -			goto unlock_folio;
> -		}
> -		anon_vma_lock_write(src_anon_vma);
> -	} else
> -		src_anon_vma = NULL;
> -
>  	dst_ptl = pmd_lockptr(mm, dst_pmd);
>  	double_pt_lock(src_ptl, dst_ptl);
>  	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> @@ -2643,11 +2628,6 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>  	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
>  unlock_ptls:
>  	double_pt_unlock(src_ptl, dst_ptl);
> -	if (src_anon_vma) {
> -		anon_vma_unlock_write(src_anon_vma);
> -		put_anon_vma(src_anon_vma);
> -	}
> -unlock_folio:
>  	/* unblock rmap walks */
>  	if (src_folio)
>  		folio_unlock(src_folio);
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index af61b95c89e4..6be65089085e 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1035,8 +1035,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
>   */
>  static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
>  						 unsigned long src_addr,
> -						 pte_t *src_pte, pte_t *dst_pte,
> -						 struct anon_vma *src_anon_vma)
> +						 pte_t *src_pte, pte_t *dst_pte)
>  {
>  	pte_t orig_dst_pte, orig_src_pte;
>  	struct folio *folio;
> @@ -1052,8 +1051,7 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
>  	folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
>  	if (!folio || !folio_trylock(folio))
>  		return NULL;
> -	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> -	    folio_anon_vma(folio) != src_anon_vma) {
> +	if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)) {
>  		folio_unlock(folio);
>  		return NULL;
>  	}
> @@ -1061,9 +1059,8 @@ static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
>  }
>
>  /*
> - * Moves src folios to dst in a batch as long as they share the same
> - * anon_vma as the first folio, are not large, and can successfully
> - * take the lock via folio_trylock().
> + * Moves src folios to dst in a batch as long as they are not large, and can
> + * successfully take the lock via folio_trylock().
>   */
>  static long move_present_ptes(struct mm_struct *mm,
>  			      struct vm_area_struct *dst_vma,
> @@ -1073,8 +1070,7 @@ static long move_present_ptes(struct mm_struct *mm,
>  			      pte_t orig_dst_pte, pte_t orig_src_pte,
>  			      pmd_t *dst_pmd, pmd_t dst_pmdval,
>  			      spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -			      struct folio **first_src_folio, unsigned long len,
> -			      struct anon_vma *src_anon_vma)
> +			      struct folio **first_src_folio, unsigned long len)
>  {
>  	int err = 0;
>  	struct folio *src_folio = *first_src_folio;
> @@ -1132,8 +1128,8 @@ static long move_present_ptes(struct mm_struct *mm,
>  		src_pte++;
>
>  		folio_unlock(src_folio);
> -		src_folio = check_ptes_for_batched_move(src_vma, src_addr, src_pte,
> -							dst_pte, src_anon_vma);
> +		src_folio = check_ptes_for_batched_move(src_vma, src_addr,
> +							src_pte, dst_pte);
>  		if (!src_folio)
>  			break;
>  	}
> @@ -1263,7 +1259,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  	pmd_t dummy_pmdval;
>  	pmd_t dst_pmdval;
>  	struct folio *src_folio = NULL;
> -	struct anon_vma *src_anon_vma = NULL;
>  	struct mmu_notifier_range range;
>  	long ret = 0;
>
> @@ -1347,9 +1342,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  		}
>
>  		/*
> -		 * Pin and lock both source folio and anon_vma. Since we are in
> -		 * RCU read section, we can't block, so on contention have to
> -		 * unmap the ptes, obtain the lock and retry.
> +		 * Pin and lock source folio. Since we are in RCU read section,
> +		 * we can't block, so on contention have to unmap the ptes,
> +		 * obtain the lock and retry.
>  		 */
>  		if (!src_folio) {
>  			struct folio *folio;
> @@ -1423,33 +1418,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  			goto retry;
>  		}
>
> -		if (!src_anon_vma) {
> -			/*
> -			 * folio_referenced walks the anon_vma chain
> -			 * without the folio lock. Serialize against it with
> -			 * the anon_vma lock, the folio lock is not enough.
> -			 */
> -			src_anon_vma = folio_get_anon_vma(src_folio);
> -			if (!src_anon_vma) {
> -				/* page was unmapped from under us */
> -				ret = -EAGAIN;
> -				goto out;
> -			}
> -			if (!anon_vma_trylock_write(src_anon_vma)) {
> -				pte_unmap(src_pte);
> -				pte_unmap(dst_pte);
> -				src_pte = dst_pte = NULL;
> -				/* now we can block and wait */
> -				anon_vma_lock_write(src_anon_vma);
> -				goto retry;
> -			}
> -		}
> -
>  		ret = move_present_ptes(mm, dst_vma, src_vma,
>  					dst_addr, src_addr, dst_pte, src_pte,
>  					orig_dst_pte, orig_src_pte, dst_pmd,
>  					dst_pmdval, dst_ptl, src_ptl, &src_folio,
> -					len, src_anon_vma);
> +					len);
>  	} else {
>  		struct folio *folio = NULL;
>
> @@ -1515,10 +1488,6 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>  	}
>
>  out:
> -	if (src_anon_vma) {
> -		anon_vma_unlock_write(src_anon_vma);
> -		put_anon_vma(src_anon_vma);
> -	}
>  	if (src_folio) {
>  		folio_unlock(src_folio);
>  		folio_put(src_folio);
> @@ -1792,15 +1761,6 @@ static void uffd_move_unlock(struct vm_area_struct *dst_vma,
>   * virtual regions without knowing if there are transparent hugepage
>   * in the regions or not, but preventing the risk of having to split
>   * the hugepmd during the remap.
> - *
> - * If there's any rmap walk that is taking the anon_vma locks without
> - * first obtaining the folio lock (the only current instance is
> - * folio_referenced), they will have to verify if the folio->mapping
> - * has changed after taking the anon_vma lock. If it changed they
> - * should release the lock and retry obtaining a new anon_vma, because
> - * it means the anon_vma was changed by move_pages() before the lock
> - * could be obtained. This is the only additional complexity added to
> - * the rmap code to provide this anonymous page remapping functionality.
>   */
>  ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>  		   unsigned long src_start, unsigned long len, __u64 mode)
> --
> 2.51.0.534.gc79095c0ca-goog
>
>

next prev parent reply	other threads:[~2025-11-03 17:52 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-23  7:10 [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Lokesh Gidra
2025-09-23  7:10 ` [PATCH v2 1/2] mm: always call rmap_walk() on locked folios Lokesh Gidra
2025-09-24 10:06   ` David Hildenbrand
2025-10-02  7:56     ` David Hildenbrand
2025-11-03 17:51   ` Lorenzo Stoakes
2025-09-23  7:10 ` [PATCH v2 2/2] mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE Lokesh Gidra
2025-09-24 10:07   ` David Hildenbrand
2025-11-03 17:52   ` Lorenzo Stoakes [this message]
2025-10-03 23:03 ` [PATCH v2 0/2] Improve UFFDIO_MOVE scalability by removing anon_vma lock Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4acead1b-6b35-4104-b1a4-c96089c1caa2@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=david@redhat.com \
    --cc=jannh@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=ngeoffray@google.com \
    --cc=peterx@redhat.com \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox