Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	kaleshsingh@google.com, ngeoffray@google.com,
	David Hildenbrand <david@redhat.com>,
	Harry Yoo <harry.yoo@oracle.com>, Peter Xu <peterx@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Barry Song <baohua@kernel.org>, SeongJae Park <sj@kernel.org>
Subject: Re: [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios
Date: Thu, 11 Sep 2025 20:39:02 +0100	[thread overview]
Message-ID: <d507dc89-e6b0-4360-bc77-1a8bf74b9134@lucifer.local> (raw)
In-Reply-To: <20250908044950.311548-1-lokeshgidra@google.com>

Please please please use a cover letter if there's more than 1 patch :)

I really dislike the 2/2 replying to the 1/2.

On Sun, Sep 07, 2025 at 09:49:49PM -0700, Lokesh Gidra wrote:
> Prior discussion about this can be found at [1].
>
> rmap_walk() requires all folios, except non-KSM anon, to be locked. This
> implies that when threads update folio->mapping to an anon_vma with
> different root (currently only done by UFFDIO MOVE), they have to

I said this on the discussion thread, but can we please stop dancing around
and acting as if this isn't an entirely uffd-specific patch please :)

Let's very explicitly say that's why we're doing this.

> serialize against rmap_walk() with write-lock on the anon_vma, hurting
> scalability. Furthermore, this necessitates rechecking anon_vma when
> pinning/locking an anon_vma (like in folio_lock_anon_vma_read()).

THis is really quite confusing, you're compressing far too much information
into a single sentence.

Let's reword this to make it clearer like:

	Userfaultfd has a scaling issue with its UFFDIO_MOVE operation, an
	operation that is heavily used in android [insert reason why].

	The issue arises because UFFDIO_MOVE updates folio->mapping to an
	anon_vma with a different root. It acquires the folio lock to do
	so, but this is insufficient, because rmap_walk() has a mode in
	which a folio lock need not be acquired, exclusive to non-KSM
	anonymous folios.

	This means that UFFDIO_MOVE has to acquire the anon_vma write lock
	of the root anon_vma belonging to the folio it wishes to move.

	This has resulted in scalability issues due to contention between
	[insert contention information]. We have observed:

	[ insert some data to back this up ]

	This patch resolves the issue by removing this exception. This is
	less problematic than it might seem, as the only caller which
	utilises this mode is shrink_active_list().

Something like this is _a lot_ clearer I think.

>
> This can be simplified quite a bit by ensuring that rmap_walk() is
> always called on locked folios. Among the few callers of rmap_walk() on
> unlocked anon folios, shrink_active_list()->folio_referenced() is the
> only performance critical one.

Let's please not call this a simplification, I mean yes we simplify the
code per se, but we're fundamentally changing the locking logic.

Let's explicitly say that.

Also I find it odd that you say shrink_active_list()->folio_referenced() is
'performance critical', I mean if so, surely this series is broken then?

I'd delete that, the entire basis of this being ok is that it's _not_
performance critical to make this change.

>
> shrink_active_list() doesn't act differently depending on what
> folio_referenced() returns for an anon folio. So returning 1 when it
> is contended, like in case of other folio types, wouldn't have any
> negative impact.

I think this is a little unclear. I mean it very much _does_ behave
differently if it returns 0. So better to say it treats folio_referenced()
as a boolean, so if the folio_referenced() call returns an incorrect
referenced count while still indicating it is referenced that's fine.

>
> Furthermore, as David pointed out in the previous discussion [2], this
> could potentially only affect R/O pages after fork as PG_anon_exclusive
> is not set. But, such folios are already isolated (prior to calling
> folio_referenced()) by grabbing a reference and clearing LRU, so
> do_wp_page()->wp_can_reuse_anon_folio() would not reuse such folios
> anyways.

I think this needs to be expanded too, this is still very unclear.

I think clearer to explicitly expand upon this, like e.g.:

	"Of course we now take a lock where we wouldn't previously have
	done so. In the past would have had a major impact in causing a CoW
	write fault to copy a page in do_wp_page(), since commit
	09854ba94c6a ("mm: do_wp_page() simplification") caused a failure
	to obtain a folio lock to result in a copy even if one wasn't
	necessary.

	However, since commit 6c287605fd56 ("mm: remember exclusively
	mapped anonymous pages with PG_anon_exclusive"), and the
	introduction of the folio anon exclusive flag, this issue is
	significantly mitigated.

	The only case remaining that we might worry about from this
	perspective is that of read-only folios immediately after fork
	where the anon exclusive bit will not have been set yet.

	We note however in the case of read-only just-forked folios that
	wp_can_reuse_anon_folio() will notice the raised reference count
	established by shrink_active_list() via isolate_lru_folios() and
	refuse to reuse in any case, so this will in fact have no impact -
	the folio lock is ultimately immaterial here.

	All-in-all it appears that there is little opportunity for
	meaningful negative impact from this change".

And now, having said all of that, you can talk about simplification,
something like:

	"As a result of changing our approach to locking, we can remove all
	the code that took steps to acuqire an anon_vma write lock instead
	of a folio lock. This results in a significant simplification of
	the code."

:)

>
> [1] https://lore.kernel.org/all/CA+EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5+4aDx5jh2DvY6UHg@mail.gmail.com/
> [2] https://lore.kernel.org/all/dc92aef8-757f-4432-923e-70d92d13fb37@redhat.com/
>
> CC: David Hildenbrand <david@redhat.com>
> CC: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> CC: Harry Yoo <harry.yoo@oracle.com>
> CC: Peter Xu <peterx@redhat.com>
> CC: Suren Baghdasaryan <surenb@google.com>
> CC: Barry Song <baohua@kernel.org>
> CC: SeongJae Park <sj@kernel.org>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> ---
>  mm/damon/ops-common.c | 16 ++++------------
>  mm/page_idle.c        |  8 ++------
>  mm/rmap.c             | 40 ++++++++++------------------------------
>  3 files changed, 16 insertions(+), 48 deletions(-)
>
> diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> index 998c5180a603..f61d6dde13dc 100644
> --- a/mm/damon/ops-common.c
> +++ b/mm/damon/ops-common.c
> @@ -162,21 +162,17 @@ void damon_folio_mkold(struct folio *folio)
>  		.rmap_one = damon_folio_mkold_one,
>  		.anon_lock = folio_lock_anon_vma_read,
>  	};
> -	bool need_lock;
>
>  	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
>  		folio_set_idle(folio);
>  		return;
>  	}
>
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>  		return;
>
>  	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);
>
>  }
>
> @@ -228,7 +224,6 @@ bool damon_folio_young(struct folio *folio)
>  		.rmap_one = damon_folio_young_one,
>  		.anon_lock = folio_lock_anon_vma_read,
>  	};
> -	bool need_lock;
>
>  	if (!folio_mapped(folio) || !folio_raw_mapping(folio)) {
>  		if (folio_test_idle(folio))
> @@ -237,14 +232,11 @@ bool damon_folio_young(struct folio *folio)
>  			return true;
>  	}
>
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>  		return false;
>
>  	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);

Oh wow damon did this too... Are we sure we caught all such cases? Have you
checked carefully?

Hate that we've open-coded this all over the place.

>
>  	return accessed;
>  }
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index a82b340dc204..9bf573d22e87 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -101,19 +101,15 @@ static void page_idle_clear_pte_refs(struct folio *folio)
>  		.rmap_one = page_idle_clear_pte_refs_one,
>  		.anon_lock = folio_lock_anon_vma_read,
>  	};
> -	bool need_lock;
>
>  	if (!folio_mapped(folio) || !folio_raw_mapping(folio))
>  		return;
>
> -	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
> -	if (need_lock && !folio_trylock(folio))
> +	if (!folio_trylock(folio))
>  		return;

And more open-coding here, god. Horrid. It's quite nice to attack this actually.

>
>  	rmap_walk(folio, &rwc);
> -
> -	if (need_lock)
> -		folio_unlock(folio);
> +	folio_unlock(folio);
>  }
>
>  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 34333ae3bd80..fc53f31434f4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -489,17 +489,15 @@ void __init anon_vma_init(void)
>   * if there is a mapcount, we can dereference the anon_vma after observing
>   * those.
>   *
> - * NOTE: the caller should normally hold folio lock when calling this.  If
> - * not, the caller needs to double check the anon_vma didn't change after
> - * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it
> - * concurrently without folio lock protection). See folio_lock_anon_vma_read()
> - * which has already covered that, and comment above remap_pages().
> + * NOTE: the caller should hold folio lock when calling this.
>   */
>  struct anon_vma *folio_get_anon_vma(const struct folio *folio)
>  {
>  	struct anon_vma *anon_vma = NULL;
>  	unsigned long anon_mapping;
>
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +
>  	rcu_read_lock();
>  	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>  	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -546,7 +544,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	struct anon_vma *root_anon_vma;
>  	unsigned long anon_mapping;
>
> -retry:
>  	rcu_read_lock();
>  	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>  	if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -557,17 +554,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
>  	root_anon_vma = READ_ONCE(anon_vma->root);
>  	if (down_read_trylock(&root_anon_vma->rwsem)) {
> -		/*
> -		 * folio_move_anon_rmap() might have changed the anon_vma as we
> -		 * might not hold the folio lock here.
> -		 */
> -		if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -			     anon_mapping)) {
> -			up_read(&root_anon_vma->rwsem);
> -			rcu_read_unlock();
> -			goto retry;
> -		}
> -
>  		/*
>  		 * If the folio is still mapped, then this anon_vma is still
>  		 * its anon_vma, and holding the mutex ensures that it will
> @@ -602,18 +588,6 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
>  	rcu_read_unlock();
>  	anon_vma_lock_read(anon_vma);
>
> -	/*
> -	 * folio_move_anon_rmap() might have changed the anon_vma as we might
> -	 * not hold the folio lock here.
> -	 */
> -	if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -		     anon_mapping)) {
> -		anon_vma_unlock_read(anon_vma);
> -		put_anon_vma(anon_vma);
> -		anon_vma = NULL;
> -		goto retry;
> -	}
> -
>  	if (atomic_dec_and_test(&anon_vma->refcount)) {
>  		/*
>  		 * Oops, we held the last refcount, release the lock
> @@ -1005,7 +979,7 @@ int folio_referenced(struct folio *folio, int is_locked,
>  	if (!folio_raw_mapping(folio))
>  		return 0;
>
> -	if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +	if (!is_locked) {
>  		we_locked = folio_trylock(folio);
>  		if (!we_locked)
>  			return 1;
> @@ -2815,6 +2789,12 @@ static void rmap_walk_anon(struct folio *folio,
>  	pgoff_t pgoff_start, pgoff_end;
>  	struct anon_vma_chain *avc;
>
> +	/*
> +	 * The folio lock ensures that folio->mapping can be changed under us to

Can't? :)

> +	 * an anon_vma with different root, like UFFDIO MOVE.

Not a fan of this 'like UFFDIO_MOVE'. I mean here I think we should just
drop that bit.

> +	 */
> +	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

PLEASE do not add BUG_ON(), checkpatch.pl will warn you on this and it's
not something we do these days pretty much ever. VM_WARN_ON_ONCE_FOLIO()
please.

> +
>  	if (locked) {
>  		anon_vma = folio_anon_vma(folio);
>  		/* anon_vma disappear under us? */
>
> base-commit: b024763926d2726978dff6588b81877d000159c1
> --
> 2.51.0.355.g5224444f11-goog
>
>

Cheers, Lorenzo

next prev parent reply	other threads:[~2025-09-11 19:39 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-08  4:49 Lokesh Gidra
2025-09-08  4:49 ` [RFC PATCH 2/2] userfaultfd: remove anon-vma lock for moving folios in MOVE ioctl Lokesh Gidra
2025-09-11 20:07   ` Lorenzo Stoakes
2025-09-12  9:15   ` David Hildenbrand
2025-09-08 21:47 ` [RFC PATCH 1/2] mm: always call rmap_walk() on locked folios Barry Song
2025-09-08 22:12   ` Lokesh Gidra
2025-09-09  0:40     ` Barry Song
2025-09-09  5:37       ` Lokesh Gidra
2025-09-09  5:51         ` Barry Song
2025-09-09  5:56           ` Lokesh Gidra
2025-09-09  6:01             ` Barry Song
2025-09-11 19:05               ` Lokesh Gidra
2025-09-12  5:10                 ` Barry Song
2025-09-10 10:10 ` Harry Yoo
2025-09-10 15:33   ` Lokesh Gidra
2025-09-11  8:40     ` Harry Yoo
2025-09-12  3:29   ` Miaohe Lin
2025-09-11 19:39 ` Lorenzo Stoakes [this message]
2025-09-12  9:03   ` David Hildenbrand
2025-09-13  4:27     ` Lokesh Gidra
2025-09-15 11:27       ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d507dc89-e6b0-4360-bc77-1a8bf74b9134@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=david@redhat.com \
    --cc=harry.yoo@oracle.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=ngeoffray@google.com \
    --cc=peterx@redhat.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox