From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Suren Baghdasaryan <surenb@google.com>
Cc: Barry Song <21cnbao@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
Nicolas Geoffray <ngeoffray@google.com>,
Lokesh Gidra <lokeshgidra@google.com>,
David Hildenbrand <david@redhat.com>,
Harry Yoo <harry.yoo@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
Rik van Riel <riel@surriel.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Linux-MM <linux-mm@kvack.org>,
Kalesh Singh <kaleshsingh@google.com>,
SeongJae Park <sj@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
Peter Xu <peterx@redhat.com>
Subject: Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
Date: Mon, 15 Sep 2025 09:41:33 +0100 [thread overview]
Message-ID: <a3a96ab2-5a8e-4d1b-9a24-01955f931b6f@lucifer.local> (raw)
In-Reply-To: <CAJuCfpG1DHeLtsTcD3A46vDV5BwBDAD7B-EVG4TQEY-GSvzfeg@mail.gmail.com>
On Sun, Sep 14, 2025 at 06:47:48PM -0700, Suren Baghdasaryan wrote:
> On Sun, Sep 14, 2025 at 5:23 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Sep 15, 2025 at 7:53 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> > > > In the process tree, many processes may share anon_vma->root, even if
> > > > they don’t share the anon_vma itself. This causes serious lock contention
> > > > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > > > and other processes calling fork(), exit(), mprotect(), etc.
> > > >
> > > > On Android, this issue becomes more severe since many processes are
> > > > descendants of zygote.
> > >
> > > I'm not nearly as familiar with anon_vma as, well, the rest of you
> > > are. As I understand this situation, usually after fork(), a process
> > > calls exec() and the VMAs evaporate. Android is different in that after
> > > the zygotecalls fork(), there is no exec() and so the VMAs stay COW.
> > >
> > > I wonder if we could fix this by adding a new syscall:
> > >
> > > mremap(addr, size, size, MREMAP_COW_NOW);
> > >
> > > That would create a new VMA that contains the COWed pages from the
> > > old VMA, but crucially no longer attached to the anon_vma root of
> > > the zygote. You wouldn't want to call this for every VMA, of course.
> > > Just the ones which are likely to be fully COWed.
> > >
> > > Maybe this isn't practical, but I thought it worth suggesting.
> >
> > Thank you for the suggestion, Matthew.
> >
> > Lorenzo suggested possibly unlinking the child anon_vma from the root once all
> > folios have been CoW-ed:
> >
> > "Right now, even if you entirely CoW everything in a VMA, we are still
> > attached to parents with all the overhead. That's something I can look at.
> > "
> >
> > My concern is that it’s difficult to determine whether a VMA has been completely
> > CoW-ed, and a single shared folio would prevent the unlink.
> > So I’m not sure this approach would work.
> >
> > You seem to be proposing a forced CoW as a way to safely unlink from the root.
> >
> > A side effect is the potential for sudden, heavy memory allocation,
> > whereas CoW lets asynchronous tasks such as kswap work concurrently.
> >
> > Another issue is the extra memory use from folios that could have been
> > shared but aren’t—likely minor on Android, since only a small portion
> > of memory is actually shared, based on our observations.
> >
> > Calling mremap for each VMA might be difficult. Something applied to the
> > whole process could be more practical—similar to exec, but only
> > performing CoW and unlinking the anon_vma root.
> >
> > On the other hand, most anon folios are not actually shared, yet
> > folio_referenced and try_to_unmap still take the entire root lock.
> > In reality, they only care about their own node—no need to iterate
> > the whole tree.
> >
> > I still think optimizing from that angle could be a better entry point :-)
>
> Hi Barry,
> Thanks for raising this issue. I think technically the optimization
> you are suggesting is possible and it does look similar to per-vma
> locking in that:
> - The reader tries to read-lock a specific interval and on failure
> falls back to locking the entire tree (root);
> - The writer write-locks the root first and then one or more
> individual nodes in the tree. Once the writer is done it unlocks all
> the nodes it locked and then the root.
> But as Lorenzo pointed out, this will not be pretty, as it adds yet
> another lock and more locking/unlocking into the writer path.
> In the case of the pagefault path, improving its performance at the
> expense of the writers was not questioned due to pagefault being such
> a hot path. I'm not sure reclaim will be given the same benefit...
> Something to consider.
> In any case, I'm very interested in continuing this discussion and
> would love to test a POC or discuss this at LPC.
> Thanks,
> Suren.
Hi Suren,
Have submitted a proposal to LPC re: anon_vma in general, and (in collaboration
with Barry) am planning a PoC for this!
I think we need to be very data-centric here. So we have to ensure that the cost
is worth the benefit. Anyway I should put out a [PoC PATCH] soon!
We could even start by looking at a crude version that just uses per-anon_vma
rwsem as a first step.
But I do think any actually workable solution will have to resemble VMA locks
for it to be even vaguely acceptable. As otherwise grabbing the rwsem's y'know I
think probably a little too much overhead :)
One big issue we have here is that we are doing anon_vma clone on _so many_
operations, that is split/merge/fork.
So yeah needs assessment, but I think longer term this kind of issue can feed
into my 'grand redesign' of anon_vma which is a simmering background task (and
again I hope to talk about at LPC if my topic is accepted).
Cheers, Lorenzo
next prev parent reply other threads:[~2025-09-15 8:41 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-11 7:17 Barry Song
2025-09-11 8:14 ` David Hildenbrand
2025-09-11 8:34 ` Lorenzo Stoakes
2025-09-11 9:18 ` Barry Song
2025-09-11 10:47 ` Lorenzo Stoakes
2025-09-11 8:28 ` Lorenzo Stoakes
2025-09-11 18:22 ` Jann Horn
2025-09-12 4:49 ` Lorenzo Stoakes
2025-09-12 11:37 ` Jann Horn
2025-09-12 11:56 ` Lorenzo Stoakes
2025-09-14 23:53 ` Matthew Wilcox
2025-09-15 0:23 ` Barry Song
2025-09-15 1:47 ` Suren Baghdasaryan
2025-09-15 8:41 ` Lorenzo Stoakes [this message]
2025-09-15 2:50 ` Matthew Wilcox
2025-09-15 5:17 ` David Hildenbrand
2025-09-15 9:42 ` Lorenzo Stoakes
2025-09-15 10:29 ` David Hildenbrand
2025-09-15 10:56 ` Lorenzo Stoakes
2025-09-15 9:22 ` Lorenzo Stoakes
2025-09-15 10:41 ` David Hildenbrand
2025-09-15 10:51 ` Lorenzo Stoakes
2025-09-15 8:57 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a3a96ab2-5a8e-4d1b-9a24-01955f931b6f@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=harry.yoo@oracle.com \
--cc=jannh@google.com \
--cc=kaleshsingh@google.com \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=ngeoffray@google.com \
--cc=peterx@redhat.com \
--cc=riel@surriel.com \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=v-songbaohua@oppo.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox