linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Barry Song <21cnbao@gmail.com>,
	Nicolas Geoffray <ngeoffray@google.com>,
	Lokesh Gidra <lokeshgidra@google.com>,
	David Hildenbrand <david@redhat.com>,
	Harry Yoo <harry.yoo@oracle.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@surriel.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Linux-MM <linux-mm@kvack.org>,
	Kalesh Singh <kaleshsingh@google.com>,
	SeongJae Park <sj@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
	Peter Xu <peterx@redhat.com>
Subject: Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
Date: Mon, 15 Sep 2025 10:22:25 +0100	[thread overview]
Message-ID: <b6d5c607-a9d9-4c1b-b4ac-41fcaa2e696d@lucifer.local> (raw)
In-Reply-To: <aMd-2argDQCHww_Q@casper.infradead.org>

On Mon, Sep 15, 2025 at 03:50:01AM +0100, Matthew Wilcox wrote:
> On Mon, Sep 15, 2025 at 08:23:38AM +0800, Barry Song wrote:
> > > I wonder if we could fix this by adding a new syscall:
> > >
> > >         mremap(addr, size, size, MREMAP_COW_NOW);
> > >
> > > That would create a new VMA that contains the COWed pages from the
> > > old VMA, but crucially no longer attached to the anon_vma root of
> > > the zygote.  You wouldn't want to call this for every VMA, of course.
> > > Just the ones which are likely to be fully COWed.
> > >
> > > Maybe this isn't practical, but I thought it worth suggesting.
> >
> > Lorenzo suggested possibly unlinking the child anon_vma from the root once all
> > folios have been CoW-ed:
> >
> > "Right now, even if you entirely CoW everything in a VMA, we are still
> > attached to parents with all the overhead. That's something I can look at.
> > "
> >
> > My concern is that it’s difficult to determine whether a VMA has been completely
> > CoW-ed, and a single shared folio would prevent the unlink.
> > So I’m not sure this approach would work.
>
> I'm concerned that tracking how many folios remain shared may be
> inefficient.  Also that information needs to be gathered in both parent
> and child.

Yeah I think you would need to track parent + child which is just _lovely_
isn't it.

I'm really not in love with the overwrought structure of anon_vma's in general,
we've made life hard for ourselve and tacked on a bunch of complexity.

Again, I think "Lorenzo's grand rework" could help tackle this from a
fundamnetal basis (don't ask me for too many details just yet :P)

This also gets potentially complicated with the anon_vma reuse logic too.

It's about RoI.

Again, the more I look at this, the more I feel that the whole thing needs
rearchitecture.

>
> > You seem to be proposing a forced CoW as a way to safely unlink from the root.
> >
> > A side effect is the potential for sudden, heavy memory allocation,
> > whereas CoW lets asynchronous tasks such as kswap work concurrently.
>
> Perhaps you could help us out with some stats on that -- how much
> anonymous memory starts out shared between the zygote and a newly
> spawned process?

Yes stats are good!

>
> > Another issue is the extra memory use from folios that could have been
> > shared but aren’t—likely minor on Android, since only a small portion
> > of memory is actually shared, based on our observations.
> >
> > Calling mremap for each VMA might be difficult. Something applied to the
> > whole process could be more practical—similar to exec, but only
> > performing CoW and unlinking the anon_vma root.
>
> That seems like it would be worse for memory consumption than doing it
> on the VMAs in question.

Yes!

Surely the whole point of using the zygote is to take advantage of CoW no?
It'd surely hugely slow down establishing a new process if we did this
per-process?

>
> Another possibility would be for the zygote to set a flag on the VMA,
> say EAGER_COW which forces a COW of all pages as soon as the first one
> is COWed.  But then we're paying at fault time rather than in a syscall
> that we can predict.

I assume you mean write fault.

How would we identify which folios to CoW at fault time (other than the one
we are write faulting on)? There's no way to go from anon_vma to folios
without doing a full rmap traversal to the the VMA then back down again to
page tables, so that'd make this pretty damn expensive surely?

And we'd need to hold some kind of lock at that point...

Also again I wonder how easy it will be to identify which VMAs you're happy
to make expensive like this?

Also you will end up fragmenting VMAs potentially doing this.

I think it's a sort of nice idea fundamentally, but the cost is likely to
be high and it will add a bunch of complexity.

>
> Another point in favour of COW_NOW or EAGER_COW is that we can choose to
> allocate folios of the appropriate size at that time.  Unless something's
> changed, I think we always COW individual pages rather than multiple
> pages at once.
>

I don't think there's any practical way to do it any differently because
you don't know what's mapped/not at the point of fault.

How expensive would it be to have xarray for anon, I wonder... :)


  parent reply	other threads:[~2025-09-15  9:22 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-11  7:17 Barry Song
2025-09-11  8:14 ` David Hildenbrand
2025-09-11  8:34   ` Lorenzo Stoakes
2025-09-11  9:18   ` Barry Song
2025-09-11 10:47     ` Lorenzo Stoakes
2025-09-11  8:28 ` Lorenzo Stoakes
2025-09-11 18:22   ` Jann Horn
2025-09-12  4:49     ` Lorenzo Stoakes
2025-09-12 11:37       ` Jann Horn
2025-09-12 11:56         ` Lorenzo Stoakes
2025-09-14 23:53 ` Matthew Wilcox
2025-09-15  0:23   ` Barry Song
2025-09-15  1:47     ` Suren Baghdasaryan
2025-09-15  8:41       ` Lorenzo Stoakes
2025-09-15  2:50     ` Matthew Wilcox
2025-09-15  5:17       ` David Hildenbrand
2025-09-15  9:42         ` Lorenzo Stoakes
2025-09-15 10:29           ` David Hildenbrand
2025-09-15 10:56             ` Lorenzo Stoakes
2025-09-15  9:22       ` Lorenzo Stoakes [this message]
2025-09-15 10:41         ` David Hildenbrand
2025-09-15 10:51           ` Lorenzo Stoakes
2025-09-15  8:57   ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b6d5c607-a9d9-4c1b-b4ac-41fcaa2e696d@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=ngeoffray@google.com \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=v-songbaohua@oppo.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox