From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>,
Barry Song <21cnbao@gmail.com>,
Nicolas Geoffray <ngeoffray@google.com>,
Lokesh Gidra <lokeshgidra@google.com>,
Harry Yoo <harry.yoo@oracle.com>,
Suren Baghdasaryan <surenb@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Rik van Riel <riel@surriel.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Linux-MM <linux-mm@kvack.org>,
Kalesh Singh <kaleshsingh@google.com>,
SeongJae Park <sj@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
Peter Xu <peterx@redhat.com>
Subject: Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
Date: Mon, 15 Sep 2025 10:42:30 +0100 [thread overview]
Message-ID: <585b0ca3-cc56-4f74-9950-800d6faf8012@lucifer.local> (raw)
In-Reply-To: <18361483-5089-4414-b974-4c481189b9fa@redhat.com>
On Mon, Sep 15, 2025 at 07:17:41AM +0200, David Hildenbrand wrote:
> On 15.09.25 04:50, Matthew Wilcox wrote:
> > On Mon, Sep 15, 2025 at 08:23:38AM +0800, Barry Song wrote:
>
> A couple of notes:
>
> > > > I wonder if we could fix this by adding a new syscall:
> > > >
> > > > mremap(addr, size, size, MREMAP_COW_NOW);
> > > >
> > > > That would create a new VMA that contains the COWed pages from the
> > > > old VMA, but crucially no longer attached to the anon_vma root of
> > > > the zygote. You wouldn't want to call this for every VMA, of course.
> > > > Just the ones which are likely to be fully COWed.
>
> MADV_POPULATE does that for writable vMAs (excluding the rmap opt, but that
> could likely be implemented).
>
> A student of mine implemented a MADV_UNSHARE that achieves the same by
> triggering unshare-faults even for non-writable VMAs (again, excluding the
> rmap opt).
I think the rmap bit is non-trivial as per my other replies.
>
> We used MADV_UNSHARE to break COW asynchronously to the already-running
> workload to keep fork() still short but avoid the overhead of COW faults
> later.
>
> [ insert usualy comment about no weird mremap flags ]
Yup
>
> > > >
> > > > Maybe this isn't practical, but I thought it worth suggesting.
> > >
> > > Lorenzo suggested possibly unlinking the child anon_vma from the root once all
> > > folios have been CoW-ed:
> > >
> > > "Right now, even if you entirely CoW everything in a VMA, we are still
> > > attached to parents with all the overhead. That's something I can look at.
> > > "
> > >
> > > My concern is that it’s difficult to determine whether a VMA has been completely
> > > CoW-ed, and a single shared folio would prevent the unlink.
> > > So I’m not sure this approach would work.
> >
> > I'm concerned that tracking how many folios remain shared may be
> > inefficient. Also that information needs to be gathered in both parent
> > and child.
>
> Yeah, not a fan. Tracking per MM might work, tracking per VMA is problematic
> due to the possibility for VMA splits.
Well, I think it's possible (maybe)... with a rework :>)
"Lorenzo's grand rework" etc. etc.
>
> >
> > > You seem to be proposing a forced CoW as a way to safely unlink from the root.
> > >
> > > A side effect is the potential for sudden, heavy memory allocation,
> > > whereas CoW lets asynchronous tasks such as kswap work concurrently.
> >
> > Perhaps you could help us out with some stats on that -- how much
> > anonymous memory starts out shared between the zygote and a newly
> > spawned process?
> >
> > > Another issue is the extra memory use from folios that could have been
> > > shared but aren’t—likely minor on Android, since only a small portion
> > > of memory is actually shared, based on our observations.
> > >
> > > Calling mremap for each VMA might be difficult. Something applied to the
> > > whole process could be more practical—similar to exec, but only
> > > performing CoW and unlinking the anon_vma root.
> >
> > That seems like it would be worse for memory consumption than doing it
> > on the VMAs in question.
>
> MADV_UNSHARE we implemented simply took a range and one could apply it to
> the full process by supplying the full range.
>
> But yeah, the downside in any case is that you lose
You just lose? :P I assume you forgot to finish this thought :>)
I wonder from rmap point of view whether you could actually simply check to
see if you're fully CoW'd.
E.g.:
madvise(..., MADV_ISOLATE_COWED)
And have it take the anon_vma write lock from root, have it walk the rmap,
go and check to see if every folio in the VMA is now CoW'd, and if so,
detatch the CoW'd anon_vma from its parent/root?
This would be a sort of after-the-fact thing, but maybe could be done
periodically.
Of course then if you had one folio that was not yet CoW'd, that'd prevent
this from completing.
>
> >
> > Another possibility would be for the zygote to set a flag on the VMA,
> > say EAGER_COW which forces a COW of all pages as soon as the first one
> > is COWed. But then we're paying at fault time rather than in a syscall
> > that we can predict.
>
> Right, or just avoid COW altogether (if fork time is irrelevant) and just
> copy during fork(). Either using a clone flag for the whole MM or using a
> new MADV option copy during fork.
Could have sworn we already had an madvise() flag for that but no we
don't... MADV_COPY_ON_FORK...
Anyway any such solution has the issue of using extra memory when the user
very probably does not want this.
>
> >
> > Another point in favour of COW_NOW or EAGER_COW is that we can choose to
> > allocate folios of the appropriate size at that time. Unless something's
> > changed, I think we always COW individual pages rather than multiple
> > pages at once.
>
> Yes. khugepaged will soon starting fixing that up later asynchronously I
> hope.
But only at mTHP granularity a. once the relevant series lands and b. if
mTHP is enabled (I mean for sub-PMD sized/aligned ranges) :>)
>
>
> But obviously, whenever we copy/unshare, we consume more memory. While this
> might possibly work for Android, I know that some workloads (was it
> websevers or webbrowsers for example?) spin up many instances through fork()
> to actually keep sharing pages and not break COW.
>
> So I would hope we can find a better optimization that doesn't rely on the
> workload to manually break COW and effectively consume more memory.
Yup, agreed.
>
> --
> Cheers
>
> David / dhildenb
>
>
Cheers, Lorenzo
next prev parent reply other threads:[~2025-09-15 9:42 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-11 7:17 Barry Song
2025-09-11 8:14 ` David Hildenbrand
2025-09-11 8:34 ` Lorenzo Stoakes
2025-09-11 9:18 ` Barry Song
2025-09-11 10:47 ` Lorenzo Stoakes
2025-09-11 8:28 ` Lorenzo Stoakes
2025-09-11 18:22 ` Jann Horn
2025-09-12 4:49 ` Lorenzo Stoakes
2025-09-12 11:37 ` Jann Horn
2025-09-12 11:56 ` Lorenzo Stoakes
2025-09-14 23:53 ` Matthew Wilcox
2025-09-15 0:23 ` Barry Song
2025-09-15 1:47 ` Suren Baghdasaryan
2025-09-15 8:41 ` Lorenzo Stoakes
2025-09-15 2:50 ` Matthew Wilcox
2025-09-15 5:17 ` David Hildenbrand
2025-09-15 9:42 ` Lorenzo Stoakes [this message]
2025-09-15 10:29 ` David Hildenbrand
2025-09-15 10:56 ` Lorenzo Stoakes
2025-09-15 9:22 ` Lorenzo Stoakes
2025-09-15 10:41 ` David Hildenbrand
2025-09-15 10:51 ` Lorenzo Stoakes
2025-09-15 8:57 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=585b0ca3-cc56-4f74-9950-800d6faf8012@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=harry.yoo@oracle.com \
--cc=jannh@google.com \
--cc=kaleshsingh@google.com \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=ngeoffray@google.com \
--cc=peterx@redhat.com \
--cc=riel@surriel.com \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=v-songbaohua@oppo.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox