From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>, Barry Song <21cnbao@gmail.com>
Cc: Nicolas Geoffray <ngeoffray@google.com>,
Lokesh Gidra <lokeshgidra@google.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Harry Yoo <harry.yoo@oracle.com>,
Suren Baghdasaryan <surenb@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Rik van Riel <riel@surriel.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Linux-MM <linux-mm@kvack.org>,
Kalesh Singh <kaleshsingh@google.com>,
SeongJae Park <sj@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
Peter Xu <peterx@redhat.com>
Subject: Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
Date: Mon, 15 Sep 2025 07:17:41 +0200 [thread overview]
Message-ID: <18361483-5089-4414-b974-4c481189b9fa@redhat.com> (raw)
In-Reply-To: <aMd-2argDQCHww_Q@casper.infradead.org>
On 15.09.25 04:50, Matthew Wilcox wrote:
> On Mon, Sep 15, 2025 at 08:23:38AM +0800, Barry Song wrote:
A couple of notes:
>>> I wonder if we could fix this by adding a new syscall:
>>>
>>> mremap(addr, size, size, MREMAP_COW_NOW);
>>>
>>> That would create a new VMA that contains the COWed pages from the
>>> old VMA, but crucially no longer attached to the anon_vma root of
>>> the zygote. You wouldn't want to call this for every VMA, of course.
>>> Just the ones which are likely to be fully COWed.
MADV_POPULATE does that for writable vMAs (excluding the rmap opt, but
that could likely be implemented).
A student of mine implemented a MADV_UNSHARE that achieves the same by
triggering unshare-faults even for non-writable VMAs (again, excluding
the rmap opt).
We used MADV_UNSHARE to break COW asynchronously to the already-running
workload to keep fork() still short but avoid the overhead of COW faults
later.
[ insert usualy comment about no weird mremap flags ]
>>>
>>> Maybe this isn't practical, but I thought it worth suggesting.
>>
>> Lorenzo suggested possibly unlinking the child anon_vma from the root once all
>> folios have been CoW-ed:
>>
>> "Right now, even if you entirely CoW everything in a VMA, we are still
>> attached to parents with all the overhead. That's something I can look at.
>> "
>>
>> My concern is that it’s difficult to determine whether a VMA has been completely
>> CoW-ed, and a single shared folio would prevent the unlink.
>> So I’m not sure this approach would work.
>
> I'm concerned that tracking how many folios remain shared may be
> inefficient. Also that information needs to be gathered in both parent
> and child.
Yeah, not a fan. Tracking per MM might work, tracking per VMA is
problematic due to the possibility for VMA splits.
>
>> You seem to be proposing a forced CoW as a way to safely unlink from the root.
>>
>> A side effect is the potential for sudden, heavy memory allocation,
>> whereas CoW lets asynchronous tasks such as kswap work concurrently.
>
> Perhaps you could help us out with some stats on that -- how much
> anonymous memory starts out shared between the zygote and a newly
> spawned process?
>
>> Another issue is the extra memory use from folios that could have been
>> shared but aren’t—likely minor on Android, since only a small portion
>> of memory is actually shared, based on our observations.
>>
>> Calling mremap for each VMA might be difficult. Something applied to the
>> whole process could be more practical—similar to exec, but only
>> performing CoW and unlinking the anon_vma root.
>
> That seems like it would be worse for memory consumption than doing it
> on the VMAs in question.
MADV_UNSHARE we implemented simply took a range and one could apply it
to the full process by supplying the full range.
But yeah, the downside in any case is that you lose
>
> Another possibility would be for the zygote to set a flag on the VMA,
> say EAGER_COW which forces a COW of all pages as soon as the first one
> is COWed. But then we're paying at fault time rather than in a syscall
> that we can predict.
Right, or just avoid COW altogether (if fork time is irrelevant) and
just copy during fork(). Either using a clone flag for the whole MM or
using a new MADV option copy during fork.
>
> Another point in favour of COW_NOW or EAGER_COW is that we can choose to
> allocate folios of the appropriate size at that time. Unless something's
> changed, I think we always COW individual pages rather than multiple
> pages at once.
Yes. khugepaged will soon starting fixing that up later asynchronously I
hope.
But obviously, whenever we copy/unshare, we consume more memory. While
this might possibly work for Android, I know that some workloads (was it
websevers or webbrowsers for example?) spin up many instances through
fork() to actually keep sharing pages and not break COW.
So I would hope we can find a better optimization that doesn't rely on
the workload to manually break COW and effectively consume more memory.
--
Cheers
David / dhildenb
next prev parent reply other threads:[~2025-09-15 5:17 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-11 7:17 Barry Song
2025-09-11 8:14 ` David Hildenbrand
2025-09-11 8:34 ` Lorenzo Stoakes
2025-09-11 9:18 ` Barry Song
2025-09-11 10:47 ` Lorenzo Stoakes
2025-09-11 8:28 ` Lorenzo Stoakes
2025-09-11 18:22 ` Jann Horn
2025-09-12 4:49 ` Lorenzo Stoakes
2025-09-12 11:37 ` Jann Horn
2025-09-12 11:56 ` Lorenzo Stoakes
2025-09-14 23:53 ` Matthew Wilcox
2025-09-15 0:23 ` Barry Song
2025-09-15 1:47 ` Suren Baghdasaryan
2025-09-15 8:41 ` Lorenzo Stoakes
2025-09-15 2:50 ` Matthew Wilcox
2025-09-15 5:17 ` David Hildenbrand [this message]
2025-09-15 9:42 ` Lorenzo Stoakes
2025-09-15 10:29 ` David Hildenbrand
2025-09-15 10:56 ` Lorenzo Stoakes
2025-09-15 9:22 ` Lorenzo Stoakes
2025-09-15 10:41 ` David Hildenbrand
2025-09-15 10:51 ` Lorenzo Stoakes
2025-09-15 8:57 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=18361483-5089-4414-b974-4c481189b9fa@redhat.com \
--to=david@redhat.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=harry.yoo@oracle.com \
--cc=jannh@google.com \
--cc=kaleshsingh@google.com \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=ngeoffray@google.com \
--cc=peterx@redhat.com \
--cc=riel@surriel.com \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=v-songbaohua@oppo.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox