linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>, Barry Song <21cnbao@gmail.com>
Cc: Nicolas Geoffray <ngeoffray@google.com>,
	Lokesh Gidra <lokeshgidra@google.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Harry Yoo <harry.yoo@oracle.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@surriel.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Linux-MM <linux-mm@kvack.org>,
	Kalesh Singh <kaleshsingh@google.com>,
	SeongJae Park <sj@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
	Peter Xu <peterx@redhat.com>
Subject: Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
Date: Mon, 15 Sep 2025 07:17:41 +0200	[thread overview]
Message-ID: <18361483-5089-4414-b974-4c481189b9fa@redhat.com> (raw)
In-Reply-To: <aMd-2argDQCHww_Q@casper.infradead.org>

On 15.09.25 04:50, Matthew Wilcox wrote:
> On Mon, Sep 15, 2025 at 08:23:38AM +0800, Barry Song wrote:

A couple of notes:

>>> I wonder if we could fix this by adding a new syscall:
>>>
>>>          mremap(addr, size, size, MREMAP_COW_NOW);
>>>
>>> That would create a new VMA that contains the COWed pages from the
>>> old VMA, but crucially no longer attached to the anon_vma root of
>>> the zygote.  You wouldn't want to call this for every VMA, of course.
>>> Just the ones which are likely to be fully COWed.

MADV_POPULATE does that for writable vMAs (excluding the rmap opt, but 
that could likely be implemented).

A student of mine implemented a MADV_UNSHARE that achieves the same by 
triggering unshare-faults even for non-writable VMAs (again, excluding 
the rmap opt).

We used MADV_UNSHARE to break COW asynchronously to the already-running 
workload to keep fork() still short but avoid the overhead of COW faults 
later.

[ insert usualy comment about no weird mremap flags ]

>>>
>>> Maybe this isn't practical, but I thought it worth suggesting.
>>
>> Lorenzo suggested possibly unlinking the child anon_vma from the root once all
>> folios have been CoW-ed:
>>
>> "Right now, even if you entirely CoW everything in a VMA, we are still
>> attached to parents with all the overhead. That's something I can look at.
>> "
>>
>> My concern is that it’s difficult to determine whether a VMA has been completely
>> CoW-ed, and a single shared folio would prevent the unlink.
>> So I’m not sure this approach would work.
> 
> I'm concerned that tracking how many folios remain shared may be
> inefficient.  Also that information needs to be gathered in both parent
> and child.

Yeah, not a fan. Tracking per MM might work, tracking per VMA is 
problematic due to the possibility for VMA splits.

> 
>> You seem to be proposing a forced CoW as a way to safely unlink from the root.
>>
>> A side effect is the potential for sudden, heavy memory allocation,
>> whereas CoW lets asynchronous tasks such as kswap work concurrently.
> 
> Perhaps you could help us out with some stats on that -- how much
> anonymous memory starts out shared between the zygote and a newly
> spawned process?
> 
>> Another issue is the extra memory use from folios that could have been
>> shared but aren’t—likely minor on Android, since only a small portion
>> of memory is actually shared, based on our observations.
>>
>> Calling mremap for each VMA might be difficult. Something applied to the
>> whole process could be more practical—similar to exec, but only
>> performing CoW and unlinking the anon_vma root.
> 
> That seems like it would be worse for memory consumption than doing it
> on the VMAs in question.

MADV_UNSHARE we implemented simply took a range and one could apply it 
to the full process by supplying the full range.

But yeah, the downside in any case is that you lose

> 
> Another possibility would be for the zygote to set a flag on the VMA,
> say EAGER_COW which forces a COW of all pages as soon as the first one
> is COWed.  But then we're paying at fault time rather than in a syscall
> that we can predict.

Right, or just avoid COW altogether (if fork time is irrelevant) and 
just copy during fork(). Either using a clone flag for the whole MM or 
using a new MADV option copy during fork.

> 
> Another point in favour of COW_NOW or EAGER_COW is that we can choose to
> allocate folios of the appropriate size at that time.  Unless something's
> changed, I think we always COW individual pages rather than multiple
> pages at once.

Yes. khugepaged will soon starting fixing that up later asynchronously I 
hope.


But obviously, whenever we copy/unshare, we consume more memory. While 
this might possibly work for Android, I know that some workloads (was it 
websevers or webbrowsers for example?) spin up many instances through 
fork() to actually keep sharing pages and not break COW.

So I would hope we can find a better optimization that doesn't rely on 
the workload to manually break COW and effectively consume more memory.

-- 
Cheers

David / dhildenb



  reply	other threads:[~2025-09-15  5:17 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-11  7:17 Barry Song
2025-09-11  8:14 ` David Hildenbrand
2025-09-11  8:34   ` Lorenzo Stoakes
2025-09-11  9:18   ` Barry Song
2025-09-11 10:47     ` Lorenzo Stoakes
2025-09-11  8:28 ` Lorenzo Stoakes
2025-09-11 18:22   ` Jann Horn
2025-09-12  4:49     ` Lorenzo Stoakes
2025-09-12 11:37       ` Jann Horn
2025-09-12 11:56         ` Lorenzo Stoakes
2025-09-14 23:53 ` Matthew Wilcox
2025-09-15  0:23   ` Barry Song
2025-09-15  1:47     ` Suren Baghdasaryan
2025-09-15  8:41       ` Lorenzo Stoakes
2025-09-15  2:50     ` Matthew Wilcox
2025-09-15  5:17       ` David Hildenbrand [this message]
2025-09-15  9:42         ` Lorenzo Stoakes
2025-09-15 10:29           ` David Hildenbrand
2025-09-15 10:56             ` Lorenzo Stoakes
2025-09-15  9:22       ` Lorenzo Stoakes
2025-09-15 10:41         ` David Hildenbrand
2025-09-15 10:51           ` Lorenzo Stoakes
2025-09-15  8:57   ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18361483-5089-4414-b974-4c481189b9fa@redhat.com \
    --to=david@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=ngeoffray@google.com \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=v-songbaohua@oppo.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox