[DISCUSSION] anon_vma root lock contention and per anon

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [DISCUSSION] anon_vma root lock contention and per anon_vma lock
@ 2025-09-11  7:17 Barry Song
  2025-09-11  8:14 ` David Hildenbrand
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Barry Song @ 2025-09-11  7:17 UTC (permalink / raw)
  To: Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Suren Baghdasaryan, Andrew Morton,
	Rik van Riel, Liam R . Howlett, Vlastimil Babka, Jann Horn
  Cc: Linux-MM, Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

Hi All,

I’m aware that Lokesh started a discussion on the concurrency issue
between usefaultfd_move and memory reclamation [1]. However, my
concern is different, so I’m starting a separate discussion.

In the process tree, many processes may share anon_vma->root, even if
they don’t share the anon_vma itself. This causes serious lock contention
between memory reclamation (which calls folio_referenced and try_to_unmap)
and other processes calling fork(), exit(), mprotect(), etc.

On Android, this issue becomes more severe since many processes are
descendants of zygote.

Memory reclamation path:
  folio_lock_anon_vma_read

mprotect path:
  mprotect
    split_vma
      anon_vma_clone

fork / copy_process path:
  copy_process
    dup_mmap
      anon_vma_fork

exit path:
  exit_mmap
    free_pgtables
      unlink_anon_vmas

To be honest, memory reclamation—especially folio_referenced()—is a
problem. It is called very frequently and can block other important
user threads waiting for the anon_vma root lock, causing UI lag.

I have a rough idea: since the vast majority of anon folios are actually
exclusive (I observed almost 98% of Android anon folios fall into this
category), they don’t need to iterate the anon_vma tree. They belong to
a single process, and even for rmap, it is per-process.

I propose introducing a per-anon_vma lock. For exclusive folios whose
anon_vma is not shared, we could use this per-anon_vma lock.
folio_referenced declares that it will begin reading, and Lokesh’s
folio_lock may also help maintain folios as exclusive, so I am
somewhat in favor of his RFC. Any thread writing to such an anon_vma
would take the per-vma write lock, and possibly also the anon_vma
root write lock. If folio_referenced fails to declare the per-vma lock,
it can fall back to the global anon_vma->root read mutex, similar to
mmap_lock.

I haven’t carefully considered this or written any code yet—just a
very rough idea. Sorry if it comes across as too naive.

[1] https://lore.kernel.org/linux-mm/CA+EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5+4aDx5jh2DvY6UHg@mail.gmail.com/

Thanks
Barry

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-11  7:17 [DISCUSSION] anon_vma root lock contention and per anon_vma lock Barry Song
@ 2025-09-11  8:14 ` David Hildenbrand
  2025-09-11  8:34   ` Lorenzo Stoakes
  2025-09-11  9:18   ` Barry Song
  2025-09-11  8:28 ` Lorenzo Stoakes
  2025-09-14 23:53 ` Matthew Wilcox
  2 siblings, 2 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-09-11  8:14 UTC (permalink / raw)
  To: Barry Song, Nicolas Geoffray, Lokesh Gidra, Lorenzo Stoakes,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn
  Cc: Linux-MM, Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On 11.09.25 09:17, Barry Song wrote:
> Hi All,
> 
> I’m aware that Lokesh started a discussion on the concurrency issue
> between usefaultfd_move and memory reclamation [1]. However, my
> concern is different, so I’m starting a separate discussion.
> 
> In the process tree, many processes may share anon_vma->root, even if
> they don’t share the anon_vma itself. This causes serious lock contention
> between memory reclamation (which calls folio_referenced and try_to_unmap)
> and other processes calling fork(), exit(), mprotect(), etc.
> 
> On Android, this issue becomes more severe since many processes are
> descendants of zygote.
> 
> Memory reclamation path:
>    folio_lock_anon_vma_read
> 
> mprotect path:
>    mprotect
>      split_vma
>        anon_vma_clone
> 
> fork / copy_process path:
>    copy_process
>      dup_mmap
>        anon_vma_fork
> 
> exit path:
>    exit_mmap
>      free_pgtables
>        unlink_anon_vmas
> 
> To be honest, memory reclamation—especially folio_referenced()—is a
> problem. It is called very frequently and can block other important
> user threads waiting for the anon_vma root lock, causing UI lag.
> 
> I have a rough idea: since the vast majority of anon folios are actually
> exclusive (I observed almost 98% of Android anon folios fall into this
> category), they don’t need to iterate the anon_vma tree. They belong to
> a single process, and even for rmap, it is per-process.
> 
> I propose introducing a per-anon_vma lock. For exclusive folios whose
> anon_vma is not shared, we could use this per-anon_vma lock.
> folio_referenced declares that it will begin reading, and Lokesh’s
> folio_lock may also help maintain folios as exclusive, so I am
> somewhat in favor of his RFC. Any thread writing to such an anon_vma
> would take the per-vma write lock, and possibly also the anon_vma
> root write lock. If folio_referenced fails to declare the per-vma lock,
> it can fall back to the global anon_vma->root read mutex, similar to
> mmap_lock.

To summarize, are you proposing a similar locking scheme like we have 
for mm vs. vma here for anon-vma root vs. anon-vma?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-11  7:17 [DISCUSSION] anon_vma root lock contention and per anon_vma lock Barry Song
  2025-09-11  8:14 ` David Hildenbrand
@ 2025-09-11  8:28 ` Lorenzo Stoakes
  2025-09-11 18:22   ` Jann Horn
  2025-09-14 23:53 ` Matthew Wilcox
  2 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11  8:28 UTC (permalink / raw)
  To: Barry Song
  Cc: Nicolas Geoffray, Lokesh Gidra, David Hildenbrand, Harry Yoo,
	Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> Hi All,
>
> I’m aware that Lokesh started a discussion on the concurrency issue
> between usefaultfd_move and memory reclamation [1]. However, my
> concern is different, so I’m starting a separate discussion.
>
> In the process tree, many processes may share anon_vma->root, even if
> they don’t share the anon_vma itself. This causes serious lock contention
> between memory reclamation (which calls folio_referenced and try_to_unmap)
> and other processes calling fork(), exit(), mprotect(), etc.

Well, when you say lock contention, I mean - we need to have a lock that is held
over the entire fork tree, as we are cloning references to them.

This is at the anon_vma level - so the folio might be exclusive, but other
folios there might not be.

Note that I'm working on a radical rework of anon_vma's at the moment (time
is not in my favour given other tasks + review workload, but it _is_
happening).

So I'm interested to gather real world usecase data on how best to
implement things and this is interesting re: that.

My proposed approach would use something like ranged locks. It's a bit
fuzzy right now so definitely interested in putting some meat on that.

>
> On Android, this issue becomes more severe since many processes are
> descendants of zygote.
>
> Memory reclamation path:
>   folio_lock_anon_vma_read
>
> mprotect path:
>   mprotect
>     split_vma
>       anon_vma_clone
>
> fork / copy_process path:
>   copy_process
>     dup_mmap
>       anon_vma_fork
>
> exit path:
>   exit_mmap
>     free_pgtables
>       unlink_anon_vmas
>
> To be honest, memory reclamation—especially folio_referenced()—is a
> problem. It is called very frequently and can block other important
> user threads waiting for the anon_vma root lock, causing UI lag.
>
> I have a rough idea: since the vast majority of anon folios are actually
> exclusive (I observed almost 98% of Android anon folios fall into this
> category), they don’t need to iterate the anon_vma tree. They belong to
> a single process, and even for rmap, it is per-process.
>
> I propose introducing a per-anon_vma lock. For exclusive folios whose
> anon_vma is not shared, we could use this per-anon_vma lock.

I'm not sure how adding _more_ locks is going to reduce contention :) and
the anon_vma's are all linked to their parents etc. etc. so it's simply not
ok to hold one lock and not the others when making changes.

> folio_referenced declares that it will begin reading, and Lokesh’s
> folio_lock may also help maintain folios as exclusive, so I am
> somewhat in favor of his RFC. Any thread writing to such an anon_vma

Will reply on his latest re: Lokesh's approach.

> would take the per-vma write lock, and possibly also the anon_vma
> root write lock. If folio_referenced fails to declare the per-vma lock,
> it can fall back to the global anon_vma->root read mutex, similar to
> mmap_lock.

Again, we actually _need_ to hold a lock over this range. So you can't just
hold the root and a descendent it has to be all.

>
> I haven’t carefully considered this or written any code yet—just a
> very rough idea. Sorry if it comes across as too naive.

It's fine, though I do wish we'd have a _little_ less workload this cycle,
can barely breath at the moment, but that's not your fault ;)

I do wonder whether part of the problem here is keeping anon_vma's
connected to parents whwen they don't need to be.

Right now, even if you entirely CoW everything in a VMA, we are still
attached to parents with all the overhead. That's something I can look at.

But also perhaps worth considering how we approach the whole clone thing.

My (very early) anon_vma rework would do away with anon_vma_chain's
altogether and make forking simpler.

There'd be a per-mm object that connects to others via (probably) interval
tree edges for ranges that are CoW, so splitting for instance would be
easier.

Early days with it though...

>
> [1] https://lore.kernel.org/linux-mm/CA+EESO4Z6wtX7ZMdDHQRe5jAAS_bQ-POq5+4aDx5jh2DvY6UHg@mail.gmail.com/
>
> Thanks
> Barry
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-11  8:14 ` David Hildenbrand
@ 2025-09-11  8:34   ` Lorenzo Stoakes
  2025-09-11  9:18   ` Barry Song
  1 sibling, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11  8:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, Nicolas Geoffray, Lokesh Gidra, Harry Yoo,
	Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Thu, Sep 11, 2025 at 10:14:26AM +0200, David Hildenbrand wrote:
> To summarize, are you proposing a similar locking scheme like we have for mm
> vs. vma here for anon-vma root vs. anon-vma?

I don't think that can work, because when holding the write lock we will drop
interval edges, not sure how we can possibly have concurrent readers at this
point.

And for write you'd have to analagously hold the root anon_vma write lock so
that'd be no faster.

I think really this problem speaks to fundamental issues with how anon_vma's are
implemented (that being the case is why I have such an interest in attacking
this problem).

I can try implementing some ideas in relation to this, as I do hope that I can
find incremental ways of doing so... - Barry I guess it's fairly easy to repro
some of these contention issues to test ideas out?

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-11  8:14 ` David Hildenbrand
  2025-09-11  8:34   ` Lorenzo Stoakes
@ 2025-09-11  9:18   ` Barry Song
  2025-09-11 10:47     ` Lorenzo Stoakes
  1 sibling, 1 reply; 23+ messages in thread
From: Barry Song @ 2025-09-11  9:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nicolas Geoffray, Lokesh Gidra, Lorenzo Stoakes, Harry Yoo,
	Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Thu, Sep 11, 2025 at 8:14 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.09.25 09:17, Barry Song wrote:
> > Hi All,
> >
> > I’m aware that Lokesh started a discussion on the concurrency issue
> > between usefaultfd_move and memory reclamation [1]. However, my
> > concern is different, so I’m starting a separate discussion.
> >
> > In the process tree, many processes may share anon_vma->root, even if
> > they don’t share the anon_vma itself. This causes serious lock contention
> > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > and other processes calling fork(), exit(), mprotect(), etc.
> >
> > On Android, this issue becomes more severe since many processes are
> > descendants of zygote.
> >
> > Memory reclamation path:
> >    folio_lock_anon_vma_read
> >
> > mprotect path:
> >    mprotect
> >      split_vma
> >        anon_vma_clone
> >
> > fork / copy_process path:
> >    copy_process
> >      dup_mmap
> >        anon_vma_fork
> >
> > exit path:
> >    exit_mmap
> >      free_pgtables
> >        unlink_anon_vmas
> >
> > To be honest, memory reclamation—especially folio_referenced()—is a
> > problem. It is called very frequently and can block other important
> > user threads waiting for the anon_vma root lock, causing UI lag.
> >
> > I have a rough idea: since the vast majority of anon folios are actually
> > exclusive (I observed almost 98% of Android anon folios fall into this
> > category), they don’t need to iterate the anon_vma tree. They belong to
> > a single process, and even for rmap, it is per-process.
> >
> > I propose introducing a per-anon_vma lock. For exclusive folios whose
> > anon_vma is not shared, we could use this per-anon_vma lock.
> > folio_referenced declares that it will begin reading, and Lokesh’s
> > folio_lock may also help maintain folios as exclusive, so I am
> > somewhat in favor of his RFC. Any thread writing to such an anon_vma
> > would take the per-vma write lock, and possibly also the anon_vma
> > root write lock. If folio_referenced fails to declare the per-vma lock,
> > it can fall back to the global anon_vma->root read mutex, similar to
> > mmap_lock.
>
> To summarize, are you proposing a similar locking scheme like we have
> for mm vs. vma here for anon-vma root vs. anon-vma?

Quite similar, but with the optimization limited only to exclusive anon
folios.

The main issue is in folio_referenced(), which frequently takes the
anon_vma root read lock. Complaints are likely due to memory
reclamation holding this read lock—if the process is preempted, it
becomes runnable but not running, while fork(), mprotect(), and
exit() may be forced to wait. I haven’t seen complaints about
writer–writer contention, so I don’t plan to optimize write-side
conflicts at this stage. In short: the problem is frequent rwsem read
locks that get preempted, blocking fork(), mprotect(), and exit().

If a folio is exclusive and we already hold folio_lock, it should
remain exclusive to a single process and a single vma. In that case,
such folios may not need an rmap tree at all for folio_referenced().

I’m mainly concerned about cases where a read lock is held but never a
write lock. As long as a folio is exclusive, stays exclusive during rmap,
and its rmap node remains present, there’s no need to modify the rmap
tree.

When changes are made, both the target being modified and the
anon_vma->root should be locked. We are not altering writer
behavior by requiring the anon_vma->root lock.

In short, I’m seeking a simple way to avoid taking anon_vma->root
during memory reclamation for folios mapped exclusively by a single
process.

Thanks
Barry

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-11  9:18   ` Barry Song
@ 2025-09-11 10:47     ` Lorenzo Stoakes
  0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 10:47 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, Nicolas Geoffray, Lokesh Gidra, Harry Yoo,
	Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Thu, Sep 11, 2025 at 09:18:17PM +1200, Barry Song wrote:
> On Thu, Sep 11, 2025 at 8:14 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 11.09.25 09:17, Barry Song wrote:
> > > Hi All,
> > >
> > > I’m aware that Lokesh started a discussion on the concurrency issue
> > > between usefaultfd_move and memory reclamation [1]. However, my
> > > concern is different, so I’m starting a separate discussion.
> > >
> > > In the process tree, many processes may share anon_vma->root, even if
> > > they don’t share the anon_vma itself. This causes serious lock contention
> > > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > > and other processes calling fork(), exit(), mprotect(), etc.
> > >
> > > On Android, this issue becomes more severe since many processes are
> > > descendants of zygote.
> > >
> > > Memory reclamation path:
> > >    folio_lock_anon_vma_read
> > >
> > > mprotect path:
> > >    mprotect
> > >      split_vma
> > >        anon_vma_clone
> > >
> > > fork / copy_process path:
> > >    copy_process
> > >      dup_mmap
> > >        anon_vma_fork
> > >
> > > exit path:
> > >    exit_mmap
> > >      free_pgtables
> > >        unlink_anon_vmas
> > >
> > > To be honest, memory reclamation—especially folio_referenced()—is a
> > > problem. It is called very frequently and can block other important
> > > user threads waiting for the anon_vma root lock, causing UI lag.
> > >
> > > I have a rough idea: since the vast majority of anon folios are actually
> > > exclusive (I observed almost 98% of Android anon folios fall into this
> > > category), they don’t need to iterate the anon_vma tree. They belong to
> > > a single process, and even for rmap, it is per-process.
> > >
> > > I propose introducing a per-anon_vma lock. For exclusive folios whose
> > > anon_vma is not shared, we could use this per-anon_vma lock.
> > > folio_referenced declares that it will begin reading, and Lokesh’s
> > > folio_lock may also help maintain folios as exclusive, so I am
> > > somewhat in favor of his RFC. Any thread writing to such an anon_vma
> > > would take the per-vma write lock, and possibly also the anon_vma
> > > root write lock. If folio_referenced fails to declare the per-vma lock,
> > > it can fall back to the global anon_vma->root read mutex, similar to
> > > mmap_lock.
> >
> > To summarize, are you proposing a similar locking scheme like we have
> > for mm vs. vma here for anon-vma root vs. anon-vma?
>
> Quite similar, but with the optimization limited only to exclusive anon
> folios.
>
> The main issue is in folio_referenced(), which frequently takes the
> anon_vma root read lock. Complaints are likely due to memory
> reclamation holding this read lock—if the process is preempted, it
> becomes runnable but not running, while fork(), mprotect(), and
> exit() may be forced to wait. I haven’t seen complaints about
> writer–writer contention, so I don’t plan to optimize write-side
> conflicts at this stage. In short: the problem is frequent rwsem read
> locks that get preempted, blocking fork(), mprotect(), and exit().

Right.

>
> If a folio is exclusive and we already hold folio_lock, it should
> remain exclusive to a single process and a single vma. In that case,
> such folios may not need an rmap tree at all for folio_referenced().

We still need exclusion at root level to prevent concurrent removal of
interval tree links at VMA level.

>
> I’m mainly concerned about cases where a read lock is held but never a
> write lock. As long as a folio is exclusive, stays exclusive during rmap,
> and its rmap node remains present, there’s no need to modify the rmap
> tree.

A folio being exclusively mapped doesn't mean all folios in the VMa are
exclusively mapped.

And it's irrelevant really because even if they were, we don't disconnect a
fully CoW'd VMA from ancestors (but this is something maybe I can look at
changing).

>
> When changes are made, both the target being modified and the
> anon_vma->root should be locked. We are not altering writer
> behavior by requiring the anon_vma->root lock.

So if we had per-anon_vma locks for this, writers would need to go to the
root, take root lock (which has no bearing on your readers, only exclusive
to other writers), traverse all anon_vma's in tree and (assuming using
their individual rwsem's) then grab the write lock for each.

Hard to see how this wouldn't have an impact on performance.

If we tried something similar to VMA locks, with seqnums etc. we would need
to have a solid foundation upon which to do this, that is a place to store
e.g. the equivalent of the vma_writer_wait and mm_lock_seq fields in
mm_struct.

We do keep anon_vma's around if they have children, so this would be a
place, but now we're adding _even more_ fields used in one place only to
_all_ anon_vma's.

I also suspect that having potentially a great many anon_vma's that you
have to traverse _every single time_ you need to merge/split/etc. etc. is
going to be costly, and possibly in some unexpected ways because once you
start having lots of forking all these operations will take way longer.

Even with seqnum's etc. (with memory fences coming from atomic operations)

Any such vma lock-like implementation would be hugely complicated and
delicate, and be in fact more complicated and delicate than what exists.

We've had numerous work done relating to lock scalability with larger fork
graphs in the past due to initial anon_vma implementations having issues
with this.

>
> In short, I’m seeking a simple way to avoid taking anon_vma->root
> during memory reclamation for folios mapped exclusively by a single
> process.

I'm afraid there is no simple solution here.

Again overall, every time I look into anon_vma stuff, I end up concluding
it requires a more significant, fundamental rework, which is why I have
started on doing this in the bg.

I do suggest that we fold this concept into that work, as otherwise we're
going to step on each other's toes and I just think trying to bolt
something like this on to the existing implementation is not going to work
well.

I may try to solidify my ideas a bit and do a presentation on anon_vma at
LPC, but it's a question of time re: that.

>
> Thanks
> Barry
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-11  8:28 ` Lorenzo Stoakes
@ 2025-09-11 18:22   ` Jann Horn
  2025-09-12  4:49     ` Lorenzo Stoakes
  0 siblings, 1 reply; 23+ messages in thread
From: Jann Horn @ 2025-09-11 18:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song, Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Linux-MM, Kalesh Singh,
	SeongJae Park, Barry Song, Peter Xu

On Thu, Sep 11, 2025 at 10:29 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> > Hi All,
> >
> > I’m aware that Lokesh started a discussion on the concurrency issue
> > between usefaultfd_move and memory reclamation [1]. However, my
> > concern is different, so I’m starting a separate discussion.
> >
> > In the process tree, many processes may share anon_vma->root, even if
> > they don’t share the anon_vma itself. This causes serious lock contention
> > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > and other processes calling fork(), exit(), mprotect(), etc.
>
> Well, when you say lock contention, I mean - we need to have a lock that is held
> over the entire fork tree, as we are cloning references to them.
>
> This is at the anon_vma level - so the folio might be exclusive, but other
> folios there might not be.
>
> Note that I'm working on a radical rework of anon_vma's at the moment (time
> is not in my favour given other tasks + review workload, but it _is_
> happening).
>
> So I'm interested to gather real world usecase data on how best to
> implement things and this is interesting re: that.
>
> My proposed approach would use something like ranged locks. It's a bit
> fuzzy right now so definitely interested in putting some meat on that.
>
> >
> > On Android, this issue becomes more severe since many processes are
> > descendants of zygote.
> >
> > Memory reclamation path:
> >   folio_lock_anon_vma_read
> >
> > mprotect path:
> >   mprotect
> >     split_vma
> >       anon_vma_clone
> >
> > fork / copy_process path:
> >   copy_process
> >     dup_mmap
> >       anon_vma_fork
> >
> > exit path:
> >   exit_mmap
> >     free_pgtables
> >       unlink_anon_vmas
> >
> > To be honest, memory reclamation—especially folio_referenced()—is a
> > problem. It is called very frequently and can block other important
> > user threads waiting for the anon_vma root lock, causing UI lag.
> >
> > I have a rough idea: since the vast majority of anon folios are actually
> > exclusive (I observed almost 98% of Android anon folios fall into this
> > category), they don’t need to iterate the anon_vma tree. They belong to
> > a single process, and even for rmap, it is per-process.
> >
> > I propose introducing a per-anon_vma lock. For exclusive folios whose
> > anon_vma is not shared, we could use this per-anon_vma lock.
>
> I'm not sure how adding _more_ locks is going to reduce contention :) and
> the anon_vma's are all linked to their parents etc. etc. so it's simply not
> ok to hold one lock and not the others when making changes.

folio_referenced() only wants to look at mappings of a single folio,
right? And it only uses the anon_vma of that folio? So as long as we
can guarantee that the folio can't concurrently change which anon_vma
it is associated with, folio_referenced() really only cares about the
specific anon_vma that the folio is associated with, and the anon_vmas
of other folios in the VMAs we traverse are irrelevant?

Basically I think paths that come through the rmap would usually be
able to use such a fine-grained lock, while paths that come through
the MM would often have to use more coarse locking.

Of course paths requiring coarse locking (like for splitting VMAs and
such) would then have to take a pile of locks, one lock per anon_vma
associated with a given VMA. That part shouldn't be overly complicated
though, we'd mainly have to make sure that there is a consistent lock
ordering (such as "if you want to lock multiple anon_vmas, you have to
lock the root anon_vma before the others").


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-11 18:22   ` Jann Horn
@ 2025-09-12  4:49     ` Lorenzo Stoakes
  2025-09-12 11:37       ` Jann Horn
  0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12  4:49 UTC (permalink / raw)
  To: Jann Horn
  Cc: Barry Song, Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Linux-MM, Kalesh Singh,
	SeongJae Park, Barry Song, Peter Xu

On Thu, Sep 11, 2025 at 08:22:13PM +0200, Jann Horn wrote:
> On Thu, Sep 11, 2025 at 10:29 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> > > Hi All,
> > >
> > > I’m aware that Lokesh started a discussion on the concurrency issue
> > > between usefaultfd_move and memory reclamation [1]. However, my
> > > concern is different, so I’m starting a separate discussion.
> > >
> > > In the process tree, many processes may share anon_vma->root, even if
> > > they don’t share the anon_vma itself. This causes serious lock contention
> > > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > > and other processes calling fork(), exit(), mprotect(), etc.
> >
> > Well, when you say lock contention, I mean - we need to have a lock that is held
> > over the entire fork tree, as we are cloning references to them.
> >
> > This is at the anon_vma level - so the folio might be exclusive, but other
> > folios there might not be.
> >
> > Note that I'm working on a radical rework of anon_vma's at the moment (time
> > is not in my favour given other tasks + review workload, but it _is_
> > happening).
> >
> > So I'm interested to gather real world usecase data on how best to
> > implement things and this is interesting re: that.
> >
> > My proposed approach would use something like ranged locks. It's a bit
> > fuzzy right now so definitely interested in putting some meat on that.
> >
> > >
> > > On Android, this issue becomes more severe since many processes are
> > > descendants of zygote.
> > >
> > > Memory reclamation path:
> > >   folio_lock_anon_vma_read
> > >
> > > mprotect path:
> > >   mprotect
> > >     split_vma
> > >       anon_vma_clone
> > >
> > > fork / copy_process path:
> > >   copy_process
> > >     dup_mmap
> > >       anon_vma_fork
> > >
> > > exit path:
> > >   exit_mmap
> > >     free_pgtables
> > >       unlink_anon_vmas
> > >
> > > To be honest, memory reclamation—especially folio_referenced()—is a
> > > problem. It is called very frequently and can block other important
> > > user threads waiting for the anon_vma root lock, causing UI lag.
> > >
> > > I have a rough idea: since the vast majority of anon folios are actually
> > > exclusive (I observed almost 98% of Android anon folios fall into this
> > > category), they don’t need to iterate the anon_vma tree. They belong to
> > > a single process, and even for rmap, it is per-process.
> > >
> > > I propose introducing a per-anon_vma lock. For exclusive folios whose
> > > anon_vma is not shared, we could use this per-anon_vma lock.
> >
> > I'm not sure how adding _more_ locks is going to reduce contention :) and
> > the anon_vma's are all linked to their parents etc. etc. so it's simply not
> > ok to hold one lock and not the others when making changes.
>
> folio_referenced() only wants to look at mappings of a single folio,
> right? And it only uses the anon_vma of that folio? So as long as we
> can guarantee that the folio can't concurrently change which anon_vma
> it is associated with, folio_referenced() really only cares about the
> specific anon_vma that the folio is associated with, and the anon_vmas
> of other folios in the VMAs we traverse are irrelevant?

Right yeah, true. But the AVC's link you to 'related' VMA's which are
across the hierarchy.

I think really the refined way of saying this is - yes, you could, but
you're then putting the weight on the VMA side, and the VMA side is
being invoked _all the time_.

>
> Basically I think paths that come through the rmap would usually be
> able to use such a fine-grained lock, while paths that come through
> the MM would often have to use more coarse locking.

They'd have to use _both_ sets of locking.

And this is on every single fork, merge, etc. etc.

So we'd reduce lock acquisition from rmap end, and significantly increase
it, scaling with 'how far forked we are' ;)

So this is the fundamnetal issue.

>
> Of course paths requiring coarse locking (like for splitting VMAs and
> such) would then have to take a pile of locks, one lock per anon_vma
> associated with a given VMA. That part shouldn't be overly complicated
> though, we'd mainly have to make sure that there is a consistent lock
> ordering (such as "if you want to lock multiple anon_vmas, you have to
> lock the root anon_vma before the others").
>

I mean already this lock ordering is not so fun :)

I suspect there'd be other issues.

But perhaps a way forward is, since I'm working in this area already, to
try and hack together an RFC which we could use to figure out how heavy the
cost is...

So let me try and do that.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-12  4:49     ` Lorenzo Stoakes
@ 2025-09-12 11:37       ` Jann Horn
  2025-09-12 11:56         ` Lorenzo Stoakes
  0 siblings, 1 reply; 23+ messages in thread
From: Jann Horn @ 2025-09-12 11:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song, Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Linux-MM, Kalesh Singh,
	SeongJae Park, Barry Song, Peter Xu

On Fri, Sep 12, 2025 at 6:49 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Thu, Sep 11, 2025 at 08:22:13PM +0200, Jann Horn wrote:
> > On Thu, Sep 11, 2025 at 10:29 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> > > > Hi All,
> > > >
> > > > I’m aware that Lokesh started a discussion on the concurrency issue
> > > > between usefaultfd_move and memory reclamation [1]. However, my
> > > > concern is different, so I’m starting a separate discussion.
> > > >
> > > > In the process tree, many processes may share anon_vma->root, even if
> > > > they don’t share the anon_vma itself. This causes serious lock contention
> > > > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > > > and other processes calling fork(), exit(), mprotect(), etc.
> > >
> > > Well, when you say lock contention, I mean - we need to have a lock that is held
> > > over the entire fork tree, as we are cloning references to them.
> > >
> > > This is at the anon_vma level - so the folio might be exclusive, but other
> > > folios there might not be.
> > >
> > > Note that I'm working on a radical rework of anon_vma's at the moment (time
> > > is not in my favour given other tasks + review workload, but it _is_
> > > happening).
> > >
> > > So I'm interested to gather real world usecase data on how best to
> > > implement things and this is interesting re: that.
> > >
> > > My proposed approach would use something like ranged locks. It's a bit
> > > fuzzy right now so definitely interested in putting some meat on that.
> > >
> > > >
> > > > On Android, this issue becomes more severe since many processes are
> > > > descendants of zygote.
> > > >
> > > > Memory reclamation path:
> > > >   folio_lock_anon_vma_read
> > > >
> > > > mprotect path:
> > > >   mprotect
> > > >     split_vma
> > > >       anon_vma_clone
> > > >
> > > > fork / copy_process path:
> > > >   copy_process
> > > >     dup_mmap
> > > >       anon_vma_fork
> > > >
> > > > exit path:
> > > >   exit_mmap
> > > >     free_pgtables
> > > >       unlink_anon_vmas
> > > >
> > > > To be honest, memory reclamation—especially folio_referenced()—is a
> > > > problem. It is called very frequently and can block other important
> > > > user threads waiting for the anon_vma root lock, causing UI lag.
> > > >
> > > > I have a rough idea: since the vast majority of anon folios are actually
> > > > exclusive (I observed almost 98% of Android anon folios fall into this
> > > > category), they don’t need to iterate the anon_vma tree. They belong to
> > > > a single process, and even for rmap, it is per-process.
> > > >
> > > > I propose introducing a per-anon_vma lock. For exclusive folios whose
> > > > anon_vma is not shared, we could use this per-anon_vma lock.
> > >
> > > I'm not sure how adding _more_ locks is going to reduce contention :) and
> > > the anon_vma's are all linked to their parents etc. etc. so it's simply not
> > > ok to hold one lock and not the others when making changes.
> >
> > folio_referenced() only wants to look at mappings of a single folio,
> > right? And it only uses the anon_vma of that folio? So as long as we
> > can guarantee that the folio can't concurrently change which anon_vma
> > it is associated with, folio_referenced() really only cares about the
> > specific anon_vma that the folio is associated with, and the anon_vmas
> > of other folios in the VMAs we traverse are irrelevant?
>
> Right yeah, true. But the AVC's link you to 'related' VMA's which are
> across the hierarchy.
>
> I think really the refined way of saying this is - yes, you could, but
> you're then putting the weight on the VMA side, and the VMA side is
> being invoked _all the time_.

Ah, fair.

I guess one approach would be to do something hazard-pointer-ish? Like
a semaphore-like thing in the root anon_vma that contains a normal
reader count, a hazard-pointer reader count (limited to some small
number like 2 or 4), and a writer count (up to 1), combined with a
limited number of hazard pointer slots; where a writer can ignore the
hazard-pointer reader count if none of the hazard pointers match any
anon_vma it wants to look at (but readers still always have to wait
for writers). The write-locking fastpath would just be a normal
"atomically add N if zero" just like with normal locking, and only the
case where there actually are hazard-pointer readers would make the
locking more expensive...

But inventing more artisanal locking schemes is probably not a great idea...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-12 11:37       ` Jann Horn
@ 2025-09-12 11:56         ` Lorenzo Stoakes
  0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 11:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Barry Song, Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Linux-MM, Kalesh Singh,
	SeongJae Park, Barry Song, Peter Xu

On Fri, Sep 12, 2025 at 01:37:35PM +0200, Jann Horn wrote:
> On Fri, Sep 12, 2025 at 6:49 AM Lorenzo Stoakes
> > > folio_referenced() only wants to look at mappings of a single folio,
> > > right? And it only uses the anon_vma of that folio? So as long as we
> > > can guarantee that the folio can't concurrently change which anon_vma
> > > it is associated with, folio_referenced() really only cares about the
> > > specific anon_vma that the folio is associated with, and the anon_vmas
> > > of other folios in the VMAs we traverse are irrelevant?
> >
> > Right yeah, true. But the AVC's link you to 'related' VMA's which are
> > across the hierarchy.
> >
> > I think really the refined way of saying this is - yes, you could, but
> > you're then putting the weight on the VMA side, and the VMA side is
> > being invoked _all the time_.
>
> Ah, fair.
>
> I guess one approach would be to do something hazard-pointer-ish? Like
> a semaphore-like thing in the root anon_vma that contains a normal
> reader count, a hazard-pointer reader count (limited to some small
> number like 2 or 4), and a writer count (up to 1), combined with a
> limited number of hazard pointer slots; where a writer can ignore the
> hazard-pointer reader count if none of the hazard pointers match any
> anon_vma it wants to look at (but readers still always have to wait
> for writers). The write-locking fastpath would just be a normal
> "atomically add N if zero" just like with normal locking, and only the
> case where there actually are hazard-pointer readers would make the
> locking more expensive...

Ohhh nice idea! Will look into that :)

>
> But inventing more artisanal locking schemes is probably not a great idea...
>

Well, sometimes it's valid!

I will come up with some 'stupid' solution first so we can analyse it and
shoot out an RFC.

Also thanks Barry for raising this - this is an important issue and we do
need to figure out a way to attack it.

I think a combination of incremental work with the current anon_vma impl
and also adjusting the design for the new anon_vma approach I'm working
on is I think the way forward here.

Anyway will send out an RFC soon! :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-11  7:17 [DISCUSSION] anon_vma root lock contention and per anon_vma lock Barry Song
  2025-09-11  8:14 ` David Hildenbrand
  2025-09-11  8:28 ` Lorenzo Stoakes
@ 2025-09-14 23:53 ` Matthew Wilcox
  2025-09-15  0:23   ` Barry Song
  2025-09-15  8:57   ` Lorenzo Stoakes
  2 siblings, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2025-09-14 23:53 UTC (permalink / raw)
  To: Barry Song
  Cc: Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Suren Baghdasaryan, Andrew Morton,
	Rik van Riel, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Linux-MM, Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> In the process tree, many processes may share anon_vma->root, even if
> they don’t share the anon_vma itself. This causes serious lock contention
> between memory reclamation (which calls folio_referenced and try_to_unmap)
> and other processes calling fork(), exit(), mprotect(), etc.
> 
> On Android, this issue becomes more severe since many processes are
> descendants of zygote.

I'm not nearly as familiar with anon_vma as, well, the rest of you
are.  As I understand this situation, usually after fork(), a process
calls exec() and the VMAs evaporate.  Android is different in that after
the zygotecalls fork(), there is no exec() and so the VMAs stay COW.

I wonder if we could fix this by adding a new syscall:

	mremap(addr, size, size, MREMAP_COW_NOW);

That would create a new VMA that contains the COWed pages from the
old VMA, but crucially no longer attached to the anon_vma root of
the zygote.  You wouldn't want to call this for every VMA, of course.
Just the ones which are likely to be fully COWed.

Maybe this isn't practical, but I thought it worth suggesting.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-14 23:53 ` Matthew Wilcox
@ 2025-09-15  0:23   ` Barry Song
  2025-09-15  1:47     ` Suren Baghdasaryan
  2025-09-15  2:50     ` Matthew Wilcox
  2025-09-15  8:57   ` Lorenzo Stoakes
  1 sibling, 2 replies; 23+ messages in thread
From: Barry Song @ 2025-09-15  0:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Suren Baghdasaryan, Andrew Morton,
	Rik van Riel, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Linux-MM, Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Mon, Sep 15, 2025 at 7:53 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> > In the process tree, many processes may share anon_vma->root, even if
> > they don’t share the anon_vma itself. This causes serious lock contention
> > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > and other processes calling fork(), exit(), mprotect(), etc.
> >
> > On Android, this issue becomes more severe since many processes are
> > descendants of zygote.
>
> I'm not nearly as familiar with anon_vma as, well, the rest of you
> are.  As I understand this situation, usually after fork(), a process
> calls exec() and the VMAs evaporate.  Android is different in that after
> the zygotecalls fork(), there is no exec() and so the VMAs stay COW.
>
> I wonder if we could fix this by adding a new syscall:
>
>         mremap(addr, size, size, MREMAP_COW_NOW);
>
> That would create a new VMA that contains the COWed pages from the
> old VMA, but crucially no longer attached to the anon_vma root of
> the zygote.  You wouldn't want to call this for every VMA, of course.
> Just the ones which are likely to be fully COWed.
>
> Maybe this isn't practical, but I thought it worth suggesting.

Thank you for the suggestion, Matthew.

Lorenzo suggested possibly unlinking the child anon_vma from the root once all
folios have been CoW-ed:

"Right now, even if you entirely CoW everything in a VMA, we are still
attached to parents with all the overhead. That's something I can look at.
"

My concern is that it’s difficult to determine whether a VMA has been completely
CoW-ed, and a single shared folio would prevent the unlink.
So I’m not sure this approach would work.

You seem to be proposing a forced CoW as a way to safely unlink from the root.

A side effect is the potential for sudden, heavy memory allocation,
whereas CoW lets asynchronous tasks such as kswap work concurrently.

Another issue is the extra memory use from folios that could have been
shared but aren’t—likely minor on Android, since only a small portion
of memory is actually shared, based on our observations.

Calling mremap for each VMA might be difficult. Something applied to the
whole process could be more practical—similar to exec, but only
performing CoW and unlinking the anon_vma root.

On the other hand, most anon folios are not actually shared, yet
folio_referenced and try_to_unmap still take the entire root lock.
In reality, they only care about their own node—no need to iterate
the whole tree.

I still think optimizing from that angle could be a better entry point :-)

Thanks
Barry

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15  0:23   ` Barry Song
@ 2025-09-15  1:47     ` Suren Baghdasaryan
  2025-09-15  8:41       ` Lorenzo Stoakes
  2025-09-15  2:50     ` Matthew Wilcox
  1 sibling, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2025-09-15  1:47 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, Nicolas Geoffray, Lokesh Gidra,
	David Hildenbrand, Lorenzo Stoakes, Harry Yoo, Andrew Morton,
	Rik van Riel, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Linux-MM, Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Sun, Sep 14, 2025 at 5:23 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Sep 15, 2025 at 7:53 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> > > In the process tree, many processes may share anon_vma->root, even if
> > > they don’t share the anon_vma itself. This causes serious lock contention
> > > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > > and other processes calling fork(), exit(), mprotect(), etc.
> > >
> > > On Android, this issue becomes more severe since many processes are
> > > descendants of zygote.
> >
> > I'm not nearly as familiar with anon_vma as, well, the rest of you
> > are.  As I understand this situation, usually after fork(), a process
> > calls exec() and the VMAs evaporate.  Android is different in that after
> > the zygotecalls fork(), there is no exec() and so the VMAs stay COW.
> >
> > I wonder if we could fix this by adding a new syscall:
> >
> >         mremap(addr, size, size, MREMAP_COW_NOW);
> >
> > That would create a new VMA that contains the COWed pages from the
> > old VMA, but crucially no longer attached to the anon_vma root of
> > the zygote.  You wouldn't want to call this for every VMA, of course.
> > Just the ones which are likely to be fully COWed.
> >
> > Maybe this isn't practical, but I thought it worth suggesting.
>
> Thank you for the suggestion, Matthew.
>
> Lorenzo suggested possibly unlinking the child anon_vma from the root once all
> folios have been CoW-ed:
>
> "Right now, even if you entirely CoW everything in a VMA, we are still
> attached to parents with all the overhead. That's something I can look at.
> "
>
> My concern is that it’s difficult to determine whether a VMA has been completely
> CoW-ed, and a single shared folio would prevent the unlink.
> So I’m not sure this approach would work.
>
> You seem to be proposing a forced CoW as a way to safely unlink from the root.
>
> A side effect is the potential for sudden, heavy memory allocation,
> whereas CoW lets asynchronous tasks such as kswap work concurrently.
>
> Another issue is the extra memory use from folios that could have been
> shared but aren’t—likely minor on Android, since only a small portion
> of memory is actually shared, based on our observations.
>
> Calling mremap for each VMA might be difficult. Something applied to the
> whole process could be more practical—similar to exec, but only
> performing CoW and unlinking the anon_vma root.
>
> On the other hand, most anon folios are not actually shared, yet
> folio_referenced and try_to_unmap still take the entire root lock.
> In reality, they only care about their own node—no need to iterate
> the whole tree.
>
> I still think optimizing from that angle could be a better entry point :-)

Hi Barry,
Thanks for raising this issue. I think technically the optimization
you are suggesting is possible and it does look similar to per-vma
locking in that:
- The reader tries to read-lock a specific interval and on failure
falls back to locking the entire tree (root);
- The writer write-locks the root first and then one or more
individual nodes in the tree. Once the writer is done it unlocks all
the nodes it locked and then the root.
But as Lorenzo pointed out, this will not be pretty, as it adds yet
another lock and more locking/unlocking into the writer path.
In the case of the pagefault path, improving its performance at the
expense of the writers was not questioned due to pagefault being such
a hot path. I'm not sure reclaim will be given the same benefit...
Something to consider.
In any case, I'm very interested in continuing this discussion and
would love to test a POC or discuss this at LPC.
Thanks,
Suren.

>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15  0:23   ` Barry Song
  2025-09-15  1:47     ` Suren Baghdasaryan
@ 2025-09-15  2:50     ` Matthew Wilcox
  2025-09-15  5:17       ` David Hildenbrand
  2025-09-15  9:22       ` Lorenzo Stoakes
  1 sibling, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2025-09-15  2:50 UTC (permalink / raw)
  To: Barry Song
  Cc: Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Lorenzo Stoakes, Harry Yoo, Suren Baghdasaryan, Andrew Morton,
	Rik van Riel, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Linux-MM, Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Mon, Sep 15, 2025 at 08:23:38AM +0800, Barry Song wrote:
> > I wonder if we could fix this by adding a new syscall:
> >
> >         mremap(addr, size, size, MREMAP_COW_NOW);
> >
> > That would create a new VMA that contains the COWed pages from the
> > old VMA, but crucially no longer attached to the anon_vma root of
> > the zygote.  You wouldn't want to call this for every VMA, of course.
> > Just the ones which are likely to be fully COWed.
> >
> > Maybe this isn't practical, but I thought it worth suggesting.
> 
> Lorenzo suggested possibly unlinking the child anon_vma from the root once all
> folios have been CoW-ed:
> 
> "Right now, even if you entirely CoW everything in a VMA, we are still
> attached to parents with all the overhead. That's something I can look at.
> "
> 
> My concern is that it’s difficult to determine whether a VMA has been completely
> CoW-ed, and a single shared folio would prevent the unlink.
> So I’m not sure this approach would work.

I'm concerned that tracking how many folios remain shared may be
inefficient.  Also that information needs to be gathered in both parent
and child.

> You seem to be proposing a forced CoW as a way to safely unlink from the root.
> 
> A side effect is the potential for sudden, heavy memory allocation,
> whereas CoW lets asynchronous tasks such as kswap work concurrently.

Perhaps you could help us out with some stats on that -- how much
anonymous memory starts out shared between the zygote and a newly
spawned process?

> Another issue is the extra memory use from folios that could have been
> shared but aren’t—likely minor on Android, since only a small portion
> of memory is actually shared, based on our observations.
> 
> Calling mremap for each VMA might be difficult. Something applied to the
> whole process could be more practical—similar to exec, but only
> performing CoW and unlinking the anon_vma root.

That seems like it would be worse for memory consumption than doing it
on the VMAs in question.

Another possibility would be for the zygote to set a flag on the VMA,
say EAGER_COW which forces a COW of all pages as soon as the first one
is COWed.  But then we're paying at fault time rather than in a syscall
that we can predict.

Another point in favour of COW_NOW or EAGER_COW is that we can choose to
allocate folios of the appropriate size at that time.  Unless something's
changed, I think we always COW individual pages rather than multiple
pages at once.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15  2:50     ` Matthew Wilcox
@ 2025-09-15  5:17       ` David Hildenbrand
  2025-09-15  9:42         ` Lorenzo Stoakes
  2025-09-15  9:22       ` Lorenzo Stoakes
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-09-15  5:17 UTC (permalink / raw)
  To: Matthew Wilcox, Barry Song
  Cc: Nicolas Geoffray, Lokesh Gidra, Lorenzo Stoakes, Harry Yoo,
	Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On 15.09.25 04:50, Matthew Wilcox wrote:
> On Mon, Sep 15, 2025 at 08:23:38AM +0800, Barry Song wrote:

A couple of notes:

>>> I wonder if we could fix this by adding a new syscall:
>>>
>>>          mremap(addr, size, size, MREMAP_COW_NOW);
>>>
>>> That would create a new VMA that contains the COWed pages from the
>>> old VMA, but crucially no longer attached to the anon_vma root of
>>> the zygote.  You wouldn't want to call this for every VMA, of course.
>>> Just the ones which are likely to be fully COWed.

MADV_POPULATE does that for writable vMAs (excluding the rmap opt, but 
that could likely be implemented).

A student of mine implemented a MADV_UNSHARE that achieves the same by 
triggering unshare-faults even for non-writable VMAs (again, excluding 
the rmap opt).

We used MADV_UNSHARE to break COW asynchronously to the already-running 
workload to keep fork() still short but avoid the overhead of COW faults 
later.

[ insert usualy comment about no weird mremap flags ]

>>>
>>> Maybe this isn't practical, but I thought it worth suggesting.
>>
>> Lorenzo suggested possibly unlinking the child anon_vma from the root once all
>> folios have been CoW-ed:
>>
>> "Right now, even if you entirely CoW everything in a VMA, we are still
>> attached to parents with all the overhead. That's something I can look at.
>> "
>>
>> My concern is that it’s difficult to determine whether a VMA has been completely
>> CoW-ed, and a single shared folio would prevent the unlink.
>> So I’m not sure this approach would work.
> 
> I'm concerned that tracking how many folios remain shared may be
> inefficient.  Also that information needs to be gathered in both parent
> and child.

Yeah, not a fan. Tracking per MM might work, tracking per VMA is 
problematic due to the possibility for VMA splits.

> 
>> You seem to be proposing a forced CoW as a way to safely unlink from the root.
>>
>> A side effect is the potential for sudden, heavy memory allocation,
>> whereas CoW lets asynchronous tasks such as kswap work concurrently.
> 
> Perhaps you could help us out with some stats on that -- how much
> anonymous memory starts out shared between the zygote and a newly
> spawned process?
> 
>> Another issue is the extra memory use from folios that could have been
>> shared but aren’t—likely minor on Android, since only a small portion
>> of memory is actually shared, based on our observations.
>>
>> Calling mremap for each VMA might be difficult. Something applied to the
>> whole process could be more practical—similar to exec, but only
>> performing CoW and unlinking the anon_vma root.
> 
> That seems like it would be worse for memory consumption than doing it
> on the VMAs in question.

MADV_UNSHARE we implemented simply took a range and one could apply it 
to the full process by supplying the full range.

But yeah, the downside in any case is that you lose

> 
> Another possibility would be for the zygote to set a flag on the VMA,
> say EAGER_COW which forces a COW of all pages as soon as the first one
> is COWed.  But then we're paying at fault time rather than in a syscall
> that we can predict.

Right, or just avoid COW altogether (if fork time is irrelevant) and 
just copy during fork(). Either using a clone flag for the whole MM or 
using a new MADV option copy during fork.

> 
> Another point in favour of COW_NOW or EAGER_COW is that we can choose to
> allocate folios of the appropriate size at that time.  Unless something's
> changed, I think we always COW individual pages rather than multiple
> pages at once.

Yes. khugepaged will soon starting fixing that up later asynchronously I 
hope.


But obviously, whenever we copy/unshare, we consume more memory. While 
this might possibly work for Android, I know that some workloads (was it 
websevers or webbrowsers for example?) spin up many instances through 
fork() to actually keep sharing pages and not break COW.

So I would hope we can find a better optimization that doesn't rely on 
the workload to manually break COW and effectively consume more memory.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15  1:47     ` Suren Baghdasaryan
@ 2025-09-15  8:41       ` Lorenzo Stoakes
  0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15  8:41 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Barry Song, Matthew Wilcox, Nicolas Geoffray, Lokesh Gidra,
	David Hildenbrand, Harry Yoo, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Sun, Sep 14, 2025 at 06:47:48PM -0700, Suren Baghdasaryan wrote:
> On Sun, Sep 14, 2025 at 5:23 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Sep 15, 2025 at 7:53 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> > > > In the process tree, many processes may share anon_vma->root, even if
> > > > they don’t share the anon_vma itself. This causes serious lock contention
> > > > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > > > and other processes calling fork(), exit(), mprotect(), etc.
> > > >
> > > > On Android, this issue becomes more severe since many processes are
> > > > descendants of zygote.
> > >
> > > I'm not nearly as familiar with anon_vma as, well, the rest of you
> > > are.  As I understand this situation, usually after fork(), a process
> > > calls exec() and the VMAs evaporate.  Android is different in that after
> > > the zygotecalls fork(), there is no exec() and so the VMAs stay COW.
> > >
> > > I wonder if we could fix this by adding a new syscall:
> > >
> > >         mremap(addr, size, size, MREMAP_COW_NOW);
> > >
> > > That would create a new VMA that contains the COWed pages from the
> > > old VMA, but crucially no longer attached to the anon_vma root of
> > > the zygote.  You wouldn't want to call this for every VMA, of course.
> > > Just the ones which are likely to be fully COWed.
> > >
> > > Maybe this isn't practical, but I thought it worth suggesting.
> >
> > Thank you for the suggestion, Matthew.
> >
> > Lorenzo suggested possibly unlinking the child anon_vma from the root once all
> > folios have been CoW-ed:
> >
> > "Right now, even if you entirely CoW everything in a VMA, we are still
> > attached to parents with all the overhead. That's something I can look at.
> > "
> >
> > My concern is that it’s difficult to determine whether a VMA has been completely
> > CoW-ed, and a single shared folio would prevent the unlink.
> > So I’m not sure this approach would work.
> >
> > You seem to be proposing a forced CoW as a way to safely unlink from the root.
> >
> > A side effect is the potential for sudden, heavy memory allocation,
> > whereas CoW lets asynchronous tasks such as kswap work concurrently.
> >
> > Another issue is the extra memory use from folios that could have been
> > shared but aren’t—likely minor on Android, since only a small portion
> > of memory is actually shared, based on our observations.
> >
> > Calling mremap for each VMA might be difficult. Something applied to the
> > whole process could be more practical—similar to exec, but only
> > performing CoW and unlinking the anon_vma root.
> >
> > On the other hand, most anon folios are not actually shared, yet
> > folio_referenced and try_to_unmap still take the entire root lock.
> > In reality, they only care about their own node—no need to iterate
> > the whole tree.
> >
> > I still think optimizing from that angle could be a better entry point :-)
>
> Hi Barry,
> Thanks for raising this issue. I think technically the optimization
> you are suggesting is possible and it does look similar to per-vma
> locking in that:
> - The reader tries to read-lock a specific interval and on failure
> falls back to locking the entire tree (root);
> - The writer write-locks the root first and then one or more
> individual nodes in the tree. Once the writer is done it unlocks all
> the nodes it locked and then the root.
> But as Lorenzo pointed out, this will not be pretty, as it adds yet
> another lock and more locking/unlocking into the writer path.
> In the case of the pagefault path, improving its performance at the
> expense of the writers was not questioned due to pagefault being such
> a hot path. I'm not sure reclaim will be given the same benefit...
> Something to consider.
> In any case, I'm very interested in continuing this discussion and
> would love to test a POC or discuss this at LPC.
> Thanks,
> Suren.

Hi Suren,

Have submitted a proposal to LPC re: anon_vma in general, and (in collaboration
with Barry) am planning a PoC for this!

I think we need to be very data-centric here. So we have to ensure that the cost
is worth the benefit. Anyway I should put out a [PoC PATCH] soon!

We could even start by looking at a crude version that just uses per-anon_vma
rwsem as a first step.

But I do think any actually workable solution will have to resemble VMA locks
for it to be even vaguely acceptable. As otherwise grabbing the rwsem's y'know I
think probably a little too much overhead :)

One big issue we have here is that we are doing anon_vma clone on _so many_
operations, that is split/merge/fork.

So yeah needs assessment, but I think longer term this kind of issue can feed
into my 'grand redesign' of anon_vma which is a simmering background task (and
again I hope to talk about at LPC if my topic is accepted).

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-14 23:53 ` Matthew Wilcox
  2025-09-15  0:23   ` Barry Song
@ 2025-09-15  8:57   ` Lorenzo Stoakes
  1 sibling, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15  8:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song, Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Mon, Sep 15, 2025 at 12:53:20AM +0100, Matthew Wilcox wrote:
> On Thu, Sep 11, 2025 at 07:17:01PM +1200, Barry Song wrote:
> > In the process tree, many processes may share anon_vma->root, even if
> > they don’t share the anon_vma itself. This causes serious lock contention
> > between memory reclamation (which calls folio_referenced and try_to_unmap)
> > and other processes calling fork(), exit(), mprotect(), etc.
> >
> > On Android, this issue becomes more severe since many processes are
> > descendants of zygote.
>
> I'm not nearly as familiar with anon_vma as, well, the rest of you
> are.  As I understand this situation, usually after fork(), a process
> calls exec() and the VMAs evaporate.  Android is different in that after
> the zygotecalls fork(), there is no exec() and so the VMAs stay COW.

Oh really, wasn't aware of this...

>
> I wonder if we could fix this by adding a new syscall:
>
> 	mremap(addr, size, size, MREMAP_COW_NOW);
>
> That would create a new VMA that contains the COWed pages from the
> old VMA, but crucially no longer attached to the anon_vma root of
> the zygote.  You wouldn't want to call this for every VMA, of course.
> Just the ones which are likely to be fully COWed.

Hm, I'm not sure how this would work.

So the folio->mapping would point at the zygote's anon_vma, which would
have AVC's to the zygote + the child.

after this call you have a new VMA that surely would need that same
anon_vma referencing it via an AVC unless you intend to actually CoW the
folios to new folios that reference the new VMA"s, which I guess is what
you mean?

This is essentially doing a CoW _and_ saying 'hey we are definitely
actually CoWing the _whole range_ in the VMA so can safely no longer link
to the zygote'.

I mean firstly I think the interface is definitely not right, I don't know
where you'd be mremap()'ing to and from.

I think it'd need to be more like an madvise(), one that you'd have to
restrict to a whole VMA.

But I really don't love this idea, I think we'd be solving a specific issue
for Android while leaving a genuine problem that exists in the anon_vma
logic alone, I'd far rather we attack things at a fundamental level.

Also I think figuring out which bits are likely to get CoW'd or not will be
non-trivial.

Presumably Google are doing this zygote stuff to take advantage of CoW, and
wouldn't want the overhead of copying data all that much.

>
> Maybe this isn't practical, but I thought it worth suggesting.
>

Yeah I'm not sure this is the right approach.

But you have managed to get 'cow now' stuck in my head ;)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15  2:50     ` Matthew Wilcox
  2025-09-15  5:17       ` David Hildenbrand
@ 2025-09-15  9:22       ` Lorenzo Stoakes
  2025-09-15 10:41         ` David Hildenbrand
  1 sibling, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15  9:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song, Nicolas Geoffray, Lokesh Gidra, David Hildenbrand,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Mon, Sep 15, 2025 at 03:50:01AM +0100, Matthew Wilcox wrote:
> On Mon, Sep 15, 2025 at 08:23:38AM +0800, Barry Song wrote:
> > > I wonder if we could fix this by adding a new syscall:
> > >
> > >         mremap(addr, size, size, MREMAP_COW_NOW);
> > >
> > > That would create a new VMA that contains the COWed pages from the
> > > old VMA, but crucially no longer attached to the anon_vma root of
> > > the zygote.  You wouldn't want to call this for every VMA, of course.
> > > Just the ones which are likely to be fully COWed.
> > >
> > > Maybe this isn't practical, but I thought it worth suggesting.
> >
> > Lorenzo suggested possibly unlinking the child anon_vma from the root once all
> > folios have been CoW-ed:
> >
> > "Right now, even if you entirely CoW everything in a VMA, we are still
> > attached to parents with all the overhead. That's something I can look at.
> > "
> >
> > My concern is that it’s difficult to determine whether a VMA has been completely
> > CoW-ed, and a single shared folio would prevent the unlink.
> > So I’m not sure this approach would work.
>
> I'm concerned that tracking how many folios remain shared may be
> inefficient.  Also that information needs to be gathered in both parent
> and child.

Yeah I think you would need to track parent + child which is just _lovely_
isn't it.

I'm really not in love with the overwrought structure of anon_vma's in general,
we've made life hard for ourselve and tacked on a bunch of complexity.

Again, I think "Lorenzo's grand rework" could help tackle this from a
fundamnetal basis (don't ask me for too many details just yet :P)

This also gets potentially complicated with the anon_vma reuse logic too.

It's about RoI.

Again, the more I look at this, the more I feel that the whole thing needs
rearchitecture.

>
> > You seem to be proposing a forced CoW as a way to safely unlink from the root.
> >
> > A side effect is the potential for sudden, heavy memory allocation,
> > whereas CoW lets asynchronous tasks such as kswap work concurrently.
>
> Perhaps you could help us out with some stats on that -- how much
> anonymous memory starts out shared between the zygote and a newly
> spawned process?

Yes stats are good!

>
> > Another issue is the extra memory use from folios that could have been
> > shared but aren’t—likely minor on Android, since only a small portion
> > of memory is actually shared, based on our observations.
> >
> > Calling mremap for each VMA might be difficult. Something applied to the
> > whole process could be more practical—similar to exec, but only
> > performing CoW and unlinking the anon_vma root.
>
> That seems like it would be worse for memory consumption than doing it
> on the VMAs in question.

Yes!

Surely the whole point of using the zygote is to take advantage of CoW no?
It'd surely hugely slow down establishing a new process if we did this
per-process?

>
> Another possibility would be for the zygote to set a flag on the VMA,
> say EAGER_COW which forces a COW of all pages as soon as the first one
> is COWed.  But then we're paying at fault time rather than in a syscall
> that we can predict.

I assume you mean write fault.

How would we identify which folios to CoW at fault time (other than the one
we are write faulting on)? There's no way to go from anon_vma to folios
without doing a full rmap traversal to the the VMA then back down again to
page tables, so that'd make this pretty damn expensive surely?

And we'd need to hold some kind of lock at that point...

Also again I wonder how easy it will be to identify which VMAs you're happy
to make expensive like this?

Also you will end up fragmenting VMAs potentially doing this.

I think it's a sort of nice idea fundamentally, but the cost is likely to
be high and it will add a bunch of complexity.

>
> Another point in favour of COW_NOW or EAGER_COW is that we can choose to
> allocate folios of the appropriate size at that time.  Unless something's
> changed, I think we always COW individual pages rather than multiple
> pages at once.
>

I don't think there's any practical way to do it any differently because
you don't know what's mapped/not at the point of fault.

How expensive would it be to have xarray for anon, I wonder... :)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15  5:17       ` David Hildenbrand
@ 2025-09-15  9:42         ` Lorenzo Stoakes
  2025-09-15 10:29           ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15  9:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Barry Song, Nicolas Geoffray, Lokesh Gidra,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Mon, Sep 15, 2025 at 07:17:41AM +0200, David Hildenbrand wrote:
> On 15.09.25 04:50, Matthew Wilcox wrote:
> > On Mon, Sep 15, 2025 at 08:23:38AM +0800, Barry Song wrote:
>
> A couple of notes:
>
> > > > I wonder if we could fix this by adding a new syscall:
> > > >
> > > >          mremap(addr, size, size, MREMAP_COW_NOW);
> > > >
> > > > That would create a new VMA that contains the COWed pages from the
> > > > old VMA, but crucially no longer attached to the anon_vma root of
> > > > the zygote.  You wouldn't want to call this for every VMA, of course.
> > > > Just the ones which are likely to be fully COWed.
>
> MADV_POPULATE does that for writable vMAs (excluding the rmap opt, but that
> could likely be implemented).
>
> A student of mine implemented a MADV_UNSHARE that achieves the same by
> triggering unshare-faults even for non-writable VMAs (again, excluding the
> rmap opt).

I think the rmap bit is non-trivial as per my other replies.

>
> We used MADV_UNSHARE to break COW asynchronously to the already-running
> workload to keep fork() still short but avoid the overhead of COW faults
> later.
>
> [ insert usualy comment about no weird mremap flags ]

Yup

>
> > > >
> > > > Maybe this isn't practical, but I thought it worth suggesting.
> > >
> > > Lorenzo suggested possibly unlinking the child anon_vma from the root once all
> > > folios have been CoW-ed:
> > >
> > > "Right now, even if you entirely CoW everything in a VMA, we are still
> > > attached to parents with all the overhead. That's something I can look at.
> > > "
> > >
> > > My concern is that it’s difficult to determine whether a VMA has been completely
> > > CoW-ed, and a single shared folio would prevent the unlink.
> > > So I’m not sure this approach would work.
> >
> > I'm concerned that tracking how many folios remain shared may be
> > inefficient.  Also that information needs to be gathered in both parent
> > and child.
>
> Yeah, not a fan. Tracking per MM might work, tracking per VMA is problematic
> due to the possibility for VMA splits.

Well, I think it's possible (maybe)... with a rework :>)

"Lorenzo's grand rework" etc. etc.

>
> >
> > > You seem to be proposing a forced CoW as a way to safely unlink from the root.
> > >
> > > A side effect is the potential for sudden, heavy memory allocation,
> > > whereas CoW lets asynchronous tasks such as kswap work concurrently.
> >
> > Perhaps you could help us out with some stats on that -- how much
> > anonymous memory starts out shared between the zygote and a newly
> > spawned process?
> >
> > > Another issue is the extra memory use from folios that could have been
> > > shared but aren’t—likely minor on Android, since only a small portion
> > > of memory is actually shared, based on our observations.
> > >
> > > Calling mremap for each VMA might be difficult. Something applied to the
> > > whole process could be more practical—similar to exec, but only
> > > performing CoW and unlinking the anon_vma root.
> >
> > That seems like it would be worse for memory consumption than doing it
> > on the VMAs in question.
>
> MADV_UNSHARE we implemented simply took a range and one could apply it to
> the full process by supplying the full range.
>
> But yeah, the downside in any case is that you lose

You just lose? :P I assume you forgot to finish this thought :>)

I wonder from rmap point of view whether you could actually simply check to
see if you're fully CoW'd.

E.g.:

	madvise(..., MADV_ISOLATE_COWED)

And have it take the anon_vma write lock from root, have it walk the rmap,
go and check to see if every folio in the VMA is now CoW'd, and if so,
detatch the CoW'd anon_vma from its parent/root?

This would be a sort of after-the-fact thing, but maybe could be done
periodically.

Of course then if you had one folio that was not yet CoW'd, that'd prevent
this from completing.

>
> >
> > Another possibility would be for the zygote to set a flag on the VMA,
> > say EAGER_COW which forces a COW of all pages as soon as the first one
> > is COWed.  But then we're paying at fault time rather than in a syscall
> > that we can predict.
>
> Right, or just avoid COW altogether (if fork time is irrelevant) and just
> copy during fork(). Either using a clone flag for the whole MM or using a
> new MADV option copy during fork.

Could have sworn we already had an madvise() flag for that but no we
don't... MADV_COPY_ON_FORK...

Anyway any such solution has the issue of using extra memory when the user
very probably does not want this.

>
> >
> > Another point in favour of COW_NOW or EAGER_COW is that we can choose to
> > allocate folios of the appropriate size at that time.  Unless something's
> > changed, I think we always COW individual pages rather than multiple
> > pages at once.
>
> Yes. khugepaged will soon starting fixing that up later asynchronously I
> hope.

But only at mTHP granularity a. once the relevant series lands and b. if
mTHP is enabled (I mean for sub-PMD sized/aligned ranges) :>)

>
>
> But obviously, whenever we copy/unshare, we consume more memory. While this
> might possibly work for Android, I know that some workloads (was it
> websevers or webbrowsers for example?) spin up many instances through fork()
> to actually keep sharing pages and not break COW.
>
> So I would hope we can find a better optimization that doesn't rely on the
> workload to manually break COW and effectively consume more memory.

Yup, agreed.

>
> --
> Cheers
>
> David / dhildenb
>
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15  9:42         ` Lorenzo Stoakes
@ 2025-09-15 10:29           ` David Hildenbrand
  2025-09-15 10:56             ` Lorenzo Stoakes
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-09-15 10:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Matthew Wilcox, Barry Song, Nicolas Geoffray, Lokesh Gidra,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On 15.09.25 11:42, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 07:17:41AM +0200, David Hildenbrand wrote:
>> On 15.09.25 04:50, Matthew Wilcox wrote:
>>> On Mon, Sep 15, 2025 at 08:23:38AM +0800, Barry Song wrote:
>>
>> A couple of notes:
>>
>>>>> I wonder if we could fix this by adding a new syscall:
>>>>>
>>>>>           mremap(addr, size, size, MREMAP_COW_NOW);
>>>>>
>>>>> That would create a new VMA that contains the COWed pages from the
>>>>> old VMA, but crucially no longer attached to the anon_vma root of
>>>>> the zygote.  You wouldn't want to call this for every VMA, of course.
>>>>> Just the ones which are likely to be fully COWed.
>>
>> MADV_POPULATE does that for writable vMAs (excluding the rmap opt, but that
>> could likely be implemented).
>>
>> A student of mine implemented a MADV_UNSHARE that achieves the same by
>> triggering unshare-faults even for non-writable VMAs (again, excluding the
>> rmap opt).
> 
> I think the rmap bit is non-trivial as per my other replies.
> 
>>
>> We used MADV_UNSHARE to break COW asynchronously to the already-running
>> workload to keep fork() still short but avoid the overhead of COW faults
>> later.
>>
>> [ insert usualy comment about no weird mremap flags ]
> 
> Yup
> 
>>
>>>>>
>>>>> Maybe this isn't practical, but I thought it worth suggesting.
>>>>
>>>> Lorenzo suggested possibly unlinking the child anon_vma from the root once all
>>>> folios have been CoW-ed:
>>>>
>>>> "Right now, even if you entirely CoW everything in a VMA, we are still
>>>> attached to parents with all the overhead. That's something I can look at.
>>>> "
>>>>
>>>> My concern is that it’s difficult to determine whether a VMA has been completely
>>>> CoW-ed, and a single shared folio would prevent the unlink.
>>>> So I’m not sure this approach would work.
>>>
>>> I'm concerned that tracking how many folios remain shared may be
>>> inefficient.  Also that information needs to be gathered in both parent
>>> and child.
>>
>> Yeah, not a fan. Tracking per MM might work, tracking per VMA is problematic
>> due to the possibility for VMA splits.
> 
> Well, I think it's possible (maybe)... with a rework :>)
> 
> "Lorenzo's grand rework" etc. etc.
> 
>>
>>>
>>>> You seem to be proposing a forced CoW as a way to safely unlink from the root.
>>>>
>>>> A side effect is the potential for sudden, heavy memory allocation,
>>>> whereas CoW lets asynchronous tasks such as kswap work concurrently.
>>>
>>> Perhaps you could help us out with some stats on that -- how much
>>> anonymous memory starts out shared between the zygote and a newly
>>> spawned process?
>>>
>>>> Another issue is the extra memory use from folios that could have been
>>>> shared but aren’t—likely minor on Android, since only a small portion
>>>> of memory is actually shared, based on our observations.
>>>>
>>>> Calling mremap for each VMA might be difficult. Something applied to the
>>>> whole process could be more practical—similar to exec, but only
>>>> performing CoW and unlinking the anon_vma root.
>>>
>>> That seems like it would be worse for memory consumption than doing it
>>> on the VMAs in question.
>>
>> MADV_UNSHARE we implemented simply took a range and one could apply it to
>> the full process by supplying the full range.
>>
>> But yeah, the downside in any case is that you lose
> 
> You just lose? :P I assume you forgot to finish this thought :>)

"you lose the memory savings of COW" -- was still tired there.

> 
> I wonder from rmap point of view whether you could actually simply check to
> see if you're fully CoW'd.
> 
> E.g.:
> 
> 	madvise(..., MADV_ISOLATE_COWED)
> 
> And have it take the anon_vma write lock from root, have it walk the rmap,
> go and check to see if every folio in the VMA is now CoW'd, and if so,
> detatch the CoW'd anon_vma from its parent/root?

TBH, this all feels like things we should be optimizing internally somehow.

And don't get me started on

MADV_ISOLATE_COWED eww

NO_COWS eww

COW NOW eww

> 
> This would be a sort of after-the-fact thing, but maybe could be done
> periodically.
> 
> Of course then if you had one folio that was not yet CoW'd, that'd prevent
> this from completing.

[...]

>>>
>>> Another point in favour of COW_NOW or EAGER_COW is that we can choose to
>>> allocate folios of the appropriate size at that time.  Unless something's
>>> changed, I think we always COW individual pages rather than multiple
>>> pages at once.
>>
>> Yes. khugepaged will soon starting fixing that up later asynchronously I
>> hope.
> 
> But only at mTHP granularity a. once the relevant series lands and b. if
> mTHP is enabled (I mean for sub-PMD sized/aligned ranges) :>)

Well, we need khugepaged in one form or the other form mTHP in any case :P

And the glorious future will have all sizes enabled as default.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15  9:22       ` Lorenzo Stoakes
@ 2025-09-15 10:41         ` David Hildenbrand
  2025-09-15 10:51           ` Lorenzo Stoakes
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-09-15 10:41 UTC (permalink / raw)
  To: Lorenzo Stoakes, Matthew Wilcox
  Cc: Barry Song, Nicolas Geoffray, Lokesh Gidra, Harry Yoo,
	Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

>>
>> Another point in favour of COW_NOW or EAGER_COW is that we can choose to
>> allocate folios of the appropriate size at that time.  Unless something's
>> changed, I think we always COW individual pages rather than multiple
>> pages at once.
>>
> 
> I don't think there's any practical way to do it any differently because
> you don't know what's mapped/not at the point of fault.
> 
> How expensive would it be to have xarray for anon, I wonder... :)

We had a proposal least year I think where someone wanted to bring some 
weird FreeBSD semantics into Linux MM and proposed a secondary tracking 
structure for anon.

Tracking twice is obviously more expensive than tracking once, so I am 
quite convinced that no, we never want that.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15 10:41         ` David Hildenbrand
@ 2025-09-15 10:51           ` Lorenzo Stoakes
  0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Barry Song, Nicolas Geoffray, Lokesh Gidra,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Mon, Sep 15, 2025 at 12:41:52PM +0200, David Hildenbrand wrote:
> > >
> > > Another point in favour of COW_NOW or EAGER_COW is that we can choose to
> > > allocate folios of the appropriate size at that time.  Unless something's
> > > changed, I think we always COW individual pages rather than multiple
> > > pages at once.
> > >
> >
> > I don't think there's any practical way to do it any differently because
> > you don't know what's mapped/not at the point of fault.
> >
> > How expensive would it be to have xarray for anon, I wonder... :)
>
> We had a proposal least year I think where someone wanted to bring some
> weird FreeBSD semantics into Linux MM and proposed a secondary tracking
> structure for anon.

Yeah ok :) you've brought the big guns out to shoot down this... well I won't
say idea, cheeky thought :P

It'd make some things easier, but come at a really big cost.

It's obviously of great importance for the page cache (otherwise, how do you
even know where anything is there).

For anon this isn't the case.

>
> Tracking twice is obviously more expensive than tracking once, so I am quite
> convinced that no, we never want that.

Yup :)

>
> --
> Cheers
>
> David / dhildenb
>
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [DISCUSSION] anon_vma root lock contention and per anon_vma lock
  2025-09-15 10:29           ` David Hildenbrand
@ 2025-09-15 10:56             ` Lorenzo Stoakes
  0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 10:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Barry Song, Nicolas Geoffray, Lokesh Gidra,
	Harry Yoo, Suren Baghdasaryan, Andrew Morton, Rik van Riel,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Linux-MM,
	Kalesh Singh, SeongJae Park, Barry Song, Peter Xu

On Mon, Sep 15, 2025 at 12:29:50PM +0200, David Hildenbrand wrote:

> > > MADV_UNSHARE we implemented simply took a range and one could apply it to
> > > the full process by supplying the full range.
> > >
> > > But yeah, the downside in any case is that you lose
> >
> > You just lose? :P I assume you forgot to finish this thought :>)
>
> "you lose the memory savings of COW" -- was still tired there.

You and me both, bud... :)

And right, yes agreed.

>
> >
> > I wonder from rmap point of view whether you could actually simply check to
> > see if you're fully CoW'd.
> >
> > E.g.:
> >
> > 	madvise(..., MADV_ISOLATE_COWED)
> >
> > And have it take the anon_vma write lock from root, have it walk the rmap,
> > go and check to see if every folio in the VMA is now CoW'd, and if so,
> > detatch the CoW'd anon_vma from its parent/root?
>
> TBH, this all feels like things we should be optimizing internally somehow.
>
> And don't get me started on
>
> MADV_ISOLATE_COWED eww
>
> NO_COWS eww
>
> COW NOW eww

Holy CoW! ;)

Yeah it'd be nice for us to do this automagically. But I suspect anything like
this will be quite painful.

A kthread to do this... hmmm... Doing it on fault? Expensive. Tracking it? Also
expensive.

++lorenzos_grand_rework_goals I guess...

>
> >
> > This would be a sort of after-the-fact thing, but maybe could be done
> > periodically.
> >
> > Of course then if you had one folio that was not yet CoW'd, that'd prevent
> > this from completing.
>
> [...]
>
> > > >
> > > > Another point in favour of COW_NOW or EAGER_COW is that we can choose to
> > > > allocate folios of the appropriate size at that time.  Unless something's
> > > > changed, I think we always COW individual pages rather than multiple
> > > > pages at once.
> > >
> > > Yes. khugepaged will soon starting fixing that up later asynchronously I
> > > hope.
> >
> > But only at mTHP granularity a. once the relevant series lands and b. if
> > mTHP is enabled (I mean for sub-PMD sized/aligned ranges) :>)
>
> Well, we need khugepaged in one form or the other form mTHP in any case :P
>
> And the glorious future will have all sizes enabled as default.

I look forward to this glorious future :)

>
> --
> Cheers
>
> David / dhildenb
>
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-09-15 10:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-11  7:17 [DISCUSSION] anon_vma root lock contention and per anon_vma lock Barry Song
2025-09-11  8:14 ` David Hildenbrand
2025-09-11  8:34   ` Lorenzo Stoakes
2025-09-11  9:18   ` Barry Song
2025-09-11 10:47     ` Lorenzo Stoakes
2025-09-11  8:28 ` Lorenzo Stoakes
2025-09-11 18:22   ` Jann Horn
2025-09-12  4:49     ` Lorenzo Stoakes
2025-09-12 11:37       ` Jann Horn
2025-09-12 11:56         ` Lorenzo Stoakes
2025-09-14 23:53 ` Matthew Wilcox
2025-09-15  0:23   ` Barry Song
2025-09-15  1:47     ` Suren Baghdasaryan
2025-09-15  8:41       ` Lorenzo Stoakes
2025-09-15  2:50     ` Matthew Wilcox
2025-09-15  5:17       ` David Hildenbrand
2025-09-15  9:42         ` Lorenzo Stoakes
2025-09-15 10:29           ` David Hildenbrand
2025-09-15 10:56             ` Lorenzo Stoakes
2025-09-15  9:22       ` Lorenzo Stoakes
2025-09-15 10:41         ` David Hildenbrand
2025-09-15 10:51           ` Lorenzo Stoakes
2025-09-15  8:57   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox