linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
@ 2026-03-30 21:23 Lorenzo Stoakes (Oracle)
  2026-03-31 23:30 ` Barry Song
  0 siblings, 1 reply; 6+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-30 21:23 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo,
	Rik van Riel, Jann Horn, Chris Li, Barry Song

[sorry subject line was typo'd, resending with correct subject line for
visibility. Original at
https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]

Currently we track the reverse mapping between folios and VMAs at a VMA level,
utilising a complicated and confusing combination of anon_vma objects and
anon_vma_chain's linking them, which must be updated when VMAs are split,
merged, remapped or forked.

It's further complicated by various optimisations intended to avoid scalability
issues in locking and memory allocation.

I have done recent work to improve the situation [0] which has also lead to a
reported improvement in lock scalability [1], but fundamentally the situation
remains the same.

The logic is actually, when you think hard enough about it, is a fairly
reasonable means of implementing the reverse mapping at a VMA level.

It is, however, a very broken abstraction as it stands. In order to work with
the logic, you have to essentially keep a broad understanding of the entire
implementation in your head at one time - that is, not much is really
abstracted.

This results in confusion, mistakes, and bit rot. It's also very time-consuming
to work with - personally I've gone to the lengths of writing a private set of
slides for myself on the topic as a reminder each time I come back to it.

There are also issues with lock scalability - the use of interval trees to
maintain a connection between an anon_vma and AVCs connected to VMAs requires
that a lock must be held across the entire 'CoW hierarchy' of parent and child
VMAs whenever performing an rmap walk or performing a merge, split, remap or
fork.

This is because we tear down all interval tree mappings and reestablish them
each time we might see changes in VMA geometry. This is an issue Barry Song
identified as problematic in a real world use case [2].

So what do we do to improve the situation?

Recently I have been working on an experimental new approach to the anonymous
reverse mapping, in which we instead track anonymous remaps, and then use the
VMA's virtual page offset to locate VMAs from the folio.

I have got the implementation working to the point where it tracks the exact
same VMAs as the anon_vma implementation, and it seems a lot of it can be done
under RCU.

It avoids the need to maintain expensive mappings at a VMA level, though it
incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO
(they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is
pretty sub-optimal).

I am investigating whether I can change how MAP_PRIVATE file-backed mappings
work to avoid this issue, and will be developing tests to see how lock
scalability, throughput and memory usage compare to the anon_vma approach under
different workloads.

This experiment may or may not work out, either way it will be interesting to
discuss it.

By the time LSF/MM comes around I may even have already decided on a different
approach but that's what makes things interesting :)

[0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/
[1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/
[2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
  2026-03-30 21:23 [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] Lorenzo Stoakes (Oracle)
@ 2026-03-31 23:30 ` Barry Song
  2026-04-01  8:43   ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 6+ messages in thread
From: Barry Song @ 2026-03-31 23:30 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
	Harry Yoo, Rik van Riel, Jann Horn, Chris Li

Hi Lorenzo,

Thank you very much for bringing this up for discussion.

On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> [sorry subject line was typo'd, resending with correct subject line for
> visibility. Original at
> https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
>
> Currently we track the reverse mapping between folios and VMAs at a VMA level,
> utilising a complicated and confusing combination of anon_vma objects and
> anon_vma_chain's linking them, which must be updated when VMAs are split,
> merged, remapped or forked.
>
> It's further complicated by various optimisations intended to avoid scalability
> issues in locking and memory allocation.
>
> I have done recent work to improve the situation [0] which has also lead to a
> reported improvement in lock scalability [1], but fundamentally the situation
> remains the same.
>
> The logic is actually, when you think hard enough about it, is a fairly
> reasonable means of implementing the reverse mapping at a VMA level.
>
> It is, however, a very broken abstraction as it stands. In order to work with
> the logic, you have to essentially keep a broad understanding of the entire
> implementation in your head at one time - that is, not much is really
> abstracted.
>
> This results in confusion, mistakes, and bit rot. It's also very time-consuming
> to work with - personally I've gone to the lengths of writing a private set of
> slides for myself on the topic as a reminder each time I come back to it.
>
> There are also issues with lock scalability - the use of interval trees to
> maintain a connection between an anon_vma and AVCs connected to VMAs requires
> that a lock must be held across the entire 'CoW hierarchy' of parent and child
> VMAs whenever performing an rmap walk or performing a merge, split, remap or
> fork.
>
> This is because we tear down all interval tree mappings and reestablish them
> each time we might see changes in VMA geometry. This is an issue Barry Song
> identified as problematic in a real world use case [2].
>
> So what do we do to improve the situation?
>
> Recently I have been working on an experimental new approach to the anonymous
> reverse mapping, in which we instead track anonymous remaps, and then use the
> VMA's virtual page offset to locate VMAs from the folio.

Please forgive my confusion. I’m still struggling to fully
understand your approach of “tracking anonymous remaps.”
Could you provide a concrete example to illustrate how it works?

For example, if A forks B, and then B forks C, how do we
determine the VMAs for a folio from the original A that has
not yet been COWed in B or C?

Additionally, if B COWs and obtains a new folio before forking
C, how do we determine its VMAs in B and C?

Also, what happens if C performs a remap on the inherited VMA
in the two cases described above?

>
> I have got the implementation working to the point where it tracks the exact
> same VMAs as the anon_vma implementation, and it seems a lot of it can be done
> under RCU.
>
> It avoids the need to maintain expensive mappings at a VMA level, though it
> incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO
> (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is
> pretty sub-optimal).
>
> I am investigating whether I can change how MAP_PRIVATE file-backed mappings
> work to avoid this issue, and will be developing tests to see how lock
> scalability, throughput and memory usage compare to the anon_vma approach under
> different workloads.
>
> This experiment may or may not work out, either way it will be interesting to
> discuss it.
>
> By the time LSF/MM comes around I may even have already decided on a different
> approach but that's what makes things interesting :)
>
> [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/
> [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/
> [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/
>
> Cheers, Lorenzo

Thanks
Barry


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
  2026-03-31 23:30 ` Barry Song
@ 2026-04-01  8:43   ` Lorenzo Stoakes (Oracle)
  2026-04-01 21:03     ` Barry Song
  0 siblings, 1 reply; 6+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-04-01  8:43 UTC (permalink / raw)
  To: Barry Song
  Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
	Harry Yoo, Rik van Riel, Jann Horn, Chris Li

On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote:
> Hi Lorenzo,
>
> Thank you very much for bringing this up for discussion.
>
> On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > [sorry subject line was typo'd, resending with correct subject line for
> > visibility. Original at
> > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
> >
> > Currently we track the reverse mapping between folios and VMAs at a VMA level,
> > utilising a complicated and confusing combination of anon_vma objects and
> > anon_vma_chain's linking them, which must be updated when VMAs are split,
> > merged, remapped or forked.
> >
> > It's further complicated by various optimisations intended to avoid scalability
> > issues in locking and memory allocation.
> >
> > I have done recent work to improve the situation [0] which has also lead to a
> > reported improvement in lock scalability [1], but fundamentally the situation
> > remains the same.
> >
> > The logic is actually, when you think hard enough about it, is a fairly
> > reasonable means of implementing the reverse mapping at a VMA level.
> >
> > It is, however, a very broken abstraction as it stands. In order to work with
> > the logic, you have to essentially keep a broad understanding of the entire
> > implementation in your head at one time - that is, not much is really
> > abstracted.
> >
> > This results in confusion, mistakes, and bit rot. It's also very time-consuming
> > to work with - personally I've gone to the lengths of writing a private set of
> > slides for myself on the topic as a reminder each time I come back to it.
> >
> > There are also issues with lock scalability - the use of interval trees to
> > maintain a connection between an anon_vma and AVCs connected to VMAs requires
> > that a lock must be held across the entire 'CoW hierarchy' of parent and child
> > VMAs whenever performing an rmap walk or performing a merge, split, remap or
> > fork.
> >
> > This is because we tear down all interval tree mappings and reestablish them
> > each time we might see changes in VMA geometry. This is an issue Barry Song
> > identified as problematic in a real world use case [2].
> >
> > So what do we do to improve the situation?
> >
> > Recently I have been working on an experimental new approach to the anonymous
> > reverse mapping, in which we instead track anonymous remaps, and then use the
> > VMA's virtual page offset to locate VMAs from the folio.
>
> Please forgive my confusion. I’m still struggling to fully
> understand your approach of “tracking anonymous remaps.”
> Could you provide a concrete example to illustrate how it works?

I should really put this code somewhere :)

>
> For example, if A forks B, and then B forks C, how do we
> determine the VMAs for a folio from the original A that has
> not yet been COWed in B or C?

The folio references the cow_context associated with the mm in A.

So mm has a new cow_context field that points to cow_context, and the
cow_context can outlive the mm if it has children.

Each cow context tracks its forked children also, so an rmap search will
traverse A, B, C.

>
> Additionally, if B COWs and obtains a new folio before forking
> C, how do we determine its VMAs in B and C?

The new folio would point to B's cow context, and it'd traverse B and C to find
relevant folios.

Overall we pay a higher search price (though arguably, not too bad still) but
get to do it _all_ under RCU.

In exchange, we avoid the locking issues and use ~30x less memory.

(Of course I am yet to solve rmap lock stabilisation so got to try and do that
first :)

>
> Also, what happens if C performs a remap on the inherited VMA
> in the two cases described above?

Remaps are tracked within cow_context's via an extended maple tree (currently
maple tree -> dynamic arrays) that also handles multiple entries and overlaps.

>
> >
> > I have got the implementation working to the point where it tracks the exact
> > same VMAs as the anon_vma implementation, and it seems a lot of it can be done
> > under RCU.
> >
> > It avoids the need to maintain expensive mappings at a VMA level, though it
> > incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO
> > (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is
> > pretty sub-optimal).
> >
> > I am investigating whether I can change how MAP_PRIVATE file-backed mappings
> > work to avoid this issue, and will be developing tests to see how lock
> > scalability, throughput and memory usage compare to the anon_vma approach under
> > different workloads.
> >
> > This experiment may or may not work out, either way it will be interesting to
> > discuss it.
> >
> > By the time LSF/MM comes around I may even have already decided on a different
> > approach but that's what makes things interesting :)
> >
> > [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/
> > [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/
> > [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/
> >
> > Cheers, Lorenzo
>
> Thanks
> Barry

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
  2026-04-01  8:43   ` Lorenzo Stoakes (Oracle)
@ 2026-04-01 21:03     ` Barry Song
  2026-04-02 12:20       ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 6+ messages in thread
From: Barry Song @ 2026-04-01 21:03 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
	Harry Yoo, Rik van Riel, Jann Horn, Chris Li

On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote:
> > Hi Lorenzo,
> >
> > Thank you very much for bringing this up for discussion.
> >
> > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > >
> > > [sorry subject line was typo'd, resending with correct subject line for
> > > visibility. Original at
> > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
> > >
> > > Currently we track the reverse mapping between folios and VMAs at a VMA level,
> > > utilising a complicated and confusing combination of anon_vma objects and
> > > anon_vma_chain's linking them, which must be updated when VMAs are split,
> > > merged, remapped or forked.
> > >
> > > It's further complicated by various optimisations intended to avoid scalability
> > > issues in locking and memory allocation.
> > >
> > > I have done recent work to improve the situation [0] which has also lead to a
> > > reported improvement in lock scalability [1], but fundamentally the situation
> > > remains the same.
> > >
> > > The logic is actually, when you think hard enough about it, is a fairly
> > > reasonable means of implementing the reverse mapping at a VMA level.
> > >
> > > It is, however, a very broken abstraction as it stands. In order to work with
> > > the logic, you have to essentially keep a broad understanding of the entire
> > > implementation in your head at one time - that is, not much is really
> > > abstracted.
> > >
> > > This results in confusion, mistakes, and bit rot. It's also very time-consuming
> > > to work with - personally I've gone to the lengths of writing a private set of
> > > slides for myself on the topic as a reminder each time I come back to it.
> > >
> > > There are also issues with lock scalability - the use of interval trees to
> > > maintain a connection between an anon_vma and AVCs connected to VMAs requires
> > > that a lock must be held across the entire 'CoW hierarchy' of parent and child
> > > VMAs whenever performing an rmap walk or performing a merge, split, remap or
> > > fork.
> > >
> > > This is because we tear down all interval tree mappings and reestablish them
> > > each time we might see changes in VMA geometry. This is an issue Barry Song
> > > identified as problematic in a real world use case [2].
> > >
> > > So what do we do to improve the situation?
> > >
> > > Recently I have been working on an experimental new approach to the anonymous
> > > reverse mapping, in which we instead track anonymous remaps, and then use the
> > > VMA's virtual page offset to locate VMAs from the folio.
> >
> > Please forgive my confusion. I’m still struggling to fully
> > understand your approach of “tracking anonymous remaps.”
> > Could you provide a concrete example to illustrate how it works?
>
> I should really put this code somewhere :)
>
> >
> > For example, if A forks B, and then B forks C, how do we
> > determine the VMAs for a folio from the original A that has
> > not yet been COWed in B or C?
>
> The folio references the cow_context associated with the mm in A.
>
> So mm has a new cow_context field that points to cow_context, and the
> cow_context can outlive the mm if it has children.

So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling)
because in vfork() and exec() cases the mm_struct is not inherited?

>
> Each cow context tracks its forked children also, so an rmap search will
> traverse A, B, C.

I still don’t understand how we can get a folio’s VMA from the folio itself.
For anonymous VMAs, vma->vm_pgoff is always zero, right?

Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT?

In case A forks B, and B unmaps a VMA then maps a new
VMA at the same address as before, what happens? Will the
traversal find the new VMA, which doesn’t actually map the folio?

>
> >
> > Additionally, if B COWs and obtains a new folio before forking
> > C, how do we determine its VMAs in B and C?
>
> The new folio would point to B's cow context, and it'd traverse B and C to find
> relevant folios.
>
> Overall we pay a higher search price (though arguably, not too bad still) but
> get to do it _all_ under RCU.

Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling)
can work safely under RCU.

>
> In exchange, we avoid the locking issues and use ~30x less memory.
>
> (Of course I am yet to solve rmap lock stabilisation so got to try and do that
> first :)
>
> >
> > Also, what happens if C performs a remap on the inherited VMA
> > in the two cases described above?
>
> Remaps are tracked within cow_context's via an extended maple tree (currently
> maple tree -> dynamic arrays) that also handles multiple entries and overlaps.

If we have multiple remaps for multiple VMAs within one mm_struct,
will we end up traversing all the dynamic arrays for any folio that
might be located in a VMA that has been remapped?

>
> >
> > >
> > > I have got the implementation working to the point where it tracks the exact
> > > same VMAs as the anon_vma implementation, and it seems a lot of it can be done
> > > under RCU.
> > >
> > > It avoids the need to maintain expensive mappings at a VMA level, though it
> > > incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO
> > > (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is
> > > pretty sub-optimal).
> > >
> > > I am investigating whether I can change how MAP_PRIVATE file-backed mappings
> > > work to avoid this issue, and will be developing tests to see how lock
> > > scalability, throughput and memory usage compare to the anon_vma approach under
> > > different workloads.
> > >
> > > This experiment may or may not work out, either way it will be interesting to
> > > discuss it.
> > >
> > > By the time LSF/MM comes around I may even have already decided on a different
> > > approach but that's what makes things interesting :)
> > >
> > > [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/
> > > [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/
> > > [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/
> > >

Thanks
Barry


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
  2026-04-01 21:03     ` Barry Song
@ 2026-04-02 12:20       ` Lorenzo Stoakes (Oracle)
  2026-04-02 21:49         ` Barry Song
  0 siblings, 1 reply; 6+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-04-02 12:20 UTC (permalink / raw)
  To: Barry Song
  Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
	Harry Yoo, Rik van Riel, Jann Horn, Chris Li

On Thu, Apr 02, 2026 at 05:03:42AM +0800, Barry Song wrote:
> On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote:
> > > Hi Lorenzo,
> > >
> > > Thank you very much for bringing this up for discussion.
> > >
> > > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > > >
> > > > [sorry subject line was typo'd, resending with correct subject line for
> > > > visibility. Original at
> > > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
> > > >
> > > > Currently we track the reverse mapping between folios and VMAs at a VMA level,
> > > > utilising a complicated and confusing combination of anon_vma objects and
> > > > anon_vma_chain's linking them, which must be updated when VMAs are split,
> > > > merged, remapped or forked.
> > > >
> > > > It's further complicated by various optimisations intended to avoid scalability
> > > > issues in locking and memory allocation.
> > > >
> > > > I have done recent work to improve the situation [0] which has also lead to a
> > > > reported improvement in lock scalability [1], but fundamentally the situation
> > > > remains the same.
> > > >
> > > > The logic is actually, when you think hard enough about it, is a fairly
> > > > reasonable means of implementing the reverse mapping at a VMA level.
> > > >
> > > > It is, however, a very broken abstraction as it stands. In order to work with
> > > > the logic, you have to essentially keep a broad understanding of the entire
> > > > implementation in your head at one time - that is, not much is really
> > > > abstracted.
> > > >
> > > > This results in confusion, mistakes, and bit rot. It's also very time-consuming
> > > > to work with - personally I've gone to the lengths of writing a private set of
> > > > slides for myself on the topic as a reminder each time I come back to it.
> > > >
> > > > There are also issues with lock scalability - the use of interval trees to
> > > > maintain a connection between an anon_vma and AVCs connected to VMAs requires
> > > > that a lock must be held across the entire 'CoW hierarchy' of parent and child
> > > > VMAs whenever performing an rmap walk or performing a merge, split, remap or
> > > > fork.
> > > >
> > > > This is because we tear down all interval tree mappings and reestablish them
> > > > each time we might see changes in VMA geometry. This is an issue Barry Song
> > > > identified as problematic in a real world use case [2].
> > > >
> > > > So what do we do to improve the situation?
> > > >
> > > > Recently I have been working on an experimental new approach to the anonymous
> > > > reverse mapping, in which we instead track anonymous remaps, and then use the
> > > > VMA's virtual page offset to locate VMAs from the folio.
> > >
> > > Please forgive my confusion. I’m still struggling to fully
> > > understand your approach of “tracking anonymous remaps.”
> > > Could you provide a concrete example to illustrate how it works?
> >
> > I should really put this code somewhere :)
> >
> > >
> > > For example, if A forks B, and then B forks C, how do we
> > > determine the VMAs for a folio from the original A that has
> > > not yet been COWed in B or C?
> >
> > The folio references the cow_context associated with the mm in A.
> >
> > So mm has a new cow_context field that points to cow_context, and the
> > cow_context can outlive the mm if it has children.
>
> So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling)
> because in vfork() and exec() cases the mm_struct is not inherited?

Umm, memory is not preserved across an exec() :) so it works fine with that.

vfork() is CLONE_VM so the mm is shared and everything works fine.

>
> >
> > Each cow context tracks its forked children also, so an rmap search will
> > traverse A, B, C.
>
> I still don’t understand how we can get a folio’s VMA from the folio itself.
> For anonymous VMAs, vma->vm_pgoff is always zero, right?

No, not at all.

vma->vm_pgoff is equal to vma->vm_start >> PAGE_SHIFT when first faulted in for
anon.

That reduces the problem to tracking remaps, which I do.

>
> Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT?

No, that's how anon works already.

>
> In case A forks B, and B unmaps a VMA then maps a new
> VMA at the same address as before, what happens? Will the
> traversal find the new VMA, which doesn’t actually map the folio?

Well you're missing stuff there, the folio would have to be non-anon exclusive
(which is rare). Yes it'd find the new VMA, then traverse, and find the folio
does not match, and traverse children.

rmap walks _always_ allow for you walking VMAs that a folio does not belong
to.

For instance, with anon_vma, if you CoW a bunch of folios to child process VMAs,
the non-CoW'd folio will _still_ traverse all of that uselessly.

In any case, this isn't a common case.

However note that if a folio _becomes_ anon exclusive, it switches its 'root'
cow context to the one associated with the mm which it became exclusive to.

>
> >
> > >
> > > Additionally, if B COWs and obtains a new folio before forking
> > > C, how do we determine its VMAs in B and C?
> >
> > The new folio would point to B's cow context, and it'd traverse B and C to find
> > relevant folios.
> >
> > Overall we pay a higher search price (though arguably, not too bad still) but
> > get to do it _all_ under RCU.
>
> Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling)
> can work safely under RCU.
>
> >
> > In exchange, we avoid the locking issues and use ~30x less memory.
> >
> > (Of course I am yet to solve rmap lock stabilisation so got to try and do that
> > first :)
> >
> > >
> > > Also, what happens if C performs a remap on the inherited VMA
> > > in the two cases described above?
> >
> > Remaps are tracked within cow_context's via an extended maple tree (currently
> > maple tree -> dynamic arrays) that also handles multiple entries and overlaps.
>
> If we have multiple remaps for multiple VMAs within one mm_struct,
> will we end up traversing all the dynamic arrays for any folio that
> might be located in a VMA that has been remapped?

Yup. But there aren't all that many, and it's all under RCU so :)

That part of the search should be quick, parts of the search involving page
tables, less so.

Also I need to figure out how to maintain stabilisation without an rmap lock, an
ongoing open problem in all this.

In the end, as the original mail said, I may conclude _this_ approach is
unworkable and come up with an alternative that's more conventional.

BUT. Doing it this way saves 30x the amount of kernel allocated memory. I tried
a heavy load case and it was very substantial. That's not to be sniffed at.

In any case, all of this is going to be _very_ driven by metrics. How slow is
it, how much overhead does it actually produce, is it workable, are the
trade-offs right, etc.

It's an exploration rather than a fait accompli.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
  2026-04-02 12:20       ` Lorenzo Stoakes (Oracle)
@ 2026-04-02 21:49         ` Barry Song
  0 siblings, 0 replies; 6+ messages in thread
From: Barry Song @ 2026-04-02 21:49 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts,
	Harry Yoo, Rik van Riel, Jann Horn, Chris Li

On Thu, Apr 2, 2026 at 8:20 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Thu, Apr 02, 2026 at 05:03:42AM +0800, Barry Song wrote:
> > On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > >
> > > On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote:
> > > > Hi Lorenzo,
> > > >
> > > > Thank you very much for bringing this up for discussion.
> > > >
> > > > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > > > >
> > > > > [sorry subject line was typo'd, resending with correct subject line for
> > > > > visibility. Original at
> > > > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
> > > > >
> > > > > Currently we track the reverse mapping between folios and VMAs at a VMA level,
> > > > > utilising a complicated and confusing combination of anon_vma objects and
> > > > > anon_vma_chain's linking them, which must be updated when VMAs are split,
> > > > > merged, remapped or forked.
> > > > >
> > > > > It's further complicated by various optimisations intended to avoid scalability
> > > > > issues in locking and memory allocation.
> > > > >
> > > > > I have done recent work to improve the situation [0] which has also lead to a
> > > > > reported improvement in lock scalability [1], but fundamentally the situation
> > > > > remains the same.
> > > > >
> > > > > The logic is actually, when you think hard enough about it, is a fairly
> > > > > reasonable means of implementing the reverse mapping at a VMA level.
> > > > >
> > > > > It is, however, a very broken abstraction as it stands. In order to work with
> > > > > the logic, you have to essentially keep a broad understanding of the entire
> > > > > implementation in your head at one time - that is, not much is really
> > > > > abstracted.
> > > > >
> > > > > This results in confusion, mistakes, and bit rot. It's also very time-consuming
> > > > > to work with - personally I've gone to the lengths of writing a private set of
> > > > > slides for myself on the topic as a reminder each time I come back to it.
> > > > >
> > > > > There are also issues with lock scalability - the use of interval trees to
> > > > > maintain a connection between an anon_vma and AVCs connected to VMAs requires
> > > > > that a lock must be held across the entire 'CoW hierarchy' of parent and child
> > > > > VMAs whenever performing an rmap walk or performing a merge, split, remap or
> > > > > fork.
> > > > >
> > > > > This is because we tear down all interval tree mappings and reestablish them
> > > > > each time we might see changes in VMA geometry. This is an issue Barry Song
> > > > > identified as problematic in a real world use case [2].
> > > > >
> > > > > So what do we do to improve the situation?
> > > > >
> > > > > Recently I have been working on an experimental new approach to the anonymous
> > > > > reverse mapping, in which we instead track anonymous remaps, and then use the
> > > > > VMA's virtual page offset to locate VMAs from the folio.
> > > >
> > > > Please forgive my confusion. I’m still struggling to fully
> > > > understand your approach of “tracking anonymous remaps.”
> > > > Could you provide a concrete example to illustrate how it works?
> > >
> > > I should really put this code somewhere :)
> > >
> > > >
> > > > For example, if A forks B, and then B forks C, how do we
> > > > determine the VMAs for a folio from the original A that has
> > > > not yet been COWed in B or C?
> > >
> > > The folio references the cow_context associated with the mm in A.
> > >
> > > So mm has a new cow_context field that points to cow_context, and the
> > > cow_context can outlive the mm if it has children.
> >
> > So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling)
> > because in vfork() and exec() cases the mm_struct is not inherited?
>
> Umm, memory is not preserved across an exec() :) so it works fine with that.
>
> vfork() is CLONE_VM so the mm is shared and everything works fine.

My question is whether we can reuse the process tree, similar to
walk_tg_tree_from(). With some flags in mm_struct, it might be
possible to distinguish whether an mm_struct was copied from the
parent or created by a new exec.

>
> >
> > >
> > > Each cow context tracks its forked children also, so an rmap search will
> > > traverse A, B, C.
> >
> > I still don’t understand how we can get a folio’s VMA from the folio itself.
> > For anonymous VMAs, vma->vm_pgoff is always zero, right?
>
> No, not at all.
>
> vma->vm_pgoff is equal to vma->vm_start >> PAGE_SHIFT when first faulted in for
> anon.
>
> That reduces the problem to tracking remaps, which I do.
>
> >
> > Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT?
>
> No, that's how anon works already.

Sorry for my mistake. I was somehow reading incorrect information from
/proc/<pid>/maps, where anonymous VMAs always appeared as zero.
A simple patch like the one below proves that you are absolutely right:

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 33e5094a7842..0cecff1c6307 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -475,9 +475,9 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)

                dev = inode->i_sb->s_dev;
                ino = inode->i_ino;
-               pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
+               //pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
        }
-
+       pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
        start = vma->vm_start;
        end = vma->vm_end;
        show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);

>
> >
> > In case A forks B, and B unmaps a VMA then maps a new
> > VMA at the same address as before, what happens? Will the
> > traversal find the new VMA, which doesn’t actually map the folio?
>
> Well you're missing stuff there, the folio would have to be non-anon exclusive
> (which is rare). Yes it'd find the new VMA, then traverse, and find the folio
> does not match, and traverse children.
>
> rmap walks _always_ allow for you walking VMAs that a folio does not belong
> to.

I understand that we can check whether the folio belongs to the new
VMA, but I’m curious whether this will occur more frequently in practice
after the change. In the rmap case, I assume the original A’s folio
anon_vma would be detached from process B once B unmaps and then maps
a new VMA, so we wouldn’t search B anymore—is that correct?

>
> For instance, with anon_vma, if you CoW a bunch of folios to child process VMAs,
> the non-CoW'd folio will _still_ traverse all of that uselessly.
>
> In any case, this isn't a common case.
>
> However note that if a folio _becomes_ anon exclusive, it switches its 'root'
> cow context to the one associated with the mm which it became exclusive to.
>

Agreed. I’m curious about the case of A’s folio, whose VMA has been
completely replaced in B after the unmap and map. In the old anon_vma
case, we wouldn’t search B anymore, but now we’ll need to check B's
vm_pgoff since it covers the folio’s address—is that correct?

> >
> > >
> > > >
> > > > Additionally, if B COWs and obtains a new folio before forking
> > > > C, how do we determine its VMAs in B and C?
> > >
> > > The new folio would point to B's cow context, and it'd traverse B and C to find
> > > relevant folios.
> > >
> > > Overall we pay a higher search price (though arguably, not too bad still) but
> > > get to do it _all_ under RCU.
> >
> > Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling)
> > can work safely under RCU.
> >
> > >
> > > In exchange, we avoid the locking issues and use ~30x less memory.
> > >
> > > (Of course I am yet to solve rmap lock stabilisation so got to try and do that
> > > first :)
> > >
> > > >
> > > > Also, what happens if C performs a remap on the inherited VMA
> > > > in the two cases described above?
> > >
> > > Remaps are tracked within cow_context's via an extended maple tree (currently
> > > maple tree -> dynamic arrays) that also handles multiple entries and overlaps.
> >
> > If we have multiple remaps for multiple VMAs within one mm_struct,
> > will we end up traversing all the dynamic arrays for any folio that
> > might be located in a VMA that has been remapped?
>
> Yup. But there aren't all that many, and it's all under RCU so :)
>
> That part of the search should be quick, parts of the search involving page
> tables, less so.
>
> Also I need to figure out how to maintain stabilisation without an rmap lock, an
> ongoing open problem in all this.
>
> In the end, as the original mail said, I may conclude _this_ approach is
> unworkable and come up with an alternative that's more conventional.

I’m genuinely interested in the new approach. If you have the code, I’d be
happy to read, test, and work on it.

>
> BUT. Doing it this way saves 30x the amount of kernel allocated memory. I tried
> a heavy load case and it was very substantial. That's not to be sniffed at.
>
> In any case, all of this is going to be _very_ driven by metrics. How slow is
> it, how much overhead does it actually produce, is it workable, are the
> trade-offs right, etc.
>
> It's an exploration rather than a fait accompli.

Right now, I’m still at the stage of trying to understand the details of
your new approach and would like to learn more—so I might have quite a
few naive questions :-)

Thanks
Barry


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-02 21:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-30 21:23 [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] Lorenzo Stoakes (Oracle)
2026-03-31 23:30 ` Barry Song
2026-04-01  8:43   ` Lorenzo Stoakes (Oracle)
2026-04-01 21:03     ` Barry Song
2026-04-02 12:20       ` Lorenzo Stoakes (Oracle)
2026-04-02 21:49         ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox