Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Barry Song <baohua@kernel.org>
To: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	 David Hildenbrand <david@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	 Vlastimil Babka <vbabka@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Pedro Falcato <pfalcato@suse.de>,
	 Ryan Roberts <ryan.roberts@arm.com>,
	Harry Yoo <harry.yoo@oracle.com>,
	 Rik van Riel <riel@surriel.com>, Jann Horn <jannh@google.com>,
	Chris Li <chriscli@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND]
Date: Fri, 3 Apr 2026 05:49:10 +0800	[thread overview]
Message-ID: <CAGsJ_4za9WM8=-60OVKraOw8KLCgb+CcJfiKixuFsMxUWQ2nzQ@mail.gmail.com> (raw)
In-Reply-To: <926d7e26-4f13-4e70-a392-1111de27f700@lucifer.local>

On Thu, Apr 2, 2026 at 8:20 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Thu, Apr 02, 2026 at 05:03:42AM +0800, Barry Song wrote:
> > On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > >
> > > On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote:
> > > > Hi Lorenzo,
> > > >
> > > > Thank you very much for bringing this up for discussion.
> > > >
> > > > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > > > >
> > > > > [sorry subject line was typo'd, resending with correct subject line for
> > > > > visibility. Original at
> > > > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/]
> > > > >
> > > > > Currently we track the reverse mapping between folios and VMAs at a VMA level,
> > > > > utilising a complicated and confusing combination of anon_vma objects and
> > > > > anon_vma_chain's linking them, which must be updated when VMAs are split,
> > > > > merged, remapped or forked.
> > > > >
> > > > > It's further complicated by various optimisations intended to avoid scalability
> > > > > issues in locking and memory allocation.
> > > > >
> > > > > I have done recent work to improve the situation [0] which has also lead to a
> > > > > reported improvement in lock scalability [1], but fundamentally the situation
> > > > > remains the same.
> > > > >
> > > > > The logic is actually, when you think hard enough about it, is a fairly
> > > > > reasonable means of implementing the reverse mapping at a VMA level.
> > > > >
> > > > > It is, however, a very broken abstraction as it stands. In order to work with
> > > > > the logic, you have to essentially keep a broad understanding of the entire
> > > > > implementation in your head at one time - that is, not much is really
> > > > > abstracted.
> > > > >
> > > > > This results in confusion, mistakes, and bit rot. It's also very time-consuming
> > > > > to work with - personally I've gone to the lengths of writing a private set of
> > > > > slides for myself on the topic as a reminder each time I come back to it.
> > > > >
> > > > > There are also issues with lock scalability - the use of interval trees to
> > > > > maintain a connection between an anon_vma and AVCs connected to VMAs requires
> > > > > that a lock must be held across the entire 'CoW hierarchy' of parent and child
> > > > > VMAs whenever performing an rmap walk or performing a merge, split, remap or
> > > > > fork.
> > > > >
> > > > > This is because we tear down all interval tree mappings and reestablish them
> > > > > each time we might see changes in VMA geometry. This is an issue Barry Song
> > > > > identified as problematic in a real world use case [2].
> > > > >
> > > > > So what do we do to improve the situation?
> > > > >
> > > > > Recently I have been working on an experimental new approach to the anonymous
> > > > > reverse mapping, in which we instead track anonymous remaps, and then use the
> > > > > VMA's virtual page offset to locate VMAs from the folio.
> > > >
> > > > Please forgive my confusion. I’m still struggling to fully
> > > > understand your approach of “tracking anonymous remaps.”
> > > > Could you provide a concrete example to illustrate how it works?
> > >
> > > I should really put this code somewhere :)
> > >
> > > >
> > > > For example, if A forks B, and then B forks C, how do we
> > > > determine the VMAs for a folio from the original A that has
> > > > not yet been COWed in B or C?
> > >
> > > The folio references the cow_context associated with the mm in A.
> > >
> > > So mm has a new cow_context field that points to cow_context, and the
> > > cow_context can outlive the mm if it has children.
> >
> > So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling)
> > because in vfork() and exec() cases the mm_struct is not inherited?
>
> Umm, memory is not preserved across an exec() :) so it works fine with that.
>
> vfork() is CLONE_VM so the mm is shared and everything works fine.

My question is whether we can reuse the process tree, similar to
walk_tg_tree_from(). With some flags in mm_struct, it might be
possible to distinguish whether an mm_struct was copied from the
parent or created by a new exec.

>
> >
> > >
> > > Each cow context tracks its forked children also, so an rmap search will
> > > traverse A, B, C.
> >
> > I still don’t understand how we can get a folio’s VMA from the folio itself.
> > For anonymous VMAs, vma->vm_pgoff is always zero, right?
>
> No, not at all.
>
> vma->vm_pgoff is equal to vma->vm_start >> PAGE_SHIFT when first faulted in for
> anon.
>
> That reduces the problem to tracking remaps, which I do.
>
> >
> > Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT?
>
> No, that's how anon works already.

Sorry for my mistake. I was somehow reading incorrect information from
/proc/<pid>/maps, where anonymous VMAs always appeared as zero.
A simple patch like the one below proves that you are absolutely right:

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 33e5094a7842..0cecff1c6307 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -475,9 +475,9 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)

                dev = inode->i_sb->s_dev;
                ino = inode->i_ino;
-               pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
+               //pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
        }
-
+       pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
        start = vma->vm_start;
        end = vma->vm_end;
        show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);

>
> >
> > In case A forks B, and B unmaps a VMA then maps a new
> > VMA at the same address as before, what happens? Will the
> > traversal find the new VMA, which doesn’t actually map the folio?
>
> Well you're missing stuff there, the folio would have to be non-anon exclusive
> (which is rare). Yes it'd find the new VMA, then traverse, and find the folio
> does not match, and traverse children.
>
> rmap walks _always_ allow for you walking VMAs that a folio does not belong
> to.

I understand that we can check whether the folio belongs to the new
VMA, but I’m curious whether this will occur more frequently in practice
after the change. In the rmap case, I assume the original A’s folio
anon_vma would be detached from process B once B unmaps and then maps
a new VMA, so we wouldn’t search B anymore—is that correct?

>
> For instance, with anon_vma, if you CoW a bunch of folios to child process VMAs,
> the non-CoW'd folio will _still_ traverse all of that uselessly.
>
> In any case, this isn't a common case.
>
> However note that if a folio _becomes_ anon exclusive, it switches its 'root'
> cow context to the one associated with the mm which it became exclusive to.
>

Agreed. I’m curious about the case of A’s folio, whose VMA has been
completely replaced in B after the unmap and map. In the old anon_vma
case, we wouldn’t search B anymore, but now we’ll need to check B's
vm_pgoff since it covers the folio’s address—is that correct?

> >
> > >
> > > >
> > > > Additionally, if B COWs and obtains a new folio before forking
> > > > C, how do we determine its VMAs in B and C?
> > >
> > > The new folio would point to B's cow context, and it'd traverse B and C to find
> > > relevant folios.
> > >
> > > Overall we pay a higher search price (though arguably, not too bad still) but
> > > get to do it _all_ under RCU.
> >
> > Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling)
> > can work safely under RCU.
> >
> > >
> > > In exchange, we avoid the locking issues and use ~30x less memory.
> > >
> > > (Of course I am yet to solve rmap lock stabilisation so got to try and do that
> > > first :)
> > >
> > > >
> > > > Also, what happens if C performs a remap on the inherited VMA
> > > > in the two cases described above?
> > >
> > > Remaps are tracked within cow_context's via an extended maple tree (currently
> > > maple tree -> dynamic arrays) that also handles multiple entries and overlaps.
> >
> > If we have multiple remaps for multiple VMAs within one mm_struct,
> > will we end up traversing all the dynamic arrays for any folio that
> > might be located in a VMA that has been remapped?
>
> Yup. But there aren't all that many, and it's all under RCU so :)
>
> That part of the search should be quick, parts of the search involving page
> tables, less so.
>
> Also I need to figure out how to maintain stabilisation without an rmap lock, an
> ongoing open problem in all this.
>
> In the end, as the original mail said, I may conclude _this_ approach is
> unworkable and come up with an alternative that's more conventional.

I’m genuinely interested in the new approach. If you have the code, I’d be
happy to read, test, and work on it.

>
> BUT. Doing it this way saves 30x the amount of kernel allocated memory. I tried
> a heavy load case and it was very substantial. That's not to be sniffed at.
>
> In any case, all of this is going to be _very_ driven by metrics. How slow is
> it, how much overhead does it actually produce, is it workable, are the
> trade-offs right, etc.
>
> It's an exploration rather than a fait accompli.

Right now, I’m still at the stage of trying to understand the details of
your new approach and would like to learn more—so I might have quite a
few naive questions :-)

Thanks
Barry

     prev parent reply	other threads:[~2026-04-02 21:49 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-30 21:23 Lorenzo Stoakes (Oracle)
2026-03-31 23:30 ` Barry Song
2026-04-01  8:43   ` Lorenzo Stoakes (Oracle)
2026-04-01 21:03     ` Barry Song
2026-04-02 12:20       ` Lorenzo Stoakes (Oracle)
2026-04-02 21:49         ` Barry Song [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4za9WM8=-60OVKraOw8KLCgb+CcJfiKixuFsMxUWQ2nzQ@mail.gmail.com' \
    --to=baohua@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=chriscli@google.com \
    --cc=david@kernel.org \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=pfalcato@suse.de \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox