* [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] @ 2026-03-30 21:23 Lorenzo Stoakes (Oracle) 2026-03-31 23:30 ` Barry Song 0 siblings, 1 reply; 6+ messages in thread From: Lorenzo Stoakes (Oracle) @ 2026-03-30 21:23 UTC (permalink / raw) To: lsf-pc Cc: linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo, Rik van Riel, Jann Horn, Chris Li, Barry Song [sorry subject line was typo'd, resending with correct subject line for visibility. Original at https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/] Currently we track the reverse mapping between folios and VMAs at a VMA level, utilising a complicated and confusing combination of anon_vma objects and anon_vma_chain's linking them, which must be updated when VMAs are split, merged, remapped or forked. It's further complicated by various optimisations intended to avoid scalability issues in locking and memory allocation. I have done recent work to improve the situation [0] which has also lead to a reported improvement in lock scalability [1], but fundamentally the situation remains the same. The logic is actually, when you think hard enough about it, is a fairly reasonable means of implementing the reverse mapping at a VMA level. It is, however, a very broken abstraction as it stands. In order to work with the logic, you have to essentially keep a broad understanding of the entire implementation in your head at one time - that is, not much is really abstracted. This results in confusion, mistakes, and bit rot. It's also very time-consuming to work with - personally I've gone to the lengths of writing a private set of slides for myself on the topic as a reminder each time I come back to it. There are also issues with lock scalability - the use of interval trees to maintain a connection between an anon_vma and AVCs connected to VMAs requires that a lock must be held across the entire 'CoW hierarchy' of parent and child VMAs whenever performing an rmap walk or performing a merge, split, remap or fork. This is because we tear down all interval tree mappings and reestablish them each time we might see changes in VMA geometry. This is an issue Barry Song identified as problematic in a real world use case [2]. So what do we do to improve the situation? Recently I have been working on an experimental new approach to the anonymous reverse mapping, in which we instead track anonymous remaps, and then use the VMA's virtual page offset to locate VMAs from the folio. I have got the implementation working to the point where it tracks the exact same VMAs as the anon_vma implementation, and it seems a lot of it can be done under RCU. It avoids the need to maintain expensive mappings at a VMA level, though it incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is pretty sub-optimal). I am investigating whether I can change how MAP_PRIVATE file-backed mappings work to avoid this issue, and will be developing tests to see how lock scalability, throughput and memory usage compare to the anon_vma approach under different workloads. This experiment may or may not work out, either way it will be interesting to discuss it. By the time LSF/MM comes around I may even have already decided on a different approach but that's what makes things interesting :) [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/ [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/ [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/ Cheers, Lorenzo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] 2026-03-30 21:23 [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] Lorenzo Stoakes (Oracle) @ 2026-03-31 23:30 ` Barry Song 2026-04-01 8:43 ` Lorenzo Stoakes (Oracle) 0 siblings, 1 reply; 6+ messages in thread From: Barry Song @ 2026-03-31 23:30 UTC (permalink / raw) To: Lorenzo Stoakes (Oracle) Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo, Rik van Riel, Jann Horn, Chris Li Hi Lorenzo, Thank you very much for bringing this up for discussion. On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote: > > [sorry subject line was typo'd, resending with correct subject line for > visibility. Original at > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/] > > Currently we track the reverse mapping between folios and VMAs at a VMA level, > utilising a complicated and confusing combination of anon_vma objects and > anon_vma_chain's linking them, which must be updated when VMAs are split, > merged, remapped or forked. > > It's further complicated by various optimisations intended to avoid scalability > issues in locking and memory allocation. > > I have done recent work to improve the situation [0] which has also lead to a > reported improvement in lock scalability [1], but fundamentally the situation > remains the same. > > The logic is actually, when you think hard enough about it, is a fairly > reasonable means of implementing the reverse mapping at a VMA level. > > It is, however, a very broken abstraction as it stands. In order to work with > the logic, you have to essentially keep a broad understanding of the entire > implementation in your head at one time - that is, not much is really > abstracted. > > This results in confusion, mistakes, and bit rot. It's also very time-consuming > to work with - personally I've gone to the lengths of writing a private set of > slides for myself on the topic as a reminder each time I come back to it. > > There are also issues with lock scalability - the use of interval trees to > maintain a connection between an anon_vma and AVCs connected to VMAs requires > that a lock must be held across the entire 'CoW hierarchy' of parent and child > VMAs whenever performing an rmap walk or performing a merge, split, remap or > fork. > > This is because we tear down all interval tree mappings and reestablish them > each time we might see changes in VMA geometry. This is an issue Barry Song > identified as problematic in a real world use case [2]. > > So what do we do to improve the situation? > > Recently I have been working on an experimental new approach to the anonymous > reverse mapping, in which we instead track anonymous remaps, and then use the > VMA's virtual page offset to locate VMAs from the folio. Please forgive my confusion. I’m still struggling to fully understand your approach of “tracking anonymous remaps.” Could you provide a concrete example to illustrate how it works? For example, if A forks B, and then B forks C, how do we determine the VMAs for a folio from the original A that has not yet been COWed in B or C? Additionally, if B COWs and obtains a new folio before forking C, how do we determine its VMAs in B and C? Also, what happens if C performs a remap on the inherited VMA in the two cases described above? > > I have got the implementation working to the point where it tracks the exact > same VMAs as the anon_vma implementation, and it seems a lot of it can be done > under RCU. > > It avoids the need to maintain expensive mappings at a VMA level, though it > incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO > (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is > pretty sub-optimal). > > I am investigating whether I can change how MAP_PRIVATE file-backed mappings > work to avoid this issue, and will be developing tests to see how lock > scalability, throughput and memory usage compare to the anon_vma approach under > different workloads. > > This experiment may or may not work out, either way it will be interesting to > discuss it. > > By the time LSF/MM comes around I may even have already decided on a different > approach but that's what makes things interesting :) > > [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/ > [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/ > [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/ > > Cheers, Lorenzo Thanks Barry ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] 2026-03-31 23:30 ` Barry Song @ 2026-04-01 8:43 ` Lorenzo Stoakes (Oracle) 2026-04-01 21:03 ` Barry Song 0 siblings, 1 reply; 6+ messages in thread From: Lorenzo Stoakes (Oracle) @ 2026-04-01 8:43 UTC (permalink / raw) To: Barry Song Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo, Rik van Riel, Jann Horn, Chris Li On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote: > Hi Lorenzo, > > Thank you very much for bringing this up for discussion. > > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote: > > > > [sorry subject line was typo'd, resending with correct subject line for > > visibility. Original at > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/] > > > > Currently we track the reverse mapping between folios and VMAs at a VMA level, > > utilising a complicated and confusing combination of anon_vma objects and > > anon_vma_chain's linking them, which must be updated when VMAs are split, > > merged, remapped or forked. > > > > It's further complicated by various optimisations intended to avoid scalability > > issues in locking and memory allocation. > > > > I have done recent work to improve the situation [0] which has also lead to a > > reported improvement in lock scalability [1], but fundamentally the situation > > remains the same. > > > > The logic is actually, when you think hard enough about it, is a fairly > > reasonable means of implementing the reverse mapping at a VMA level. > > > > It is, however, a very broken abstraction as it stands. In order to work with > > the logic, you have to essentially keep a broad understanding of the entire > > implementation in your head at one time - that is, not much is really > > abstracted. > > > > This results in confusion, mistakes, and bit rot. It's also very time-consuming > > to work with - personally I've gone to the lengths of writing a private set of > > slides for myself on the topic as a reminder each time I come back to it. > > > > There are also issues with lock scalability - the use of interval trees to > > maintain a connection between an anon_vma and AVCs connected to VMAs requires > > that a lock must be held across the entire 'CoW hierarchy' of parent and child > > VMAs whenever performing an rmap walk or performing a merge, split, remap or > > fork. > > > > This is because we tear down all interval tree mappings and reestablish them > > each time we might see changes in VMA geometry. This is an issue Barry Song > > identified as problematic in a real world use case [2]. > > > > So what do we do to improve the situation? > > > > Recently I have been working on an experimental new approach to the anonymous > > reverse mapping, in which we instead track anonymous remaps, and then use the > > VMA's virtual page offset to locate VMAs from the folio. > > Please forgive my confusion. I’m still struggling to fully > understand your approach of “tracking anonymous remaps.” > Could you provide a concrete example to illustrate how it works? I should really put this code somewhere :) > > For example, if A forks B, and then B forks C, how do we > determine the VMAs for a folio from the original A that has > not yet been COWed in B or C? The folio references the cow_context associated with the mm in A. So mm has a new cow_context field that points to cow_context, and the cow_context can outlive the mm if it has children. Each cow context tracks its forked children also, so an rmap search will traverse A, B, C. > > Additionally, if B COWs and obtains a new folio before forking > C, how do we determine its VMAs in B and C? The new folio would point to B's cow context, and it'd traverse B and C to find relevant folios. Overall we pay a higher search price (though arguably, not too bad still) but get to do it _all_ under RCU. In exchange, we avoid the locking issues and use ~30x less memory. (Of course I am yet to solve rmap lock stabilisation so got to try and do that first :) > > Also, what happens if C performs a remap on the inherited VMA > in the two cases described above? Remaps are tracked within cow_context's via an extended maple tree (currently maple tree -> dynamic arrays) that also handles multiple entries and overlaps. > > > > > I have got the implementation working to the point where it tracks the exact > > same VMAs as the anon_vma implementation, and it seems a lot of it can be done > > under RCU. > > > > It avoids the need to maintain expensive mappings at a VMA level, though it > > incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO > > (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is > > pretty sub-optimal). > > > > I am investigating whether I can change how MAP_PRIVATE file-backed mappings > > work to avoid this issue, and will be developing tests to see how lock > > scalability, throughput and memory usage compare to the anon_vma approach under > > different workloads. > > > > This experiment may or may not work out, either way it will be interesting to > > discuss it. > > > > By the time LSF/MM comes around I may even have already decided on a different > > approach but that's what makes things interesting :) > > > > [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/ > > [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/ > > [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/ > > > > Cheers, Lorenzo > > Thanks > Barry Cheers, Lorenzo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] 2026-04-01 8:43 ` Lorenzo Stoakes (Oracle) @ 2026-04-01 21:03 ` Barry Song 2026-04-02 12:20 ` Lorenzo Stoakes (Oracle) 0 siblings, 1 reply; 6+ messages in thread From: Barry Song @ 2026-04-01 21:03 UTC (permalink / raw) To: Lorenzo Stoakes (Oracle) Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo, Rik van Riel, Jann Horn, Chris Li On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote: > > On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote: > > Hi Lorenzo, > > > > Thank you very much for bringing this up for discussion. > > > > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote: > > > > > > [sorry subject line was typo'd, resending with correct subject line for > > > visibility. Original at > > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/] > > > > > > Currently we track the reverse mapping between folios and VMAs at a VMA level, > > > utilising a complicated and confusing combination of anon_vma objects and > > > anon_vma_chain's linking them, which must be updated when VMAs are split, > > > merged, remapped or forked. > > > > > > It's further complicated by various optimisations intended to avoid scalability > > > issues in locking and memory allocation. > > > > > > I have done recent work to improve the situation [0] which has also lead to a > > > reported improvement in lock scalability [1], but fundamentally the situation > > > remains the same. > > > > > > The logic is actually, when you think hard enough about it, is a fairly > > > reasonable means of implementing the reverse mapping at a VMA level. > > > > > > It is, however, a very broken abstraction as it stands. In order to work with > > > the logic, you have to essentially keep a broad understanding of the entire > > > implementation in your head at one time - that is, not much is really > > > abstracted. > > > > > > This results in confusion, mistakes, and bit rot. It's also very time-consuming > > > to work with - personally I've gone to the lengths of writing a private set of > > > slides for myself on the topic as a reminder each time I come back to it. > > > > > > There are also issues with lock scalability - the use of interval trees to > > > maintain a connection between an anon_vma and AVCs connected to VMAs requires > > > that a lock must be held across the entire 'CoW hierarchy' of parent and child > > > VMAs whenever performing an rmap walk or performing a merge, split, remap or > > > fork. > > > > > > This is because we tear down all interval tree mappings and reestablish them > > > each time we might see changes in VMA geometry. This is an issue Barry Song > > > identified as problematic in a real world use case [2]. > > > > > > So what do we do to improve the situation? > > > > > > Recently I have been working on an experimental new approach to the anonymous > > > reverse mapping, in which we instead track anonymous remaps, and then use the > > > VMA's virtual page offset to locate VMAs from the folio. > > > > Please forgive my confusion. I’m still struggling to fully > > understand your approach of “tracking anonymous remaps.” > > Could you provide a concrete example to illustrate how it works? > > I should really put this code somewhere :) > > > > > For example, if A forks B, and then B forks C, how do we > > determine the VMAs for a folio from the original A that has > > not yet been COWed in B or C? > > The folio references the cow_context associated with the mm in A. > > So mm has a new cow_context field that points to cow_context, and the > cow_context can outlive the mm if it has children. So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling) because in vfork() and exec() cases the mm_struct is not inherited? > > Each cow context tracks its forked children also, so an rmap search will > traverse A, B, C. I still don’t understand how we can get a folio’s VMA from the folio itself. For anonymous VMAs, vma->vm_pgoff is always zero, right? Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT? In case A forks B, and B unmaps a VMA then maps a new VMA at the same address as before, what happens? Will the traversal find the new VMA, which doesn’t actually map the folio? > > > > > Additionally, if B COWs and obtains a new folio before forking > > C, how do we determine its VMAs in B and C? > > The new folio would point to B's cow context, and it'd traverse B and C to find > relevant folios. > > Overall we pay a higher search price (though arguably, not too bad still) but > get to do it _all_ under RCU. Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling) can work safely under RCU. > > In exchange, we avoid the locking issues and use ~30x less memory. > > (Of course I am yet to solve rmap lock stabilisation so got to try and do that > first :) > > > > > Also, what happens if C performs a remap on the inherited VMA > > in the two cases described above? > > Remaps are tracked within cow_context's via an extended maple tree (currently > maple tree -> dynamic arrays) that also handles multiple entries and overlaps. If we have multiple remaps for multiple VMAs within one mm_struct, will we end up traversing all the dynamic arrays for any folio that might be located in a VMA that has been remapped? > > > > > > > > > I have got the implementation working to the point where it tracks the exact > > > same VMAs as the anon_vma implementation, and it seems a lot of it can be done > > > under RCU. > > > > > > It avoids the need to maintain expensive mappings at a VMA level, though it > > > incurs a cost in tracking remaps, and MAP_PRIVATE files are very much a TODO > > > (they maintain a file vma->vm_pgoff, even when CoW'd, so the remap tracking is > > > pretty sub-optimal). > > > > > > I am investigating whether I can change how MAP_PRIVATE file-backed mappings > > > work to avoid this issue, and will be developing tests to see how lock > > > scalability, throughput and memory usage compare to the anon_vma approach under > > > different workloads. > > > > > > This experiment may or may not work out, either way it will be interesting to > > > discuss it. > > > > > > By the time LSF/MM comes around I may even have already decided on a different > > > approach but that's what makes things interesting :) > > > > > > [0]:https://lore.kernel.org/all/cover.1767711638.git.lorenzo.stoakes@oracle.com/ > > > [1]:https://lore.kernel.org/all/202602061747.855f053f-lkp@intel.com/ > > > [2]:https://lore.kernel.org/linux-mm/CAGsJ_4x=YsQR=nNcHA-q=0vg0b7ok=81C_qQqKmoJ+BZ+HVduQ@mail.gmail.com/ > > > Thanks Barry ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] 2026-04-01 21:03 ` Barry Song @ 2026-04-02 12:20 ` Lorenzo Stoakes (Oracle) 2026-04-02 21:49 ` Barry Song 0 siblings, 1 reply; 6+ messages in thread From: Lorenzo Stoakes (Oracle) @ 2026-04-02 12:20 UTC (permalink / raw) To: Barry Song Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo, Rik van Riel, Jann Horn, Chris Li On Thu, Apr 02, 2026 at 05:03:42AM +0800, Barry Song wrote: > On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote: > > > > On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote: > > > Hi Lorenzo, > > > > > > Thank you very much for bringing this up for discussion. > > > > > > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote: > > > > > > > > [sorry subject line was typo'd, resending with correct subject line for > > > > visibility. Original at > > > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/] > > > > > > > > Currently we track the reverse mapping between folios and VMAs at a VMA level, > > > > utilising a complicated and confusing combination of anon_vma objects and > > > > anon_vma_chain's linking them, which must be updated when VMAs are split, > > > > merged, remapped or forked. > > > > > > > > It's further complicated by various optimisations intended to avoid scalability > > > > issues in locking and memory allocation. > > > > > > > > I have done recent work to improve the situation [0] which has also lead to a > > > > reported improvement in lock scalability [1], but fundamentally the situation > > > > remains the same. > > > > > > > > The logic is actually, when you think hard enough about it, is a fairly > > > > reasonable means of implementing the reverse mapping at a VMA level. > > > > > > > > It is, however, a very broken abstraction as it stands. In order to work with > > > > the logic, you have to essentially keep a broad understanding of the entire > > > > implementation in your head at one time - that is, not much is really > > > > abstracted. > > > > > > > > This results in confusion, mistakes, and bit rot. It's also very time-consuming > > > > to work with - personally I've gone to the lengths of writing a private set of > > > > slides for myself on the topic as a reminder each time I come back to it. > > > > > > > > There are also issues with lock scalability - the use of interval trees to > > > > maintain a connection between an anon_vma and AVCs connected to VMAs requires > > > > that a lock must be held across the entire 'CoW hierarchy' of parent and child > > > > VMAs whenever performing an rmap walk or performing a merge, split, remap or > > > > fork. > > > > > > > > This is because we tear down all interval tree mappings and reestablish them > > > > each time we might see changes in VMA geometry. This is an issue Barry Song > > > > identified as problematic in a real world use case [2]. > > > > > > > > So what do we do to improve the situation? > > > > > > > > Recently I have been working on an experimental new approach to the anonymous > > > > reverse mapping, in which we instead track anonymous remaps, and then use the > > > > VMA's virtual page offset to locate VMAs from the folio. > > > > > > Please forgive my confusion. I’m still struggling to fully > > > understand your approach of “tracking anonymous remaps.” > > > Could you provide a concrete example to illustrate how it works? > > > > I should really put this code somewhere :) > > > > > > > > For example, if A forks B, and then B forks C, how do we > > > determine the VMAs for a folio from the original A that has > > > not yet been COWed in B or C? > > > > The folio references the cow_context associated with the mm in A. > > > > So mm has a new cow_context field that points to cow_context, and the > > cow_context can outlive the mm if it has children. > > So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling) > because in vfork() and exec() cases the mm_struct is not inherited? Umm, memory is not preserved across an exec() :) so it works fine with that. vfork() is CLONE_VM so the mm is shared and everything works fine. > > > > > Each cow context tracks its forked children also, so an rmap search will > > traverse A, B, C. > > I still don’t understand how we can get a folio’s VMA from the folio itself. > For anonymous VMAs, vma->vm_pgoff is always zero, right? No, not at all. vma->vm_pgoff is equal to vma->vm_start >> PAGE_SHIFT when first faulted in for anon. That reduces the problem to tracking remaps, which I do. > > Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT? No, that's how anon works already. > > In case A forks B, and B unmaps a VMA then maps a new > VMA at the same address as before, what happens? Will the > traversal find the new VMA, which doesn’t actually map the folio? Well you're missing stuff there, the folio would have to be non-anon exclusive (which is rare). Yes it'd find the new VMA, then traverse, and find the folio does not match, and traverse children. rmap walks _always_ allow for you walking VMAs that a folio does not belong to. For instance, with anon_vma, if you CoW a bunch of folios to child process VMAs, the non-CoW'd folio will _still_ traverse all of that uselessly. In any case, this isn't a common case. However note that if a folio _becomes_ anon exclusive, it switches its 'root' cow context to the one associated with the mm which it became exclusive to. > > > > > > > > > Additionally, if B COWs and obtains a new folio before forking > > > C, how do we determine its VMAs in B and C? > > > > The new folio would point to B's cow context, and it'd traverse B and C to find > > relevant folios. > > > > Overall we pay a higher search price (though arguably, not too bad still) but > > get to do it _all_ under RCU. > > Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling) > can work safely under RCU. > > > > > In exchange, we avoid the locking issues and use ~30x less memory. > > > > (Of course I am yet to solve rmap lock stabilisation so got to try and do that > > first :) > > > > > > > > Also, what happens if C performs a remap on the inherited VMA > > > in the two cases described above? > > > > Remaps are tracked within cow_context's via an extended maple tree (currently > > maple tree -> dynamic arrays) that also handles multiple entries and overlaps. > > If we have multiple remaps for multiple VMAs within one mm_struct, > will we end up traversing all the dynamic arrays for any folio that > might be located in a VMA that has been remapped? Yup. But there aren't all that many, and it's all under RCU so :) That part of the search should be quick, parts of the search involving page tables, less so. Also I need to figure out how to maintain stabilisation without an rmap lock, an ongoing open problem in all this. In the end, as the original mail said, I may conclude _this_ approach is unworkable and come up with an alternative that's more conventional. BUT. Doing it this way saves 30x the amount of kernel allocated memory. I tried a heavy load case and it was very substantial. That's not to be sniffed at. In any case, all of this is going to be _very_ driven by metrics. How slow is it, how much overhead does it actually produce, is it workable, are the trade-offs right, etc. It's an exploration rather than a fait accompli. Cheers, Lorenzo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] 2026-04-02 12:20 ` Lorenzo Stoakes (Oracle) @ 2026-04-02 21:49 ` Barry Song 0 siblings, 0 replies; 6+ messages in thread From: Barry Song @ 2026-04-02 21:49 UTC (permalink / raw) To: Lorenzo Stoakes (Oracle) Cc: lsf-pc, linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, Pedro Falcato, Ryan Roberts, Harry Yoo, Rik van Riel, Jann Horn, Chris Li On Thu, Apr 2, 2026 at 8:20 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote: > > On Thu, Apr 02, 2026 at 05:03:42AM +0800, Barry Song wrote: > > On Wed, Apr 1, 2026 at 4:43 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote: > > > > > > On Wed, Apr 01, 2026 at 07:30:41AM +0800, Barry Song wrote: > > > > Hi Lorenzo, > > > > > > > > Thank you very much for bringing this up for discussion. > > > > > > > > On Tue, Mar 31, 2026 at 5:23 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote: > > > > > > > > > > [sorry subject line was typo'd, resending with correct subject line for > > > > > visibility. Original at > > > > > https://lore.kernel.org/linux-mm/8aa41d47-ee41-4af1-a334-587a34fe865d@lucifer.local/] > > > > > > > > > > Currently we track the reverse mapping between folios and VMAs at a VMA level, > > > > > utilising a complicated and confusing combination of anon_vma objects and > > > > > anon_vma_chain's linking them, which must be updated when VMAs are split, > > > > > merged, remapped or forked. > > > > > > > > > > It's further complicated by various optimisations intended to avoid scalability > > > > > issues in locking and memory allocation. > > > > > > > > > > I have done recent work to improve the situation [0] which has also lead to a > > > > > reported improvement in lock scalability [1], but fundamentally the situation > > > > > remains the same. > > > > > > > > > > The logic is actually, when you think hard enough about it, is a fairly > > > > > reasonable means of implementing the reverse mapping at a VMA level. > > > > > > > > > > It is, however, a very broken abstraction as it stands. In order to work with > > > > > the logic, you have to essentially keep a broad understanding of the entire > > > > > implementation in your head at one time - that is, not much is really > > > > > abstracted. > > > > > > > > > > This results in confusion, mistakes, and bit rot. It's also very time-consuming > > > > > to work with - personally I've gone to the lengths of writing a private set of > > > > > slides for myself on the topic as a reminder each time I come back to it. > > > > > > > > > > There are also issues with lock scalability - the use of interval trees to > > > > > maintain a connection between an anon_vma and AVCs connected to VMAs requires > > > > > that a lock must be held across the entire 'CoW hierarchy' of parent and child > > > > > VMAs whenever performing an rmap walk or performing a merge, split, remap or > > > > > fork. > > > > > > > > > > This is because we tear down all interval tree mappings and reestablish them > > > > > each time we might see changes in VMA geometry. This is an issue Barry Song > > > > > identified as problematic in a real world use case [2]. > > > > > > > > > > So what do we do to improve the situation? > > > > > > > > > > Recently I have been working on an experimental new approach to the anonymous > > > > > reverse mapping, in which we instead track anonymous remaps, and then use the > > > > > VMA's virtual page offset to locate VMAs from the folio. > > > > > > > > Please forgive my confusion. I’m still struggling to fully > > > > understand your approach of “tracking anonymous remaps.” > > > > Could you provide a concrete example to illustrate how it works? > > > > > > I should really put this code somewhere :) > > > > > > > > > > > For example, if A forks B, and then B forks C, how do we > > > > determine the VMAs for a folio from the original A that has > > > > not yet been COWed in B or C? > > > > > > The folio references the cow_context associated with the mm in A. > > > > > > So mm has a new cow_context field that points to cow_context, and the > > > cow_context can outlive the mm if it has children. > > > > So we can’t use list_for_each_entry_rcu(child, &parent->children, sibling) > > because in vfork() and exec() cases the mm_struct is not inherited? > > Umm, memory is not preserved across an exec() :) so it works fine with that. > > vfork() is CLONE_VM so the mm is shared and everything works fine. My question is whether we can reuse the process tree, similar to walk_tg_tree_from(). With some flags in mm_struct, it might be possible to distinguish whether an mm_struct was copied from the parent or created by a new exec. > > > > > > > > > Each cow context tracks its forked children also, so an rmap search will > > > traverse A, B, C. > > > > I still don’t understand how we can get a folio’s VMA from the folio itself. > > For anonymous VMAs, vma->vm_pgoff is always zero, right? > > No, not at all. > > vma->vm_pgoff is equal to vma->vm_start >> PAGE_SHIFT when first faulted in for > anon. > > That reduces the problem to tracking remaps, which I do. > > > > > Are you changing vm_pgoff to a value equal to vm_start >> PAGE_SHIFT? > > No, that's how anon works already. Sorry for my mistake. I was somehow reading incorrect information from /proc/<pid>/maps, where anonymous VMAs always appeared as zero. A simple patch like the one below proves that you are absolutely right: diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 33e5094a7842..0cecff1c6307 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -475,9 +475,9 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma) dev = inode->i_sb->s_dev; ino = inode->i_ino; - pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT; + //pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT; } - + pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT; start = vma->vm_start; end = vma->vm_end; show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino); > > > > > In case A forks B, and B unmaps a VMA then maps a new > > VMA at the same address as before, what happens? Will the > > traversal find the new VMA, which doesn’t actually map the folio? > > Well you're missing stuff there, the folio would have to be non-anon exclusive > (which is rare). Yes it'd find the new VMA, then traverse, and find the folio > does not match, and traverse children. > > rmap walks _always_ allow for you walking VMAs that a folio does not belong > to. I understand that we can check whether the folio belongs to the new VMA, but I’m curious whether this will occur more frequently in practice after the change. In the rmap case, I assume the original A’s folio anon_vma would be detached from process B once B unmaps and then maps a new VMA, so we wouldn’t search B anymore—is that correct? > > For instance, with anon_vma, if you CoW a bunch of folios to child process VMAs, > the non-CoW'd folio will _still_ traverse all of that uselessly. > > In any case, this isn't a common case. > > However note that if a folio _becomes_ anon exclusive, it switches its 'root' > cow context to the one associated with the mm which it became exclusive to. > Agreed. I’m curious about the case of A’s folio, whose VMA has been completely replaced in B after the unmap and map. In the old anon_vma case, we wouldn’t search B anymore, but now we’ll need to check B's vm_pgoff since it covers the folio’s address—is that correct? > > > > > > > > > > > > > Additionally, if B COWs and obtains a new folio before forking > > > > C, how do we determine its VMAs in B and C? > > > > > > The new folio would point to B's cow context, and it'd traverse B and C to find > > > relevant folios. > > > > > > Overall we pay a higher search price (though arguably, not too bad still) but > > > get to do it _all_ under RCU. > > > > Yep. I see that list_for_each_entry_rcu(child, &parent->children, sibling) > > can work safely under RCU. > > > > > > > > In exchange, we avoid the locking issues and use ~30x less memory. > > > > > > (Of course I am yet to solve rmap lock stabilisation so got to try and do that > > > first :) > > > > > > > > > > > Also, what happens if C performs a remap on the inherited VMA > > > > in the two cases described above? > > > > > > Remaps are tracked within cow_context's via an extended maple tree (currently > > > maple tree -> dynamic arrays) that also handles multiple entries and overlaps. > > > > If we have multiple remaps for multiple VMAs within one mm_struct, > > will we end up traversing all the dynamic arrays for any folio that > > might be located in a VMA that has been remapped? > > Yup. But there aren't all that many, and it's all under RCU so :) > > That part of the search should be quick, parts of the search involving page > tables, less so. > > Also I need to figure out how to maintain stabilisation without an rmap lock, an > ongoing open problem in all this. > > In the end, as the original mail said, I may conclude _this_ approach is > unworkable and come up with an alternative that's more conventional. I’m genuinely interested in the new approach. If you have the code, I’d be happy to read, test, and work on it. > > BUT. Doing it this way saves 30x the amount of kernel allocated memory. I tried > a heavy load case and it was very substantial. That's not to be sniffed at. > > In any case, all of this is going to be _very_ driven by metrics. How slow is > it, how much overhead does it actually produce, is it workable, are the > trade-offs right, etc. > > It's an exploration rather than a fait accompli. Right now, I’m still at the stage of trying to understand the details of your new approach and would like to learn more—so I might have quite a few naive questions :-) Thanks Barry ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-02 21:49 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-03-30 21:23 [LSF/MM/BPF TOPIC] The Future of the Anonymous Reverse Mapping [RESEND] Lorenzo Stoakes (Oracle) 2026-03-31 23:30 ` Barry Song 2026-04-01 8:43 ` Lorenzo Stoakes (Oracle) 2026-04-01 21:03 ` Barry Song 2026-04-02 12:20 ` Lorenzo Stoakes (Oracle) 2026-04-02 21:49 ` Barry Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox