在 2025/7/24 16:59, David Hildenbrand 写道:

On 24.07.25 10:44, Huan Yang wrote:

Summary
==
This patchset reuses page_type to store migrate entry count during the
period from migrate entry setup to removal, enabling accelerated VMA
traversal when removing migrate entries, following a similar principle to
early termination when folio is unmapped in try_to_migrate.

I absolutely detest (ab)using page types for that, so no from my side unless I am missing something important.

In my self-constructed test scenario, the migration time can be reduced

How relevant is that in practice?

IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space), will benefit from this.

So, all pages that have been COW-ed by child processes can be skipped.

from over 150+ms to around 30+ms, achieving nearly a 70% performance
improvement. Additionally, the flame graph shows that the proportion of
remove_migration_ptes can be reduced from 80%+ to 60%+.

Notice: migrate entry specifically refers to migrate PTE entry, as large
folio are not supported page type and 0 mapcount reuse.

Principle
==
When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
entry, we can determine whether the traversal of remaining VMAs can be
terminated early by checking if mapcount is zero. This optimization
helps improve performance during migration.

However, when removing migrate PTE entries and setting up PTEs for the
destination folio in remove_migration_ptes, there is no such information
available to assist in deciding whether the traversal of remaining VMAs
can be ended early. Therefore, it is necessary to traversal all VMAs
associated with this folio.

Yes, we don't know how many migration entries are still pointing at the page.

In reality, when a folio is fully unmapped and before all migrate PTE
entries are removed, the mapcount will always be zero. Since page_type
and mapcount share a union, and referring to folio_mapcount, we can
reuse page_type to record the number of migrate PTE entries of the
current folio in the system as long as it's not a large folio. This
reuse does not affect calls to folio_mapcount, which will always return
zero.

> > Therefore, we can set the folio's page_type to PGTY_mgt_entry when

try_to_migrate completes, the folio is already unmapped, and it's not a
large folio. The remaining 24 bits can then be used to record the number
of migrate PTE entries generated by try_to_migrate.

In the future the page type will no longer overlay the mapcount and, consequently, be sticky.

Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
zero, we can terminate the VMA traversal early.

It's important to note that we need to initialize the folio's page_type
to PGTY_mgt_entry and set the migrate entry count only while holding the
rmap walk lock.This is because during the lock period, we can prevent
new VMA fork (which would increase migrate entries) and VMA unmap
(which would decrease migrate entries).

The more I read about PGTY_mgt_entry, the more I hate it.

However, I doubt there is actually an additional critical section here, for
example anon:

Process Parent                          fork
try_to_migrate
                                         anon_vma_clone
                                             write_lock
                                                 avc_inster_tree tail
                                         ....
     folio_lock_anon_vma_read             copy_pte_range
         vma_iter                            pte_lock
                 ....                           pte_present copy
                                             ...
                 pte_lock
                     new forked pte clean
....
remove_migration_ptes
     rmap_walk_anon_lock

If my understanding is correct and such a critical section exists, it
shouldn't cause any issues—newly added PTEs can still be properly
removed and converted into migrate entries.

But in this:

Process Parent                          fork
try_to_migrate
                                         anon_vma_clone
                                             write_lock
                                                 avc_inster_tree
                                         ....
     folio_lock_anon_vma_read             copy_pte_range
         vma_iter
                 pte_lock
                     migrate entry set
                 ....                        pte_lock
                                                 pte_nonpresent copy
                                             ....
....
remove_migration_ptes
     rmap_walk_anon_lock

Just a note: migration entries also apply to non-anon folios.

Yes, just example.