在 2025/7/24 16:59, David Hildenbrand 写道:
> On 24.07.25 10:44, Huan Yang wrote:
>> Summary
>> ==
>> This patchset reuses page_type to store migrate entry count during the
>> period from migrate entry setup to removal, enabling accelerated VMA
>> traversal when removing migrate entries, following a similar 
>> principle to
>> early termination when folio is unmapped in try_to_migrate.
>
> I absolutely detest (ab)using page types for that, so no from my side 
> unless I am missing something important.
>
>>
>> In my self-constructed test scenario, the migration time can be reduced
>
> How relevant is that in practice?

IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space), 
will benefit from this.

So, all pages that have been COW-ed by child processes can be skipped.

>
>> from over 150+ms to around 30+ms, achieving nearly a 70% performance
>> improvement. Additionally, the flame graph shows that the proportion of
>> remove_migration_ptes can be reduced from 80%+ to 60%+.
>>
>> Notice: migrate entry specifically refers to migrate PTE entry, as large
>> folio are not supported page type and 0 mapcount reuse.
>>
>> Principle
>> ==
>> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
>> entry, we can determine whether the traversal of remaining VMAs can be
>> terminated early by checking if mapcount is zero. This optimization
>> helps improve performance during migration.
>>
>> However, when removing migrate PTE entries and setting up PTEs for the
>> destination folio in remove_migration_ptes, there is no such information
>> available to assist in deciding whether the traversal of remaining VMAs
>> can be ended early. Therefore, it is necessary to traversal all VMAs
>> associated with this folio.
>
> Yes, we don't know how many migration entries are still pointing at 
> the page.
>
>>
>> In reality, when a folio is fully unmapped and before all migrate PTE
>> entries are removed, the mapcount will always be zero. Since page_type
>> and mapcount share a union, and referring to folio_mapcount, we can
>> reuse page_type to record the number of migrate PTE entries of the
>> current folio in the system as long as it's not a large folio. This
>> reuse does not affect calls to folio_mapcount, which will always return
>> zero.
> > > Therefore, we can set the folio's page_type to PGTY_mgt_entry when
>> try_to_migrate completes, the folio is already unmapped, and it's not a
>> large folio. The remaining 24 bits can then be used to record the number
>> of migrate PTE entries generated by try_to_migrate.
>
> In the future the page type will no longer overlay the mapcount and, 
> consequently, be sticky.
>
>>
>> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
>> zero, we can terminate the VMA traversal early.
>>
>> It's important to note that we need to initialize the folio's page_type
>> to PGTY_mgt_entry and set the migrate entry count only while holding the
>> rmap walk lock.This is because during the lock period, we can prevent
>> new VMA fork (which would increase migrate entries) and VMA unmap
>> (which would decrease migrate entries).
>
> The more I read about PGTY_mgt_entry, the more I hate it.
>
>>
>> However, I doubt there is actually an additional critical section 
>> here, for
>> example anon:
>>
>> Process Parent                          fork
>> try_to_migrate
>>                                          anon_vma_clone
>>                                              write_lock
>>                                                  avc_inster_tree tail
>>                                          ....
>>      folio_lock_anon_vma_read             copy_pte_range
>>          vma_iter                            pte_lock
>>                  ....                           pte_present copy
>>                                              ...
>>                  pte_lock
>>                      new forked pte clean
>> ....
>> remove_migration_ptes
>>      rmap_walk_anon_lock
>>
>> If my understanding is correct and such a critical section exists, it
>> shouldn't cause any issues—newly added PTEs can still be properly
>> removed and converted into migrate entries.
>>
>> But in this:
>>
>> Process Parent                          fork
>> try_to_migrate
>>                                          anon_vma_clone
>>                                              write_lock
>>                                                  avc_inster_tree
>>                                          ....
>>      folio_lock_anon_vma_read             copy_pte_range
>>          vma_iter
>>                  pte_lock
>>                      migrate entry set
>>                  ....                        pte_lock
>>                                                  pte_nonpresent copy
>>                                              ....
>> ....
>> remove_migration_ptes
>>      rmap_walk_anon_lock
>
> Just a note: migration entries also apply to non-anon folios.
Yes, just example.