在 2025/7/24 16:59, David Hildenbrand 写道: > On 24.07.25 10:44, Huan Yang wrote: >> Summary >> == >> This patchset reuses page_type to store migrate entry count during the >> period from migrate entry setup to removal, enabling accelerated VMA >> traversal when removing migrate entries, following a similar >> principle to >> early termination when folio is unmapped in try_to_migrate. > > I absolutely detest (ab)using page types for that, so no from my side > unless I am missing something important. > >> >> In my self-constructed test scenario, the migration time can be reduced > > How relevant is that in practice? IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space), will benefit from this. So, all pages that have been COW-ed by child processes can be skipped. > >> from over 150+ms to around 30+ms, achieving nearly a 70% performance >> improvement. Additionally, the flame graph shows that the proportion of >> remove_migration_ptes can be reduced from 80%+ to 60%+. >> >> Notice: migrate entry specifically refers to migrate PTE entry, as large >> folio are not supported page type and 0 mapcount reuse. >> >> Principle >> == >> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE >> entry, we can determine whether the traversal of remaining VMAs can be >> terminated early by checking if mapcount is zero. This optimization >> helps improve performance during migration. >> >> However, when removing migrate PTE entries and setting up PTEs for the >> destination folio in remove_migration_ptes, there is no such information >> available to assist in deciding whether the traversal of remaining VMAs >> can be ended early. Therefore, it is necessary to traversal all VMAs >> associated with this folio. > > Yes, we don't know how many migration entries are still pointing at > the page. > >> >> In reality, when a folio is fully unmapped and before all migrate PTE >> entries are removed, the mapcount will always be zero. Since page_type >> and mapcount share a union, and referring to folio_mapcount, we can >> reuse page_type to record the number of migrate PTE entries of the >> current folio in the system as long as it's not a large folio. This >> reuse does not affect calls to folio_mapcount, which will always return >> zero. > > > Therefore, we can set the folio's page_type to PGTY_mgt_entry when >> try_to_migrate completes, the folio is already unmapped, and it's not a >> large folio. The remaining 24 bits can then be used to record the number >> of migrate PTE entries generated by try_to_migrate. > > In the future the page type will no longer overlay the mapcount and, > consequently, be sticky. > >> >> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to >> zero, we can terminate the VMA traversal early. >> >> It's important to note that we need to initialize the folio's page_type >> to PGTY_mgt_entry and set the migrate entry count only while holding the >> rmap walk lock.This is because during the lock period, we can prevent >> new VMA fork (which would increase migrate entries) and VMA unmap >> (which would decrease migrate entries). > > The more I read about PGTY_mgt_entry, the more I hate it. > >> >> However, I doubt there is actually an additional critical section >> here, for >> example anon: >> >> Process Parent                          fork >> try_to_migrate >>                                          anon_vma_clone >>                                              write_lock >>                                                  avc_inster_tree tail >>                                          .... >>      folio_lock_anon_vma_read             copy_pte_range >>          vma_iter                            pte_lock >>                  ....                           pte_present copy >>                                              ... >>                  pte_lock >>                      new forked pte clean >> .... >> remove_migration_ptes >>      rmap_walk_anon_lock >> >> If my understanding is correct and such a critical section exists, it >> shouldn't cause any issues—newly added PTEs can still be properly >> removed and converted into migrate entries. >> >> But in this: >> >> Process Parent                          fork >> try_to_migrate >>                                          anon_vma_clone >>                                              write_lock >>                                                  avc_inster_tree >>                                          .... >>      folio_lock_anon_vma_read             copy_pte_range >>          vma_iter >>                  pte_lock >>                      migrate entry set >>                  ....                        pte_lock >>                                                  pte_nonpresent copy >>                                              .... >> .... >> remove_migration_ptes >>      rmap_walk_anon_lock > > Just a note: migration entries also apply to non-anon folios. Yes, just example.