* [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM
@ 2023-08-08 7:13 Yan Zhao
2023-08-08 7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Yan Zhao @ 2023-08-08 7:13 UTC (permalink / raw)
To: linux-mm, linux-kernel, kvm
Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
kevin.tian, Yan Zhao
This is an RFC series trying to fix the issue of unnecessary NUMA
protection and TLB-shootdowns found in VMs with assigned devices or VFIO
mediated devices during NUMA balance.
For VMs with assigned devices or VFIO mediated devices, all or part of
guest memory are pinned for long-term.
Auto NUMA balancing will periodically selects VMAs of a process and change
protections to PROT_NONE even though some or all pages in the selected
ranges are long-term pinned for DMAs, which is true for VMs with assigned
devices or VFIO mediated devices.
Though this will not cause real problem because NUMA migration will
ultimately reject migration of those kind of pages and restore those
PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
with equal SPTEs finally faulted back, wasting CPU cycles and generating
unnecessary TLB-shootdowns.
This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
event is sent for NUMA migration purpose in specific.
Then, with patch 3, during zapping KVM's secondary MMU, KVM can check
and keep accessing the long-term pinned pages even though it's
PROT_NONE-mapped in the primary MMU.
Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
MMU to avoid NUMA protection introduced page faults and restoration of old
huge PMDs/PTEs in primary MMU. As change_pmd_range() will first send
.invalidate_range_start() before going down and checking the pages to skip,
patch 1 and 3 are still required for KVM.
In my test environment, with this series, during boot-up with a VM with
assigned devices:
TLB shootdown count in KVM caused by .invalidate_range_start() sent for
NUMA balancing in change_pmd_range() is reduced from 9000+ on average to 0.
Yan Zhao (3):
mm/mmu_notifier: introduce a new mmu notifier flag
MMU_NOTIFIER_RANGE_NUMA
mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
purpose
KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
arch/x86/kvm/mmu/mmu.c | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++----
arch/x86/kvm/mmu/tdp_mmu.h | 4 ++--
include/linux/kvm_host.h | 1 +
include/linux/mmu_notifier.h | 1 +
mm/huge_memory.c | 5 +++++
mm/mprotect.c | 9 ++++++++-
virt/kvm/kvm_main.c | 5 +++++
8 files changed, 46 insertions(+), 9 deletions(-)
base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
--
2.17.1
^ permalink raw reply [flat|nested] 16+ messages in thread* [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA 2023-08-08 7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao @ 2023-08-08 7:14 ` Yan Zhao 2023-08-08 7:15 ` [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao 2023-08-08 7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao 2 siblings, 0 replies; 16+ messages in thread From: Yan Zhao @ 2023-08-08 7:14 UTC (permalink / raw) To: linux-mm, linux-kernel, kvm Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Yan Zhao Introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA to indicate the notification of MMU_NOTIFY_PROTECTION_VMA is for NUMA balance purpose specifically. So that, the subscriber of mmu notifier, like KVM, can do some performance optimization according to this accurate information. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- include/linux/mmu_notifier.h | 1 + mm/mprotect.c | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 64a3e051c3c4..a6dc829a4bce 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -60,6 +60,7 @@ enum mmu_notifier_event { }; #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) +#define MMU_NOTIFIER_RANGE_NUMA (1 << 1) struct mmu_notifier_ops { /* diff --git a/mm/mprotect.c b/mm/mprotect.c index 6f658d483704..cb99a7d66467 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -381,7 +381,9 @@ static inline long change_pmd_range(struct mmu_gather *tlb, /* invoke the mmu notifier if the pmd is populated */ if (!range.start) { mmu_notifier_range_init(&range, - MMU_NOTIFY_PROTECTION_VMA, 0, + MMU_NOTIFY_PROTECTION_VMA, + cp_flags & MM_CP_PROT_NUMA ? + MMU_NOTIFIER_RANGE_NUMA : 0, vma->vm_mm, addr, end); mmu_notifier_invalidate_range_start(&range); } -- 2.17.1 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose 2023-08-08 7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao 2023-08-08 7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao @ 2023-08-08 7:15 ` Yan Zhao 2023-08-08 7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao 2 siblings, 0 replies; 16+ messages in thread From: Yan Zhao @ 2023-08-08 7:15 UTC (permalink / raw) To: linux-mm, linux-kernel, kvm Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Yan Zhao Don't set PROT_NONE for exclusive anonymas and maybe-dma-pinned pages for NUMA migration purpose. For exclusive anonymas and page_maybe_dma_pinned() pages, NUMA-migration will eventually drop migration of those pages in try_to_migrate_one(). (i.e. after -EBUSY returned in page_try_share_anon_rmap()). So, skip setting PROT_NONE to those kind of pages earlier in change_protection_range() phase to avoid later futile page faults, detections, and restoration to original PTEs/PMDs. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- mm/huge_memory.c | 5 +++++ mm/mprotect.c | 5 +++++ 2 files changed, 10 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index eb3678360b97..a71cf686e3b2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1875,6 +1875,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, goto unlock; page = pmd_page(*pmd); + + if (PageAnon(page) && PageAnonExclusive(page) && + page_maybe_dma_pinned(page)) + goto unlock; + toptier = node_is_toptier(page_to_nid(page)); /* * Skip scanning top tier node if normal numa diff --git a/mm/mprotect.c b/mm/mprotect.c index cb99a7d66467..a1f63df34b86 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -146,6 +146,11 @@ static long change_pte_range(struct mmu_gather *tlb, nid = page_to_nid(page); if (target_node == nid) continue; + + if (PageAnon(page) && PageAnonExclusive(page) && + page_maybe_dma_pinned(page)) + continue; + toptier = node_is_toptier(nid); /* -- 2.17.1 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao 2023-08-08 7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao 2023-08-08 7:15 ` [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao @ 2023-08-08 7:17 ` Yan Zhao 2023-08-08 12:32 ` Jason Gunthorpe 2023-08-26 6:39 ` liulongfang 2 siblings, 2 replies; 16+ messages in thread From: Yan Zhao @ 2023-08-08 7:17 UTC (permalink / raw) To: linux-mm, linux-kernel, kvm Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Yan Zhao Skip zapping pages that're exclusive anonymas and maybe-dma-pinned in TDP MMU if it's for NUMA migration purpose to save unnecessary zaps and TLB shootdowns. For NUMA balancing, change_pmd_range() will send .invalidate_range_start() and .invalidate_range_end() pair unconditionally before setting a huge PMD or PTE to be PROT_NONE. No matter whether PROT_NONE is set under change_pmd_range(), NUMA migration will eventually reject migrating of exclusive anonymas and maybe_dma_pinned pages in later try_to_migrate_one() phase and restoring the affected huge PMD or PTE. Therefore, if KVM can detect those kind of pages in the zap phase, zap and TLB shootdowns caused by this kind of protection can be avoided. Corner cases like below are still fine. 1. Auto NUMA balancing selects a PMD range to set PROT_NONE in change_pmd_range(). 2. A page is maybe-dma-pinned at the time of sending .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA. ==> so it's not zapped in KVM's secondary MMU. 3. The page is unpinned after sending .invalidate_range_start(), therefore is not maybe-dma-pinned and set to PROT_NONE in primary MMU. 4. For some reason, page fault is triggered in primary MMU and the page will be found to be suitable for NUMA migration. 5. try_to_migrate_one() will send .invalidate_range_start() notification with event type MMU_NOTIFY_CLEAR to KVM, and ===> KVM will zap the pages in secondary MMU. 6. The old page will be replaced by a new page in primary MMU. If step 4 does not happen, though KVM will keep accessing a page that might not be on the best NUMA node, it can be fixed by a next round of step 1 in Auto NUMA balancing as change_pmd_range() will send mmu notification without checking PROT_NONE is set or not. Currently in this patch, for NUMA migration protection purpose, only exclusive anonymous maybe-dma-pinned pages are skipped. Can later include other type of pages, e.g., is_zone_device_page() or PageKsm() if necessary. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/kvm/mmu/mmu.c | 4 ++-- arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++---- arch/x86/kvm/mmu/tdp_mmu.h | 4 ++-- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 5 +++++ 5 files changed, 32 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index d72f2b20f430..9dccc25b1389 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6307,8 +6307,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) if (tdp_mmu_enabled) { for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) - flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, - gfn_end, true, flush); + flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, gfn_end, + true, flush, false); } if (flush) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 6250bd3d20c1..17762b5a2b98 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) * operation can cause a soft lockup. */ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, - gfn_t start, gfn_t end, bool can_yield, bool flush) + gfn_t start, gfn_t end, bool can_yield, bool flush, + bool skip_pinned) { struct tdp_iter iter; @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, !is_last_spte(iter.old_spte, iter.level)) continue; + if (skip_pinned) { + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); + struct page *page = kvm_pfn_to_refcounted_page(pfn); + struct folio *folio; + + if (!page) + continue; + + folio = page_folio(page); + + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && + folio_maybe_dma_pinned(folio)) + continue; + } + tdp_mmu_iter_set_spte(kvm, &iter, 0); flush = true; } @@ -878,12 +894,13 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, * more SPTEs were zapped since the MMU lock was last acquired. */ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, - bool can_yield, bool flush) + bool can_yield, bool flush, bool skip_pinned) { struct kvm_mmu_page *root; for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) - flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush); + flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush, + skip_pinned); return flush; } @@ -1147,7 +1164,8 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, bool flush) { return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start, - range->end, range->may_block, flush); + range->end, range->may_block, flush, + range->skip_pinned); } typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter, diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 0a63b1afabd3..2a9de44bc5c3 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -20,8 +20,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root) void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, bool shared); -bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, - gfn_t end, bool can_yield, bool flush); +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, + bool can_yield, bool flush, bool skip_pinned); bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp); void kvm_tdp_mmu_zap_all(struct kvm *kvm); void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9125d0ab642d..f883d6b59545 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -266,6 +266,7 @@ struct kvm_gfn_range { gfn_t end; union kvm_mmu_notifier_arg arg; bool may_block; + bool skip_pinned; }; bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f84ef9399aee..1202c1daa568 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -532,6 +532,7 @@ struct kvm_hva_range { on_unlock_fn_t on_unlock; bool flush_on_ret; bool may_block; + bool skip_pinned; }; /* @@ -595,6 +596,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, */ gfn_range.arg = range->arg; gfn_range.may_block = range->may_block; + gfn_range.skip_pinned = range->skip_pinned; /* * {gfn(page) | page intersects with [hva_start, hva_end)} = @@ -754,6 +756,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, .on_unlock = kvm_arch_guest_memory_reclaimed, .flush_on_ret = true, .may_block = mmu_notifier_range_blockable(range), + .skip_pinned = test_bit(MMF_HAS_PINNED, &range->mm->flags) && + (range->event == MMU_NOTIFY_PROTECTION_VMA) && + (range->flags & MMU_NOTIFIER_RANGE_NUMA), }; trace_kvm_unmap_hva_range(range->start, range->end); -- 2.17.1 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao @ 2023-08-08 12:32 ` Jason Gunthorpe 2023-08-08 14:26 ` Sean Christopherson 2023-08-26 6:39 ` liulongfang 1 sibling, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2023-08-08 12:32 UTC (permalink / raw) To: Yan Zhao Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz, apopple, rppt, akpm, kevin.tian On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > Skip zapping pages that're exclusive anonymas and maybe-dma-pinned in TDP > MMU if it's for NUMA migration purpose to save unnecessary zaps and TLB > shootdowns. > > For NUMA balancing, change_pmd_range() will send .invalidate_range_start() > and .invalidate_range_end() pair unconditionally before setting a huge PMD > or PTE to be PROT_NONE. > > No matter whether PROT_NONE is set under change_pmd_range(), NUMA migration > will eventually reject migrating of exclusive anonymas and maybe_dma_pinned > pages in later try_to_migrate_one() phase and restoring the affected huge > PMD or PTE. > > Therefore, if KVM can detect those kind of pages in the zap phase, zap and > TLB shootdowns caused by this kind of protection can be avoided. > > Corner cases like below are still fine. > 1. Auto NUMA balancing selects a PMD range to set PROT_NONE in > change_pmd_range(). > 2. A page is maybe-dma-pinned at the time of sending > .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA. > ==> so it's not zapped in KVM's secondary MMU. > 3. The page is unpinned after sending .invalidate_range_start(), therefore > is not maybe-dma-pinned and set to PROT_NONE in primary MMU. > 4. For some reason, page fault is triggered in primary MMU and the page > will be found to be suitable for NUMA migration. > 5. try_to_migrate_one() will send .invalidate_range_start() notification > with event type MMU_NOTIFY_CLEAR to KVM, and ===> > KVM will zap the pages in secondary MMU. > 6. The old page will be replaced by a new page in primary MMU. > > If step 4 does not happen, though KVM will keep accessing a page that > might not be on the best NUMA node, it can be fixed by a next round of > step 1 in Auto NUMA balancing as change_pmd_range() will send mmu > notification without checking PROT_NONE is set or not. > > Currently in this patch, for NUMA migration protection purpose, only > exclusive anonymous maybe-dma-pinned pages are skipped. > Can later include other type of pages, e.g., is_zone_device_page() or > PageKsm() if necessary. > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > arch/x86/kvm/mmu/mmu.c | 4 ++-- > arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++---- > arch/x86/kvm/mmu/tdp_mmu.h | 4 ++-- > include/linux/kvm_host.h | 1 + > virt/kvm/kvm_main.c | 5 +++++ > 5 files changed, 32 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index d72f2b20f430..9dccc25b1389 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -6307,8 +6307,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) > > if (tdp_mmu_enabled) { > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) > - flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, > - gfn_end, true, flush); > + flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, gfn_end, > + true, flush, false); > } > > if (flush) > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 6250bd3d20c1..17762b5a2b98 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) > * operation can cause a soft lockup. > */ > static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > - gfn_t start, gfn_t end, bool can_yield, bool flush) > + gfn_t start, gfn_t end, bool can_yield, bool flush, > + bool skip_pinned) > { > struct tdp_iter iter; > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > !is_last_spte(iter.old_spte, iter.level)) > continue; > > + if (skip_pinned) { > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > + struct folio *folio; > + > + if (!page) > + continue; > + > + folio = page_folio(page); > + > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > + folio_maybe_dma_pinned(folio)) > + continue; > + } > + I don't get it.. The last patch made it so that the NUMA balancing code doesn't change page_maybe_dma_pinned() pages to PROT_NONE So why doesn't KVM just check if the current and new SPTE are the same and refrain from invalidating if nothing changed? Duplicating the checks here seems very frail to me. If you did that then you probably don't need to change the notifiers. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 12:32 ` Jason Gunthorpe @ 2023-08-08 14:26 ` Sean Christopherson 2023-08-08 14:32 ` Jason Gunthorpe 2023-08-09 0:29 ` Yan Zhao 0 siblings, 2 replies; 16+ messages in thread From: Sean Christopherson @ 2023-08-08 14:26 UTC (permalink / raw) To: Jason Gunthorpe Cc: Yan Zhao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple, rppt, akpm, kevin.tian On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > !is_last_spte(iter.old_spte, iter.level)) > > continue; > > > > + if (skip_pinned) { > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > + struct folio *folio; > > + > > + if (!page) > > + continue; > > + > > + folio = page_folio(page); > > + > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > + folio_maybe_dma_pinned(folio)) > > + continue; > > + } > > + > > I don't get it.. > > The last patch made it so that the NUMA balancing code doesn't change > page_maybe_dma_pinned() pages to PROT_NONE > > So why doesn't KVM just check if the current and new SPTE are the same > and refrain from invalidating if nothing changed? Because KVM doesn't have visibility into the current and new PTEs when the zapping occurs. The contract for invalidate_range_start() requires that KVM drop all references before returning, and so the zapping occurs before change_pte_range() or change_huge_pmd() have done antyhing. > Duplicating the checks here seems very frail to me. Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() can yield different results purely based on refcounts, i.e. KVM could skip pages that the primary MMU does not, and thus violate the mmu_notifier contract. And in general, I am steadfastedly against adding any kind of heuristic to KVM's zapping logic. This really needs to be fixed in the primary MMU and not require any direct involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs to be skipped. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 14:26 ` Sean Christopherson @ 2023-08-08 14:32 ` Jason Gunthorpe 2023-08-08 23:56 ` Sean Christopherson 2023-08-09 2:58 ` Yan Zhao 2023-08-09 0:29 ` Yan Zhao 1 sibling, 2 replies; 16+ messages in thread From: Jason Gunthorpe @ 2023-08-08 14:32 UTC (permalink / raw) To: Sean Christopherson Cc: Yan Zhao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple, rppt, akpm, kevin.tian On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > !is_last_spte(iter.old_spte, iter.level)) > > > continue; > > > > > > + if (skip_pinned) { > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > + struct folio *folio; > > > + > > > + if (!page) > > > + continue; > > > + > > > + folio = page_folio(page); > > > + > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > + folio_maybe_dma_pinned(folio)) > > > + continue; > > > + } > > > + > > > > I don't get it.. > > > > The last patch made it so that the NUMA balancing code doesn't change > > page_maybe_dma_pinned() pages to PROT_NONE > > > > So why doesn't KVM just check if the current and new SPTE are the same > > and refrain from invalidating if nothing changed? > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > occurs. The contract for invalidate_range_start() requires that KVM drop all > references before returning, and so the zapping occurs before change_pte_range() > or change_huge_pmd() have done antyhing. > > > Duplicating the checks here seems very frail to me. > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > can yield different results purely based on refcounts, i.e. KVM could skip pages > that the primary MMU does not, and thus violate the mmu_notifier contract. And > in general, I am steadfastedly against adding any kind of heuristic to KVM's > zapping logic. > > This really needs to be fixed in the primary MMU and not require any direct > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > to be skipped. This likely has the same issue you just described, we don't know if it can be skipped until we iterate over the PTEs and by then it is too late to invoke the notifier. Maybe some kind of abort and restart scheme could work? Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 14:32 ` Jason Gunthorpe @ 2023-08-08 23:56 ` Sean Christopherson 2023-08-09 0:11 ` Yan Zhao 2023-08-09 5:06 ` Yan Zhao 2023-08-09 2:58 ` Yan Zhao 1 sibling, 2 replies; 16+ messages in thread From: Sean Christopherson @ 2023-08-08 23:56 UTC (permalink / raw) To: Jason Gunthorpe Cc: Yan Zhao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple, rppt, akpm, kevin.tian On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > > !is_last_spte(iter.old_spte, iter.level)) > > > > continue; > > > > > > > > + if (skip_pinned) { > > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > > + struct folio *folio; > > > > + > > > > + if (!page) > > > > + continue; > > > > + > > > > + folio = page_folio(page); > > > > + > > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > > + folio_maybe_dma_pinned(folio)) > > > > + continue; > > > > + } > > > > + > > > > > > I don't get it.. > > > > > > The last patch made it so that the NUMA balancing code doesn't change > > > page_maybe_dma_pinned() pages to PROT_NONE > > > > > > So why doesn't KVM just check if the current and new SPTE are the same > > > and refrain from invalidating if nothing changed? > > > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > > occurs. The contract for invalidate_range_start() requires that KVM drop all > > references before returning, and so the zapping occurs before change_pte_range() > > or change_huge_pmd() have done antyhing. > > > > > Duplicating the checks here seems very frail to me. > > > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > > can yield different results purely based on refcounts, i.e. KVM could skip pages > > that the primary MMU does not, and thus violate the mmu_notifier contract. And > > in general, I am steadfastedly against adding any kind of heuristic to KVM's > > zapping logic. > > > > This really needs to be fixed in the primary MMU and not require any direct > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > > to be skipped. > > This likely has the same issue you just described, we don't know if it > can be skipped until we iterate over the PTEs and by then it is too > late to invoke the notifier. Maybe some kind of abort and restart > scheme could work? Or maybe treat this as a userspace config problem? Pinning DMA pages in a VM, having a fair amount of remote memory, *and* expecting NUMA balancing to do anything useful for that VM seems like a userspace problem. Actually, does NUMA balancing even support this particular scenario? I see this in do_numa_page() /* TODO: handle PTE-mapped THP */ if (PageCompound(page)) goto out_map; and then for PG_anon_exclusive * ... For now, we only expect it to be * set on tail pages for PTE-mapped THP. */ PG_anon_exclusive = PG_mappedtodisk, which IIUC means zapping these pages to do migrate_on-fault will never succeed. Can we just tell userspace to mbind() the pinned region to explicitly exclude the VMA(s) from NUMA balancing? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 23:56 ` Sean Christopherson @ 2023-08-09 0:11 ` Yan Zhao 2023-08-09 11:59 ` Jason Gunthorpe 2023-08-09 5:06 ` Yan Zhao 1 sibling, 1 reply; 16+ messages in thread From: Yan Zhao @ 2023-08-09 0:11 UTC (permalink / raw) To: Sean Christopherson Cc: Jason Gunthorpe, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple, rppt, akpm, kevin.tian On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > > > !is_last_spte(iter.old_spte, iter.level)) > > > > > continue; > > > > > > > > > > + if (skip_pinned) { > > > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > > > + struct folio *folio; > > > > > + > > > > > + if (!page) > > > > > + continue; > > > > > + > > > > > + folio = page_folio(page); > > > > > + > > > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > > > + folio_maybe_dma_pinned(folio)) > > > > > + continue; > > > > > + } > > > > > + > > > > > > > > I don't get it.. > > > > > > > > The last patch made it so that the NUMA balancing code doesn't change > > > > page_maybe_dma_pinned() pages to PROT_NONE > > > > > > > > So why doesn't KVM just check if the current and new SPTE are the same > > > > and refrain from invalidating if nothing changed? > > > > > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > > > occurs. The contract for invalidate_range_start() requires that KVM drop all > > > references before returning, and so the zapping occurs before change_pte_range() > > > or change_huge_pmd() have done antyhing. > > > > > > > Duplicating the checks here seems very frail to me. > > > > > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > > > can yield different results purely based on refcounts, i.e. KVM could skip pages > > > that the primary MMU does not, and thus violate the mmu_notifier contract. And > > > in general, I am steadfastedly against adding any kind of heuristic to KVM's > > > zapping logic. > > > > > > This really needs to be fixed in the primary MMU and not require any direct > > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > > > to be skipped. > > > > This likely has the same issue you just described, we don't know if it > > can be skipped until we iterate over the PTEs and by then it is too > > late to invoke the notifier. Maybe some kind of abort and restart > > scheme could work? > > Or maybe treat this as a userspace config problem? Pinning DMA pages in a VM, > having a fair amount of remote memory, *and* expecting NUMA balancing to do anything > useful for that VM seems like a userspace problem. > > Actually, does NUMA balancing even support this particular scenario? I see this > in do_numa_page() > > /* TODO: handle PTE-mapped THP */ > if (PageCompound(page)) > goto out_map; hi Sean, I think compound page is handled in do_huge_pmd_numa_page(), and I do observed numa migration of those kind of pages. > and then for PG_anon_exclusive > > * ... For now, we only expect it to be > * set on tail pages for PTE-mapped THP. > */ > PG_anon_exclusive = PG_mappedtodisk, > > which IIUC means zapping these pages to do migrate_on-fault will never succeed. > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the > VMA(s) from NUMA balancing? For VMs with VFIO mdev mediated devices, the VMAs to be pinned are dynamic, I think it's hard to mbind() in advance. Thanks Yan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-09 0:11 ` Yan Zhao @ 2023-08-09 11:59 ` Jason Gunthorpe 2023-08-10 9:08 ` Yan Zhao 0 siblings, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2023-08-09 11:59 UTC (permalink / raw) To: Yan Zhao Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple, rppt, akpm, kevin.tian On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote: > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the > > VMA(s) from NUMA balancing? > For VMs with VFIO mdev mediated devices, the VMAs to be pinned are > dynamic, I think it's hard to mbind() in advance. It is hard to view the mediated devices path as a performance path that deserves this kind of intervention :\ Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-09 11:59 ` Jason Gunthorpe @ 2023-08-10 9:08 ` Yan Zhao 0 siblings, 0 replies; 16+ messages in thread From: Yan Zhao @ 2023-08-10 9:08 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple, rppt, akpm, kevin.tian On Wed, Aug 09, 2023 at 08:59:16AM -0300, Jason Gunthorpe wrote: > On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote: > > > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the > > > VMA(s) from NUMA balancing? > > > For VMs with VFIO mdev mediated devices, the VMAs to be pinned are > > dynamic, I think it's hard to mbind() in advance. > > It is hard to view the mediated devices path as a performance path > that deserves this kind of intervention :\ Though you are right, maybe we can still make it better? What about introducing a new callback which will be called when a page is ensured to be PROT_NONE protected for NUMA balancing? Then, rather than duplicate mm logic in KVM, KVM can depend on this callback and do the page unmap in secondary MMU only for pages that are indeed PROT_NONE protected for NUMA balancing, excluding pages that are obviously non-NUMA-migratable. I sent a RFC v2 (commit messages and comments are not well polished) to show this idea, https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/ Do you think we can continue the work? Thanks a lot for your review! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 23:56 ` Sean Christopherson 2023-08-09 0:11 ` Yan Zhao @ 2023-08-09 5:06 ` Yan Zhao 1 sibling, 0 replies; 16+ messages in thread From: Yan Zhao @ 2023-08-09 5:06 UTC (permalink / raw) To: Sean Christopherson Cc: Jason Gunthorpe, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple, rppt, akpm, kevin.tian On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote: > and then for PG_anon_exclusive > > * ... For now, we only expect it to be > * set on tail pages for PTE-mapped THP. > */ > PG_anon_exclusive = PG_mappedtodisk, > /* * Depending on the way an anonymous folio can be mapped into a page * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped * THP), PG_anon_exclusive may be set only for the head page or for * tail pages of an anonymous folio. For now, we only expect it to be * set on tail pages for PTE-mapped THP. */ PG_anon_exclusive = PG_mappedtodisk, Now sure why the comment says PG_anon_exclusive is set only on tail pages for PTE-mapped THP, what I observed is that only head page of a compound page is set to anon_exclusive. And the code path is here: __handle_mm_fault |->create_huge_pmd |->do_huge_pmd_anonymous_page //if (vma_is_anonymous(vmf->vma) |->folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true); |->__do_huge_pmd_anonymous_page(vmf, &folio->page, gfp); |->folio_add_new_anon_rmap |->__page_set_anon_rmap(folio, &folio->page, vma, address, 1); |->SetPageAnonExclusive(page) And this code path has been present since 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 14:32 ` Jason Gunthorpe 2023-08-08 23:56 ` Sean Christopherson @ 2023-08-09 2:58 ` Yan Zhao 1 sibling, 0 replies; 16+ messages in thread From: Yan Zhao @ 2023-08-09 2:58 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple, rppt, akpm, kevin.tian On Tue, Aug 08, 2023 at 11:32:37AM -0300, Jason Gunthorpe wrote: .... > > This really needs to be fixed in the primary MMU and not require any direct > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > > to be skipped. > > This likely has the same issue you just described, we don't know if it > can be skipped until we iterate over the PTEs and by then it is too > late to invoke the notifier. Maybe some kind of abort and restart The problem is that KVM currently performs the zap in handler of .invalidate_range_start(), so before abort in mm, KVM has done the zap in secondary MMU. Or, could we move the zap in KVM side to handler of .invalidate_range_end() only for MMU_NOTIFY_PROTECTION_VMA and MMU_NOTIFIER_RANGE_NUMA? Then, in mm side, we could do the abort and update the range to contain only successful subrange .invalidate_range_end(). Is that acceptable? > scheme could work? > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 14:26 ` Sean Christopherson 2023-08-08 14:32 ` Jason Gunthorpe @ 2023-08-09 0:29 ` Yan Zhao 1 sibling, 0 replies; 16+ messages in thread From: Yan Zhao @ 2023-08-09 0:29 UTC (permalink / raw) To: Sean Christopherson Cc: Jason Gunthorpe, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple, rppt, akpm, kevin.tian On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > !is_last_spte(iter.old_spte, iter.level)) > > > continue; > > > > > > + if (skip_pinned) { > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > + struct folio *folio; > > > + > > > + if (!page) > > > + continue; > > > + > > > + folio = page_folio(page); > > > + > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > + folio_maybe_dma_pinned(folio)) > > > + continue; > > > + } > > > + > > > > I don't get it.. > > > > The last patch made it so that the NUMA balancing code doesn't change > > page_maybe_dma_pinned() pages to PROT_NONE > > > > So why doesn't KVM just check if the current and new SPTE are the same > > and refrain from invalidating if nothing changed? > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > occurs. The contract for invalidate_range_start() requires that KVM drop all > references before returning, and so the zapping occurs before change_pte_range() > or change_huge_pmd() have done antyhing. > > > Duplicating the checks here seems very frail to me. > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > can yield different results purely based on refcounts, i.e. KVM could skip pages Do you mean the different results of folio_maybe_dma_pinned() and page_maybe_dma_pinned()? I choose to use folio_maybe_dma_pinned() in KVM on purpose because in this .invalidate_range_start() handler in KVM, we may get tail pages of a folio, so it's better to call this folio's version of folio_maybe_dma_pinned(). However, in mm core, i.e. in change_huge_pmd() and change_pte_range(), the "page" it gets is always head page of a folio, so though page_maybe_dma_pinned() is called in it, it actually equals to folio_maybe_dma_pinned(page_folio(page)). So, I think the two sides should yield equal results. On this other hand, if you are concerning about the ref count of page is dynamic, and because KVM and mm core do not check ref count of a page atomically, I think it's still fine. Because, the notification of .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA only means the corresponding PTE is protected in the primary MMU, it does not mean the page is UNMAPed. In series [1], we can even see that for processes other than KVM, the PROT_NONE in primary MMU for NUMA migration purpose is actually ignored and the underlying PFNs are still accessed. So, could KVM open a door for maybe-dma-pinned pages, and keeps mapping those pages until (1) a invalidate notification other than MMU_NOTIFY_PROTECTION_VMA comes or (2) a invalidate notification with MMU_NOTIFY_PROTECTION_VMA comes again with reduced page ref count? [1]: https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/ Thanks Yan > that the primary MMU does not, and thus violate the mmu_notifier contract. And > in general, I am steadfastedly against adding any kind of heuristic to KVM's > zapping logic. > > This really needs to be fixed in the primary MMU and not require any direct > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > to be skipped. > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-08 7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao 2023-08-08 12:32 ` Jason Gunthorpe @ 2023-08-26 6:39 ` liulongfang 2023-09-04 7:03 ` Yan Zhao 1 sibling, 1 reply; 16+ messages in thread From: liulongfang @ 2023-08-26 6:39 UTC (permalink / raw) To: Yan Zhao, linux-mm, linux-kernel, kvm Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian On 2023/8/8 15:17, Yan Zhao wrote: > Skip zapping pages that're exclusive anonymas and maybe-dma-pinned in TDP > MMU if it's for NUMA migration purpose to save unnecessary zaps and TLB > shootdowns. > > For NUMA balancing, change_pmd_range() will send .invalidate_range_start() > and .invalidate_range_end() pair unconditionally before setting a huge PMD > or PTE to be PROT_NONE. > > No matter whether PROT_NONE is set under change_pmd_range(), NUMA migration > will eventually reject migrating of exclusive anonymas and maybe_dma_pinned > pages in later try_to_migrate_one() phase and restoring the affected huge > PMD or PTE. > > Therefore, if KVM can detect those kind of pages in the zap phase, zap and > TLB shootdowns caused by this kind of protection can be avoided. > > Corner cases like below are still fine. > 1. Auto NUMA balancing selects a PMD range to set PROT_NONE in > change_pmd_range(). > 2. A page is maybe-dma-pinned at the time of sending > .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA. > ==> so it's not zapped in KVM's secondary MMU. > 3. The page is unpinned after sending .invalidate_range_start(), therefore > is not maybe-dma-pinned and set to PROT_NONE in primary MMU. > 4. For some reason, page fault is triggered in primary MMU and the page > will be found to be suitable for NUMA migration. > 5. try_to_migrate_one() will send .invalidate_range_start() notification > with event type MMU_NOTIFY_CLEAR to KVM, and ===> > KVM will zap the pages in secondary MMU. > 6. The old page will be replaced by a new page in primary MMU. > > If step 4 does not happen, though KVM will keep accessing a page that > might not be on the best NUMA node, it can be fixed by a next round of > step 1 in Auto NUMA balancing as change_pmd_range() will send mmu > notification without checking PROT_NONE is set or not. > > Currently in this patch, for NUMA migration protection purpose, only > exclusive anonymous maybe-dma-pinned pages are skipped. > Can later include other type of pages, e.g., is_zone_device_page() or > PageKsm() if necessary. > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > arch/x86/kvm/mmu/mmu.c | 4 ++-- > arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++---- > arch/x86/kvm/mmu/tdp_mmu.h | 4 ++-- > include/linux/kvm_host.h | 1 + > virt/kvm/kvm_main.c | 5 +++++ > 5 files changed, 32 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index d72f2b20f430..9dccc25b1389 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -6307,8 +6307,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) > > if (tdp_mmu_enabled) { > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) > - flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, > - gfn_end, true, flush); > + flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, gfn_end, > + true, flush, false); > } > > if (flush) > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 6250bd3d20c1..17762b5a2b98 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) > * operation can cause a soft lockup. > */ > static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > - gfn_t start, gfn_t end, bool can_yield, bool flush) > + gfn_t start, gfn_t end, bool can_yield, bool flush, > + bool skip_pinned) > { > struct tdp_iter iter; > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > !is_last_spte(iter.old_spte, iter.level)) > continue; > > + if (skip_pinned) { > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > + struct folio *folio; > + > + if (!page) > + continue; > + > + folio = page_folio(page); > + > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > + folio_maybe_dma_pinned(folio)) > + continue; > + } > + > tdp_mmu_iter_set_spte(kvm, &iter, 0); > flush = true; > } > @@ -878,12 +894,13 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > * more SPTEs were zapped since the MMU lock was last acquired. > */ > bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, > - bool can_yield, bool flush) > + bool can_yield, bool flush, bool skip_pinned) > { > struct kvm_mmu_page *root; > > for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) > - flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush); > + flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush, > + skip_pinned); > > return flush; > } > @@ -1147,7 +1164,8 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, > bool flush) > { > return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start, > - range->end, range->may_block, flush); > + range->end, range->may_block, flush, > + range->skip_pinned); > } > > typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter, > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > index 0a63b1afabd3..2a9de44bc5c3 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.h > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > @@ -20,8 +20,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root) > void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, > bool shared); > > -bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, > - gfn_t end, bool can_yield, bool flush); > +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, > + bool can_yield, bool flush, bool skip_pinned); > bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp); > void kvm_tdp_mmu_zap_all(struct kvm *kvm); > void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm); > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 9125d0ab642d..f883d6b59545 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -266,6 +266,7 @@ struct kvm_gfn_range { > gfn_t end; > union kvm_mmu_notifier_arg arg; > bool may_block; > + bool skip_pinned; > }; > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index f84ef9399aee..1202c1daa568 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -532,6 +532,7 @@ struct kvm_hva_range { > on_unlock_fn_t on_unlock; > bool flush_on_ret; > bool may_block; > + bool skip_pinned; > }; > > /* > @@ -595,6 +596,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > */ > gfn_range.arg = range->arg; > gfn_range.may_block = range->may_block; > + gfn_range.skip_pinned = range->skip_pinned; > > /* > * {gfn(page) | page intersects with [hva_start, hva_end)} = > @@ -754,6 +756,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > .on_unlock = kvm_arch_guest_memory_reclaimed, > .flush_on_ret = true, > .may_block = mmu_notifier_range_blockable(range), > + .skip_pinned = test_bit(MMF_HAS_PINNED, &range->mm->flags) && > + (range->event == MMU_NOTIFY_PROTECTION_VMA) && > + (range->flags & MMU_NOTIFIER_RANGE_NUMA), > }; > > trace_kvm_unmap_hva_range(range->start, range->end); > I have a question. The numa id of the cpu can be reconfigured in the VM. Will the page table migration operations initiated by the numa balance in the VM and the numa balance in the host conflict with each other after this setting? Thanks Longfang. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration 2023-08-26 6:39 ` liulongfang @ 2023-09-04 7:03 ` Yan Zhao 0 siblings, 0 replies; 16+ messages in thread From: Yan Zhao @ 2023-09-04 7:03 UTC (permalink / raw) To: liulongfang Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian > I have a question. The numa id of the cpu can be reconfigured in the VM. Are you talking about vNUMA and the numa balancing in guest? > Will the page table migration operations initiated by the numa balance in the > VM and the numa balance in the host conflict with each other after this setting? There's no page table migration in host. Only page table is modified and page is migrated in a single host. Live migration also doesn't copy the secondary page tables (i.e. EPT/NPT) to the target. Instead, it's rebuilt in the target side. So, I don't quite understand your question here. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-09-04 7:31 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-08-08 7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao 2023-08-08 7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao 2023-08-08 7:15 ` [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao 2023-08-08 7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao 2023-08-08 12:32 ` Jason Gunthorpe 2023-08-08 14:26 ` Sean Christopherson 2023-08-08 14:32 ` Jason Gunthorpe 2023-08-08 23:56 ` Sean Christopherson 2023-08-09 0:11 ` Yan Zhao 2023-08-09 11:59 ` Jason Gunthorpe 2023-08-10 9:08 ` Yan Zhao 2023-08-09 5:06 ` Yan Zhao 2023-08-09 2:58 ` Yan Zhao 2023-08-09 0:29 ` Yan Zhao 2023-08-26 6:39 ` liulongfang 2023-09-04 7:03 ` Yan Zhao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox