[RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM
@ 2023-08-08  7:13 Yan Zhao
  2023-08-08  7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Yan Zhao @ 2023-08-08  7:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, Yan Zhao

This is an RFC series trying to fix the issue of unnecessary NUMA
protection and TLB-shootdowns found in VMs with assigned devices or VFIO
mediated devices during NUMA balance.

For VMs with assigned devices or VFIO mediated devices, all or part of
guest memory are pinned for long-term.

Auto NUMA balancing will periodically selects VMAs of a process and change
protections to PROT_NONE even though some or all pages in the selected
ranges are long-term pinned for DMAs, which is true for VMs with assigned
devices or VFIO mediated devices.

Though this will not cause real problem because NUMA migration will
ultimately reject migration of those kind of pages and restore those
PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
with equal SPTEs finally faulted back, wasting CPU cycles and generating
unnecessary TLB-shootdowns.

This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
event is sent for NUMA migration purpose in specific.

Then, with patch 3, during zapping KVM's secondary MMU, KVM can check
and keep accessing the long-term pinned pages even though it's
PROT_NONE-mapped in the primary MMU.

Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
MMU to avoid NUMA protection introduced page faults and restoration of old
huge PMDs/PTEs in primary MMU. As change_pmd_range() will first send
.invalidate_range_start() before going down and checking the pages to skip,
patch 1 and 3 are still required for KVM.

In my test environment, with this series, during boot-up with a VM with
assigned devices:
TLB shootdown count in KVM caused by .invalidate_range_start() sent for
NUMA balancing in change_pmd_range() is reduced from 9000+ on average to 0.

Yan Zhao (3):
  mm/mmu_notifier: introduce a new mmu notifier flag
    MMU_NOTIFIER_RANGE_NUMA
  mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
    purpose
  KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

 arch/x86/kvm/mmu/mmu.c       |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c   | 26 ++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h   |  4 ++--
 include/linux/kvm_host.h     |  1 +
 include/linux/mmu_notifier.h |  1 +
 mm/huge_memory.c             |  5 +++++
 mm/mprotect.c                |  9 ++++++++-
 virt/kvm/kvm_main.c          |  5 +++++
 8 files changed, 46 insertions(+), 9 deletions(-)

base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
-- 
2.17.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA
  2023-08-08  7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
@ 2023-08-08  7:14 ` Yan Zhao
  2023-08-08  7:15 ` [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
  2023-08-08  7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao
  2 siblings, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2023-08-08  7:14 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, Yan Zhao

Introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA to indicate the
notification of MMU_NOTIFY_PROTECTION_VMA is for NUMA balance purpose
specifically.

So that, the subscriber of mmu notifier, like KVM, can do some
performance optimization according to this accurate information.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 include/linux/mmu_notifier.h | 1 +
 mm/mprotect.c                | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..a6dc829a4bce 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -60,6 +60,7 @@ enum mmu_notifier_event {
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
+#define MMU_NOTIFIER_RANGE_NUMA (1 << 1)
 
 struct mmu_notifier_ops {
 	/*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f658d483704..cb99a7d66467 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -381,7 +381,9 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!range.start) {
 			mmu_notifier_range_init(&range,
-				MMU_NOTIFY_PROTECTION_VMA, 0,
+				MMU_NOTIFY_PROTECTION_VMA,
+				cp_flags & MM_CP_PROT_NUMA ?
+				MMU_NOTIFIER_RANGE_NUMA : 0,
 				vma->vm_mm, addr, end);
 			mmu_notifier_invalidate_range_start(&range);
 		}
-- 
2.17.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose
  2023-08-08  7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
  2023-08-08  7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
@ 2023-08-08  7:15 ` Yan Zhao
  2023-08-08  7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao
  2 siblings, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2023-08-08  7:15 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, Yan Zhao

Don't set PROT_NONE for exclusive anonymas and maybe-dma-pinned pages for
NUMA migration purpose.

For exclusive anonymas and page_maybe_dma_pinned() pages, NUMA-migration
will eventually drop migration of those pages in try_to_migrate_one().
(i.e. after -EBUSY returned in page_try_share_anon_rmap()).

So, skip setting PROT_NONE to those kind of pages earlier in
change_protection_range() phase to avoid later futile page faults,
detections, and restoration to original PTEs/PMDs.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 mm/huge_memory.c | 5 +++++
 mm/mprotect.c    | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..a71cf686e3b2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1875,6 +1875,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			goto unlock;
 
 		page = pmd_page(*pmd);
+
+		if (PageAnon(page) && PageAnonExclusive(page) &&
+		    page_maybe_dma_pinned(page))
+			goto unlock;
+
 		toptier = node_is_toptier(page_to_nid(page));
 		/*
 		 * Skip scanning top tier node if normal numa
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cb99a7d66467..a1f63df34b86 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -146,6 +146,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 				nid = page_to_nid(page);
 				if (target_node == nid)
 					continue;
+
+				if (PageAnon(page) && PageAnonExclusive(page) &&
+				    page_maybe_dma_pinned(page))
+					continue;
+
 				toptier = node_is_toptier(nid);
 
 				/*
-- 
2.17.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08  7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
  2023-08-08  7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
  2023-08-08  7:15 ` [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
@ 2023-08-08  7:17 ` Yan Zhao
  2023-08-08 12:32   ` Jason Gunthorpe
  2023-08-26  6:39   ` liulongfang
  2 siblings, 2 replies; 16+ messages in thread
From: Yan Zhao @ 2023-08-08  7:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, Yan Zhao

Skip zapping pages that're exclusive anonymas and maybe-dma-pinned in TDP
MMU if it's for NUMA migration purpose to save unnecessary zaps and TLB
shootdowns.

For NUMA balancing, change_pmd_range() will send .invalidate_range_start()
and .invalidate_range_end() pair unconditionally before setting a huge PMD
or PTE to be PROT_NONE.

No matter whether PROT_NONE is set under change_pmd_range(), NUMA migration
will eventually reject migrating of exclusive anonymas and maybe_dma_pinned
pages in later try_to_migrate_one() phase and restoring the affected huge
PMD or PTE.

Therefore, if KVM can detect those kind of pages in the zap phase, zap and
TLB shootdowns caused by this kind of protection can be avoided.

Corner cases like below are still fine.
1. Auto NUMA balancing selects a PMD range to set PROT_NONE in
   change_pmd_range().
2. A page is maybe-dma-pinned at the time of sending
   .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA.
    ==> so it's not zapped in KVM's secondary MMU.
3. The page is unpinned after sending .invalidate_range_start(), therefore
   is not maybe-dma-pinned and set to PROT_NONE in primary MMU.
4. For some reason, page fault is triggered in primary MMU and the page
   will be found to be suitable for NUMA migration.
5. try_to_migrate_one() will send .invalidate_range_start() notification
   with event type MMU_NOTIFY_CLEAR to KVM, and ===>
   KVM will zap the pages in secondary MMU.
6. The old page will be replaced by a new page in primary MMU.

If step 4 does not happen, though KVM will keep accessing a page that
might not be on the best NUMA node, it can be fixed by a next round of
step 1 in Auto NUMA balancing as change_pmd_range() will send mmu
notification without checking PROT_NONE is set or not.

Currently in this patch, for NUMA migration protection purpose, only
exclusive anonymous maybe-dma-pinned pages are skipped.
Can later include other type of pages, e.g., is_zone_device_page() or
PageKsm() if necessary.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c     |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h |  4 ++--
 include/linux/kvm_host.h   |  1 +
 virt/kvm/kvm_main.c        |  5 +++++
 5 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d72f2b20f430..9dccc25b1389 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6307,8 +6307,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	if (tdp_mmu_enabled) {
 		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
-						      gfn_end, true, flush);
+			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, gfn_end,
+						      true, flush, false);
 	}
 
 	if (flush)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6250bd3d20c1..17762b5a2b98 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
  * operation can cause a soft lockup.
  */
 static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
-			      gfn_t start, gfn_t end, bool can_yield, bool flush)
+			      gfn_t start, gfn_t end, bool can_yield, bool flush,
+			      bool skip_pinned)
 {
 	struct tdp_iter iter;
 
@@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
+		if (skip_pinned) {
+			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
+			struct page *page = kvm_pfn_to_refcounted_page(pfn);
+			struct folio *folio;
+
+			if (!page)
+				continue;
+
+			folio = page_folio(page);
+
+			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
+			    folio_maybe_dma_pinned(folio))
+				continue;
+		}
+
 		tdp_mmu_iter_set_spte(kvm, &iter, 0);
 		flush = true;
 	}
@@ -878,12 +894,13 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
  * more SPTEs were zapped since the MMU lock was last acquired.
  */
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
-			   bool can_yield, bool flush)
+			   bool can_yield, bool flush, bool skip_pinned)
 {
 	struct kvm_mmu_page *root;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
-		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
+		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush,
+					  skip_pinned);
 
 	return flush;
 }
@@ -1147,7 +1164,8 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush)
 {
 	return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
-				     range->end, range->may_block, flush);
+				     range->end, range->may_block, flush,
+				     range->skip_pinned);
 }
 
 typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..2a9de44bc5c3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,8 +20,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared);
 
-bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush);
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
+			   bool can_yield, bool flush, bool skip_pinned);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9125d0ab642d..f883d6b59545 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -266,6 +266,7 @@ struct kvm_gfn_range {
 	gfn_t end;
 	union kvm_mmu_notifier_arg arg;
 	bool may_block;
+	bool skip_pinned;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f84ef9399aee..1202c1daa568 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -532,6 +532,7 @@ struct kvm_hva_range {
 	on_unlock_fn_t on_unlock;
 	bool flush_on_ret;
 	bool may_block;
+	bool skip_pinned;
 };
 
 /*
@@ -595,6 +596,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			 */
 			gfn_range.arg = range->arg;
 			gfn_range.may_block = range->may_block;
+			gfn_range.skip_pinned = range->skip_pinned;
 
 			/*
 			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -754,6 +756,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
+		.skip_pinned	= test_bit(MMF_HAS_PINNED, &range->mm->flags) &&
+				  (range->event == MMU_NOTIFY_PROTECTION_VMA) &&
+				  (range->flags & MMU_NOTIFIER_RANGE_NUMA),
 	};
 
 	trace_kvm_unmap_hva_range(range->start, range->end);
-- 
2.17.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08  7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao
@ 2023-08-08 12:32   ` Jason Gunthorpe
  2023-08-08 14:26     ` Sean Christopherson
  2023-08-26  6:39   ` liulongfang
  1 sibling, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-08-08 12:32 UTC (permalink / raw)
  To: Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, rppt, akpm, kevin.tian

On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> Skip zapping pages that're exclusive anonymas and maybe-dma-pinned in TDP
> MMU if it's for NUMA migration purpose to save unnecessary zaps and TLB
> shootdowns.
> 
> For NUMA balancing, change_pmd_range() will send .invalidate_range_start()
> and .invalidate_range_end() pair unconditionally before setting a huge PMD
> or PTE to be PROT_NONE.
> 
> No matter whether PROT_NONE is set under change_pmd_range(), NUMA migration
> will eventually reject migrating of exclusive anonymas and maybe_dma_pinned
> pages in later try_to_migrate_one() phase and restoring the affected huge
> PMD or PTE.
> 
> Therefore, if KVM can detect those kind of pages in the zap phase, zap and
> TLB shootdowns caused by this kind of protection can be avoided.
> 
> Corner cases like below are still fine.
> 1. Auto NUMA balancing selects a PMD range to set PROT_NONE in
>    change_pmd_range().
> 2. A page is maybe-dma-pinned at the time of sending
>    .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA.
>     ==> so it's not zapped in KVM's secondary MMU.
> 3. The page is unpinned after sending .invalidate_range_start(), therefore
>    is not maybe-dma-pinned and set to PROT_NONE in primary MMU.
> 4. For some reason, page fault is triggered in primary MMU and the page
>    will be found to be suitable for NUMA migration.
> 5. try_to_migrate_one() will send .invalidate_range_start() notification
>    with event type MMU_NOTIFY_CLEAR to KVM, and ===>
>    KVM will zap the pages in secondary MMU.
> 6. The old page will be replaced by a new page in primary MMU.
> 
> If step 4 does not happen, though KVM will keep accessing a page that
> might not be on the best NUMA node, it can be fixed by a next round of
> step 1 in Auto NUMA balancing as change_pmd_range() will send mmu
> notification without checking PROT_NONE is set or not.
> 
> Currently in this patch, for NUMA migration protection purpose, only
> exclusive anonymous maybe-dma-pinned pages are skipped.
> Can later include other type of pages, e.g., is_zone_device_page() or
> PageKsm() if necessary.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  4 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++----
>  arch/x86/kvm/mmu/tdp_mmu.h |  4 ++--
>  include/linux/kvm_host.h   |  1 +
>  virt/kvm/kvm_main.c        |  5 +++++
>  5 files changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d72f2b20f430..9dccc25b1389 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6307,8 +6307,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	if (tdp_mmu_enabled) {
>  		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
> -						      gfn_end, true, flush);
> +			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, gfn_end,
> +						      true, flush, false);
>  	}
>  
>  	if (flush)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6250bd3d20c1..17762b5a2b98 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>   * operation can cause a soft lockup.
>   */
>  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> -			      gfn_t start, gfn_t end, bool can_yield, bool flush)
> +			      gfn_t start, gfn_t end, bool can_yield, bool flush,
> +			      bool skip_pinned)
>  {
>  	struct tdp_iter iter;
>  
> @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;
>  
> +		if (skip_pinned) {
> +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> +			struct folio *folio;
> +
> +			if (!page)
> +				continue;
> +
> +			folio = page_folio(page);
> +
> +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> +			    folio_maybe_dma_pinned(folio))
> +				continue;
> +		}
> +

I don't get it..

The last patch made it so that the NUMA balancing code doesn't change
page_maybe_dma_pinned() pages to PROT_NONE

So why doesn't KVM just check if the current and new SPTE are the same
and refrain from invalidating if nothing changed? Duplicating the
checks here seems very frail to me.

If you did that then you probably don't need to change the notifiers.

Jason


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08 12:32   ` Jason Gunthorpe
@ 2023-08-08 14:26     ` Sean Christopherson
  2023-08-08 14:32       ` Jason Gunthorpe
  2023-08-09  0:29       ` Yan Zhao
  0 siblings, 2 replies; 16+ messages in thread
From: Sean Christopherson @ 2023-08-08 14:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yan Zhao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz,
	apopple, rppt, akpm, kevin.tian

On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> >  
> > +		if (skip_pinned) {
> > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > +			struct folio *folio;
> > +
> > +			if (!page)
> > +				continue;
> > +
> > +			folio = page_folio(page);
> > +
> > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > +			    folio_maybe_dma_pinned(folio))
> > +				continue;
> > +		}
> > +
> 
> I don't get it..
> 
> The last patch made it so that the NUMA balancing code doesn't change
> page_maybe_dma_pinned() pages to PROT_NONE
> 
> So why doesn't KVM just check if the current and new SPTE are the same
> and refrain from invalidating if nothing changed?

Because KVM doesn't have visibility into the current and new PTEs when the zapping
occurs.  The contract for invalidate_range_start() requires that KVM drop all
references before returning, and so the zapping occurs before change_pte_range()
or change_huge_pmd() have done antyhing.

> Duplicating the checks here seems very frail to me.

Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
can yield different results purely based on refcounts, i.e. KVM could skip pages
that the primary MMU does not, and thus violate the mmu_notifier contract.  And
in general, I am steadfastedly against adding any kind of heuristic to KVM's
zapping logic.

This really needs to be fixed in the primary MMU and not require any direct
involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
to be skipped.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08 14:26     ` Sean Christopherson
@ 2023-08-08 14:32       ` Jason Gunthorpe
  2023-08-08 23:56         ` Sean Christopherson
  2023-08-09  2:58         ` Yan Zhao
  2023-08-09  0:29       ` Yan Zhao
  1 sibling, 2 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2023-08-08 14:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz,
	apopple, rppt, akpm, kevin.tian

On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > >  		    !is_last_spte(iter.old_spte, iter.level))
> > >  			continue;
> > >  
> > > +		if (skip_pinned) {
> > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > +			struct folio *folio;
> > > +
> > > +			if (!page)
> > > +				continue;
> > > +
> > > +			folio = page_folio(page);
> > > +
> > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > +			    folio_maybe_dma_pinned(folio))
> > > +				continue;
> > > +		}
> > > +
> > 
> > I don't get it..
> > 
> > The last patch made it so that the NUMA balancing code doesn't change
> > page_maybe_dma_pinned() pages to PROT_NONE
> > 
> > So why doesn't KVM just check if the current and new SPTE are the same
> > and refrain from invalidating if nothing changed?
> 
> Because KVM doesn't have visibility into the current and new PTEs when the zapping
> occurs.  The contract for invalidate_range_start() requires that KVM drop all
> references before returning, and so the zapping occurs before change_pte_range()
> or change_huge_pmd() have done antyhing.
> 
> > Duplicating the checks here seems very frail to me.
> 
> Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> can yield different results purely based on refcounts, i.e. KVM could skip pages
> that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> in general, I am steadfastedly against adding any kind of heuristic to KVM's
> zapping logic.
> 
> This really needs to be fixed in the primary MMU and not require any direct
> involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> to be skipped.

This likely has the same issue you just described, we don't know if it
can be skipped until we iterate over the PTEs and by then it is too
late to invoke the notifier. Maybe some kind of abort and restart
scheme could work?

Jason


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08 14:32       ` Jason Gunthorpe
@ 2023-08-08 23:56         ` Sean Christopherson
  2023-08-09  0:11           ` Yan Zhao
  2023-08-09  5:06           ` Yan Zhao
  2023-08-09  2:58         ` Yan Zhao
  1 sibling, 2 replies; 16+ messages in thread
From: Sean Christopherson @ 2023-08-08 23:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yan Zhao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz,
	apopple, rppt, akpm, kevin.tian

On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> > On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > > >  		    !is_last_spte(iter.old_spte, iter.level))
> > > >  			continue;
> > > >  
> > > > +		if (skip_pinned) {
> > > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > > +			struct folio *folio;
> > > > +
> > > > +			if (!page)
> > > > +				continue;
> > > > +
> > > > +			folio = page_folio(page);
> > > > +
> > > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > > +			    folio_maybe_dma_pinned(folio))
> > > > +				continue;
> > > > +		}
> > > > +
> > > 
> > > I don't get it..
> > > 
> > > The last patch made it so that the NUMA balancing code doesn't change
> > > page_maybe_dma_pinned() pages to PROT_NONE
> > > 
> > > So why doesn't KVM just check if the current and new SPTE are the same
> > > and refrain from invalidating if nothing changed?
> > 
> > Because KVM doesn't have visibility into the current and new PTEs when the zapping
> > occurs.  The contract for invalidate_range_start() requires that KVM drop all
> > references before returning, and so the zapping occurs before change_pte_range()
> > or change_huge_pmd() have done antyhing.
> > 
> > > Duplicating the checks here seems very frail to me.
> > 
> > Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> > can yield different results purely based on refcounts, i.e. KVM could skip pages
> > that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> > in general, I am steadfastedly against adding any kind of heuristic to KVM's
> > zapping logic.
> > 
> > This really needs to be fixed in the primary MMU and not require any direct
> > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > to be skipped.
> 
> This likely has the same issue you just described, we don't know if it
> can be skipped until we iterate over the PTEs and by then it is too
> late to invoke the notifier. Maybe some kind of abort and restart
> scheme could work?

Or maybe treat this as a userspace config problem?  Pinning DMA pages in a VM,
having a fair amount of remote memory, *and* expecting NUMA balancing to do anything
useful for that VM seems like a userspace problem.

Actually, does NUMA balancing even support this particular scenario?  I see this
in do_numa_page()

	/* TODO: handle PTE-mapped THP */
	if (PageCompound(page))
		goto out_map;

and then for PG_anon_exclusive

	 * ... For now, we only expect it to be
	 * set on tail pages for PTE-mapped THP.
	 */
	PG_anon_exclusive = PG_mappedtodisk,

which IIUC means zapping these pages to do migrate_on-fault will never succeed.

Can we just tell userspace to mbind() the pinned region to explicitly exclude the
VMA(s) from NUMA balancing?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08 23:56         ` Sean Christopherson
@ 2023-08-09  0:11           ` Yan Zhao
  2023-08-09 11:59             ` Jason Gunthorpe
  2023-08-09  5:06           ` Yan Zhao
  1 sibling, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2023-08-09  0:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, rppt, akpm, kevin.tian

On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > > > >  		    !is_last_spte(iter.old_spte, iter.level))
> > > > >  			continue;
> > > > >  
> > > > > +		if (skip_pinned) {
> > > > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > > > +			struct folio *folio;
> > > > > +
> > > > > +			if (!page)
> > > > > +				continue;
> > > > > +
> > > > > +			folio = page_folio(page);
> > > > > +
> > > > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > > > +			    folio_maybe_dma_pinned(folio))
> > > > > +				continue;
> > > > > +		}
> > > > > +
> > > > 
> > > > I don't get it..
> > > > 
> > > > The last patch made it so that the NUMA balancing code doesn't change
> > > > page_maybe_dma_pinned() pages to PROT_NONE
> > > > 
> > > > So why doesn't KVM just check if the current and new SPTE are the same
> > > > and refrain from invalidating if nothing changed?
> > > 
> > > Because KVM doesn't have visibility into the current and new PTEs when the zapping
> > > occurs.  The contract for invalidate_range_start() requires that KVM drop all
> > > references before returning, and so the zapping occurs before change_pte_range()
> > > or change_huge_pmd() have done antyhing.
> > > 
> > > > Duplicating the checks here seems very frail to me.
> > > 
> > > Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> > > can yield different results purely based on refcounts, i.e. KVM could skip pages
> > > that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> > > in general, I am steadfastedly against adding any kind of heuristic to KVM's
> > > zapping logic.
> > > 
> > > This really needs to be fixed in the primary MMU and not require any direct
> > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > > to be skipped.
> > 
> > This likely has the same issue you just described, we don't know if it
> > can be skipped until we iterate over the PTEs and by then it is too
> > late to invoke the notifier. Maybe some kind of abort and restart
> > scheme could work?
> 
> Or maybe treat this as a userspace config problem?  Pinning DMA pages in a VM,
> having a fair amount of remote memory, *and* expecting NUMA balancing to do anything
> useful for that VM seems like a userspace problem.
> 
> Actually, does NUMA balancing even support this particular scenario?  I see this
> in do_numa_page()
> 
> 	/* TODO: handle PTE-mapped THP */
> 	if (PageCompound(page))
> 		goto out_map;
hi Sean,
I think compound page is handled in do_huge_pmd_numa_page(), and I do
observed numa migration of those kind of pages.


> and then for PG_anon_exclusive
> 
> 	 * ... For now, we only expect it to be
> 	 * set on tail pages for PTE-mapped THP.
> 	 */
> 	PG_anon_exclusive = PG_mappedtodisk,
> 
> which IIUC means zapping these pages to do migrate_on-fault will never succeed.
> 
> Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> VMA(s) from NUMA balancing?
For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
dynamic, I think it's hard to mbind() in advance.

Thanks
Yan



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-09  0:11           ` Yan Zhao
@ 2023-08-09 11:59             ` Jason Gunthorpe
  2023-08-10  9:08               ` Yan Zhao
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-08-09 11:59 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, rppt, akpm, kevin.tian

On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote:

> > Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> > VMA(s) from NUMA balancing?

> For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
> dynamic, I think it's hard to mbind() in advance.

It is hard to view the mediated devices path as a performance path
that deserves this kind of intervention :\

Jason


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-09 11:59             ` Jason Gunthorpe
@ 2023-08-10  9:08               ` Yan Zhao
  0 siblings, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2023-08-10  9:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, rppt, akpm, kevin.tian

On Wed, Aug 09, 2023 at 08:59:16AM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote:
> 
> > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> > > VMA(s) from NUMA balancing?
> 
> > For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
> > dynamic, I think it's hard to mbind() in advance.
> 
> It is hard to view the mediated devices path as a performance path
> that deserves this kind of intervention :\

Though you are right, maybe we can still make it better?

What about introducing a new callback which will be called when a page
is ensured to be PROT_NONE protected for NUMA balancing?

Then, rather than duplicate mm logic in KVM, KVM can depend on this callback
and do the page unmap in secondary MMU only for pages that are indeed
PROT_NONE protected for NUMA balancing, excluding pages that are obviously
non-NUMA-migratable.

I sent a RFC v2 (commit messages and comments are not well polished) to
show this idea,
https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/ 

Do you think we can continue the work?

Thanks a lot for your review!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08 23:56         ` Sean Christopherson
  2023-08-09  0:11           ` Yan Zhao
@ 2023-08-09  5:06           ` Yan Zhao
  1 sibling, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2023-08-09  5:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, rppt, akpm, kevin.tian

On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote:
 
> and then for PG_anon_exclusive
> 
> 	 * ... For now, we only expect it to be
> 	 * set on tail pages for PTE-mapped THP.
> 	 */
> 	PG_anon_exclusive = PG_mappedtodisk,
> 

	/*
         * Depending on the way an anonymous folio can be mapped into a page
         * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
         * THP), PG_anon_exclusive may be set only for the head page or for
         * tail pages of an anonymous folio. For now, we only expect it to be
         * set on tail pages for PTE-mapped THP.
         */
        PG_anon_exclusive = PG_mappedtodisk,

Now sure why the comment says PG_anon_exclusive is set only on tail
pages for PTE-mapped THP,

what I observed is that only head page of a compound page is set to
anon_exclusive.

And the code path is here:
__handle_mm_fault
  |->create_huge_pmd
     |->do_huge_pmd_anonymous_page //if (vma_is_anonymous(vmf->vma)
     	|->folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
        |->__do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
           |->folio_add_new_anon_rmap
              |->__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
	         |->SetPageAnonExclusive(page)

And this code path has been present since
6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08 14:32       ` Jason Gunthorpe
  2023-08-08 23:56         ` Sean Christopherson
@ 2023-08-09  2:58         ` Yan Zhao
  1 sibling, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2023-08-09  2:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, rppt, akpm, kevin.tian

On Tue, Aug 08, 2023 at 11:32:37AM -0300, Jason Gunthorpe wrote:
.... 
> > This really needs to be fixed in the primary MMU and not require any direct
> > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > to be skipped.
> 
> This likely has the same issue you just described, we don't know if it
> can be skipped until we iterate over the PTEs and by then it is too
> late to invoke the notifier. Maybe some kind of abort and restart
The problem is that KVM currently performs the zap in handler of .invalidate_range_start(),
so before abort in mm, KVM has done the zap in secondary MMU.

Or, could we move the zap in KVM side to handler of .invalidate_range_end() only for
MMU_NOTIFY_PROTECTION_VMA and MMU_NOTIFIER_RANGE_NUMA?

Then, in mm side, we could do the abort and update the range to contain only successful
subrange .invalidate_range_end().

Is that acceptable?

> scheme could work?
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08 14:26     ` Sean Christopherson
  2023-08-08 14:32       ` Jason Gunthorpe
@ 2023-08-09  0:29       ` Yan Zhao
  1 sibling, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2023-08-09  0:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, rppt, akpm, kevin.tian

On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > >  		    !is_last_spte(iter.old_spte, iter.level))
> > >  			continue;
> > >  
> > > +		if (skip_pinned) {
> > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > +			struct folio *folio;
> > > +
> > > +			if (!page)
> > > +				continue;
> > > +
> > > +			folio = page_folio(page);
> > > +
> > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > +			    folio_maybe_dma_pinned(folio))
> > > +				continue;
> > > +		}
> > > +
> > 
> > I don't get it..
> > 
> > The last patch made it so that the NUMA balancing code doesn't change
> > page_maybe_dma_pinned() pages to PROT_NONE
> > 
> > So why doesn't KVM just check if the current and new SPTE are the same
> > and refrain from invalidating if nothing changed?
> 
> Because KVM doesn't have visibility into the current and new PTEs when the zapping
> occurs.  The contract for invalidate_range_start() requires that KVM drop all
> references before returning, and so the zapping occurs before change_pte_range()
> or change_huge_pmd() have done antyhing.
> 
> > Duplicating the checks here seems very frail to me.
> 
> Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> can yield different results purely based on refcounts, i.e. KVM could skip pages
Do you mean the different results of folio_maybe_dma_pinned() and
page_maybe_dma_pinned()?

I choose to use folio_maybe_dma_pinned() in KVM on purpose because in
this .invalidate_range_start() handler in KVM, we may get tail pages of
a folio, so it's better to call this folio's version of folio_maybe_dma_pinned().

However, in mm core, i.e. in change_huge_pmd() and change_pte_range(),
the "page" it gets is always head page of a folio, so though
page_maybe_dma_pinned() is called in it, it actually equals to
folio_maybe_dma_pinned(page_folio(page)).

So, I think the two sides should yield equal results.

On this other hand, if you are concerning about the ref count of page is
dynamic, and because KVM and mm core do not check ref count of a page
atomically, I think it's still fine.
Because, the notification of .invalidate_range_start() with event type
MMU_NOTIFY_PROTECTION_VMA only means the corresponding PTE is protected
in the primary MMU, it does not mean the page is UNMAPed.

In series [1], we can even see that for processes other than KVM, the
PROT_NONE in primary MMU for NUMA migration purpose is actually ignored
and the underlying PFNs are still accessed.

So, could KVM open a door for maybe-dma-pinned pages, and keeps mapping
those pages until
(1) a invalidate notification other than MMU_NOTIFY_PROTECTION_VMA comes or
(2) a invalidate notification with MMU_NOTIFY_PROTECTION_VMA comes again with
reduced page ref count?

[1]: https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/

Thanks
Yan

> that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> in general, I am steadfastedly against adding any kind of heuristic to KVM's
> zapping logic.
> 
> This really needs to be fixed in the primary MMU and not require any direct
> involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> to be skipped.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-08  7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao
  2023-08-08 12:32   ` Jason Gunthorpe
@ 2023-08-26  6:39   ` liulongfang
  2023-09-04  7:03     ` Yan Zhao
  1 sibling, 1 reply; 16+ messages in thread
From: liulongfang @ 2023-08-26  6:39 UTC (permalink / raw)
  To: Yan Zhao, linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian

On 2023/8/8 15:17, Yan Zhao wrote:
> Skip zapping pages that're exclusive anonymas and maybe-dma-pinned in TDP
> MMU if it's for NUMA migration purpose to save unnecessary zaps and TLB
> shootdowns.
> 
> For NUMA balancing, change_pmd_range() will send .invalidate_range_start()
> and .invalidate_range_end() pair unconditionally before setting a huge PMD
> or PTE to be PROT_NONE.
> 
> No matter whether PROT_NONE is set under change_pmd_range(), NUMA migration
> will eventually reject migrating of exclusive anonymas and maybe_dma_pinned
> pages in later try_to_migrate_one() phase and restoring the affected huge
> PMD or PTE.
> 
> Therefore, if KVM can detect those kind of pages in the zap phase, zap and
> TLB shootdowns caused by this kind of protection can be avoided.
> 
> Corner cases like below are still fine.
> 1. Auto NUMA balancing selects a PMD range to set PROT_NONE in
>    change_pmd_range().
> 2. A page is maybe-dma-pinned at the time of sending
>    .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA.
>     ==> so it's not zapped in KVM's secondary MMU.
> 3. The page is unpinned after sending .invalidate_range_start(), therefore
>    is not maybe-dma-pinned and set to PROT_NONE in primary MMU.
> 4. For some reason, page fault is triggered in primary MMU and the page
>    will be found to be suitable for NUMA migration.
> 5. try_to_migrate_one() will send .invalidate_range_start() notification
>    with event type MMU_NOTIFY_CLEAR to KVM, and ===>
>    KVM will zap the pages in secondary MMU.
> 6. The old page will be replaced by a new page in primary MMU.
> 
> If step 4 does not happen, though KVM will keep accessing a page that
> might not be on the best NUMA node, it can be fixed by a next round of
> step 1 in Auto NUMA balancing as change_pmd_range() will send mmu
> notification without checking PROT_NONE is set or not.
> 
> Currently in this patch, for NUMA migration protection purpose, only
> exclusive anonymous maybe-dma-pinned pages are skipped.
> Can later include other type of pages, e.g., is_zone_device_page() or
> PageKsm() if necessary.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  4 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++----
>  arch/x86/kvm/mmu/tdp_mmu.h |  4 ++--
>  include/linux/kvm_host.h   |  1 +
>  virt/kvm/kvm_main.c        |  5 +++++
>  5 files changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d72f2b20f430..9dccc25b1389 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6307,8 +6307,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	if (tdp_mmu_enabled) {
>  		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
> -						      gfn_end, true, flush);
> +			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, gfn_end,
> +						      true, flush, false);
>  	}
>  
>  	if (flush)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6250bd3d20c1..17762b5a2b98 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>   * operation can cause a soft lockup.
>   */
>  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> -			      gfn_t start, gfn_t end, bool can_yield, bool flush)
> +			      gfn_t start, gfn_t end, bool can_yield, bool flush,
> +			      bool skip_pinned)
>  {
>  	struct tdp_iter iter;
>  
> @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;
>  
> +		if (skip_pinned) {
> +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> +			struct folio *folio;
> +
> +			if (!page)
> +				continue;
> +
> +			folio = page_folio(page);
> +
> +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> +			    folio_maybe_dma_pinned(folio))
> +				continue;
> +		}
> +
>  		tdp_mmu_iter_set_spte(kvm, &iter, 0);
>  		flush = true;
>  	}
> @@ -878,12 +894,13 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
>   * more SPTEs were zapped since the MMU lock was last acquired.
>   */
>  bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> -			   bool can_yield, bool flush)
> +			   bool can_yield, bool flush, bool skip_pinned)
>  {
>  	struct kvm_mmu_page *root;
>  
>  	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> -		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
> +		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush,
> +					  skip_pinned);
>  
>  	return flush;
>  }
> @@ -1147,7 +1164,8 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>  				 bool flush)
>  {
>  	return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
> -				     range->end, range->may_block, flush);
> +				     range->end, range->may_block, flush,
> +				     range->skip_pinned);
>  }
>  
>  typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 0a63b1afabd3..2a9de44bc5c3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -20,8 +20,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  			  bool shared);
>  
> -bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
> -				 gfn_t end, bool can_yield, bool flush);
> +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> +			   bool can_yield, bool flush, bool skip_pinned);
>  bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9125d0ab642d..f883d6b59545 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -266,6 +266,7 @@ struct kvm_gfn_range {
>  	gfn_t end;
>  	union kvm_mmu_notifier_arg arg;
>  	bool may_block;
> +	bool skip_pinned;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f84ef9399aee..1202c1daa568 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -532,6 +532,7 @@ struct kvm_hva_range {
>  	on_unlock_fn_t on_unlock;
>  	bool flush_on_ret;
>  	bool may_block;
> +	bool skip_pinned;
>  };
>  
>  /*
> @@ -595,6 +596,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  			 */
>  			gfn_range.arg = range->arg;
>  			gfn_range.may_block = range->may_block;
> +			gfn_range.skip_pinned = range->skip_pinned;
>  
>  			/*
>  			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
> @@ -754,6 +756,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		.on_unlock	= kvm_arch_guest_memory_reclaimed,
>  		.flush_on_ret	= true,
>  		.may_block	= mmu_notifier_range_blockable(range),
> +		.skip_pinned	= test_bit(MMF_HAS_PINNED, &range->mm->flags) &&
> +				  (range->event == MMU_NOTIFY_PROTECTION_VMA) &&
> +				  (range->flags & MMU_NOTIFIER_RANGE_NUMA),
>  	};
>  
>  	trace_kvm_unmap_hva_range(range->start, range->end);
> 

I have a question. The numa id of the cpu can be reconfigured in the VM.
Will the page table migration operations initiated by the numa balance in the
VM and the numa balance in the host conflict with each other after this setting?

Thanks
Longfang.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
  2023-08-26  6:39   ` liulongfang
@ 2023-09-04  7:03     ` Yan Zhao
  0 siblings, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2023-09-04  7:03 UTC (permalink / raw)
  To: liulongfang
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian

> I have a question. The numa id of the cpu can be reconfigured in the VM.
Are you talking about vNUMA and the numa balancing in guest?

> Will the page table migration operations initiated by the numa balance in the
> VM and the numa balance in the host conflict with each other after this setting?
There's no page table migration in host. Only page table is modified and
page is migrated in a single host.
Live migration also doesn't copy the secondary page tables (i.e.
EPT/NPT) to the target. Instead, it's rebuilt in the target side.
So, I don't quite understand your question here.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-09-04  7:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-08  7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
2023-08-08  7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
2023-08-08  7:15 ` [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
2023-08-08  7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao
2023-08-08 12:32   ` Jason Gunthorpe
2023-08-08 14:26     ` Sean Christopherson
2023-08-08 14:32       ` Jason Gunthorpe
2023-08-08 23:56         ` Sean Christopherson
2023-08-09  0:11           ` Yan Zhao
2023-08-09 11:59             ` Jason Gunthorpe
2023-08-10  9:08               ` Yan Zhao
2023-08-09  5:06           ` Yan Zhao
2023-08-09  2:58         ` Yan Zhao
2023-08-09  0:29       ` Yan Zhao
2023-08-26  6:39   ` liulongfang
2023-09-04  7:03     ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox