Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yan Zhao <yan.y.zhao@intel.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<pbonzini@redhat.com>, <mike.kravetz@oracle.com>,
	<apopple@nvidia.com>, <rppt@kernel.org>,
	<akpm@linux-foundation.org>, <kevin.tian@intel.com>
Subject: Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
Date: Wed, 9 Aug 2023 08:29:31 +0800	[thread overview]
Message-ID: <ZNLd64+92ltbp5SS@yzhao56-desk.sh.intel.com> (raw)
In-Reply-To: <ZNJQf1/jzEeyKaIi@google.com>

On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > >  		    !is_last_spte(iter.old_spte, iter.level))
> > >  			continue;
> > >  
> > > +		if (skip_pinned) {
> > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > +			struct folio *folio;
> > > +
> > > +			if (!page)
> > > +				continue;
> > > +
> > > +			folio = page_folio(page);
> > > +
> > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > +			    folio_maybe_dma_pinned(folio))
> > > +				continue;
> > > +		}
> > > +
> > 
> > I don't get it..
> > 
> > The last patch made it so that the NUMA balancing code doesn't change
> > page_maybe_dma_pinned() pages to PROT_NONE
> > 
> > So why doesn't KVM just check if the current and new SPTE are the same
> > and refrain from invalidating if nothing changed?
> 
> Because KVM doesn't have visibility into the current and new PTEs when the zapping
> occurs.  The contract for invalidate_range_start() requires that KVM drop all
> references before returning, and so the zapping occurs before change_pte_range()
> or change_huge_pmd() have done antyhing.
> 
> > Duplicating the checks here seems very frail to me.
> 
> Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> can yield different results purely based on refcounts, i.e. KVM could skip pages
Do you mean the different results of folio_maybe_dma_pinned() and
page_maybe_dma_pinned()?

I choose to use folio_maybe_dma_pinned() in KVM on purpose because in
this .invalidate_range_start() handler in KVM, we may get tail pages of
a folio, so it's better to call this folio's version of folio_maybe_dma_pinned().

However, in mm core, i.e. in change_huge_pmd() and change_pte_range(),
the "page" it gets is always head page of a folio, so though
page_maybe_dma_pinned() is called in it, it actually equals to
folio_maybe_dma_pinned(page_folio(page)).

So, I think the two sides should yield equal results.

On this other hand, if you are concerning about the ref count of page is
dynamic, and because KVM and mm core do not check ref count of a page
atomically, I think it's still fine.
Because, the notification of .invalidate_range_start() with event type
MMU_NOTIFY_PROTECTION_VMA only means the corresponding PTE is protected
in the primary MMU, it does not mean the page is UNMAPed.

In series [1], we can even see that for processes other than KVM, the
PROT_NONE in primary MMU for NUMA migration purpose is actually ignored
and the underlying PFNs are still accessed.

So, could KVM open a door for maybe-dma-pinned pages, and keeps mapping
those pages until
(1) a invalidate notification other than MMU_NOTIFY_PROTECTION_VMA comes or
(2) a invalidate notification with MMU_NOTIFY_PROTECTION_VMA comes again with
reduced page ref count?

[1]: https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/

Thanks
Yan

> that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> in general, I am steadfastedly against adding any kind of heuristic to KVM's
> zapping logic.
> 
> This really needs to be fixed in the primary MMU and not require any direct
> involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> to be skipped.
>

next prev parent reply	other threads:[~2023-08-09  0:56 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-08  7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
2023-08-08  7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
2023-08-08  7:15 ` [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
2023-08-08  7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao
2023-08-08 12:32   ` Jason Gunthorpe
2023-08-08 14:26     ` Sean Christopherson
2023-08-08 14:32       ` Jason Gunthorpe
2023-08-08 23:56         ` Sean Christopherson
2023-08-09  0:11           ` Yan Zhao
2023-08-09 11:59             ` Jason Gunthorpe
2023-08-10  9:08               ` Yan Zhao
2023-08-09  5:06           ` Yan Zhao
2023-08-09  2:58         ` Yan Zhao
2023-08-09  0:29       ` Yan Zhao [this message]
2023-08-26  6:39   ` liulongfang
2023-09-04  7:03     ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZNLd64+92ltbp5SS@yzhao56-desk.sh.intel.com \
    --to=yan.y.zhao@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=pbonzini@redhat.com \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox