Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 kvm@vger.kernel.org, pbonzini@redhat.com,
	mike.kravetz@oracle.com,  apopple@nvidia.com, rppt@kernel.org,
	akpm@linux-foundation.org,  kevin.tian@intel.com
Subject: Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration
Date: Tue, 8 Aug 2023 16:56:11 -0700	[thread overview]
Message-ID: <ZNLWG++qK1mZcEOq@google.com> (raw)
In-Reply-To: <ZNJSBS9w+6cS5eRM@nvidia.com>

On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> > On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > > >  		    !is_last_spte(iter.old_spte, iter.level))
> > > >  			continue;
> > > >  
> > > > +		if (skip_pinned) {
> > > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > > +			struct folio *folio;
> > > > +
> > > > +			if (!page)
> > > > +				continue;
> > > > +
> > > > +			folio = page_folio(page);
> > > > +
> > > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > > +			    folio_maybe_dma_pinned(folio))
> > > > +				continue;
> > > > +		}
> > > > +
> > > 
> > > I don't get it..
> > > 
> > > The last patch made it so that the NUMA balancing code doesn't change
> > > page_maybe_dma_pinned() pages to PROT_NONE
> > > 
> > > So why doesn't KVM just check if the current and new SPTE are the same
> > > and refrain from invalidating if nothing changed?
> > 
> > Because KVM doesn't have visibility into the current and new PTEs when the zapping
> > occurs.  The contract for invalidate_range_start() requires that KVM drop all
> > references before returning, and so the zapping occurs before change_pte_range()
> > or change_huge_pmd() have done antyhing.
> > 
> > > Duplicating the checks here seems very frail to me.
> > 
> > Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> > can yield different results purely based on refcounts, i.e. KVM could skip pages
> > that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> > in general, I am steadfastedly against adding any kind of heuristic to KVM's
> > zapping logic.
> > 
> > This really needs to be fixed in the primary MMU and not require any direct
> > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > to be skipped.
> 
> This likely has the same issue you just described, we don't know if it
> can be skipped until we iterate over the PTEs and by then it is too
> late to invoke the notifier. Maybe some kind of abort and restart
> scheme could work?

Or maybe treat this as a userspace config problem?  Pinning DMA pages in a VM,
having a fair amount of remote memory, *and* expecting NUMA balancing to do anything
useful for that VM seems like a userspace problem.

Actually, does NUMA balancing even support this particular scenario?  I see this
in do_numa_page()

	/* TODO: handle PTE-mapped THP */
	if (PageCompound(page))
		goto out_map;

and then for PG_anon_exclusive

	 * ... For now, we only expect it to be
	 * set on tail pages for PTE-mapped THP.
	 */
	PG_anon_exclusive = PG_mappedtodisk,

which IIUC means zapping these pages to do migrate_on-fault will never succeed.

Can we just tell userspace to mbind() the pinned region to explicitly exclude the
VMA(s) from NUMA balancing?

next prev parent reply	other threads:[~2023-08-08 23:56 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-08  7:13 [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
2023-08-08  7:14 ` [RFC PATCH 1/3] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
2023-08-08  7:15 ` [RFC PATCH 2/3] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
2023-08-08  7:17 ` [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration Yan Zhao
2023-08-08 12:32   ` Jason Gunthorpe
2023-08-08 14:26     ` Sean Christopherson
2023-08-08 14:32       ` Jason Gunthorpe
2023-08-08 23:56         ` Sean Christopherson [this message]
2023-08-09  0:11           ` Yan Zhao
2023-08-09 11:59             ` Jason Gunthorpe
2023-08-10  9:08               ` Yan Zhao
2023-08-09  5:06           ` Yan Zhao
2023-08-09  2:58         ` Yan Zhao
2023-08-09  0:29       ` Yan Zhao
2023-08-26  6:39   ` liulongfang
2023-09-04  7:03     ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZNLWG++qK1mZcEOq@google.com \
    --to=seanjc@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=pbonzini@redhat.com \
    --cc=rppt@kernel.org \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox