Re: [PATCH v3 1/6] kvm: determine memory type from VMA

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jason Gunthorpe <jgg@nvidia.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>,
	ankita@nvidia.com, alex.williamson@redhat.com,
	naoya.horiguchi@nec.com, oliver.upton@linux.dev,
	aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	Lorenzo Pieralisi <lpieralisi@kernel.org>
Subject: Re: [PATCH v3 1/6] kvm: determine memory type from VMA
Date: Wed, 14 Jun 2023 09:44:06 -0300	[thread overview]
Message-ID: <ZIm2FnH6EQfA4Qr+@nvidia.com> (raw)
In-Reply-To: <ZHcxHbCb439I1Uk2@arm.com>

On Wed, May 31, 2023 at 12:35:57PM +0100, Catalin Marinas wrote:

> Current status on arm64: vfio-pci user mapping (Qemu) of PCIe BARs is
> Device_nGnRnE. KVM also maps it at Stage 2 with the same attributes. The
> user mapping is fine since the VMM may not have detailed knowledge about
> how to safely map the BAR. KVM doesn't have such detailed knowledge
> either in the device pass-through case but for safety reasons it follows
> the same attributes as the user mapping.

To put a fine point on this - it means the KVM VM does not emulate the
same behaviors as a bare metal ARM system and this results in
significant performance degradation (loss of WC) or malfunction (loss
of cachable) when working with some VFIO device scenarios.

> architecture). Presumably cloud vendors allowing device pass-through
> already know their system well enough, no need for further policing in
> KVM (which it can't do properly anyway).

This also matches x86 pretty well. There is alot of platform and BIOS
magic at work to make sure that VFIO is safe in general. As this is
not discoverable by the OS we have no choice in the kernel but to
assume the operator knows what they are doing.

> While I think it's less likely for the VMM to access the same BAR that
> was mapped into the guest, if we want to be on the safe side from an ABI
> perspective (the returned mmap() now has different memory guarantees),
> we could make this an opt-in. That's pretty much the VMM stating that it
> is ok with losing some of the Device guarantees for the vfio-pci
> mapping. 

I don't know of any use case where this matters. The VMM is already
basically not allowed to touch the MMIO spaces for security
reasons. The only reason it is mmap'd into the VMM at all is because
this is the only way to get it into KVM.

So an opt-in seems like overkill to me.

> Going back to this series, allowing Cacheable mapping in KVM has similar
> implications as above. (2) and (3) would be assumed safe by the platform
> vendors. Regarding (1), to avoid confusion, I would only allow it if FWB
> (force write-back) is present so that KVM can enforce a cacheable
> mapping at Stage 2 if the vfio variant driver also maps it as cacheable
> in user space.

Yes, currently if FWB is not available this patch makes KVM crash :\
FWB needs to be enforced as well in this patch.

> There were some other thoughts on only allowing different attributes in
> KVM if the user counterpart does not have any mapping (e.g. fd-based
> KVM memory slots). However, this won't work well with CXL-attached
> memory for example where the VMM may want access to it (e.g. virtio). So
> I wouldn't spend more thoughts on this.

CXL is a different beast - struct page backed CXL memory should be
force-cachable everywhere just like normal system memory (it is normal
system memory). This is the kind of memory that might be mixed with
virtio.

However, some future CXL VFIO device memory should ideally allow the
VM full flexibility to use cachable or not to properly emulate
bare-metal. This approach where we pre-select some of the CXL memory
as cachable is kind of a half job..

I think providing a way for KVM to take in a FD instead of a VMA for
mapping memory could be interesting in general. We waste a fair amount
of PTEs creating a 40GB linear map in the CPU just for KVM. Avoiding
the alias mapping is a nice side effect.

We'd have to do something else for the SIGBUS stuff though..

> a) Relax the KVM Stage 2 mapping for vfio-pci to Normal NC
>    (pgprot_writecombine). TBD whether we need a VMM opt-in is at the

We are working on a tested patch for this, it isn't working yet for
some reason, still debugging.

> b) Map the KVM stage 2 mapping as cacheable+FWB iff the VMM has a
>    corresponding cacheable mapping.

That is this patch, with some improvements to check FWB/etc.

Thanks,
Jason

next prev parent reply	other threads:[~2023-06-14 12:44 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
2023-04-05 18:01 ` [PATCH v3 1/6] kvm: determine memory type from VMA ankita
2023-04-12 12:43   ` Marc Zyngier
2023-04-12 13:01     ` Jason Gunthorpe
2023-05-31 11:35       ` Catalin Marinas
2023-06-14 12:44         ` Jason Gunthorpe [this message]
2023-07-14  8:10         ` Benjamin Herrenschmidt
2023-07-16 15:09           ` Catalin Marinas
2023-07-16 22:30             ` Jason Gunthorpe
2023-07-17 18:35               ` Alex Williamson
2023-07-25  6:18                 ` Benjamin Herrenschmidt
2023-04-05 18:01 ` [PATCH v3 2/6] vfio/nvgpu: expose GPU device memory as BAR1 ankita
2023-04-05 21:07   ` kernel test robot
2023-04-05 18:01 ` [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages ankita
2023-04-05 21:07   ` kernel test robot
2023-05-09  9:51   ` HORIGUCHI NAOYA(堀口　直也)
2023-05-15 11:18     ` Ankit Agrawal
2023-05-23  5:43       ` HORIGUCHI NAOYA(堀口　直也)
2023-04-05 18:01 ` [PATCH v3 4/6] mm: Add poison error check in fixup_user_fault() for mapped PFN ankita
2023-04-05 18:01 ` [PATCH v3 5/6] mm: Change ghes code to allow poison of non-struct PFN ankita
2023-04-05 18:01 ` [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling ankita
2023-04-05 20:24   ` Zhi Wang
2023-04-05 21:50   ` kernel test robot
2023-05-24  9:53   ` Dan Carpenter
2023-04-06 12:07 ` [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible David Hildenbrand
2023-04-12  8:43   ` Ankit Agrawal
2023-04-12  9:48     ` Marc Zyngier
2023-04-12 12:28 ` Marc Zyngier
2023-04-12 12:53   ` Jason Gunthorpe
2023-04-13  9:52     ` Marc Zyngier
2023-04-13 13:19       ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZIm2FnH6EQfA4Qr+@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=acurrid@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=catalin.marinas@arm.com \
    --cc=cjia@nvidia.com \
    --cc=danw@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lpieralisi@kernel.org \
    --cc=maz@kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=oliver.upton@linux.dev \
    --cc=targupta@nvidia.com \
    --cc=vsethi@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox