Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Oliver Upton <oliver.upton@linux.dev>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Ankit Agrawal <ankita@nvidia.com>,
	Sean Christopherson <seanjc@google.com>,
	Marc Zyngier <maz@kernel.org>,
	"joey.gouly@arm.com" <joey.gouly@arm.com>,
	"suzuki.poulose@arm.com" <suzuki.poulose@arm.com>,
	"yuzenghui@huawei.com" <yuzenghui@huawei.com>,
	"will@kernel.org" <will@kernel.org>,
	"ryan.roberts@arm.com" <ryan.roberts@arm.com>,
	"shahuang@redhat.com" <shahuang@redhat.com>,
	"lpieralisi@kernel.org" <lpieralisi@kernel.org>,
	"david@redhat.com" <david@redhat.com>,
	Aniket Agashe <aniketa@nvidia.com>, Neo Jia <cjia@nvidia.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"Tarun Gupta (SW-GPU)" <targupta@nvidia.com>,
	Vikram Sethi <vsethi@nvidia.com>,
	Andy Currid <acurrid@nvidia.com>,
	Alistair Popple <apopple@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Dan Williams <danw@nvidia.com>, Zhi Wang <zhiw@nvidia.com>,
	Matt Ochs <mochs@nvidia.com>, Uday Dhoke <udhoke@nvidia.com>,
	Dheeraj Nigam <dnigam@nvidia.com>,
	Krishnakant Jaju <kjaju@nvidia.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"sebastianene@google.com" <sebastianene@google.com>,
	"coltonlewis@google.com" <coltonlewis@google.com>,
	"kevin.tian@intel.com" <kevin.tian@intel.com>,
	"yi.l.liu@intel.com" <yi.l.liu@intel.com>,
	"ardb@kernel.org" <ardb@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"gshan@redhat.com" <gshan@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"ddutile@redhat.com" <ddutile@redhat.com>,
	"tabba@google.com" <tabba@google.com>,
	"qperret@google.com" <qperret@google.com>,
	"kvmarm@lists.linux.dev" <kvmarm@lists.linux.dev>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
Date: Tue, 22 Apr 2025 14:28:18 -0700	[thread overview]
Message-ID: <aAgJ8g8Gbb06quSM@linux.dev> (raw)
In-Reply-To: <20250422170324.GB1645809@nvidia.com>

On Tue, Apr 22, 2025 at 10:54:52AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 22, 2025 at 12:49:28AM -0700, Oliver Upton wrote:
> > The reality is that userspace is an equal participant in remaining coherent with
> > the guest. Whether or not FWB is employed for a particular region of IPA
> > space is useful information for userspace deciding what it needs to do to access guest
> > memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for
> > 'normal', kernel-managed memory so it understands what CMOs may be necessary when (for
> > example) doing live migration of the VM.
> 
> Really? How does it work today then? Is this another existing problem?
> Userspace is doing CMOs during live migration that are not necessary?

Yes, this is a pre-existing problem. I'm not aware of a live migration
implementation that handles !S2FWB correctly, and assumes all guest
accesses are done through a cacheable alias.

So, if a VMM wants to do migration of VMs on !S2FWB correctly, it'd
probably want to know it can elide CMOs on something that actually bears
the feature.

> >  - The memslot flag says userspace expects a particular GFN range to guarantee
> >    Write-Back semantics. This can be applied to 'normal', kernel-managed memory
> >    and PFNMAP thingies that have cacheable attributes at host stage-1.
> 
> Userspace doesn't actaully know if it has a cachable mapping from VFIO
> though :(

That seems like a shortcoming on the VFIO side, not a KVM issue. What if
userspace wants to do atomics on some VFIO mapping, doesn't it need to
know that it has something with WB?

> I don't really see a point in this. If the KVM has the cap then
> userspace should assume the S2FWB behavior for all cachable memslots.

Wait, so userspace simultaneously doesn't know the cacheability at host
stage-1 but *does* for stage-2? This is why I contend that userspace
needs a mechanism to discover the memory attributes on a given memslot.
Without it there's no way of knowing what's a cacheable memslot.

Along those lines, how is the VMM going to describe that cacheable
PFNMAP region to the guest?

> What should happen if you have S2FWB but don't pass the flag? For
> normal kernel memory it should still use S2FWB. Thus for cachable
> PFNMAP it makes sense that it should also still use S2FWB without the
> flag?

For kernel-managed memory, I agree. Accepting the flag for a memslot
containing such memory would solely be for discoverability.

OTOH, cacheable PFNMAP is a new feature and I see no issue compelling
the use of a new bit with it.

On Tue, Apr 22, 2025 at 02:03:24PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 22, 2025 at 05:50:32PM +0100, Catalin Marinas wrote:
> 
> > So, for the above, the VMM needs to know that it somehow got into such
> > situation. If it knows the device (VFIO) capabilities and that the user
> > mapping is Cacheable, coupled with the new KVM CAP, it can infer that
> > Stage 2 will be S2FWB, no need for a memory slot flag.
> 
> So long as the memslot creation fails for cachable PFNMAP without
> S2FWB the VMM is fine. qemu will begin its first steps to startup the
> migration destination and immediately fail. The migration will be
> aborted before it even gets started on the source side.
> 
> As I said before, the present situation requires the site's
> orchestration to manage compatibility for live migration of VFIO
> devices. We only expect that the migration will abort early if the
> site has made a configuration error.
> 
> > have such information, maybe a new memory slot flag can be used to probe
> > what Stage 2 mapping is going to be: ask for KVM_MEM_PFNMAP_WB. If it
> > fails, Stage 2 is Device/NC and can attempt again with the WB flag.
> > It's a bit of a stretch for the KVM API but IIUC there's no option to
> > query the properties of a memory slot.
> 
> I don't know of any use case for something like this. If VFIO gives
> the VMM a cachable mapping there is no fallback to WB.
> 
> The operator could use a different VFIO device, one that doesn't need
> cachable, but the VMM can't flip the VFIO device between modes on the
> fly.

I agree with you that in the context of a VFIO device userspace doesn't
have any direct influence on the resulting memory attributes.

The entire reason I'm dragging my feet about this is I'm concerned we've
papered over the complexity of memory attributes (regardless of
provenance) for way too long. KVM's done enough to make this dance 'work'
in the context of kernel-managed memory, but adding more implicit KVM
behavior for cacheable thingies makes the KVM UAPI even more
unintelligible (as if it weren't already).

So this flag isn't about giving userspace any degree of control over
memory attributes. Just a way to know for things it _expects_ to be
treated as cacheable can be guaranteed to use cacheable attributes in
the VM.

Thanks,
Oliver

next prev parent reply	other threads:[~2025-04-22 21:29 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-10 10:30 [PATCH v3 0/1] KVM: arm64: Map GPU device memory as cacheable ankita
2025-03-10 10:30 ` [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
2025-03-10 11:54   ` Marc Zyngier
2025-03-11  3:42     ` Ankit Agrawal
2025-03-11 11:18       ` Marc Zyngier
2025-03-11 12:07         ` Ankit Agrawal
2025-03-12  8:21           ` Marc Zyngier
2025-03-17  5:55             ` Ankit Agrawal
2025-03-17  9:27               ` Marc Zyngier
2025-03-17 19:54                 ` Catalin Marinas
2025-03-18  9:39                   ` Marc Zyngier
2025-03-18 12:55                     ` Jason Gunthorpe
2025-03-18 19:27                       ` Catalin Marinas
2025-03-18 19:35                         ` David Hildenbrand
2025-03-18 19:40                           ` Oliver Upton
2025-03-20  3:30                             ` bibo mao
2025-03-20  7:24                               ` bibo mao
2025-03-18 23:17                         ` Jason Gunthorpe
2025-03-19 18:03                           ` Catalin Marinas
2025-03-18 19:30                       ` Oliver Upton
2025-03-18 23:09                         ` Jason Gunthorpe
2025-03-19  7:01                           ` Oliver Upton
2025-03-19 17:04                             ` Jason Gunthorpe
2025-03-19 18:11                               ` Catalin Marinas
2025-03-19 19:22                                 ` Jason Gunthorpe
2025-03-19 21:48                                   ` Catalin Marinas
2025-03-26  8:31                                     ` Ankit Agrawal
2025-03-26 14:53                                       ` Sean Christopherson
2025-03-26 15:42                                         ` Marc Zyngier
2025-03-26 16:10                                           ` Sean Christopherson
2025-03-26 18:02                                             ` Marc Zyngier
2025-03-26 18:24                                               ` Sean Christopherson
2025-03-26 18:51                                                 ` Oliver Upton
2025-03-31 14:44                                                   ` Jason Gunthorpe
2025-03-31 14:56                                                 ` Jason Gunthorpe
2025-04-07 15:20                                                   ` Sean Christopherson
2025-04-07 16:15                                                     ` Jason Gunthorpe
2025-04-07 16:43                                                       ` Sean Christopherson
2025-04-16  8:51                                                         ` Ankit Agrawal
2025-04-21 16:03                                                           ` Ankit Agrawal
2025-04-22  7:49                                                           ` Oliver Upton
2025-04-22 13:54                                                             ` Jason Gunthorpe
2025-04-22 16:50                                                               ` Catalin Marinas
2025-04-22 17:03                                                                 ` Jason Gunthorpe
2025-04-22 21:28                                                                   ` Oliver Upton [this message]
2025-04-22 23:35                                                                     ` Jason Gunthorpe
2025-04-23 10:45                                                                       ` Catalin Marinas
2025-04-23 12:02                                                                         ` Jason Gunthorpe
2025-04-23 12:26                                                                           ` Catalin Marinas
2025-04-23 13:03                                                                             ` Jason Gunthorpe
2025-04-29 10:47                                                                               ` Ankit Agrawal
2025-04-29 13:27                                                                                 ` Catalin Marinas
2025-04-29 14:14                                                                                   ` Jason Gunthorpe
2025-04-29 16:03                                                                                     ` Catalin Marinas
2025-04-29 16:44                                                                                       ` Jason Gunthorpe
2025-04-29 18:09                                                                                         ` Catalin Marinas
2025-04-29 18:19                                                                                           ` Jason Gunthorpe
2025-05-07 15:26                                                                                             ` Ankit Agrawal
2025-05-09 12:47                                                                                               ` Catalin Marinas
2025-04-22 14:53                                                             ` Sean Christopherson
2025-03-18 12:57     ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAgJ8g8Gbb06quSM@linux.dev \
    --to=oliver.upton@linux.dev \
    --cc=acurrid@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=ardb@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cjia@nvidia.com \
    --cc=coltonlewis@google.com \
    --cc=danw@nvidia.com \
    --cc=david@redhat.com \
    --cc=ddutile@redhat.com \
    --cc=dnigam@nvidia.com \
    --cc=gshan@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=joey.gouly@arm.com \
    --cc=kevin.tian@intel.com \
    --cc=kjaju@nvidia.com \
    --cc=kvmarm@lists.linux.dev \
    --cc=kwankhede@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lpieralisi@kernel.org \
    --cc=maz@kernel.org \
    --cc=mochs@nvidia.com \
    --cc=qperret@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=seanjc@google.com \
    --cc=sebastianene@google.com \
    --cc=shahuang@redhat.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=targupta@nvidia.com \
    --cc=udhoke@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=will@kernel.org \
    --cc=yi.l.liu@intel.com \
    --cc=yuzenghui@huawei.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox