Re: [PATCH v6 5/7] KVM: guest_memfd: Restore folio state after final folio_put()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michael Roth <michael.roth@amd.com>
To: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>, <kvm@vger.kernel.org>,
	<linux-arm-msm@vger.kernel.org>, <linux-mm@kvack.org>,
	<pbonzini@redhat.com>, <chenhuacai@kernel.org>,
	<mpe@ellerman.id.au>, <anup@brainfault.org>,
	<paul.walmsley@sifive.com>, <palmer@dabbelt.com>,
	<aou@eecs.berkeley.edu>, <seanjc@google.com>,
	<viro@zeniv.linux.org.uk>, <brauner@kernel.org>,
	<willy@infradead.org>, <akpm@linux-foundation.org>,
	<xiaoyao.li@intel.com>, <yilun.xu@intel.com>,
	<chao.p.peng@linux.intel.com>, <jarkko@kernel.org>,
	<amoorthy@google.com>, <dmatlack@google.com>,
	<isaku.yamahata@intel.com>, <mic@digikod.net>, <vbabka@suse.cz>,
	<ackerleytng@google.com>, <mail@maciej.szmigiero.name>,
	<david@redhat.com>, <wei.w.wang@intel.com>,
	<liam.merwick@oracle.com>, <isaku.yamahata@gmail.com>,
	<kirill.shutemov@linux.intel.com>, <suzuki.poulose@arm.com>,
	<steven.price@arm.com>, <quic_eberman@quicinc.com>,
	<quic_mnalajal@quicinc.com>, <quic_tsoni@quicinc.com>,
	<quic_svaddagi@quicinc.com>, <quic_cvanscha@quicinc.com>,
	<quic_pderrin@quicinc.com>, <quic_pheragu@quicinc.com>,
	<catalin.marinas@arm.com>, <james.morse@arm.com>,
	<yuzenghui@huawei.com>, <oliver.upton@linux.dev>,
	<maz@kernel.org>, <will@kernel.org>, <qperret@google.com>,
	<keirf@google.com>, <roypat@amazon.co.uk>, <shuah@kernel.org>,
	<hch@infradead.org>, <jgg@nvidia.com>, <rientjes@google.com>,
	<jhubbard@nvidia.com>, <fvdl@google.com>, <hughd@google.com>,
	<jthoughton@google.com>, <peterx@redhat.com>
Subject: Re: [PATCH v6 5/7] KVM: guest_memfd: Restore folio state after final folio_put()
Date: Wed, 2 Apr 2025 17:17:39 -0500	[thread overview]
Message-ID: <20250402221739.yqvuuiuxvvphgijd@amd.com> (raw)
In-Reply-To: <CA+EHjTwZaqX9Ab-XhFURn+Kn6OstN3PHNqUi_DxbHrQYBTa2KA@mail.gmail.com>

On Tue, Mar 25, 2025 at 03:57:00PM +0000, Fuad Tabba wrote:
> Hi Vishal,
> 
> 
> On Fri, 21 Mar 2025 at 20:09, Vishal Annapurve <vannapurve@google.com> wrote:
> >
> > On Tue, Mar 18, 2025 at 9:20 AM Fuad Tabba <tabba@google.com> wrote:
> > > ...
> > > +/*
> > > + * Callback function for __folio_put(), i.e., called once all references by the
> > > + * host to the folio have been dropped. This allows gmem to transition the state
> > > + * of the folio to shared with the guest, and allows the hypervisor to continue
> > > + * transitioning its state to private, since the host cannot attempt to access
> > > + * it anymore.
> > > + */
> > >  void kvm_gmem_handle_folio_put(struct folio *folio)
> > >  {
> > > -       WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> > > +       struct address_space *mapping;
> > > +       struct xarray *shared_offsets;
> > > +       struct inode *inode;
> > > +       pgoff_t index;
> > > +       void *xval;
> > > +
> > > +       mapping = folio->mapping;
> > > +       if (WARN_ON_ONCE(!mapping))
> > > +               return;
> > > +
> > > +       inode = mapping->host;
> > > +       index = folio->index;
> > > +       shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> > > +       xval = xa_mk_value(KVM_GMEM_GUEST_SHARED);
> > > +
> > > +       filemap_invalidate_lock(inode->i_mapping);
> >
> > As discussed in the guest_memfd upstream, folio_put can happen from
> > atomic context [1], so we need a way to either defer the work outside
> > kvm_gmem_handle_folio_put() (which is very likely needed to handle
> > hugepages and merge operation) or ensure to execute the logic using
> > synchronization primitives that will not sleep.
> 
> Thanks for pointing this out. For now, rather than deferring (which
> we'll come to when hugepages come into play), I think this would be

FWIW, with SNP, it's only possible to unsplit an RMP entry if the guest
cooperates with re-validating/re-accepting the memory at a higher order.
Currently, this guest support is not implemented in linux.

So, if we were to opportunistically unsplit hugepages, we'd zap the
mappings in KVM, let it fault in at a higher order so we could reduce
TLB misses, and then KVM would (via
kvm_x86_call(private_max_mapping_level)(kvm, pfn) find that the RMP
entry is still split to 4K, and remap everything right back to the 4K
granularity it was already at to begin with.

TDX seems to have a bit more flexibility in being able to
'unsplit'/promote private ranges back up to higher orders, so it could
potentially benefit from doing things opportunistically...

However, ideally...the guest would just avoid unecessarily carving up
ranges to begin with and pack all it's shared mappings into smaller GPA
ranges. Then, all this unsplitting of huge pages could be completely
avoided until cleanup/truncate time. So maybe even for hugepages we
should just plan to do things this way, at least as a start?

> possible to resolve by ensuring we have exclusive access* to the folio
> instead, and using that to ensure that we can access the
> shared_offsets maps.
> 
> * By exclusive access I mean either holding the folio lock, or knowing
> that no one else has references to the folio (which is the case when
> kvm_gmem_handle_folio_put() is called).
> 
> I'll try to respin something in time for folks to look at it before
> the next sync.

Thanks for posting. I was looking at how to get rid of
filemap_invalidate_lock() from conversion path, and having that separate
rwlock seems to resolve a lot of the potential races I was looking at.
I'm working on rebasing SNP 2MB support on top of your v7 series now.

-Mike

> 
> Cheers,
> /fuad
> 
> > [1] https://elixir.bootlin.com/linux/v6.14-rc6/source/include/linux/mm.h#L1483
> >
> > > +       folio_lock(folio);
> > > +       kvm_gmem_restore_pending_folio(folio, inode);
> > > +       folio_unlock(folio);
> > > +       WARN_ON_ONCE(xa_err(xa_store(shared_offsets, index, xval, GFP_KERNEL)));
> > > +       filemap_invalidate_unlock(inode->i_mapping);
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_gmem_handle_folio_put);
> > >
> > > --
> > > 2.49.0.rc1.451.g8f38331e32-goog
> > >
>

next prev parent reply	other threads:[~2025-04-02 22:18 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-18 16:20 [PATCH v6 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2025-03-18 16:20 ` [PATCH v6 1/7] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
2025-03-18 16:20 ` [PATCH v6 2/7] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
2025-03-18 16:20 ` [PATCH v6 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private Fuad Tabba
2025-03-18 16:20 ` [PATCH v6 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition Fuad Tabba
2025-03-18 16:20 ` [PATCH v6 5/7] KVM: guest_memfd: Restore folio state after final folio_put() Fuad Tabba
2025-03-21 20:09   ` Vishal Annapurve
2025-03-25 15:57     ` Fuad Tabba
2025-04-02 22:17       ` Michael Roth [this message]
2025-03-18 16:20 ` [PATCH v6 6/7] KVM: guest_memfd: Handle invalidation of shared memory Fuad Tabba
2025-03-18 16:20 ` [PATCH v6 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared Fuad Tabba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250402221739.yqvuuiuxvvphgijd@amd.com \
    --to=michael.roth@amd.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=amoorthy@google.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=brauner@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chenhuacai@kernel.org \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=fvdl@google.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=james.morse@arm.com \
    --cc=jarkko@kernel.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=jthoughton@google.com \
    --cc=keirf@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=liam.merwick@oracle.com \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=mic@digikod.net \
    --cc=mpe@ellerman.id.au \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qperret@google.com \
    --cc=quic_cvanscha@quicinc.com \
    --cc=quic_eberman@quicinc.com \
    --cc=quic_mnalajal@quicinc.com \
    --cc=quic_pderrin@quicinc.com \
    --cc=quic_pheragu@quicinc.com \
    --cc=quic_svaddagi@quicinc.com \
    --cc=quic_tsoni@quicinc.com \
    --cc=rientjes@google.com \
    --cc=roypat@amazon.co.uk \
    --cc=seanjc@google.com \
    --cc=shuah@kernel.org \
    --cc=steven.price@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wei.w.wang@intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox