From: Ackerley Tng <ackerleytng@google.com>
To: Fuad Tabba <tabba@google.com>
Cc: vbabka@suse.cz, kvm@vger.kernel.org,
linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au,
anup@brainfault.org, paul.walmsley@sifive.com,
palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com,
viro@zeniv.linux.org.uk, brauner@kernel.org,
willy@infradead.org, akpm@linux-foundation.org,
xiaoyao.li@intel.com, yilun.xu@intel.com,
chao.p.peng@linux.intel.com, jarkko@kernel.org,
amoorthy@google.com, dmatlack@google.com,
yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com,
mic@digikod.net, vannapurve@google.com,
mail@maciej.szmigiero.name, david@redhat.com,
michael.roth@amd.com, wei.w.wang@intel.com,
liam.merwick@oracle.com, isaku.yamahata@gmail.com,
kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com,
steven.price@arm.com, quic_eberman@quicinc.com,
quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com,
quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com,
quic_pderrin@quicinc.com, quic_pheragu@quicinc.com,
catalin.marinas@arm.com, james.morse@arm.com,
yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org,
will@kernel.org, qperret@google.com, keirf@google.com,
roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org,
jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com,
fvdl@google.com, hughd@google.com, jthoughton@google.com
Subject: Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
Date: Thu, 06 Feb 2025 03:28:00 +0000 [thread overview]
Message-ID: <diqz4j17sqf3.fsf@ackerleytng-ctop.c.googlers.com> (raw)
In-Reply-To: <CA+EHjTyToQEHoKOQLgDxdjTCCvawrtS8czsZYLehRO1N_Ph2EQ@mail.gmail.com> (message from Fuad Tabba on Thu, 23 Jan 2025 11:00:28 +0000)
Fuad Tabba <tabba@google.com> writes:
> On Wed, 22 Jan 2025 at 22:24, Ackerley Tng <ackerleytng@google.com> wrote:
>>
>> Fuad Tabba <tabba@google.com> writes:
>>
>> >> > <snip>
>> >> >
>> >> > +/*
>> >> > + * Registers a callback to __folio_put(), so that gmem knows that the host does
>> >> > + * not have any references to the folio. It does that by setting the folio type
>> >> > + * to guestmem.
>> >> > + *
>> >> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
>> >> > + * has references, and the callback has been registered.
>> >>
>> >> Note this comment.
>> >>
>> >> > + *
>> >> > + * Must be called with the following locks held:
>> >> > + * - filemap (inode->i_mapping) invalidate_lock
>> >> > + * - folio lock
>> >> > + */
>> >> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
>> >> > +{
>> >> > + struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>> >> > + void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
>> >> > + int refcount;
>> >> > +
>> >> > + rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
>> >> > + WARN_ON_ONCE(!folio_test_locked(folio));
>> >> > +
>> >> > + if (folio_mapped(folio) || folio_test_guestmem(folio))
>> >> > + return -EAGAIN;
>> >>
>> >> But here we return -EAGAIN and no callback was registered?
>> >
>> > This is intentional. If the folio is still mapped (i.e., its mapcount
>> > is elevated), then we cannot register the callback yet, so the
>> > host/vmm needs to unmap first, then try again. That said, I see the
>> > problem with the comment above, and I will clarify this.
>> >
>> >> > +
>> >> > + /* Register a callback first. */
>> >> > + __folio_set_guestmem(folio);
>> >> > +
>> >> > + /*
>> >> > + * Check for references after setting the type to guestmem, to guard
>> >> > + * against potential races with the refcount being decremented later.
>> >> > + *
>> >> > + * At least one reference is expected because the folio is locked.
>> >> > + */
>> >> > +
>> >> > + refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
>> >> > + if (refcount == 1) {
>> >> > + int r;
>> >> > +
>> >> > + /* refcount isn't elevated, it's now faultable by the guest. */
>> >>
>> >> Again this seems racy, somebody could have just speculatively increased it.
>> >> Maybe we need to freeze here as well?
>> >
>> > A speculative increase here is ok I think (famous last words). The
>> > callback was registered before the check, therefore, such an increase
>> > would trigger the callback.
>> >
>> > Thanks,
>> > /fuad
>> >
>> >
>>
>> I checked the callback (kvm_gmem_handle_folio_put()) and agree with you
>> that the mappability reset to KVM_GMEM_GUEST_MAPPABLE is handled
>> correctly (since kvm_gmem_handle_folio_put() doesn't assume anything
>> about the mappability state at callback-time).
>>
>> However, what if the new speculative reference writes to the page and
>> guest goes on to fault/use the page?
>
> I don't think that's a problem. At this point the page is in a
> transient state, but still shared from the guest's point of view.
> Moreover, no one can fault-in the page at the host at this point (we
> check in kvm_gmem_fault()).
>
> Let's have a look at the code:
>
> +static int __gmem_register_callback(struct folio *folio, struct inode
> *inode, pgoff_t idx)
> +{
> + struct xarray *mappable_offsets =
> &kvm_gmem_private(inode)->mappable_offsets;
> + void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> + int refcount;
>
> At this point the guest still perceives the page as shared, the state
> of the page is KVM_GMEM_NONE_MAPPABLE (transient state). This means
> that kvm_gmem_fault() doesn't fault-in the page at the host anymore.
>
> + rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> + WARN_ON_ONCE(!folio_test_locked(folio));
> +
> + if (folio_mapped(folio) || folio_test_guestmem(folio))
> + return -EAGAIN;
> +
> + /* Register a callback first. */
> + __folio_set_guestmem(folio);
>
> This (in addition to the state of the NONE_MAPPABLE), also ensures
> that kvm_gmem_fault() doesn't fault-in the page at the host anymore.
>
> + /*
> + * Check for references after setting the type to guestmem, to guard
> + * against potential races with the refcount being decremented later.
> + *
> + * At least one reference is expected because the folio is locked.
> + */
> +
> + refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> + if (refcount == 1) {
> + int r;
>
> At this point we know that guest_memfd has the only real reference.
> Speculative references AFAIK do not access the page itself.
> +
> + /* refcount isn't elevated, it's now faultable by the guest. */
> + r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets,
> idx, xval_guest, GFP_KERNEL)));
>
> Now it's safe so let the guest know that it can map the page.
>
> + if (!r)
> + __kvm_gmem_restore_pending_folio(folio);
> +
> + return r;
> + }
> +
> + return -EAGAIN;
> +}
>
> Does this make sense, or did I miss something?
Thanks for explaining! I don't know enough to confirm/deny this but I agree
that if speculative references don't access the page itself, this works.
What if over here, we just drop the refcount, and let setting mappability to
GUEST happen in the folio_put() callback?
>
> Thanks!
> /fuad
>
>> >> > + r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
>> >> > + if (!r)
>> >> > + __kvm_gmem_restore_pending_folio(folio);
>> >> > +
>> >> > + return r;
>> >> > + }
>> >> > +
>> >> > + return -EAGAIN;
>> >> > +}
>> >> > +
>> >> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
>> >> > +{
>> >> > + unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
>> >> > + struct inode *inode = file_inode(slot->gmem.file);
>> >> > + struct folio *folio;
>> >> > + int r;
>> >> > +
>> >> > + filemap_invalidate_lock(inode->i_mapping);
>> >> > +
>> >> > + folio = filemap_lock_folio(inode->i_mapping, pgoff);
>> >> > + if (WARN_ON_ONCE(IS_ERR(folio))) {
>> >> > + r = PTR_ERR(folio);
>> >> > + goto out;
>> >> > + }
>> >> > +
>> >> > + r = __gmem_register_callback(folio, inode, pgoff);
>> >> > +
>> >> > + folio_unlock(folio);
>> >> > + folio_put(folio);
>> >> > +out:
>> >> > + filemap_invalidate_unlock(inode->i_mapping);
>> >> > +
>> >> > + return r;
>> >> > +}
>> >> > +
>> >> > +/*
>> >> > + * Callback function for __folio_put(), i.e., called when all references by the
>> >> > + * host to the folio have been dropped. This allows gmem to transition the state
>> >> > + * of the folio to mappable by the guest, and allows the hypervisor to continue
>> >> > + * transitioning its state to private, since the host cannot attempt to access
>> >> > + * it anymore.
>> >> > + */
>> >> > +void kvm_gmem_handle_folio_put(struct folio *folio)
>> >> > +{
>> >> > + struct xarray *mappable_offsets;
>> >> > + struct inode *inode;
>> >> > + pgoff_t index;
>> >> > + void *xval;
>> >> > +
>> >> > + inode = folio->mapping->host;
>> >> > + index = folio->index;
>> >> > + mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>> >> > + xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
>> >> > +
>> >> > + filemap_invalidate_lock(inode->i_mapping);
>> >> > + __kvm_gmem_restore_pending_folio(folio);
>> >> > + WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
>> >> > + filemap_invalidate_unlock(inode->i_mapping);
>> >> > +}
>> >> > +
>> >> > static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
>> >> > {
>> >> > struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>> >>
next prev parent reply other threads:[~2025-02-06 3:28 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
2025-01-17 22:05 ` Elliot Berman
2025-01-19 14:39 ` Fuad Tabba
2025-01-20 10:39 ` David Hildenbrand
2025-01-20 10:50 ` Fuad Tabba
2025-01-20 10:39 ` David Hildenbrand
2025-01-20 10:43 ` Fuad Tabba
2025-01-20 10:43 ` Vlastimil Babka
2025-01-20 11:12 ` Vlastimil Babka
2025-01-20 11:28 ` David Hildenbrand
2025-01-17 16:29 ` [RFC PATCH v5 02/15] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
2025-01-24 4:25 ` Gavin Shan
2025-01-29 10:12 ` Fuad Tabba
2025-02-11 15:58 ` Ackerley Tng
2025-01-17 16:29 ` [RFC PATCH v5 03/15] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 04/15] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
2025-01-24 5:31 ` Gavin Shan
2025-01-29 10:15 ` Fuad Tabba
2025-02-26 22:29 ` Ackerley Tng
2025-01-17 16:29 ` [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition Fuad Tabba
2025-01-20 10:30 ` Kirill A. Shutemov
2025-01-20 10:40 ` Fuad Tabba
2025-02-06 3:14 ` Ackerley Tng
2025-02-06 9:45 ` Fuad Tabba
2025-02-19 23:33 ` Ackerley Tng
2025-02-20 9:26 ` Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
2025-01-20 11:37 ` Vlastimil Babka
2025-01-20 12:14 ` Fuad Tabba
2025-01-22 22:24 ` Ackerley Tng
2025-01-23 11:00 ` Fuad Tabba
2025-02-06 3:18 ` Ackerley Tng
2025-02-06 3:28 ` Ackerley Tng [this message]
2025-02-06 9:47 ` Fuad Tabba
2025-01-30 14:23 ` Fuad Tabba
2025-01-22 22:16 ` Ackerley Tng
2025-01-23 9:50 ` Fuad Tabba
2025-02-05 1:28 ` Vishal Annapurve
2025-02-05 4:31 ` Ackerley Tng
2025-02-05 5:58 ` Vishal Annapurve
2025-02-05 0:42 ` Vishal Annapurve
2025-02-05 10:06 ` Fuad Tabba
2025-02-05 17:39 ` Vishal Annapurve
2025-02-05 17:42 ` Vishal Annapurve
2025-02-07 10:46 ` Ackerley Tng
2025-02-10 16:04 ` Fuad Tabba
2025-02-05 0:51 ` Vishal Annapurve
2025-02-05 10:07 ` Fuad Tabba
2025-02-06 3:37 ` Ackerley Tng
2025-02-06 9:49 ` Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 07/15] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 08/15] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 09/15] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 10/15] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 11/15] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 12/15] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 13/15] KVM: arm64: Refactor user_mem_abort() calculation of force_pte Fuad Tabba
2025-01-17 16:30 ` [RFC PATCH v5 14/15] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
2025-01-17 16:30 ` [RFC PATCH v5 15/15] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=diqz4j17sqf3.fsf@ackerleytng-ctop.c.googlers.com \
--to=ackerleytng@google.com \
--cc=akpm@linux-foundation.org \
--cc=amoorthy@google.com \
--cc=anup@brainfault.org \
--cc=aou@eecs.berkeley.edu \
--cc=brauner@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=chao.p.peng@linux.intel.com \
--cc=chenhuacai@kernel.org \
--cc=david@redhat.com \
--cc=dmatlack@google.com \
--cc=fvdl@google.com \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=isaku.yamahata@gmail.com \
--cc=isaku.yamahata@intel.com \
--cc=james.morse@arm.com \
--cc=jarkko@kernel.org \
--cc=jgg@nvidia.com \
--cc=jhubbard@nvidia.com \
--cc=jthoughton@google.com \
--cc=keirf@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=liam.merwick@oracle.com \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mail@maciej.szmigiero.name \
--cc=maz@kernel.org \
--cc=mic@digikod.net \
--cc=michael.roth@amd.com \
--cc=mpe@ellerman.id.au \
--cc=oliver.upton@linux.dev \
--cc=palmer@dabbelt.com \
--cc=paul.walmsley@sifive.com \
--cc=pbonzini@redhat.com \
--cc=qperret@google.com \
--cc=quic_cvanscha@quicinc.com \
--cc=quic_eberman@quicinc.com \
--cc=quic_mnalajal@quicinc.com \
--cc=quic_pderrin@quicinc.com \
--cc=quic_pheragu@quicinc.com \
--cc=quic_svaddagi@quicinc.com \
--cc=quic_tsoni@quicinc.com \
--cc=rientjes@google.com \
--cc=roypat@amazon.co.uk \
--cc=seanjc@google.com \
--cc=shuah@kernel.org \
--cc=steven.price@arm.com \
--cc=suzuki.poulose@arm.com \
--cc=tabba@google.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=wei.w.wang@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=xiaoyao.li@intel.com \
--cc=yilun.xu@intel.com \
--cc=yu.c.zhang@linux.intel.com \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox