From: Elliot Berman <quic_eberman@quicinc.com>
To: Patrick Roy <roypat@amazon.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Sean Christopherson <seanjc@google.com>,
Fuad Tabba <tabba@google.com>,
David Hildenbrand <david@redhat.com>, <qperret@google.com>,
Ackerley Tng <ackerleytng@google.com>,
<linux-coco@lists.linux.dev>, <linux-arm-msm@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
<kvm@vger.kernel.org>, James Gowans <jgowans@amazon.com>,
"Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
"Manwaring, Derek" <derekmn@amazon.com>,
"Cali, Marco" <xmarcalx@amazon.co.uk>
Subject: Re: [PATCH RFC 3/4] mm: guest_memfd: Add option to remove guest private memory from direct map
Date: Thu, 8 Aug 2024 15:16:09 -0700 [thread overview]
Message-ID: <20240808145103617-0700.eberman@hu-eberman-lv.qualcomm.com> (raw)
In-Reply-To: <7166d51c-7757-44f2-a6f8-36da3e86bf90@amazon.co.uk>
On Thu, Aug 08, 2024 at 02:05:55PM +0100, Patrick Roy wrote:
> On Wed, 2024-08-07 at 20:06 +0100, Elliot Berman wrote:
> >>>>>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags)
> >>>>>> {
> >>>>>> + unsigned long gmem_flags = (unsigned long)file->private_data;
> >>>>>> struct inode *inode = file_inode(file);
> >>>>>> struct guest_memfd_operations *ops = inode->i_private;
> >>>>>> struct folio *folio;
> >>>>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> >>>>>> goto out_err;
> >>>>>> }
> >>>>>>
> >>>>>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
> >>>>>> + r = guest_memfd_folio_private(folio);
> >>>>>> + if (r)
> >>>>>> + goto out_err;
> >>>>>> + }
> >>>>>> +
> >>>>>
> >>>>> How does a caller of guest_memfd_grab_folio know whether a folio needs
> >>>>> to be removed from the direct map? E.g. how can a caller know ahead of
> >>>>> time whether guest_memfd_grab_folio will return a freshly allocated
> >>>>> folio (which thus needs to be removed from the direct map), vs a folio
> >>>>> that already exists and has been removed from the direct map (probably
> >>>>> fine to remove from direct map again), vs a folio that already exists
> >>>>> and is currently re-inserted into the direct map for whatever reason
> >>>>> (must not remove these from the direct map, as other parts of
> >>>>> KVM/userspace probably don't expect the direct map entries to disappear
> >>>>> from underneath them). I couldn't figure this one out for my series,
> >>>>> which is why I went with hooking into the PG_uptodate logic to always
> >>>>> remove direct map entries on freshly allocated folios.
> >>>>>
> >>>>
> >>>> gmem_flags come from the owner. If the caller (in non-CoCo case) wants
> >>
> >> Ah, oops, I got it mixed up with the new `flags` parameter.
> >>
> >>>> to restore the direct map right away, it'd have to be a direct
> >>>> operation. As an optimization, we could add option that asks for page in
> >>>> "shared" state. If allocating new page, we can return it right away
> >>>> without removing from direct map. If grabbing existing folio, it would
> >>>> try to do the private->shared conversion.
> >>
> >> My concern is more with the implicit shared->private conversion that
> >> happens on every call to guest_memfd_grab_folio (and thus
> >> kvm_gmem_get_pfn) when grabbing existing folios. If something else
> >> marked the folio as shared, then we cannot punch it out of the direct
> >> map again until that something is done using the folio (when working on
> >> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were
> >> temporarily marked shared, as I was seeing panics because of this). And
> >> if the folio is currently private, there's nothing to do. So either way,
> >> guest_memfd_grab_folio shouldn't touch the direct map entry for existing
> >> folios.
> >>
> >
> > What I did could be documented/commented better.
>
> No worries, thanks for taking the time to walk me through understanding
> it!
>
> > If ops->accessible() is *not* provided, all guest_memfd allocations will
> > immediately remove from direct map and treat them immediately like guest
> > private (goal is to match what KVM does today on tip).
>
> Ah, so if ops->accessible() is not provided, then there will never be
> any shared memory inside gmem (like today, where gmem doesn't support
> shared memory altogether), and thus there's no problems with just
> unconditionally doing set_direct_map_invalid_noflush in
> guest_memfd_grab_folio, because all existing folios already have their
> direct map entry removed. Got it!
>
> > If ops->accessible() is provided, then guest_memfd allocations start
> > as "shared" and KVM/Gunyah need to do the shared->private conversion
> > when they want to do the private conversion on the folio. "Shared" is
> > the default because that is effectively a no-op.
> > For the non-CoCo case you're interested in, we'd have the
> > ops->accessible() provided and we wouldn't pull out the direct map from
> > gpc.
>
> So in pKVM/Gunyah's case, guest memory starts as shared, and at some
> point the guest will issue a hypercall (or similar) to flip it to
> private, at which point it'll get removed from the direct map?
>
> That isn't really what we want for our case. We consider the folios as
> private straight away, as we do not let the guest control their state at
> all. Everything is always "accessible" to both KVM and userspace in the
> sense that they can just flip gfns to shared as they please without the
> guest having any say in it.
>
> I think we should untangle the behavior of guest_memfd_grab_folio from
> the presence of ops->accessible. E.g. instead of direct map removal
> being dependent on ops->accessible we should have some
> GRAB_FOLIO_RETURN_SHARED flag for gmem_flags, which is set for y'all,
> and not set for us (I don't think we should have a "call
> set_direct_map_invalid_noflush unconditionally in
> guest_memfd_grab_folio" mode at all, because if sharing gmem is
> supported, then that is broken, and if sharing gmem is not supported
> then only removing direct map entries for freshly allocated folios gets
> us the same result of "all folios never in the direct map" while
> avoiding some no-op direct map operations).
>
> Because we would still use ->accessible, albeit for us that would be
> more for bookkeeping along the lines of "which gfns does userspace
> currently require to be in the direct map?". I haven't completely
> thought it through, but what I could see working for us would be a pair
> of ioctls for marking ranges accessible/inaccessible, with
> "accessibility" stored in some xarray (somewhat like Fuad's patches, I
> guess? [1]).
>
> In a world where we have a "sharing refcount", the "make accessible"
> ioctl reinserts into the direct map (if needed), lifts the "sharings
> refcount" for each folio in the given gfn range, and marks the range as
> accessible. And the "make inaccessible" ioctl would first check that
> userspace has unmapped all those gfns again, and if yes, mark them as
> inaccessible, drop the "sharings refcount" by 1 for each, and removes
> from the direct map again if it held the last reference (if userspace
> still has some gfns mapped, the ioctl would just fail).
>
I am warming up to the sharing refcount idea. How does the sharing
refcount look for kvm gpc?
> I guess for pKVM/Gunyah, there wouldn't be userspace ioctls, but instead
> the above would happen in handlers for share/unshare hypercalls. But the
> overall flow would be similar. The only difference is the default state
> of guest memory (shared for you, private for us). You want a
> guest_memfd_grab_folio that essentially returns folios with "sharing
> refcount == 1" (and thus present in the direct map), while we want the
> opposite.
>
> So I think something like the following should work for both of us
> (modulo some error handling):
>
> static struct folio *__kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare, bool *fresh)
> {
> // as today's kvm_gmem_get_folio, except
> ...
> if (!folio_test_uptodate(folio)) {
> ...
> if (fresh)
> *fresh = true
> }
> ...
> }
>
> struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare)
> {
> bool fresh;
> unsigned long gmem_flags = /* ... */
> struct folio *folio = __kvm_gmem_get_folio(file, index, prepare, &fresh);
> if (gmem_flag & GRAB_FOLIO_RETURN_SHARED != 0) {
> // if "sharing refcount == 0", inserts back into direct map and lifts refcount, otherwise just lifts refcount
> guest_memfd_folio_clear_private(folio);
> } else {
> if (fresh)
> guest_memfd_folio_private(folio);
> }
> return folio;
> }
>
> Now, thinking ahead, there's probably optimizations here where we defer
> the direct map manipulations to gmem_fault, at which point having a
> guest_memfd_grab_folio that doesn't remove direct map entries for fresh
> folios would be useful in our non-CoCo usecase too. But that should also
> be easily achievable by maybe having a flag to kvm_gmem_get_folio that
> forces the behavior of GRAB_FOLIO_RETURN_SHARED, indendently of whether
> GRAB_FOLIO_RETURN_SHARED is set in gmem_flags.
>
> How does that sound to you?
>
Yeah, I think this is a good idea.
I'm also thinking to make a few tweaks to the ops structure:
struct guest_memfd_operations {
int (*invalidate_begin)(struct inode *inode, pgoff_t offset, unsigned long nr);
void (*invalidate_end)(struct inode *inode, pgoff_t offset, unsigned long nr);
int (*prepare_accessible)(struct inode *inode, struct folio *folio);
int (*prepare_private)(struct inode *inode, struct folio *folio);
int (*release)(struct inode *inode);
};
When grabbing a folio, we'd always call either prepare_accessible() or
prepare_private() based on GRAB_FOLIO_RETURN_SHARED. In the
prepare_private() case, guest_memfd can also ensure the folio is
unmapped and not pinned. If userspace tries to grab the folio in
pKVM/Gunyah case, prepare_accessible() will fail and grab_folio returns
error. There's a lot of details I'm glossing over, but I hope it gives
some brief idea of the direction I was thinking.
In some cases, prepare_accessible() and the invalidate_*() functions
might effectively be the same thing, except that invalidate_*() could
operate on a range larger-than-a-folio. That would be useful becase we
might offer optimization to reclaim a batch of pages versus e.g.
flushing caches every page.
Thanks,
Elliot
next prev parent reply other threads:[~2024-08-08 22:16 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-05 18:34 [PATCH RFC 0/4] mm: Introduce guest_memfd library Elliot Berman
2024-08-05 18:34 ` [PATCH RFC 1/4] mm: Introduce guest_memfd Elliot Berman
2024-08-06 13:48 ` David Hildenbrand
2024-08-08 18:39 ` Ackerley Tng
2024-08-05 18:34 ` [PATCH RFC 2/4] kvm: Convert to use mm/guest_memfd Elliot Berman
2024-08-05 18:34 ` [PATCH RFC 3/4] mm: guest_memfd: Add option to remove guest private memory from direct map Elliot Berman
2024-08-06 14:08 ` David Hildenbrand
[not found] ` <396fb134-f43e-4263-99a8-cfcef82bfd99@amazon.com>
2024-08-15 19:08 ` Manwaring, Derek
2024-08-06 15:39 ` Patrick Roy
[not found] ` <20240806104702482-0700.eberman@hu-eberman-lv.qualcomm.com>
[not found] ` <a43ae745-9907-425f-b09d-a49405d6bc2d@amazon.co.uk>
[not found] ` <90886a03-ad62-4e98-bc05-63875faa9ccc@amazon.co.uk>
[not found] ` <20240807113514068-0700.eberman@hu-eberman-lv.qualcomm.com>
[not found] ` <7166d51c-7757-44f2-a6f8-36da3e86bf90@amazon.co.uk>
2024-08-08 22:16 ` Elliot Berman [this message]
2024-08-09 15:02 ` Patrick Roy
2024-08-19 10:09 ` Mike Rapoport
2024-08-20 16:56 ` Elliot Berman
2024-08-21 14:26 ` Mike Rapoport
2024-08-05 18:34 ` [PATCH RFC 4/4] mm: guest_memfd: Add ability for mmap'ing pages Elliot Berman
2024-08-06 13:51 ` David Hildenbrand
[not found] ` <20240806093625007-0700.eberman@hu-eberman-lv.qualcomm.com>
[not found] ` <a7c5bfc0-1648-4ae1-ba08-e706596e014b@redhat.com>
2024-08-08 21:41 ` Elliot Berman
2024-08-08 21:55 ` David Hildenbrand
2024-08-08 22:26 ` Elliot Berman
2024-08-09 7:16 ` David Hildenbrand
2024-08-15 7:24 ` Fuad Tabba
2024-08-16 9:48 ` David Hildenbrand
2024-08-16 11:19 ` Fuad Tabba
2024-08-16 17:45 ` Ackerley Tng
2024-08-16 18:08 ` David Hildenbrand
2024-08-16 21:52 ` Ackerley Tng
2024-08-16 22:03 ` David Hildenbrand
2024-08-16 23:52 ` Elliot Berman
2024-08-06 15:48 ` Patrick Roy
2024-08-08 18:51 ` Ackerley Tng
2024-08-08 21:42 ` Elliot Berman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240808145103617-0700.eberman@hu-eberman-lv.qualcomm.com \
--to=quic_eberman@quicinc.com \
--cc=ackerleytng@google.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=derekmn@amazon.com \
--cc=jgowans@amazon.com \
--cc=kalyazin@amazon.co.uk \
--cc=kvm@vger.kernel.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-coco@lists.linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pbonzini@redhat.com \
--cc=qperret@google.com \
--cc=roypat@amazon.co.uk \
--cc=seanjc@google.com \
--cc=tabba@google.com \
--cc=xmarcalx@amazon.co.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox