Re: [PATCH RFC 3/4] mm: guest_memfd: Add option to remove guest private memory from direct map

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Elliot Berman <quic_eberman@quicinc.com>
To: Patrick Roy <roypat@amazon.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Fuad Tabba <tabba@google.com>,
	David Hildenbrand <david@redhat.com>, <qperret@google.com>,
	Ackerley Tng <ackerleytng@google.com>,
	<linux-coco@lists.linux.dev>, <linux-arm-msm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<kvm@vger.kernel.org>, James Gowans <jgowans@amazon.com>,
	"Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
	"Manwaring, Derek" <derekmn@amazon.com>,
	"Cali, Marco" <xmarcalx@amazon.co.uk>
Subject: Re: [PATCH RFC 3/4] mm: guest_memfd: Add option to remove guest private memory from direct map
Date: Thu, 8 Aug 2024 15:16:09 -0700	[thread overview]
Message-ID: <20240808145103617-0700.eberman@hu-eberman-lv.qualcomm.com> (raw)
In-Reply-To: <7166d51c-7757-44f2-a6f8-36da3e86bf90@amazon.co.uk>

On Thu, Aug 08, 2024 at 02:05:55PM +0100, Patrick Roy wrote:
> On Wed, 2024-08-07 at 20:06 +0100, Elliot Berman wrote:
> >>>>>>  struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags)
> >>>>>>  {
> >>>>>> +       unsigned long gmem_flags = (unsigned long)file->private_data;
> >>>>>>         struct inode *inode = file_inode(file);
> >>>>>>         struct guest_memfd_operations *ops = inode->i_private;
> >>>>>>         struct folio *folio;
> >>>>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> >>>>>>                         goto out_err;
> >>>>>>         }
> >>>>>>
> >>>>>> +       if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
> >>>>>> +               r = guest_memfd_folio_private(folio);
> >>>>>> +               if (r)
> >>>>>> +                       goto out_err;
> >>>>>> +       }
> >>>>>> +
> >>>>>
> >>>>> How does a caller of guest_memfd_grab_folio know whether a folio needs
> >>>>> to be removed from the direct map? E.g. how can a caller know ahead of
> >>>>> time whether guest_memfd_grab_folio will return a freshly allocated
> >>>>> folio (which thus needs to be removed from the direct map), vs a folio
> >>>>> that already exists and has been removed from the direct map (probably
> >>>>> fine to remove from direct map again), vs a folio that already exists
> >>>>> and is currently re-inserted into the direct map for whatever reason
> >>>>> (must not remove these from the direct map, as other parts of
> >>>>> KVM/userspace probably don't expect the direct map entries to disappear
> >>>>> from underneath them). I couldn't figure this one out for my series,
> >>>>> which is why I went with hooking into the PG_uptodate logic to always
> >>>>> remove direct map entries on freshly allocated folios.
> >>>>>
> >>>>
> >>>> gmem_flags come from the owner. If the caller (in non-CoCo case) wants
> >>
> >> Ah, oops, I got it mixed up with the new `flags` parameter.
> >>
> >>>> to restore the direct map right away, it'd have to be a direct
> >>>> operation. As an optimization, we could add option that asks for page in
> >>>> "shared" state. If allocating new page, we can return it right away
> >>>> without removing from direct map. If grabbing existing folio, it would
> >>>> try to do the private->shared conversion.
> >>
> >> My concern is more with the implicit shared->private conversion that
> >> happens on every call to guest_memfd_grab_folio (and thus
> >> kvm_gmem_get_pfn) when grabbing existing folios. If something else
> >> marked the folio as shared, then we cannot punch it out of the direct
> >> map again until that something is done using the folio (when working on
> >> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were
> >> temporarily marked shared, as I was seeing panics because of this). And
> >> if the folio is currently private, there's nothing to do. So either way,
> >> guest_memfd_grab_folio shouldn't touch the direct map entry for existing
> >> folios.
> >>
> >
> > What I did could be documented/commented better.
> 
> No worries, thanks for taking the time to walk me through understanding
> it!
> 
> > If ops->accessible() is *not* provided, all guest_memfd allocations will
> > immediately remove from direct map and treat them immediately like guest
> > private (goal is to match what KVM does today on tip).
> 
> Ah, so if ops->accessible() is not provided, then there will never be
> any shared memory inside gmem (like today, where gmem doesn't support
> shared memory altogether), and thus there's no problems with just
> unconditionally doing set_direct_map_invalid_noflush in
> guest_memfd_grab_folio, because all existing folios already have their
> direct map entry removed. Got it!
> 
> > If ops->accessible() is provided, then guest_memfd allocations start
> > as "shared" and KVM/Gunyah need to do the shared->private conversion
> > when they want to do the private conversion on the folio. "Shared" is
> > the default because that is effectively a no-op.
> > For the non-CoCo case you're interested in, we'd have the
> > ops->accessible() provided and we wouldn't pull out the direct map from
> > gpc.
> 
> So in pKVM/Gunyah's case, guest memory starts as shared, and at some
> point the guest will issue a hypercall (or similar) to flip it to
> private, at which point it'll get removed from the direct map?
> 
> That isn't really what we want for our case. We consider the folios as
> private straight away, as we do not let the guest control their state at
> all. Everything is always "accessible" to both KVM and userspace in the
> sense that they can just flip gfns to shared as they please without the
> guest having any say in it.
> 
> I think we should untangle the behavior of guest_memfd_grab_folio from
> the presence of ops->accessible. E.g.  instead of direct map removal
> being dependent on ops->accessible we should have some
> GRAB_FOLIO_RETURN_SHARED flag for gmem_flags, which is set for y'all,
> and not set for us (I don't think we should have a "call
> set_direct_map_invalid_noflush unconditionally in
> guest_memfd_grab_folio" mode at all, because if sharing gmem is
> supported, then that is broken, and if sharing gmem is not supported
> then only removing direct map entries for freshly allocated folios gets
> us the same result of "all folios never in the direct map" while
> avoiding some no-op direct map operations).
> 
> Because we would still use ->accessible, albeit for us that would be
> more for bookkeeping along the lines of "which gfns does userspace
> currently require to be in the direct map?". I haven't completely
> thought it through, but what I could see working for us would be a pair
> of ioctls for marking ranges accessible/inaccessible, with
> "accessibility" stored in some xarray (somewhat like Fuad's patches, I
> guess? [1]).
> 
> In a world where we have a "sharing refcount", the "make accessible"
> ioctl reinserts into the direct map (if needed), lifts the "sharings
> refcount" for each folio in the given gfn range, and marks the range as
> accessible.  And the "make inaccessible" ioctl would first check that
> userspace has unmapped all those gfns again, and if yes, mark them as
> inaccessible, drop the "sharings refcount" by 1 for each, and removes
> from the direct map again if it held the last reference (if userspace
> still has some gfns mapped, the ioctl would just fail).
> 

I am warming up to the sharing refcount idea. How does the sharing
refcount look for kvm gpc?

> I guess for pKVM/Gunyah, there wouldn't be userspace ioctls, but instead
> the above would happen in handlers for share/unshare hypercalls. But the
> overall flow would be similar. The only difference is the default state
> of guest memory (shared for you, private for us). You want a
> guest_memfd_grab_folio that essentially returns folios with "sharing
> refcount == 1" (and thus present in the direct map), while we want the
> opposite.
> 
> So I think something like the following should work for both of us
> (modulo some error handling):
> 
> static struct folio *__kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare, bool *fresh)
> {
>     // as today's kvm_gmem_get_folio, except
>     ...
>     if (!folio_test_uptodate(folio)) {
>         ...
>         if (fresh)
>             *fresh = true
>     }
>     ...
> }
> 
> struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare)
> {
>     bool fresh;
>     unsigned long gmem_flags = /* ... */
>     struct folio *folio = __kvm_gmem_get_folio(file, index, prepare, &fresh);
>     if (gmem_flag & GRAB_FOLIO_RETURN_SHARED != 0) {
>         // if "sharing refcount == 0", inserts back into direct map and lifts refcount, otherwise just lifts refcount
>         guest_memfd_folio_clear_private(folio);
>     } else {
>         if (fresh)
>             guest_memfd_folio_private(folio);
>     }
>     return folio;
> }
> 
> Now, thinking ahead, there's probably optimizations here where we defer
> the direct map manipulations to gmem_fault, at which point having a
> guest_memfd_grab_folio that doesn't remove direct map entries for fresh
> folios would be useful in our non-CoCo usecase too. But that should also
> be easily achievable by maybe having a flag to kvm_gmem_get_folio that
> forces the behavior of GRAB_FOLIO_RETURN_SHARED, indendently of whether
> GRAB_FOLIO_RETURN_SHARED is set in gmem_flags.
> 
> How does that sound to you?
> 

Yeah, I think this is a good idea.

I'm also thinking to make a few tweaks to the ops structure:

struct guest_memfd_operations {
        int (*invalidate_begin)(struct inode *inode, pgoff_t offset, unsigned long nr);
        void (*invalidate_end)(struct inode *inode, pgoff_t offset, unsigned long nr);
        int (*prepare_accessible)(struct inode *inode, struct folio *folio);
        int (*prepare_private)(struct inode *inode, struct folio *folio);
        int (*release)(struct inode *inode);
};

When grabbing a folio, we'd always call either prepare_accessible() or
prepare_private() based on GRAB_FOLIO_RETURN_SHARED. In the
prepare_private() case, guest_memfd can also ensure the folio is
unmapped and not pinned. If userspace tries to grab the folio in
pKVM/Gunyah case, prepare_accessible() will fail and grab_folio returns
error. There's a lot of details I'm glossing over, but I hope it gives
some brief idea of the direction I was thinking.

In some cases, prepare_accessible() and the invalidate_*() functions
might effectively be the same thing, except that invalidate_*() could
operate on a range larger-than-a-folio. That would be useful becase we
might offer optimization to reclaim a batch of pages versus e.g.
flushing caches every page.

Thanks,
Elliot

next prev parent reply	other threads:[~2024-08-08 22:16 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-05 18:34 [PATCH RFC 0/4] mm: Introduce guest_memfd library Elliot Berman
2024-08-05 18:34 ` [PATCH RFC 1/4] mm: Introduce guest_memfd Elliot Berman
2024-08-06 13:48   ` David Hildenbrand
2024-08-08 18:39   ` Ackerley Tng
2024-08-05 18:34 ` [PATCH RFC 2/4] kvm: Convert to use mm/guest_memfd Elliot Berman
2024-08-05 18:34 ` [PATCH RFC 3/4] mm: guest_memfd: Add option to remove guest private memory from direct map Elliot Berman
2024-08-06 14:08   ` David Hildenbrand
     [not found]     ` <396fb134-f43e-4263-99a8-cfcef82bfd99@amazon.com>
2024-08-15 19:08       ` Manwaring, Derek
2024-08-06 15:39   ` Patrick Roy
     [not found]     ` <20240806104702482-0700.eberman@hu-eberman-lv.qualcomm.com>
     [not found]       ` <a43ae745-9907-425f-b09d-a49405d6bc2d@amazon.co.uk>
     [not found]         ` <90886a03-ad62-4e98-bc05-63875faa9ccc@amazon.co.uk>
     [not found]           ` <20240807113514068-0700.eberman@hu-eberman-lv.qualcomm.com>
     [not found]             ` <7166d51c-7757-44f2-a6f8-36da3e86bf90@amazon.co.uk>
2024-08-08 22:16               ` Elliot Berman [this message]
2024-08-09 15:02                 ` Patrick Roy
2024-08-19 10:09   ` Mike Rapoport
2024-08-20 16:56     ` Elliot Berman
2024-08-21 14:26       ` Mike Rapoport
2024-08-05 18:34 ` [PATCH RFC 4/4] mm: guest_memfd: Add ability for mmap'ing pages Elliot Berman
2024-08-06 13:51   ` David Hildenbrand
     [not found]     ` <20240806093625007-0700.eberman@hu-eberman-lv.qualcomm.com>
     [not found]       ` <a7c5bfc0-1648-4ae1-ba08-e706596e014b@redhat.com>
2024-08-08 21:41         ` Elliot Berman
2024-08-08 21:55           ` David Hildenbrand
2024-08-08 22:26             ` Elliot Berman
2024-08-09  7:16               ` David Hildenbrand
2024-08-15  7:24     ` Fuad Tabba
2024-08-16  9:48       ` David Hildenbrand
2024-08-16 11:19         ` Fuad Tabba
2024-08-16 17:45         ` Ackerley Tng
2024-08-16 18:08           ` David Hildenbrand
2024-08-16 21:52             ` Ackerley Tng
2024-08-16 22:03               ` David Hildenbrand
2024-08-16 23:52                 ` Elliot Berman
2024-08-06 15:48   ` Patrick Roy
2024-08-08 18:51   ` Ackerley Tng
2024-08-08 21:42     ` Elliot Berman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240808145103617-0700.eberman@hu-eberman-lv.qualcomm.com \
    --to=quic_eberman@quicinc.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=derekmn@amazon.com \
    --cc=jgowans@amazon.com \
    --cc=kalyazin@amazon.co.uk \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=roypat@amazon.co.uk \
    --cc=seanjc@google.com \
    --cc=tabba@google.com \
    --cc=xmarcalx@amazon.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox