linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Patrick Roy <roypat@amazon.co.uk>
To: Sean Christopherson <seanjc@google.com>,
	Ackerley Tng <ackerleytng@google.com>
Cc: <quic_eberman@quicinc.com>, <tabba@google.com>, <jgg@nvidia.com>,
	<peterx@redhat.com>, <david@redhat.com>, <rientjes@google.com>,
	<fvdl@google.com>, <jthoughton@google.com>, <pbonzini@redhat.com>,
	<zhiquan1.li@intel.com>, <fan.du@intel.com>, <jun.miao@intel.com>,
	<isaku.yamahata@intel.com>, <muchun.song@linux.dev>,
	<mike.kravetz@oracle.com>, <erdemaktas@google.com>,
	<vannapurve@google.com>, <qperret@google.com>,
	<jhubbard@nvidia.com>, <willy@infradead.org>, <shuah@kernel.org>,
	<brauner@kernel.org>, <bfoster@redhat.com>,
	<kent.overstreet@linux.dev>, <pvorel@suse.cz>, <rppt@kernel.org>,
	<richard.weiyang@gmail.com>, <anup@brainfault.org>,
	<haibo1.xu@intel.com>, <ajones@ventanamicro.com>,
	<vkuznets@redhat.com>, <maciej.wieczor-retman@intel.com>,
	<pgonda@google.com>, <oliver.upton@linux.dev>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<kvm@vger.kernel.org>, <linux-kselftest@vger.kernel.org>,
	<linux-fsdevel@kvack.org>, <jgowans@amazon.com>,
	<kalyazin@amazon.co.uk>, <derekmn@amazon.com>
Subject: Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
Date: Thu, 10 Oct 2024 17:21:11 +0100	[thread overview]
Message-ID: <6bca3ad4-3eca-4a75-a775-5f8b0467d7a3@amazon.co.uk> (raw)
In-Reply-To: <ZwWOfXd9becAm4lH@google.com>



On Tue, 2024-10-08 at 20:56 +0100, Sean Christopherson wrote:
> On Tue, Oct 08, 2024, Ackerley Tng wrote:
>> Patrick Roy <roypat@amazon.co.uk> writes:
>>> For the "non-CoCo with direct map entries removed" VMs that we at AWS
>>> are going for, we'd like a VM type with host-controlled in-place
>>> conversions which doesn't zero on transitions,
> 
> Hmm, your use case shouldn't need conversions _for KVM_, as there's no need for
> KVM to care if userspace or the guest _wants_ a page to be shared vs. private.
> Userspace is fully trusted to manage things; KVM simply reacts to the current
> state of things.
> 
> And more importantly, whether or not the direct map is zapped needs to be a
> property of the guest_memfd inode, i.e. can't be associated with a struct kvm.
> I forget who got volunteered to do the work,

I think me? At least we talked about it briefly

> but we're going to need similar
> functionality for tracking the state of individual pages in a huge folio, as
> folio_mark_uptodate() is too coarse-grained.  I.e. at some point, I expect that
> guest_memfd will make it easy-ish to determine whether or not the direct map has
> been obliterated.
> 
> The shared vs. private attributes tracking in KVM is still needed (I think), as
> it communicates what userspace _wants_, whereas he guest_memfd machinery will
> track what the state _is_.

If I'm understanding this patch series correctly, the approach taken
here is to force the KVM memory attributes and the internal guest_memfd
state to be in-sync, because the VMA from mmap()ing guest_memfd is
reflected back into the userspace_addr of the memslot. So, to me, in
this world, "direct map zapped iff
kvm_has_mem_attributes(KVM_MEMORY_ATTRIBUTES_PRIVATE)", with memory
attribute changes forcing the corresponding gmem state change. That's
why I was talking about conversions above.

I've played around with this locally, and since KVM seems to generally
use copy_from_user and friends to access the userspace_addr VMA, (aka
private mem that's reflected back into memslots here), with this things
like MMIO emulation can be oblivious to gmem's existence, since
copy_from_user and co don't require GUP or presence of direct map
entries (well, "oblivious" in the sense that things like kvm_read_guest
currently ignore memory attributes and unconditionally access
userspace_addr, which I suppose is not really wanted for VMs where
userspace_addr and guest_memfd aren't short-circuited like this). The
exception is kvm_clock, where the pv_time page would need to be
explicitly converted to shared to restore the direct map entry, although
I think we could just let userspace deal with making sure this page is
shared (and then, if gmem supports GUP on shared memory, even the
gfn_to_pfn_caches could work without gmem knowledge. Without GUP, we'd
still need a tiny hack in the uhva->pfn translation somewhere to handle
gmem vmas, but iirc you did mention that having kvm-clock be special
might be fine).

I guess it does come down to what you note below, answering the question
of "how does KVM internally access guest_memfd for non-CoCo VMs".  Is
there any way we can make uaccesses like above work? I've finally gotten
around to re-running some performance benchmarks of my on-demand
reinsertion patches with all the needed TLB flushes added, and my fio
benchmark on a virtio-blk device suffers a ~50% throughput regression,
which does not necessarily spark joy. And I think James H.  mentioned at
LPC that making the userfault stuff work with my patches would be quite
hard. All this in addition to you also not necessarily sounding too keen
on it either :D

>>> so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new
>>> VM type for that.
> 
> Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM rename?
> The original thought behind "software protected VM" was to do a slow build of
> something akin to pKVM, but realistically I don't think that idea is going anywhere.

Ah, admittedly I've thought of KVM_X86_SW_PROTECTED_VM as a bit of a
playground where various configurations other VM types enforce can be
mixed and matched (e.g. zero on conversions yes/no, direct map removal
yes/no) so more of a KVM_X86_GMEM_VM, but am happy to update my
understanding :) 

> Alternatively, depending on how KVM accesses guest memory that's been removed from
> the direct map, another solution would be to allow "regular" VMs to bind memslots
> to guest_memfd, i.e. if the non-CoCo use case needs/wnats to bind all memory to
> guest_memfd, not just "private" mappings.
> 
> That's probably the biggest topic of discussion: how do we want to allow mapping
> guest_memfd into the guest, without direct map entries, but while still allowing
> KVM to access guest memory as needed, e.g. for shadow paging.  One approach is
> your RFC, where KVM maps guest_memfd pfns on-demand.
> 
> Another (slightly crazy) approach would be use protection keys to provide the
> security properties that you want, while giving KVM (and userspace) a quick-and-easy
> override to access guest memory.
> 
>  1. mmap() guest_memfd into userpace with RW protections
>  2. Configure PKRU to make guest_memfd memory inaccessible by default
>  3. Swizzle PKRU on-demand when intentionally accessing guest memory
> 
> It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest memory
> instead of to usersepace memory.
> 
> The benefit of the PKRU approach is that there are no PTE modifications, and thus
> no TLB flushes, and only the CPU that is access guest memory gains temporary
> access.  The big downside is that it would be limited to modern hardware, but
> that might be acceptable, especially if it simplifies KVM's implementation.

Mh, but we only have 16 protection keys, so we cannot give each VM a
unique one. And if all guest memory shares the same protection key, then
during the on-demand swizzling the CPU would get access to _all_ guest
memory on the host, which "feels" scary. What do you think, @Derek?

Does ARM have something equivalent, btw?

>>> Somewhat related sidenote: For VMs that allow inplace conversions and do
>>> not zero, we do not need to zap the stage-2 mappings on memory attribute
>>> changes, right?
> 
> See above.  I don't think conversions by toggling the shared/private flag in
> KVM's memory attributes is the right fit for your use case.


  parent reply	other threads:[~2024-10-10 16:21 UTC|newest]

Thread overview: 130+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 01/39] mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 02/39] mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 03/39] mm: hugetlb: Remove unnecessary check for avoid_reserve Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 04/39] mm: mempolicy: Refactor out policy_node_nodemask() Ackerley Tng
2024-09-11 16:46   ` Gregory Price
2024-09-10 23:43 ` [RFC PATCH 05/39] mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to interpret mempolicy instead of vma Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 06/39] mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 07/39] mm: hugetlb: Refactor out hugetlb_alloc_folio Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 08/39] mm: truncate: Expose preparation steps for truncate_inode_pages_final Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 09/39] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 10/39] mm: hugetlb: Add option to create new subpool without using surplus Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 11/39] mm: hugetlb: Expose hugetlb_acct_memory() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 12/39] mm: hugetlb: Move and expose hugetlb_zero_partial_page() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 13/39] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
2025-04-02  4:01   ` Yan Zhao
2025-04-23 20:22     ` Ackerley Tng
2025-04-24  3:53       ` Yan Zhao
2024-09-10 23:43 ` [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup Ackerley Tng
2024-09-20  9:17   ` Vishal Annapurve
2024-10-01 23:00     ` Ackerley Tng
2024-12-01 17:59   ` Peter Xu
2025-02-13  9:47     ` Ackerley Tng
2025-02-26 18:55       ` Ackerley Tng
2025-03-06 17:33   ` Peter Xu
2024-09-10 23:43 ` [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb Ackerley Tng
2024-09-13 22:26   ` Elliot Berman
2024-10-03 20:23     ` Ackerley Tng
2024-10-30  9:01   ` Jun Miao
2025-02-11  1:21     ` Ackerley Tng
2024-12-01 17:55   ` Peter Xu
2025-02-13  7:52     ` Ackerley Tng
2025-02-13 16:48       ` Peter Xu
2024-09-10 23:43 ` [RFC PATCH 16/39] KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 17/39] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 18/39] KVM: selftests: Support various types of backing sources for private memory Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 19/39] KVM: selftests: Update test for various private memory backing source types Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 20/39] KVM: selftests: Add private_mem_conversions_test.sh Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 21/39] KVM: selftests: Test that guest_memfd usage is reported via hugetlb Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 22/39] mm: hugetlb: Expose vmemmap optimization functions Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 23/39] mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 24/39] mm: hugetlb: Add functions to add/move/remove from hugetlb lists Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 25/39] KVM: guest_memfd: Split HugeTLB pages for guest_memfd use Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private Ackerley Tng
2024-10-10 16:06   ` Peter Xu
2024-10-11 23:32     ` Ackerley Tng
2024-10-15 21:34       ` Peter Xu
2024-10-15 23:42         ` Ackerley Tng
2024-10-16  8:45           ` David Hildenbrand
2024-10-16 20:16             ` Peter Xu
2024-10-16 22:51               ` Jason Gunthorpe
2024-10-16 23:49                 ` Peter Xu
2024-10-16 23:54                   ` Jason Gunthorpe
2024-10-17 14:58                     ` Peter Xu
2024-10-17 16:47                       ` Jason Gunthorpe
2024-10-17 17:05                         ` Peter Xu
2024-10-17 17:10                           ` Jason Gunthorpe
2024-10-17 19:11                             ` Peter Xu
2024-10-17 19:18                               ` Jason Gunthorpe
2024-10-17 19:29                                 ` David Hildenbrand
2024-10-18  7:15                                 ` Patrick Roy
2024-10-18  7:50                                   ` David Hildenbrand
2024-10-18  9:34                                     ` Patrick Roy
2024-10-17 17:11                         ` David Hildenbrand
2024-10-17 17:16                           ` Jason Gunthorpe
2024-10-17 17:55                             ` David Hildenbrand
2024-10-17 18:26                             ` Vishal Annapurve
2024-10-17 14:56                   ` David Hildenbrand
2024-10-17 15:02               ` David Hildenbrand
2024-10-16  8:50           ` David Hildenbrand
2024-10-16 10:48             ` Vishal Annapurve
2024-10-16 11:54               ` David Hildenbrand
2024-10-16 11:57                 ` Jason Gunthorpe
2025-02-25 20:37   ` Peter Xu
2025-04-23 22:07     ` Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files Ackerley Tng
2025-01-20 22:42   ` Peter Xu
2025-04-23 20:25     ` Ackerley Tng
2025-03-04 23:24   ` Peter Xu
2025-04-02  4:07   ` Yan Zhao
2025-04-23 20:28     ` Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 28/39] KVM: guest_memfd: Use vm_type to determine default faultability Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 29/39] KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap Ackerley Tng
2024-09-16 20:00   ` Elliot Berman
2024-10-03 21:32     ` Ackerley Tng
2024-10-03 23:43       ` Ackerley Tng
2024-10-08 19:30         ` Sean Christopherson
2024-10-07 15:56       ` Patrick Roy
2024-10-08 18:07         ` Ackerley Tng
2024-10-08 19:56           ` Sean Christopherson
2024-10-09  3:51             ` Manwaring, Derek
2024-10-09 13:52               ` Andrew Cooper
2024-10-10 16:21             ` Patrick Roy [this message]
2024-10-10 19:27               ` Manwaring, Derek
2024-10-17 23:16               ` Ackerley Tng
2024-10-18  7:10                 ` Patrick Roy
2024-09-10 23:44 ` [RFC PATCH 31/39] KVM: selftests: Allow vm_set_memory_attributes to be used without asserting return value of 0 Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 32/39] KVM: selftests: Test using guest_memfd memory from userspace Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 33/39] KVM: selftests: Test guest_memfd memory sharing between guest and host Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 34/39] KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able guest_memfd Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 35/39] KVM: selftests: Test that pinned pages block KVM from setting memory attributes to PRIVATE Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 36/39] KVM: selftests: Refactor vm_mem_add to be more flexible Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 37/39] KVM: selftests: Add helper to perform madvise by memslots Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 38/39] KVM: selftests: Update private_mem_conversions_test for mmap()able guest_memfd Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page Ackerley Tng
2025-04-03 12:33   ` Yan Zhao
2025-04-23 22:02     ` Ackerley Tng
2025-04-24  1:09       ` Yan Zhao
2025-04-24  4:25         ` Yan Zhao
2025-04-24  5:55           ` Chenyi Qiang
2025-04-24  8:13             ` Yan Zhao
2025-04-24 14:10               ` Vishal Annapurve
2025-04-24 18:15                 ` Ackerley Tng
2025-04-25  4:02                   ` Yan Zhao
2025-04-25 22:45                     ` Ackerley Tng
2025-04-28  1:05                       ` Yan Zhao
2025-04-28 19:02                         ` Vishal Annapurve
2025-04-30 20:09                         ` Ackerley Tng
2025-05-06  1:23                           ` Yan Zhao
2025-05-06 19:22                             ` Ackerley Tng
2025-05-07  3:15                               ` Yan Zhao
2025-05-13 17:33                                 ` Ackerley Tng
2024-09-11  6:56 ` [RFC PATCH 00/39] 1G page support for guest_memfd Michal Hocko
2024-09-14  1:08 ` Du, Fan
2024-09-14 13:34   ` Vishal Annapurve
2025-01-28  9:42 ` Amit Shah
2025-02-03  8:35   ` Ackerley Tng
2025-02-06 11:07     ` Amit Shah
2025-02-07  6:25       ` Ackerley Tng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6bca3ad4-3eca-4a75-a775-5f8b0467d7a3@amazon.co.uk \
    --to=roypat@amazon.co.uk \
    --cc=ackerleytng@google.com \
    --cc=ajones@ventanamicro.com \
    --cc=anup@brainfault.org \
    --cc=bfoster@redhat.com \
    --cc=brauner@kernel.org \
    --cc=david@redhat.com \
    --cc=derekmn@amazon.com \
    --cc=erdemaktas@google.com \
    --cc=fan.du@intel.com \
    --cc=fvdl@google.com \
    --cc=haibo1.xu@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jgg@nvidia.com \
    --cc=jgowans@amazon.com \
    --cc=jhubbard@nvidia.com \
    --cc=jthoughton@google.com \
    --cc=jun.miao@intel.com \
    --cc=kalyazin@amazon.co.uk \
    --cc=kent.overstreet@linux.dev \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@kvack.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maciej.wieczor-retman@intel.com \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pgonda@google.com \
    --cc=pvorel@suse.cz \
    --cc=qperret@google.com \
    --cc=quic_eberman@quicinc.com \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=shuah@kernel.org \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vkuznets@redhat.com \
    --cc=willy@infradead.org \
    --cc=zhiquan1.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox