linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Patrick Roy <roypat@amazon.co.uk>
To: Mike Rapoport <rppt@kernel.org>
Cc: <seanjc@google.com>, <pbonzini@redhat.com>,
	<akpm@linux-foundation.org>, <dwmw@amazon.co.uk>,
	<david@redhat.com>, <tglx@linutronix.de>, <mingo@redhat.com>,
	<bp@alien8.de>, <dave.hansen@linux.intel.com>, <x86@kernel.org>,
	<hpa@zytor.com>, <willy@infradead.org>, <graf@amazon.com>,
	<derekmn@amazon.com>, <kalyazin@amazon.com>,
	<kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-mm@kvack.org>, <dmatlack@google.com>, <tabba@google.com>,
	<chao.p.peng@linux.intel.com>, <xmarcalx@amazon.co.uk>,
	James Gowans <jgowans@amazon.com>
Subject: Re: [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map
Date: Wed, 10 Jul 2024 10:50:14 +0100	[thread overview]
Message-ID: <576c2316-80c3-47f0-9af4-86daa48bdc16@amazon.co.uk> (raw)
In-Reply-To: <Zo441yz7Yw2JZcPs@kernel.org>



On 7/10/24 08:31, Mike Rapoport wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Jul 09, 2024 at 02:20:33PM +0100, Patrick Roy wrote:
>> While guest_memfd is not available to be mapped by userspace, it is
>> still accessible through the kernel's direct map. This means that in
>> scenarios where guest-private memory is not hardware protected, it can
>> be speculatively read and its contents potentially leaked through
>> hardware side-channels. Removing guest-private memory from the direct
>> map, thus mitigates a large class of speculative execution issues
>> [1, Table 1].
>>
>> This patch adds a flag to the `KVM_CREATE_GUEST_MEMFD` which, if set, removes the
>> struct pages backing guest-private memory from the direct map. Should
>> `CONFIG_HAVE_KVM_GMEM_{INVALIDATE, PREPARE}` be set, pages are removed
>> after preparation and before invalidation, so that the
>> prepare/invalidate routines do not have to worry about potentially
>> absent direct map entries.
>>
>> Direct map removal do not reuse the `KVM_GMEM_PREPARE` machinery, since `prepare` can be
>> called multiple time, and it is the responsibility of the preparation
>> routine to not "prepare" the same folio twice [2]. Thus, instead
>> explicitly check if `filemap_grab_folio` allocated a new folio, and
>> remove the returned folio from the direct map only if this was the case.
>>
>> The patch uses release_folio instead of free_folio to reinsert pages
>> back into the direct map as by the time free_folio is called,
>> folio->mapping can already be NULL. This means that a call to
>> folio_inode inside free_folio might deference a NULL pointer, leaving no
>> way to access the inode which stores the flags that allow determining
>> whether the page was removed from the direct map in the first place.
>>
>> Lastly, the patch uses set_direct_map_{invalid,default}_noflush instead
>> of `set_memory_[n]p` to avoid expensive flushes of TLBs and the L*-cache
>> hierarchy. This is especially important once KVM restores direct map
>> entries on-demand in later patches, where simple FIO benchmarks of a
>> virtio-blk device have shown that TLB flushes on a Intel(R) Xeon(R)
>> Platinum 8375C CPU @ 2.90GHz resulted in 80% degradation in throughput
>> compared to a non-flushing solution.
>>
>> Not flushing the TLB means that until TLB entries for temporarily
>> restored direct map entries get naturally evicted, they can be used
>> during speculative execution, and effectively "unhide" the memory for
>> longer than intended. We consider this acceptable, as the only pages
>> that are temporarily reinserted into the direct map like this will
>> either hold PV data structures (kvm-clock, asyncpf, etc), or pages
>> containing privileged instructions inside the guest kernel image (in the
>> MMIO emulation case).
>>
>> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf
>>
>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>> ---
>>  include/uapi/linux/kvm.h |  2 ++
>>  virt/kvm/guest_memfd.c   | 52 ++++++++++++++++++++++++++++++++++------
>>  2 files changed, 47 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index e065d9fe7ab2..409116aa23c9 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1563,4 +1563,6 @@ struct kvm_create_guest_memfd {
>>       __u64 reserved[6];
>>  };
>>
>> +#define KVM_GMEM_NO_DIRECT_MAP                 (1ULL << 0)
>> +
>>  #endif /* __LINUX_KVM_H */
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 9148b9679bb1..dc9b0c2d0b0e 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -4,6 +4,7 @@
>>  #include <linux/kvm_host.h>
>>  #include <linux/pagemap.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/set_memory.h>
>>
>>  #include "kvm_mm.h"
>>
>> @@ -49,9 +50,16 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
>>       return 0;
>>  }
>>
>> +static bool kvm_gmem_not_present(struct inode *inode)
>> +{
>> +     return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) != 0;
>> +}
>> +
>>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
>>  {
>>       struct folio *folio;
>> +     bool zap_direct_map = false;
>> +     int r;
>>
>>       /* TODO: Support huge pages. */
>>       folio = filemap_grab_folio(inode->i_mapping, index);
>> @@ -74,16 +82,30 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
>>               for (i = 0; i < nr_pages; i++)
>>                       clear_highpage(folio_page(folio, i));
>>
>> +             // We need to clear the folio before calling kvm_gmem_prepare_folio,
>> +             // but can only remove it from the direct map _after_ preparation is done.
> 
> No C++ comments please
> 

Ack, sorry, will avoid in the future!

>> +             zap_direct_map = kvm_gmem_not_present(inode);
>> +
>>               folio_mark_uptodate(folio);
>>       }
>>
>>       if (prepare) {
>> -             int r = kvm_gmem_prepare_folio(inode, index, folio);
>> -             if (r < 0) {
>> -                     folio_unlock(folio);
>> -                     folio_put(folio);
>> -                     return ERR_PTR(r);
>> -             }
>> +             r = kvm_gmem_prepare_folio(inode, index, folio);
>> +             if (r < 0)
>> +                     goto out_err;
>> +     }
>> +
>> +     if (zap_direct_map) {
>> +             r = set_direct_map_invalid_noflush(&folio->page);
> 
> It's not future proof to presume that folio is a single page here.
> You should loop over folio pages and add a TLB flush after the loop.
>

Right, will do the folio iteration thing (and same for all other places
I call the direct_map* functions in this RFC)! 

I'll also have a look at the TLB flush. I specifically avoided using
set_memory_np here because the flushes it did (TLB + L1/2/3) had
significant performance impact (see cover letter). I'll have to rerun my
benchmark with set_direct_map_invalid_noflush + flush_tlb_kernel_range
instead, but if the result is similar, and we really need the flush here
for correctness, I might have to go back to the drawing board about this
whole on-demand mapping approach :(

>> +             if (r < 0)
>> +                     goto out_err;
>> +
>> +             // We use the private flag to track whether the folio has been removed
>> +             // from the direct map. This is because inside of ->free_folio,
>> +             // we do not have access to the address_space anymore, meaning we
>> +             // cannot check folio_inode(folio)->i_private to determine whether
>> +             // KVM_GMEM_NO_DIRECT_MAP was set.
>> +             folio_set_private(folio);
>>       }
>>
>>       /*
>> @@ -91,6 +113,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
>>        * unevictable and there is no storage to write back to.
>>        */
>>       return folio;
>> +out_err:
>> +     folio_unlock(folio);
>> +     folio_put(folio);
>> +     return ERR_PTR(r);
>>  }
>>
>>  static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>> @@ -354,10 +380,22 @@ static void kvm_gmem_free_folio(struct folio *folio)
>>  }
>>  #endif
>>
>> +static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end)
>> +{
>> +     if (start == 0 && end == PAGE_SIZE) {
>> +             // We only get here if PG_private is set, which only happens if kvm_gmem_not_present
>> +             // returned true in kvm_gmem_get_folio. Thus no need to do that check again.
>> +             BUG_ON(set_direct_map_default_noflush(&folio->page));
> 
> Ditto.
> 
>> +
>> +             folio_clear_private(folio);
>> +     }
>> +}
>> +
>>  static const struct address_space_operations kvm_gmem_aops = {
>>       .dirty_folio = noop_dirty_folio,
>>       .migrate_folio  = kvm_gmem_migrate_folio,
>>       .error_remove_folio = kvm_gmem_error_folio,
>> +     .invalidate_folio = kvm_gmem_invalidate_folio,
>>  #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
>>       .free_folio = kvm_gmem_free_folio,
>>  #endif
>> @@ -443,7 +481,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>>  {
>>       loff_t size = args->size;
>>       u64 flags = args->flags;
>> -     u64 valid_flags = 0;
>> +     u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
>>
>>       if (flags & ~valid_flags)
>>               return -EINVAL;
>> --
>> 2.45.2
>>
> 
> --
> Sincerely yours,
> Mike.


  reply	other threads:[~2024-07-10  9:50 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 1/8] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 2/8] kvm: use slowpath in gfn_to_hva_cache if memory is private Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem Patrick Roy
2024-07-09 14:36   ` David Woodhouse
2024-07-10  9:49     ` Patrick Roy
2024-07-10 10:20       ` David Woodhouse
2024-07-10 10:46         ` Patrick Roy
2024-07-10 10:50           ` David Woodhouse
2024-07-09 13:20 ` [RFC PATCH 4/8] kvm: x86: support walking guest page tables in gmem Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map Patrick Roy
2024-07-10  7:31   ` Mike Rapoport
2024-07-10  9:50     ` Patrick Roy [this message]
2024-07-09 13:20 ` [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed Patrick Roy
2024-07-11  6:25   ` Paolo Bonzini
2024-07-09 13:20 ` [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP Patrick Roy
2024-07-09 21:09   ` David Hildenbrand
2024-07-10  7:32     ` Mike Rapoport
2024-07-10  9:50       ` Patrick Roy
2024-07-10 21:14         ` David Hildenbrand
2024-07-09 13:20 ` [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings Patrick Roy
2024-07-09 14:48   ` Fuad Tabba
2024-07-09 21:13     ` David Hildenbrand
2024-07-10  9:51       ` Patrick Roy
2024-07-10 21:12         ` David Hildenbrand
2024-07-10 21:53           ` Sean Christopherson
2024-07-10 21:56             ` David Hildenbrand
2024-07-12 15:59           ` Patrick Roy
2024-07-30 10:15             ` David Hildenbrand
2024-08-01 10:30               ` Patrick Roy
2024-07-22 12:28 ` [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Vlastimil Babka (SUSE)
2024-07-26  6:55   ` Patrick Roy
2024-07-30 10:17     ` David Hildenbrand
2024-07-26 16:44 ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=576c2316-80c3-47f0-9af4-86daa48bdc16@amazon.co.uk \
    --to=roypat@amazon.co.uk \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=chao.p.peng@linux.intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=derekmn@amazon.com \
    --cc=dmatlack@google.com \
    --cc=dwmw@amazon.co.uk \
    --cc=graf@amazon.com \
    --cc=hpa@zytor.com \
    --cc=jgowans@amazon.com \
    --cc=kalyazin@amazon.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=tabba@google.com \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xmarcalx@amazon.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox