From: Brijesh Singh <brijesh.singh@amd.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
David Rientjes <rientjes@google.com>
Cc: brijesh.singh@amd.com, Borislav Petkov <bp@alien8.de>,
Andy Lutomirski <luto@kernel.org>,
Sean Christopherson <seanjc@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Andi Kleen <ak@linux.intel.com>,
Tom Lendacky <thomas.lendacky@amd.com>,
Jon Grimm <jon.grimm@amd.com>,
Thomas Gleixner <tglx@linutronix.de>,
Christoph Hellwig <hch@lst.de>,
Peter Zijlstra <peterz@infradead.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Ingo Molnar <mingo@redhat.com>, Joerg Roedel <jroedel@suse.de>,
x86@kernel.org, linux-mm@kvack.org
Subject: Re: AMD SEV-SNP/Intel TDX: validation of memory pages
Date: Tue, 2 Feb 2021 18:16:41 -0600 [thread overview]
Message-ID: <961a2736-9bc9-43e1-1e75-6d373fe9590b@amd.com> (raw)
In-Reply-To: <20210202160205.3wfchtibq2sd7pe5@black.fi.intel.com>
On 2/2/21 10:02 AM, Kirill A. Shutemov wrote:
> On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
>> Hi everybody,
>>
>> I'd like to kick-start the discussion on lazy validation of guest memory
>> for the purposes of AMD SEV-SNP and Intel TDX.
>>
>> Both AMD SEV-SNP and Intel TDX require validation of guest memory before
>> it may be used by the guest. This is needed for integrity protection from
>> a potentially malicious hypervisor or other host components.
>>
>> For AMD SEV-SNP, the hypervisor assigns a page to the guest using the new
>> RMPUPDATE instruction. The guest then transitions the page to a usable by
>> the new PVALIDATE instruction[1]. This sets the Validated flag in the
>> Reverse Map Table (RMP) for a guest addressable page, which opts into
>> hardware and firmware integrity protection. This may only be done by the
>> guest itself and until that time, the guest cannot access the page.
>>
>> The guest can only PVALIDATE memory for a gPA once; the RMP then
>> guarantees for each hPA that there is only a single gPA mapping. This
>> validation can either be done all up front at the time the guest is booted
>> or it can be done lazily at runtime on fault if the guest keeps track of
>> Valid vs Invalid pages. Because doing PVALIDATE for all guest memory at
>> boot would be extremely lengthy, I'd like to discuss the options for doing
>> it lazily.
>>
>> Similarly, for Intel TDX, the hypervisor unmaps the gPA from the shared
>> EPT and invalidates the tlb and all caches for the TD's vcpus; it then
>> adds a page to the gPA address space for a TD by using the new
>> TDH.MEM.PAGE.AUG call. The TDG.MEM.PAGE.ACCEPT TDCALL[2] then allows a
>> guest to accept a guest page for a gPA and initialize it using the private
>> key for that TD. This may only be done by the TD itself and until that
>> time, the gPA cannot be used within the TD.
>>
>> Both AMD SEV-SNP and Intel TDX support hugepages. SEV-SNP supports 2MB
>> whereas TDX has accept TDCALL support for 2MB and 1GB.
>>
>> I believe the UEFI ECR[3] for the unaccepted memory type to
>> EFI_MEMORY_TYPE was accepted in December. This should enable the guest to
>> learn what memory has not yet been validated (or accepted) by the firmware
>> if all guest memory is not done completely up front.
>>
>> This likely requires a pre-validation of all memory that can be accessed
>> when handling a #VC (or #VE for TDX) such as IST stacks, including memory
>> in the x86 boot sequence that must be validated before the core mm
>> subsystem is up and running to handle the lazy validation. I believe
>> lazy validation can be done by the core mm after that, perhaps by
>> maintaining a new "validated" bit in struct page flags.
>>
>> Has anybody looked into this or, even better, is anybody currently working
>> on this?
> It's likely I'm going to do this on Intel side, but I have not looked
> deeply into it.
>
>> I think quite invasive changes are needed for the guest to support lazy
>> validation/acceptance to core areas that lots of people on the recipient
>> list have strong opinions about. Some things that come to mind:
>>
>> - Annotations for pages that must be pre-validated in the x86 boot
>> sequence, including IST stacks
>>
>> - Proliferation of these annotations throughout any kernel code that can
>> access memory for #VC or #VE
>>
>> - Handling lazy validation of guest memory through the core mm layer,
>> most likely involving a bit in struct page flags to track their status
>>
>> - Any need for validating memory that is not backed by struct page that
>> needs to be special-cased
>>
>> - Any concerns about this for the DMA layer
>>
>> One possibility for minimal disruption to the boot entry code is to
>> require the guest BIOS to validate 4GB and below, and then leave 4GB and
>> above to be done lazily (the true amount of memory will actually be less
>> due to the MMIO hole).
> [ As I didn't looked into actual code, I may say total garbage below... ]
>
> Pre-validating 4GB would indeed be easiest way to go, but it's going to be
> too slow.
>
> The more realistic is for BIOS to pre-validate memory where kernel and
> initrd are placed, plus few dozen megs for runtime. It means decompression
> code would need to be aware about the validation.
I was thinking that BIOS validating the lower 4GB will simplify the
changes to the kernel entry code path as well provide a clean approach
to support kexec.
My initial thought is
- BIOS or VMM validate lower 4GB memory.
- BIOS mark the higher 4GB as unaccepted in e820/efi memmap
- Kernel early boot can be achieved with minimal (or no changes)
- If there is an unaccepted type discovered then allocate a bitmap that
can be used to keep track of information (e.g which pages are
validated). We can also explore whether removing the unaccepted flag
from the memmap range will work.
- On #VC/#VE, look at the bitmap to see if we need to validate the
pages. To speed up, we can validate more than one page on #VC/#VE.
- If we get kexec'd then rebuild the e820/memmap based on the bitmap so
that we don't double validate.
>
> The critical thing is that once memory is validate we must not validate
> it again. It's possible VMM->guest attack vector. We must track precisely
> what memory has been validated and stop the guest on detecting the
> unexpected second validation request.
>
> It also means that we has to keep the information when the control gets
> passed from decompression code to the real kernel. Page flag is no good
> for this.
>
> My initial thought that we can use e820/efi memmap to keep track of
> information -- remove the unaccepted memory flag from the range that got
> accepted.
>
> The decompression code validates the memory that it's need for
> decompression, modify memmap accordingly and pass control to the main
> kernel.
>
> The main kernel may accept the memory via #VE/#VC, but ideally it need to
> stay within memory accepted by decompression code for initial boot.
>
> I think the bulk of memory validation can be done via existing machinery:
> we have already deferred struct page initialization code in kernel and I
> believe we can hook up into it for the purpose.
>
> Any comments?
>
next prev parent reply other threads:[~2021-02-03 0:17 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-02 1:51 David Rientjes
2021-02-02 13:17 ` Matthew Wilcox
2021-02-02 16:02 ` Kirill A. Shutemov
2021-02-03 0:16 ` Brijesh Singh [this message]
2021-02-11 17:46 ` Sean Christopherson
2021-02-02 22:37 ` Andi Kleen
2021-02-11 20:46 ` Peter Zijlstra
2021-02-12 13:19 ` Joerg Roedel
2021-02-12 14:17 ` Peter Zijlstra
2021-02-12 14:53 ` Joerg Roedel
2021-02-12 15:19 ` Peter Zijlstra
2021-02-12 15:28 ` Joerg Roedel
2021-02-12 16:12 ` Peter Zijlstra
2021-02-12 16:18 ` Joerg Roedel
2021-02-12 16:45 ` Peter Zijlstra
2021-02-12 17:48 ` Dave Hansen
2021-02-12 18:22 ` Sean Christopherson
2021-02-12 18:38 ` Andy Lutomirski
2021-02-12 18:43 ` Sean Christopherson
2021-02-12 18:46 ` Dave Hansen
2021-02-12 19:24 ` Sean Christopherson
2021-02-16 10:00 ` Joerg Roedel
2021-02-16 14:27 ` Andi Kleen
2021-02-16 14:46 ` Peter Zijlstra
2021-02-16 15:59 ` Paolo Bonzini
2021-02-16 16:25 ` Joerg Roedel
2021-02-16 16:48 ` Paolo Bonzini
2021-02-16 18:26 ` Joerg Roedel
2021-02-16 18:33 ` Paolo Bonzini
2021-02-16 16:47 ` Peter Zijlstra
2021-02-16 16:57 ` Andy Lutomirski
2021-02-16 17:05 ` Paolo Bonzini
2021-02-16 16:55 ` Andi Kleen
2021-02-12 21:42 ` Andi Kleen
2021-02-12 21:58 ` Peter Zijlstra
2021-02-12 22:39 ` Andi Kleen
2021-02-12 22:46 ` Andy Lutomirski
2021-02-13 9:38 ` Peter Zijlstra
2021-02-12 23:51 ` Paolo Bonzini
2021-03-23 9:33 ` Joerg Roedel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=961a2736-9bc9-43e1-1e75-6d373fe9590b@amd.com \
--to=brijesh.singh@amd.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=hch@lst.de \
--cc=jon.grimm@amd.com \
--cc=jroedel@suse.de \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rientjes@google.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox