From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E51CC63793 for ; Thu, 22 Jul 2021 15:57:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B6E306127C for ; Thu, 22 Jul 2021 15:57:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B6E306127C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EECE46B0036; Thu, 22 Jul 2021 11:57:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E9D0A6B005D; Thu, 22 Jul 2021 11:57:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D65166B0072; Thu, 22 Jul 2021 11:57:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0160.hostedemail.com [216.40.44.160]) by kanga.kvack.org (Postfix) with ESMTP id BC1CD6B0036 for ; Thu, 22 Jul 2021 11:57:44 -0400 (EDT) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 61D682289E for ; Thu, 22 Jul 2021 15:57:44 +0000 (UTC) X-FDA: 78390679248.17.395D45A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf06.hostedemail.com (Postfix) with ESMTP id EA76F801AB17 for ; Thu, 22 Jul 2021 15:57:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1626969463; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Dh05VWAySsgMXNYVhUYb9JqDex2rzz2KSdciloEnGI8=; b=NWkUN4DWqC1o/A+/HV8v4c7sEA9kK7d7Ey3tklIC9uD8ypIQ5bPAVXx5kE55OlZSV8Nd+6 1fD3RJ8NZCJNc1PSXyf9fGIj9TFi07a8ZfuXITxoL/HBBc6hz97rJ7UJXyegKJsGmUsJGJ 7ofD7Cv2HK28sT1/5ZeLIoS5FIVhe1k= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-28-HmMTTTtzPuOS7d0rijWuBA-1; Thu, 22 Jul 2021 11:57:41 -0400 X-MC-Unique: HmMTTTtzPuOS7d0rijWuBA-1 Received: by mail-wm1-f69.google.com with SMTP id h22-20020a7bc9360000b0290215b0f3da63so445084wml.3 for ; Thu, 22 Jul 2021 08:57:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=Dh05VWAySsgMXNYVhUYb9JqDex2rzz2KSdciloEnGI8=; b=Q5jEJrZ6wIxpf3C30wH4nu1CEil48Biue/9mu+bIROc8Qo/RFehZL05dIXPJgEO9wu 6PwZMB26Y6NcXPU5SpK2ay1gQvHQyrjbq5jFgTU5ce0VSNec8UMuR4EnIt9SeSc1+RmA 68v/ocx/smqFUz5Ls7pRQvDB8+lCVml3ga0rkBquRk3y/A3Wt5NtZkrT/DtGSSF96IAu MH1M996vltTmhRGw9YrSEIFUes18ihXyf8f2e/uo+IBTRf3D6eyoRhwedkbCChcGTfLE avykiZQrzWwbd9YeG2HErghtsicXeNkqbyjzXPpZEa3f31uGQbqn700xSCyq7tZMQv+l BkIA== X-Gm-Message-State: AOAM5328/ZSfmxIB0bMVF14QjyEsIMtlYbPL0HjdeivaUea4JcU4o2wd NFDjymx13wfLgDZrET11+fkMolNK4xM7lfQTy6hE8CMtXn4GNytV3PNkBrWQYDFrEaL1guKzveT tL3hRNi5h6sU= X-Received: by 2002:a05:6000:1c4:: with SMTP id t4mr654129wrx.181.1626969459893; Thu, 22 Jul 2021 08:57:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwTGb2qTpqKwwB2vsF8+T0xHUvU8PZD+giVUQDrFjzQrPhmWnzevCt/+LHyVgMjumMjzTlMdA== X-Received: by 2002:a05:6000:1c4:: with SMTP id t4mr654078wrx.181.1626969459478; Thu, 22 Jul 2021 08:57:39 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6970.dip0.t-ipconnect.de. [91.12.105.112]) by smtp.gmail.com with ESMTPSA id a207sm3402712wme.27.2021.07.22.08.57.38 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 22 Jul 2021 08:57:39 -0700 (PDT) To: Joerg Roedel , David Rientjes , Borislav Petkov , Andy Lutomirski , Sean Christopherson , Andrew Morton , Vlastimil Babka , "Kirill A. Shutemov" , Andi Kleen , Brijesh Singh , Tom Lendacky , Jon Grimm , Thomas Gleixner , Peter Zijlstra , Paolo Bonzini , Ingo Molnar , "Kaplan, David" , Varad Gautam , Dario Faggioli Cc: x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev References: From: David Hildenbrand Organization: Red Hat Subject: Re: Runtime Memory Validation in Intel-TDX and AMD-SNP Message-ID: Date: Thu, 22 Jul 2021 17:57:37 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NWkUN4DW; spf=none (imf06.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: qn4mb45w93o56e9ahiia71qotgdhbex6 X-Rspamd-Queue-Id: EA76F801AB17 X-Rspamd-Server: rspam01 X-HE-Tag: 1626969463-221539 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 19.07.21 14:58, Joerg Roedel wrote: > Hi, >=20 > I'd like to get some movement again into the discussion around how to > implement runtime memory validation for confidential guests and wrote u= p > some thoughts on it. > Below are the results in form of a proposal I put together. Please let > me know your thoughts on it and whether it fits everyones requirements. >=20 > Thanks, >=20 > Joerg >=20 > Proposal for Runtime Memory Validation in Secure Guests on x86 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > This proposal describes a method and protocol for runtime validation of > memory in virtualization guests running with Intel Trusted Domain > Extensions (Intel-TDX) or AMD Secure Nested Paging (AMD-SNP). >=20 > AMD-SNP and Intel-TDX use different terms to discuss memory page states= . > In AMD-SNP memory has to be 'validated' while in Intel-TDX is will be > 'accepted'. This document uses the term 'validated' for both. >=20 > Problem Statement > ----------------- >=20 > Virtualization guests which run with AMD-SNP or Intel-TDX need to > validate their memory before using it. The validation assigns a hardwar= e > state to each page which allows the guest to detect when the hypervisor > tries to maliciously access or remap a guest-private page. The guest ca= n > only access validated pages. >=20 > There are three ways the guest memory can be validated: >=20 > I. The firmware validates all of guest memory at boot time. This > is the simplest method which requires the least changes to > the Linux kernel. But this method is also very slow and > causes unwanted delays in the boot process, as verification > can take several seconds (depending on guest memory size). >=20 > II. The firmware only validates its own memory and memory > validation happens as the memory is used. This significantly > improves the boot time, but needs more intrusive changes to > the Linux kernel and its boot process. >=20 >=20 > III. Approach I. and II. can be combined. The firmware only > validates the first X MB/GB of guest memory and the rest is > validated on-demand. >=20 > For method II. and III. the guest needs to track which pages have > already been validated to detect hypervisor attacks. This information > needs to be carried through the whole boot process. >=20 > This poses challenges on the Linux boot process, as there is currently > no way to forward information about validated memory up the boot chain. > This proposal tries to describe a way to solve these challenges. >=20 > Memory Validation through the Boot Process and in the Running System > -------------------------------------------------------------------- >=20 > The memory is validated throughout the boot process as described below. > These steps assume a firmware is present, but this proposal does not > strictly require a firmware. The tasks done be the firmware can also be > done by the hypervisor before starting the guest. The steps are: >=20 > 1. The firmware validates all memory which will not be owned by > the boot loader or the OS. >=20 > 2. The firmware also validates the first X MB of memory, just > enough to run a boot loader and to load the compressed Linux > kernel image. X is not expected to be very large, 64 or 128 > MB should be enough. This pre-validation should not cause > significant delays in the boot process. >=20 > 3. The validated memory is marked E820-Usable in struct > boot_params for the Linux decompressor. The rest of the > memory is also passed to Linux via new special E820 entries > which mark the memory as Usable-but-Invalid. >=20 > 4. When the Linux decompressor takes over control, it evaluates > the E820 table and calculates to total amount of memory > available to Linux (valid and invalid memory). >=20 > The decompressor allocates a physically contiguous data > structure at a random memory location which is big enough to > hold the the validation states of all 4kb pages available to > the guest. This data structure will be called the Validation > Bitmap through the rest of this document. The Validation > Bitmap is indexed by page frame numbers. >=20 > It still needs to be determined how many bits are required > per page. This depends on the necessity to track validation > page-sizes. Two bits per page are enough to track the 3 > page-sizes currently available on the x86 architecture. >=20 > The decompressor initializes the Validation Bitmap by first > validating its backing memory and then updating it with the > information from the E820 table. It will also update the > table if it changes the state of pages from invalid to valid > (and vice versa, e.g. for mapping a GHCB page). >=20 > 5. The 'struct boot_params' is extended to carry the location > and size of the Validation Bitmap to the extracted kernel > image. > In fact, since the decompressor already receives a 'struct > boot_params', it will check if it carries a Validation > Bitmap. If it does, the decompressor uses the existing one > instead of allocating a new one. >=20 > 6. When the extracted kernel image takes over control, it will > make sure the Validation Bitmap is up to date when memory > needs to be validated. >=20 > 7. When set up, the memblock and page allocators have to check > whether the memory they return is already validated, and > validate it if not. >=20 > This should happen after the memory is allocated and all > allocator-locks are dropped, but before the memory is > returned to the caller. This way the access to the > validation bitmap can be implemented without locking and only > using atomic instructions. >=20 > Under no circumstances the Linux kernel is allowed to > validate a page more than once. Doing this might create > attack vectors for the Hypervisor towards the guest. >=20 > 8. When memory is returned to the memblock or page allocators, > it is _not_ invalidated. In fact, all memory which is freed > need to be valid. If it was marked invalid in the meantime > (e.g. if it the memory was used for DMA buffers), the code > owning the memory needs to validate it again before freeing > it. >=20 > The benefit of doing memory validation at allocation time is > that it keeps the exception handler for invalid memory > simple, because no exceptions of this kind are expected under > normal operation. >=20 > The Validation Bitmap > --------------------- >=20 > This document proposes the use of a Validation Bitmap to store the > validation state of guest pages. This section discusses the benefits of > this approach. >=20 > The Linux kernel already has an array to store various state for each > memory page in the system: The struct page array. While this would be a > natural place to also store page validation information, the Validation > Bitmap is chosen because having the information separated has some clea= r > benefits: >=20 > - The Validation Bitmap is allocated in the Linux decompressor > and already available long before the struct page array is > initialized. >=20 > - Since it is a simple in-memory data structure which is > physically contiguous, it can be passed along through the > various stages of the boot process. >=20 > - It can even be passed to a new kernel booted via kexec/kdump, > making it trivial to enable these features for AMD-SNP and > Intel-TDX. >=20 > - When memory validation happens in the memblock and page > allocators, there is no need for locking when making changes > to the Validation Bitmap, because: > =09 > - Nobody will try to concurrently access the same bits, as > the code-path doing the validation is the only owner of > the memory. >=20 > - Updates can happen via atomic cmpxchg instructions > when multiple bits are used per page. If only one bit is > needed, atomic bit manipulation instructions will suffice. >=20 > - NUMA-locality is not considered to be a problem for the > Validation Bitmap. Since memory is not invalidated upon free, > the data structure will become read-mostly over time. >=20 > Final Notes > ----------- >=20 > This proposal does not introduce requirements about the firmware that > has to be used to run Intel-TDX or AMD-SNP guests. It works with UEFI > and non-UEFI firmwares, or with no firmware at all. This is important > for use-cases like Confidential Containers running in VMs, which often > use a very small firmware (or no firmware at all) for reducing boot > times. >=20 Although most probably not what people want to have, but I'd just like=20 to mention something that might be possible. It essentially hotplugs=20 memory during boot what has been suggested here already ... 1. Start the VM with small memory (e.g., 256MiB) 2. Let the firmware validate all boot memory 3. Use virtio-mem to expose additional memory to the VM As the VM boots up, virtio-mem will add the requested amount of memory=20 to the guest. While it gets added, it will get validated and exposed to=20 the page allocator. kexec might need some thought if we end up invalidating parts of our=20 validated boot memory (I assume that will happen when sharing memory).=20 We would have to express these semantics in the e820 map we forward to=20 out new kernel. Pretty much all you'd need to do is teach virtio-mem encrypted memory=20 semantics. Shouldn't be too hard I guess, but we would have to look into=20 the details. --=20 Thanks, David / dhildenb