From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org,
tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com,
dan.j.williams@intel.com, rafael.j.wysocki@intel.com,
kirill.shutemov@linux.intel.com, ying.huang@intel.com,
reinette.chatre@intel.com, len.brown@intel.com,
tony.luck@intel.com, ak@linux.intel.com,
isaku.yamahata@intel.com, chao.gao@intel.com,
sathyanarayanan.kuppuswamy@linux.intel.com, david@redhat.com,
bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com,
kai.huang@intel.com
Subject: [PATCH v9 18/18] Documentation/x86: Add documentation for TDX host support
Date: Tue, 14 Feb 2023 00:59:25 +1300 [thread overview]
Message-ID: <d7ba69d0bb2a655a7f3a428328142738d5831d0c.1676286526.git.kai.huang@intel.com> (raw)
In-Reply-To: <cover.1676286526.git.kai.huang@intel.com>
Add documentation for TDX host kernel support. There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals. Also reuse it for TDX host kernel support.
Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.
Signed-off-by: Kai Huang <kai.huang@intel.com>
---
Documentation/x86/tdx.rst | 176 +++++++++++++++++++++++++++++++++++---
1 file changed, 165 insertions(+), 11 deletions(-)
diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index dc8d9fd2c3f7..8a84d7646bc3 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -10,6 +10,160 @@ encrypting the guest memory. In TDX, a special module running in a special
mode sits between the host and the guest and manages the guest/host
separation.
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+Before the TDX module can be used to create and run protected VMs, it
+must be loaded into the isolated range and properly initialized. The TDX
+architecture doesn't require the BIOS to load the TDX module, but the
+kernel assumes it is loaded by the BIOS.
+
+TDX boot-time detection
+-----------------------
+
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
+boot. Below dmesg shows when TDX is enabled by BIOS::
+
+ [..] tdx: BIOS enabled: private KeyID range: [16, 64).
+
+TDX module detection and initialization
+---------------------------------------
+
+There is no CPUID or MSR to detect the TDX module. The kernel detects it
+by initializing it.
+
+The kernel talks to the TDX module via the new SEAMCALL instruction. The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory. It also takes additional CPU
+time to initialize those metadata along with the TDX module itself. Both
+are not trivial. The kernel initializes the TDX module at runtime on
+demand. The caller to call tdx_enable() to initialize the TDX module::
+
+ ret = tdx_enable();
+ if (ret)
+ goto no_tdx;
+ // TDX is ready to use
+
+One step of initializing the TDX module requires at least one online cpu
+for each package. The caller needs to guarantee this otherwise the
+initialization will fail.
+
+Making SEAMCALL requires the CPU already being in VMX operation (VMXON
+has been done). For now tdx_enable() doesn't handle VMXON internally,
+but depends on the caller to guarantee that. So far only KVM calls
+tdx_enable() and KVM already handles VMXON.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below::
+
+ [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below::
+
+ [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+ [..] tdx: 262668 KBs allocated for PAMT.
+ [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg also shows it failed to
+initialize::
+
+ [..] tdx: initialization failed ...
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+TDX Memory Policy
+~~~~~~~~~~~~~~~~~
+
+TDX reports a list of "Convertible Memory Region" (CMR) to tell the
+kernel which memory is TDX compatible. The kernel needs to build a list
+of memory regions (out of CMRs) as "TDX-usable" memory and pass those
+regions to the TDX module. Once this is done, those "TDX-usable" memory
+regions are fixed during module's lifetime.
+
+To keep things simple, currently the kernel simply guarantees all pages
+in the page allocator are TDX memory. Specifically, the kernel uses all
+system memory in the core-mm at the time of initializing the TDX module
+as TDX memory, and in the meantime, refuses to online any non-TDX-memory
+in the memory hotplug.
+
+This can be enhanced in the future, i.e. by allowing adding non-TDX
+memory to a separate NUMA node. In this case, the "TDX-capable" nodes
+and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
+needs to guarantee memory pages for TDX guests are always allocated from
+the "TDX-capable" nodes.
+
+Physical Memory Hotplug
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Note TDX assumes convertible memory is always physically present during
+machine's runtime. A non-buggy BIOS should never support hot-removal of
+any convertible memory. This implementation doesn't handle ACPI memory
+removal but depends on the BIOS to behave correctly.
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX module requires one SEAMCALL (TDH.SYS.LP.INIT) to do per-cpu module
+initialization on one cpu before any other SEAMCALLs can be made on that
+cpu, including those involved during the module initialization.
+
+Currently kernel simply guarantees all online cpus are "TDX-runnable"
+(TDH.SYS.LP.INIT has been done successfully on them). During module
+initialization, the SEAMCALL is done for all online cpus and CPU hotplug
+is disabled during the entire module initialization. If any fails, TDX
+is disabled. In CPU hotplug, the kernel provides another function
+tdx_cpu_online() for the user of TDX (KVM for now) to call in it's own
+CPU online callback, and reject to online the cpu if SEAMCALL fails.
+
+TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
+TDX verifies all boot-time present logical CPUs are TDX compatible before
+enabling TDX. A non-buggy BIOS should never support hot-add/removal of
+physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
+but depends on the BIOS to behave correctly.
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Kexec()
+~~~~~~~
+
+There are two problems in terms of using kexec() to boot to a new kernel
+when the old kernel has enabled TDX: 1) Part of the memory pages are
+still TDX private pages; 2) There might be dirty cachelines associated
+with TDX private pages.
+
+The first problem doesn't matter. KeyID 0 doesn't have integrity check.
+Even the new kernel wants use any non-zero KeyID, it needs to convert
+the memory to that KeyID and such conversion would work from any KeyID.
+
+However the old kernel needs to guarantee there's no dirty cacheline
+left behind before booting to the new kernel to avoid silent corruption
+from later cacheline writeback (Intel hardware doesn't guarantee cache
+coherency across different KeyIDs).
+
+Similar to AMD SME, the kernel just uses wbinvd() to flush cache before
+booting to the new kernel.
+
+TDX Guest Support
+=================
Since the host cannot directly access guest registers or memory, much
normal functionality of a hypervisor must be moved into the guest. This is
implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +174,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
guest to the hypervisor or the TDX module.
New TDX Exceptions
-==================
+------------------
TDX guests behave differently from bare-metal and traditional VMX guests.
In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +184,7 @@ Instructions marked with an '*' conditionally cause exceptions. The
details for these instructions are discussed below.
Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
- Port I/O (INS, OUTS, IN, OUT)
- HLT
@@ -41,7 +195,7 @@ Instruction-based #VE
- CPUID*
Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +206,7 @@ Instruction-based #GP
- RDMSR*,WRMSR*
RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
MSR access behavior falls into three categories:
@@ -73,7 +227,7 @@ trapping and handling in the TDX module. Other than possibly being slow,
these MSRs appear to function just as they would on bare metal.
CPUID Behavior
---------------
+~~~~~~~~~~~~~~
For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +247,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
value with a hypercall.
#VE on Memory Accesses
-======================
+----------------------
There are essentially two classes of TDX memory: private and shared.
Private memory receives full TDX protections. Its content is protected
@@ -107,7 +261,7 @@ entries. This helps ensure that a guest does not place sensitive
information in shared memory, exposing it to the untrusted hypervisor.
#VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
Access to shared mappings can cause a #VE. The hypervisor ultimately
controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +281,7 @@ be careful not to access device MMIO regions unless it is also prepared to
handle a #VE.
#VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
An access to private mappings can also cause a #VE. Since all kernel
memory is also private memory, the kernel might theoretically need to
@@ -145,7 +299,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
to handle the exception.
Linux #VE handler
-=================
+-----------------
Just like page faults or #GP's, #VE exceptions can be either handled or be
fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +321,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
which is not recoverable.
MMIO handling
-=============
+-------------
In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +343,7 @@ MMIO access via other means (like structure overlays) may result in an
oops.
Shared Memory Conversions
-=========================
+-------------------------
All TDX guest memory starts out as private at boot. This memory can not
be accessed by the hypervisor. However, some kernel users like device
--
2.39.1
prev parent reply other threads:[~2023-02-13 12:01 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-13 11:59 [PATCH v9 00/18] TDX host kernel support Kai Huang
2023-02-13 11:59 ` [PATCH v9 01/18] x86/tdx: Define TDX supported page sizes as macros Kai Huang
2023-02-13 11:59 ` [PATCH v9 02/18] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2023-02-13 11:59 ` [PATCH v9 03/18] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
2023-02-13 11:59 ` [PATCH v9 04/18] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
2023-02-14 12:46 ` Peter Zijlstra
2023-02-14 17:23 ` Dave Hansen
2023-02-14 21:08 ` Huang, Kai
2023-02-13 11:59 ` [PATCH v9 05/18] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
2023-02-13 17:48 ` Dave Hansen
2023-02-13 21:21 ` Huang, Kai
2023-02-13 22:39 ` Dave Hansen
2023-02-13 23:22 ` Huang, Kai
2023-02-14 8:57 ` Huang, Kai
2023-02-14 17:27 ` Dave Hansen
2023-02-14 22:17 ` Huang, Kai
2023-02-14 12:42 ` Peter Zijlstra
2023-02-14 21:02 ` Huang, Kai
2023-02-13 11:59 ` [PATCH v9 06/18] x86/virt/tdx: Do TDX module global initialization Kai Huang
2023-02-13 11:59 ` [PATCH v9 07/18] x86/virt/tdx: Do TDX module per-cpu initialization Kai Huang
2023-02-13 17:59 ` Dave Hansen
2023-02-13 21:19 ` Huang, Kai
2023-02-13 22:43 ` Dave Hansen
2023-02-14 0:02 ` Huang, Kai
2023-02-14 14:12 ` Peter Zijlstra
2023-02-14 22:53 ` Huang, Kai
2023-02-15 9:16 ` Peter Zijlstra
2023-02-15 9:46 ` Huang, Kai
2023-02-15 13:25 ` Peter Zijlstra
2023-02-15 21:37 ` Huang, Kai
2023-03-06 14:26 ` Huang, Kai
2023-02-13 18:07 ` Dave Hansen
2023-02-13 21:13 ` Huang, Kai
2023-02-13 22:28 ` Dave Hansen
2023-02-13 23:43 ` Huang, Kai
2023-02-13 23:52 ` Dave Hansen
2023-02-14 0:09 ` Huang, Kai
2023-02-14 14:12 ` Peter Zijlstra
2023-02-14 12:59 ` Peter Zijlstra
2023-02-13 11:59 ` [PATCH v9 08/18] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2023-02-13 11:59 ` [PATCH v9 09/18] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
2023-02-14 3:30 ` Huang, Ying
2023-02-14 8:24 ` Huang, Kai
2023-02-13 11:59 ` [PATCH v9 10/18] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
2023-02-13 11:59 ` [PATCH v9 11/18] x86/virt/tdx: Fill out " Kai Huang
2023-02-13 11:59 ` [PATCH v9 12/18] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2023-02-13 11:59 ` [PATCH v9 13/18] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
2023-02-13 11:59 ` [PATCH v9 14/18] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Kai Huang
2023-02-13 11:59 ` [PATCH v9 15/18] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2023-02-13 11:59 ` [PATCH v9 16/18] x86/virt/tdx: Initialize all TDMRs Kai Huang
2023-02-13 11:59 ` [PATCH v9 17/18] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
2023-02-13 11:59 ` Kai Huang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d7ba69d0bb2a655a7f3a428328142738d5831d0c.1676286526.git.kai.huang@intel.com \
--to=kai.huang@intel.com \
--cc=ak@linux.intel.com \
--cc=bagasdotme@gmail.com \
--cc=chao.gao@intel.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=david@redhat.com \
--cc=imammedo@redhat.com \
--cc=isaku.yamahata@intel.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rafael.j.wysocki@intel.com \
--cc=reinette.chatre@intel.com \
--cc=sagis@google.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox