* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
[not found] ` <20230606123821.exit7gyxs42dxotz@box.shutemov.name>
@ 2023-06-06 22:58 ` Huang, Kai
2023-06-07 15:06 ` kirill.shutemov
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-06 22:58 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, 2023-06-06 at 15:38 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 05, 2023 at 02:27:17AM +1200, Kai Huang wrote:
> > TDX memory has integrity and confidentiality protections. Violations of
> > this integrity protection are supposed to only affect TDX operations and
> > are never supposed to affect the host kernel itself. In other words,
> > the host kernel should never, itself, see machine checks induced by the
> > TDX integrity hardware.
> >
> > Alas, the first few generations of TDX hardware have an erratum. A
> > "partial" write to a TDX private memory cacheline will silently "poison"
> > the line. Subsequent reads will consume the poison and generate a
> > machine check. According to the TDX hardware spec, neither of these
> > things should have happened.
> >
> > Virtually all kernel memory accesses operations happen in full
> > cachelines. In practice, writing a "byte" of memory usually reads a 64
> > byte cacheline of memory, modifies it, then writes the whole line back.
> > Those operations do not trigger this problem.
> >
> > This problem is triggered by "partial" writes where a write transaction
> > of less than cacheline lands at the memory controller. The CPU does
> > these via non-temporal write instructions (like MOVNTI), or through
> > UC/WC memory mappings. The issue can also be triggered away from the
> > CPU by devices doing partial writes via DMA.
> >
> > With this erratum, there are additional things need to be done around
> > machine check handler and kexec(), etc. Similar to other CPU bugs, use
> > a CPU bug bit to indicate this erratum, and detect this erratum during
> > early boot. Note this bug reflects the hardware thus it is detected
> > regardless of whether the kernel is built with TDX support or not.
> >
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >
> > v10 -> v11:
> > - New patch
> >
> > ---
> > arch/x86/include/asm/cpufeatures.h | 1 +
> > arch/x86/kernel/cpu/intel.c | 21 +++++++++++++++++++++
> > 2 files changed, 22 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index cb8ca46213be..dc8701f8d88b 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -483,5 +483,6 @@
> > #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
> > #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
> > #define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
> > +#define X86_BUG_TDX_PW_MCE X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
> >
> > #endif /* _ASM_X86_CPUFEATURES_H */
> > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> > index 1c4639588ff9..251b333e53d2 100644
> > --- a/arch/x86/kernel/cpu/intel.c
> > +++ b/arch/x86/kernel/cpu/intel.c
> > @@ -1552,3 +1552,24 @@ u8 get_this_hybrid_cpu_type(void)
> >
> > return cpuid_eax(0x0000001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT;
> > }
> > +
> > +/*
> > + * These CPUs have an erratum. A partial write from non-TD
> > + * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
> > + * private memory poisons that memory, and a subsequent read of
> > + * that memory triggers #MC.
> > + */
> > +static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = {
> > + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL),
> > + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL),
> > + { }
> > +};
> > +
> > +static int __init tdx_erratum_detect(void)
> > +{
> > + if (x86_match_cpu(tdx_pw_mce_cpu_ids))
> > + setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
> > +
> > + return 0;
> > +}
> > +early_initcall(tdx_erratum_detect);
>
> Initcall? Don't we already have a codepath to call it directly?
> Maybe cpu_set_bug_bits()?
>
I didn't like doing in cpu_set_bug_bits() because it appears the bugs that
handled in that function seem to have some dependency. For instance, if a CPU
is in the whitelist of NO_SPECULATION, then this function simply returns and
assumes all other bugs are not present:
static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
{
u64 ia32_cap = x86_read_arch_cap_msr();
...
if (cpu_matches(cpu_vuln_whitelist, NO_SPECULATION))
return;
setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
...
}
This TDX erratum is quite self contained thus I think using some initcall is the
cleanest way to do.
And there are other bug flags that are handled in other places but not in
cpu_set_bug_bits(), for instance,
static void init_intel(struct cpuinfo_x86 *c)
{
...
if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) &&
((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
set_cpu_bug(c, X86_BUG_MONITOR);
...
}
So it seems there's no hard rule that all bugs need to be done in
cpu_set_bug_bits().
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 02/20] x86/virt/tdx: Detect TDX during kernel boot
[not found] ` <a2da8af2-41a9-a0cf-dbe9-7f0a14bf05fe@linux.intel.com>
@ 2023-06-06 22:58 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-06 22:58 UTC (permalink / raw)
To: sathyanarayanan.kuppuswamy, kvm, linux-kernel
Cc: Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, linux-mm, Yamahata, Isaku, tglx, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, Huang,
Ying, Williams, Dan J
On Tue, 2023-06-06 at 07:00 -0700, Sathyanarayanan Kuppuswamy wrote:
> > + if (nr_tdx_keyids < 2) {
> > + pr_info("initialization failed: too few private KeyIDs
> > available.\n");
> > + goto no_tdx;
>
> I think you can return -ENODEV directly here. Maybe this goto is added to
> adapt to next
> patches. But for this patch, I don't think you need it.
OK I can do. Thanks.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 02/20] x86/virt/tdx: Detect TDX during kernel boot
[not found] ` <af4e428ab1245e9441031438e606c14472daf927.1685887183.git.kai.huang@intel.com>
[not found] ` <a2da8af2-41a9-a0cf-dbe9-7f0a14bf05fe@linux.intel.com>
@ 2023-06-06 23:44 ` Isaku Yamahata
2023-06-19 12:12 ` David Hildenbrand
2 siblings, 0 replies; 144+ messages in thread
From: Isaku Yamahata @ 2023-06-06 23:44 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, kirill.shutemov,
tony.luck, peterz, tglx, seanjc, pbonzini, david, dan.j.williams,
rafael.j.wysocki, ying.huang, reinette.chatre, len.brown, ak,
isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, bagasdotme,
sagis, imammedo, isaku.yamahata
On Mon, Jun 05, 2023 at 02:27:15AM +1200,
Kai Huang <kai.huang@intel.com> wrote:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> new file mode 100644
> index 000000000000..2d91e7120c90
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright(c) 2023 Intel Corporation.
> + *
> + * Intel Trusted Domain Extensions (TDX) support
> + */
> +
> +#define pr_fmt(fmt) "tdx: " fmt
> +
> +#include <linux/types.h>
> +#include <linux/cache.h>
> +#include <linux/init.h>
> +#include <linux/errno.h>
> +#include <linux/printk.h>
> +#include <asm/msr-index.h>
> +#include <asm/msr.h>
> +#include <asm/tdx.h>
> +
> +static u32 tdx_global_keyid __ro_after_init;
> +static u32 tdx_guest_keyid_start __ro_after_init;
> +static u32 tdx_nr_guest_keyids __ro_after_init;
> +
> +static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
> + u32 *nr_tdx_keyids)
> +{
> + u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
> + int ret;
> +
> + /*
> + * IA32_MKTME_KEYID_PARTIONING:
> + * Bit [31:0]: Number of MKTME KeyIDs.
> + * Bit [63:32]: Number of TDX private KeyIDs.
> + */
> + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
> + &_nr_tdx_keyids);
> + if (ret)
> + return -ENODEV;
> +
> + if (!_nr_tdx_keyids)
> + return -ENODEV;
> +
> + /* TDX KeyIDs start after the last MKTME KeyID. */
> + _tdx_keyid_start = _nr_mktme_keyids + 1;
> +
> + *tdx_keyid_start = _tdx_keyid_start;
> + *nr_tdx_keyids = _nr_tdx_keyids;
> +
> + return 0;
> +}
> +
> +static int __init tdx_init(void)
> +{
> + u32 tdx_keyid_start, nr_tdx_keyids;
> + int err;
> +
> + err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids);
> + if (err)
> + return err;
> +
> + pr_info("BIOS enabled: private KeyID range [%u, %u)\n",
> + tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
> +
> + /*
> + * The TDX module itself requires one 'global KeyID' to protect
> + * its metadata. If there's only one TDX KeyID, there won't be
> + * any left for TDX guests thus there's no point to enable TDX
> + * at all.
> + */
> + if (nr_tdx_keyids < 2) {
> + pr_info("initialization failed: too few private KeyIDs available.\n");
Because this case is against the admin expectation, pr_warn() or pr_err()?
Except that, looks good to me
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
[not found] ` <ec640452a4385d61bec97f8b761ed1ff38898504.1685887183.git.kai.huang@intel.com>
@ 2023-06-06 23:55 ` Isaku Yamahata
2023-06-07 14:24 ` Dave Hansen
2023-06-19 12:52 ` David Hildenbrand
2 siblings, 0 replies; 144+ messages in thread
From: Isaku Yamahata @ 2023-06-06 23:55 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, kirill.shutemov,
tony.luck, peterz, tglx, seanjc, pbonzini, david, dan.j.williams,
rafael.j.wysocki, ying.huang, reinette.chatre, len.brown, ak,
isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, bagasdotme,
sagis, imammedo, isaku.yamahata
On Mon, Jun 05, 2023 at 02:27:18AM +1200,
Kai Huang <kai.huang@intel.com> wrote:
> TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
> mode runs only the TDX module itself or other code to load the TDX
> module.
>
> The host kernel communicates with SEAM software via a new SEAMCALL
> instruction. This is conceptually similar to a guest->host hypercall,
> except it is made from the host to SEAM software instead. The TDX
> module establishes a new SEAMCALL ABI which allows the host to
> initialize the module and to manage VMs.
>
> Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very similar
> to the TDCALL ABI and leverages much TDCALL infrastructure.
>
> SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD
> when CPU is not in VMX operation. Currently, only KVM code mocks with
> VMX enabling, and KVM is the only user of TDX. This implementation
> chooses to make KVM itself responsible for enabling VMX before using
> TDX and let the rest of the kernel stay blissfully unaware of VMX.
>
> The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
> kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
> first. Architecturally, there is no CPU flag to check whether the CPU
> is in VMX operation. Also, if a BIOS were buggy, it could still report
> valid TDX private KeyIDs when TDX actually couldn't be enabled.
>
> Extend the TDX_MODULE_CALL macro to handle #UD and #GP to return error
> codes. Introduce two new TDX error codes for them respectively so the
> caller can distinguish.
>
> Also add a wrapper function of SEAMCALL to convert SEAMCALL error code
> to the kernel error code, and print out SEAMCALL error code to help the
> user to understand what went wrong.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> v10 -> v11:
> - No update
>
> v9 -> v10:
> - Make the TDX_SEAMCALL_{GP|UD} error codes unconditional but doesn't
> define them when INTEL_TDX_HOST is enabled. (Dave)
> - Slightly improved changelog to explain why add assembly code to handle
> #UD and #GP.
>
> v8 -> v9:
> - Changed patch title (Dave).
> - Enhanced seamcall() to include the cpu id to the error message when
> SEAMCALL fails.
>
> v7 -> v8:
> - Improved changelog (Dave):
> - Trim down some sentences (Dave).
> - Removed __seamcall() and seamcall() function name and changed
> accordingly (Dave).
> - Improved the sentence explaining why to handle #GP (Dave).
> - Added code to print out error message in seamcall(), following
> the idea that tdx_enable() to return universal error and print out
> error message to make clear what's going wrong (Dave). Also mention
> this in changelog.
>
> v6 -> v7:
> - No change.
>
> v5 -> v6:
> - Added code to handle #UD and #GP (Dave).
> - Moved the seamcall() wrapper function to this patch, and used a
> temporary __always_unused to avoid compile warning (Dave).
>
> - v3 -> v5 (no feedback on v4):
> - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
> SEAMCALL itself fails.
> - Improve the changelog.
>
> ---
> arch/x86/include/asm/tdx.h | 5 +++
> arch/x86/virt/vmx/tdx/Makefile | 2 +-
> arch/x86/virt/vmx/tdx/seamcall.S | 52 +++++++++++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.c | 56 ++++++++++++++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.h | 10 ++++++
> arch/x86/virt/vmx/tdx/tdxcall.S | 19 +++++++++--
> 6 files changed, 141 insertions(+), 3 deletions(-)
> create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
> create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 4dfe2e794411..b489b5b9de5d 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,6 +8,8 @@
> #include <asm/ptrace.h>
> #include <asm/shared/tdx.h>
>
> +#include <asm/trapnr.h>
> +
> /*
> * SW-defined error codes.
> *
> @@ -18,6 +20,9 @@
> #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40))
> #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000))
>
> +#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
> +#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
> +
> #ifndef __ASSEMBLY__
>
> /* TDX supported page sizes from the TDX module ABI. */
> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> index 93ca8b73e1f1..38d534f2c113 100644
> --- a/arch/x86/virt/vmx/tdx/Makefile
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -1,2 +1,2 @@
> # SPDX-License-Identifier: GPL-2.0-only
> -obj-y += tdx.o
> +obj-y += tdx.o seamcall.o
> diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> new file mode 100644
> index 000000000000..f81be6b9c133
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/linkage.h>
> +#include <asm/frame.h>
> +
> +#include "tdxcall.S"
> +
> +/*
> + * __seamcall() - Host-side interface functions to SEAM software module
> + * (the P-SEAMLDR or the TDX module).
> + *
> + * Transform function call register arguments into the SEAMCALL register
> + * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
> + * or the completion status of the SEAMCALL leaf function. Additional
> + * output operands are saved in @out (if it is provided by the caller).
> + *
> + *-------------------------------------------------------------------------
> + * SEAMCALL ABI:
> + *-------------------------------------------------------------------------
> + * Input Registers:
> + *
> + * RAX - SEAMCALL Leaf number.
> + * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers.
> + *
> + * Output Registers:
> + *
> + * RAX - SEAMCALL completion status code.
> + * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * __seamcall() function ABI:
> + *
> + * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
> + * @rcx (RSI) - Input parameter 1, moved to RCX
> + * @rdx (RDX) - Input parameter 2, moved to RDX
> + * @r8 (RCX) - Input parameter 3, moved to R8
> + * @r9 (R8) - Input parameter 4, moved to R9
> + *
> + * @out (R9) - struct tdx_module_output pointer
> + * stored temporarily in R12 (not
> + * used by the P-SEAMLDR or the TDX
> + * module). It can be NULL.
> + *
> + * Return (via RAX) the completion status of the SEAMCALL, or
> + * TDX_SEAMCALL_VMFAILINVALID.
> + */
> +SYM_FUNC_START(__seamcall)
> + FRAME_BEGIN
> + TDX_MODULE_CALL host=1
> + FRAME_END
> + RET
> +SYM_FUNC_END(__seamcall)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 2d91e7120c90..e82713dd5d54 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -12,14 +12,70 @@
> #include <linux/init.h>
> #include <linux/errno.h>
> #include <linux/printk.h>
> +#include <linux/smp.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/tdx.h>
> +#include "tdx.h"
>
> static u32 tdx_global_keyid __ro_after_init;
> static u32 tdx_guest_keyid_start __ro_after_init;
> static u32 tdx_nr_guest_keyids __ro_after_init;
>
> +/*
> + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> + * leaf function return code and the additional output respectively if
> + * not NULL.
> + */
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + u64 *seamcall_ret,
> + struct tdx_module_output *out)
> +{
> + int cpu, ret = 0;
> + u64 sret;
> +
> + /* Need a stable CPU id for printing error message */
> + cpu = get_cpu();
> +
> + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> +
> + /* Save SEAMCALL return code if the caller wants it */
> + if (seamcall_ret)
> + *seamcall_ret = sret;
> +
> + /* SEAMCALL was successful */
> + if (!sret)
> + goto out;
> +
> + switch (sret) {
> + case TDX_SEAMCALL_GP:
> + pr_err_once("[firmware bug]: TDX is not enabled by BIOS.\n");
> + ret = -ENODEV;
> + break;
> + case TDX_SEAMCALL_VMFAILINVALID:
> + pr_err_once("TDX module is not loaded.\n");
> + ret = -ENODEV;
> + break;
> + case TDX_SEAMCALL_UD:
> + pr_err_once("SEAMCALL failed: CPU %d is not in VMX operation.\n",
> + cpu);
> + ret = -EINVAL;
> + break;
> + default:
> + pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> + cpu, fn, sret);
> + if (out)
> + pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> + out->rcx, out->rdx, out->r8,
> + out->r9, out->r10, out->r11);
> + ret = -EIO;
> + }
> +out:
> + put_cpu();
> + return ret;
> +}
> +
> static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
> u32 *nr_tdx_keyids)
> {
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> new file mode 100644
> index 000000000000..48ad1a1ba737
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _X86_VIRT_TDX_H
> +#define _X86_VIRT_TDX_H
> +
> +#include <linux/types.h>
> +
> +struct tdx_module_output;
> +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + struct tdx_module_output *out);
> +#endif
> diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> index 49a54356ae99..757b0c34be10 100644
> --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #include <asm/asm-offsets.h>
> #include <asm/tdx.h>
> +#include <asm/asm.h>
>
> /*
> * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> @@ -45,6 +46,7 @@
> /* Leave input param 2 in RDX */
>
> .if \host
> +1:
> seamcall
> /*
> * SEAMCALL instruction is essentially a VMExit from VMX root
> @@ -57,10 +59,23 @@
> * This value will never be used as actual SEAMCALL error code as
> * it is from the Reserved status code class.
> */
> - jnc .Lno_vmfailinvalid
> + jnc .Lseamcall_out
> mov $TDX_SEAMCALL_VMFAILINVALID, %rax
> -.Lno_vmfailinvalid:
> + jmp .Lseamcall_out
> +2:
> + /*
> + * SEAMCALL caused #GP or #UD. By reaching here %eax contains
> + * the trap number. Convert the trap number to the TDX error
> + * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
> + *
> + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
> + * only accepts 32-bit immediate at most.
> + */
> + mov $TDX_SW_ERROR, %r12
> + orq %r12, %rax
>
> + _ASM_EXTABLE_FAULT(1b, 2b)
> +.Lseamcall_out:
> .else
> tdcall
> .endif
> --
> 2.40.1
>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
[not found] ` <9b3582c9f3a81ae68b32d9997fcd20baecb63b9b.1685887183.git.kai.huang@intel.com>
@ 2023-06-07 8:19 ` Isaku Yamahata
2023-06-07 15:08 ` Dave Hansen
` (3 subsequent siblings)
4 siblings, 0 replies; 144+ messages in thread
From: Isaku Yamahata @ 2023-06-07 8:19 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, kirill.shutemov,
tony.luck, peterz, tglx, seanjc, pbonzini, david, dan.j.williams,
rafael.j.wysocki, ying.huang, reinette.chatre, len.brown, ak,
isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, bagasdotme,
sagis, imammedo, isaku.yamahata
On Mon, Jun 05, 2023 at 02:27:19AM +1200,
Kai Huang <kai.huang@intel.com> wrote:
> Certain SEAMCALL leaf functions may return error due to running out of
> entropy, in which case the SEAMCALL should be retried as suggested by
> the TDX spec.
>
> Handle this case in SEAMCALL common function. Mimic the existing
> rdrand_long() to retry RDRAND_RETRY_LOOPS times.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> v10 -> v11:
> - New patch
>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 15 ++++++++++++++-
> arch/x86/virt/vmx/tdx/tdx.h | 17 +++++++++++++++++
> 2 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index e82713dd5d54..e62e978eba1b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -15,6 +15,7 @@
> #include <linux/smp.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> +#include <asm/archrandom.h>
> #include <asm/tdx.h>
> #include "tdx.h"
>
> @@ -33,12 +34,24 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> struct tdx_module_output *out)
> {
> int cpu, ret = 0;
> + int retry;
> u64 sret;
>
> /* Need a stable CPU id for printing error message */
> cpu = get_cpu();
>
> - sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> + /*
> + * Certain SEAMCALL leaf functions may return error due to
> + * running out of entropy, in which case the SEAMCALL should
> + * be retried. Handle this in SEAMCALL common function.
> + *
> + * Mimic the existing rdrand_long() to retry
> + * RDRAND_RETRY_LOOPS times.
> + */
> + retry = RDRAND_RETRY_LOOPS;
> + do {
> + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> + } while (sret == TDX_RND_NO_ENTROPY && --retry);
>
> /* Save SEAMCALL return code if the caller wants it */
> if (seamcall_ret)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 48ad1a1ba737..55dbb1b8c971 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -4,6 +4,23 @@
>
> #include <linux/types.h>
>
> +/*
> + * This file contains both macros and data structures defined by the TDX
> + * architecture and Linux defined software data structures and functions.
> + * The two should not be mixed together for better readability. The
> + * architectural definitions come first.
> + */
> +
> +/*
> + * TDX SEAMCALL error codes
> + */
> +#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL
> +
> +/*
> + * Do not put any hardware-defined TDX structure representations below
> + * this comment!
> + */
> +
> struct tdx_module_output;
> u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> struct tdx_module_output *out);
> --
> 2.40.1
>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
[not found] ` <86f2a8814240f4bbe850f6a09fc9d0b934979d1b.1685887183.git.kai.huang@intel.com>
[not found] ` <20230606123821.exit7gyxs42dxotz@box.shutemov.name>
@ 2023-06-07 14:15 ` Dave Hansen
2023-06-07 22:43 ` Huang, Kai
2023-06-19 12:21 ` David Hildenbrand
2 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 14:15 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
> TDX memory has integrity and confidentiality protections. Violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself. In other words,
> the host kernel should never, itself, see machine checks induced by the
> TDX integrity hardware.
At the risk of patting myself on the back by acking a changelog that I
wrote 95% of:
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
[not found] ` <ec640452a4385d61bec97f8b761ed1ff38898504.1685887183.git.kai.huang@intel.com>
2023-06-06 23:55 ` [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure Isaku Yamahata
@ 2023-06-07 14:24 ` Dave Hansen
2023-06-07 18:53 ` Isaku Yamahata
2023-06-07 22:56 ` Huang, Kai
2023-06-19 12:52 ` David Hildenbrand
2 siblings, 2 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 14:24 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
> TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
> mode runs only the TDX module itself or other code to load the TDX
> module.
>
> The host kernel communicates with SEAM software via a new SEAMCALL
> instruction. This is conceptually similar to a guest->host hypercall,
> except it is made from the host to SEAM software instead. The TDX
> module establishes a new SEAMCALL ABI which allows the host to
> initialize the module and to manage VMs.
>
> Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very similar
> to the TDCALL ABI and leverages much TDCALL infrastructure.
>
> SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD
> when CPU is not in VMX operation. Currently, only KVM code mocks with
"mocks"? Did you mean "mucks"?
> VMX enabling, and KVM is the only user of TDX. This implementation
> chooses to make KVM itself responsible for enabling VMX before using
> TDX and let the rest of the kernel stay blissfully unaware of VMX.
>
> The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
> kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
> first. Architecturally, there is no CPU flag to check whether the CPU
> is in VMX operation. Also, if a BIOS were buggy, it could still report
> valid TDX private KeyIDs when TDX actually couldn't be enabled.
I'm not sure this is a great justification. If the BIOS is lying to the
OS, we _should_ oops.
How else can this happen other than silly kernel bugs. It's OK to oops
in the face of silly kernel bugs.
...
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 4dfe2e794411..b489b5b9de5d 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,6 +8,8 @@
> #include <asm/ptrace.h>
> #include <asm/shared/tdx.h>
>
> +#include <asm/trapnr.h>
> +
> /*
> * SW-defined error codes.
> *
> @@ -18,6 +20,9 @@
> #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40))
> #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000))
>
> +#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
> +#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
> +
> #ifndef __ASSEMBLY__
>
> /* TDX supported page sizes from the TDX module ABI. */
> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> index 93ca8b73e1f1..38d534f2c113 100644
> --- a/arch/x86/virt/vmx/tdx/Makefile
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -1,2 +1,2 @@
> # SPDX-License-Identifier: GPL-2.0-only
> -obj-y += tdx.o
> +obj-y += tdx.o seamcall.o
> diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> new file mode 100644
> index 000000000000..f81be6b9c133
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/linkage.h>
> +#include <asm/frame.h>
> +
> +#include "tdxcall.S"
> +
> +/*
> + * __seamcall() - Host-side interface functions to SEAM software module
> + * (the P-SEAMLDR or the TDX module).
> + *
> + * Transform function call register arguments into the SEAMCALL register
> + * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
> + * or the completion status of the SEAMCALL leaf function. Additional
> + * output operands are saved in @out (if it is provided by the caller).
> + *
> + *-------------------------------------------------------------------------
> + * SEAMCALL ABI:
> + *-------------------------------------------------------------------------
> + * Input Registers:
> + *
> + * RAX - SEAMCALL Leaf number.
> + * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers.
> + *
> + * Output Registers:
> + *
> + * RAX - SEAMCALL completion status code.
> + * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * __seamcall() function ABI:
> + *
> + * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
> + * @rcx (RSI) - Input parameter 1, moved to RCX
> + * @rdx (RDX) - Input parameter 2, moved to RDX
> + * @r8 (RCX) - Input parameter 3, moved to R8
> + * @r9 (R8) - Input parameter 4, moved to R9
> + *
> + * @out (R9) - struct tdx_module_output pointer
> + * stored temporarily in R12 (not
> + * used by the P-SEAMLDR or the TDX
> + * module). It can be NULL.
> + *
> + * Return (via RAX) the completion status of the SEAMCALL, or
> + * TDX_SEAMCALL_VMFAILINVALID.
> + */
> +SYM_FUNC_START(__seamcall)
> + FRAME_BEGIN
> + TDX_MODULE_CALL host=1
> + FRAME_END
> + RET
> +SYM_FUNC_END(__seamcall)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 2d91e7120c90..e82713dd5d54 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -12,14 +12,70 @@
> #include <linux/init.h>
> #include <linux/errno.h>
> #include <linux/printk.h>
> +#include <linux/smp.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/tdx.h>
> +#include "tdx.h"
>
> static u32 tdx_global_keyid __ro_after_init;
> static u32 tdx_guest_keyid_start __ro_after_init;
> static u32 tdx_nr_guest_keyids __ro_after_init;
>
> +/*
> + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> + * leaf function return code and the additional output respectively if
> + * not NULL.
> + */
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + u64 *seamcall_ret,
> + struct tdx_module_output *out)
> +{
> + int cpu, ret = 0;
> + u64 sret;
> +
> + /* Need a stable CPU id for printing error message */
> + cpu = get_cpu();
> +
> + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> +
> + /* Save SEAMCALL return code if the caller wants it */
> + if (seamcall_ret)
> + *seamcall_ret = sret;
> +
> + /* SEAMCALL was successful */
> + if (!sret)
> + goto out;
> +
> + switch (sret) {
> + case TDX_SEAMCALL_GP:
> + pr_err_once("[firmware bug]: TDX is not enabled by BIOS.\n");
> + ret = -ENODEV;
> + break;
> + case TDX_SEAMCALL_VMFAILINVALID:
> + pr_err_once("TDX module is not loaded.\n");
> + ret = -ENODEV;
> + break;
> + case TDX_SEAMCALL_UD:
> + pr_err_once("SEAMCALL failed: CPU %d is not in VMX operation.\n",
> + cpu);
> + ret = -EINVAL;
> + break;
> + default:
> + pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> + cpu, fn, sret);
> + if (out)
> + pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> + out->rcx, out->rdx, out->r8,
> + out->r9, out->r10, out->r11);
> + ret = -EIO;
> + }
> +out:
> + put_cpu();
> + return ret;
> +}
This fails to distinguish two very different things:
* A SEAMCALL error
and
* A SEAMCALL *failure*
"Errors" are normal. Hypercalls can return errors and so can SEAMCALLs.
No biggie.
But SEAMCALL failures are another matter. They mean that something
really fundamental is *BROKEN*.
Just saying "SEAMCALL was successful" is a bit ambiguous for me.
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> new file mode 100644
> index 000000000000..48ad1a1ba737
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _X86_VIRT_TDX_H
> +#define _X86_VIRT_TDX_H
> +
> +#include <linux/types.h>
> +
> +struct tdx_module_output;
> +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + struct tdx_module_output *out);
> +#endif
> diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> index 49a54356ae99..757b0c34be10 100644
> --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #include <asm/asm-offsets.h>
> #include <asm/tdx.h>
> +#include <asm/asm.h>
>
> /*
> * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> @@ -45,6 +46,7 @@
> /* Leave input param 2 in RDX */
>
> .if \host
> +1:
> seamcall
> /*
> * SEAMCALL instruction is essentially a VMExit from VMX root
> @@ -57,10 +59,23 @@
> * This value will never be used as actual SEAMCALL error code as
> * it is from the Reserved status code class.
> */
> - jnc .Lno_vmfailinvalid
> + jnc .Lseamcall_out
> mov $TDX_SEAMCALL_VMFAILINVALID, %rax
> -.Lno_vmfailinvalid:
> + jmp .Lseamcall_out
> +2:
> + /*
> + * SEAMCALL caused #GP or #UD. By reaching here %eax contains
> + * the trap number. Convert the trap number to the TDX error
> + * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
> + *
> + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
> + * only accepts 32-bit immediate at most.
> + */
> + mov $TDX_SW_ERROR, %r12
> + orq %r12, %rax
I think the justification for doing the #UD/#GP handling is a bit weak.
In the end, it gets us a nicer error message. Is that error message
*REALLY* needed? Or is an oops OK in the very rare circumstance that
the BIOS is totally buggy?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
2023-06-06 22:58 ` [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum Huang, Kai
@ 2023-06-07 15:06 ` kirill.shutemov
0 siblings, 0 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-07 15:06 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, Jun 06, 2023 at 10:58:04PM +0000, Huang, Kai wrote:
> So it seems there's no hard rule that all bugs need to be done in
> cpu_set_bug_bits().
Yes, CPU identify is a mess, but initcall makes it worse. Initcall is lazy
way out that contributes to the mess. Maybe cpu_set_bug_bits() is the
wrong place, find the right one.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
[not found] ` <9b3582c9f3a81ae68b32d9997fcd20baecb63b9b.1685887183.git.kai.huang@intel.com>
2023-06-07 8:19 ` [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error Isaku Yamahata
@ 2023-06-07 15:08 ` Dave Hansen
2023-06-07 23:36 ` Huang, Kai
2023-06-08 0:08 ` kirill.shutemov
` (2 subsequent siblings)
4 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 15:08 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
> Certain SEAMCALL leaf functions may return error due to running out of
> entropy, in which case the SEAMCALL should be retried as suggested by
> the TDX spec.
>
> Handle this case in SEAMCALL common function. Mimic the existing
> rdrand_long() to retry RDRAND_RETRY_LOOPS times.
... because who are we kidding? When the TDX module says it doesn't
have enough entropy it means rdrand.
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand
[not found] ` <21b3a45cb73b4e1917c1eba75b7769781a15aa14.1685887183.git.kai.huang@intel.com>
@ 2023-06-07 15:22 ` Dave Hansen
2023-06-08 2:10 ` Huang, Kai
2023-06-19 13:16 ` David Hildenbrand
1 sibling, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 15:22 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
...
> +static int try_init_module_global(void)
> +{
> + unsigned long flags;
> + int ret;
> +
> + /*
> + * The TDX module global initialization only needs to be done
> + * once on any cpu.
> + */
> + raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
Why is this "raw_"?
There's zero mention of it anywhere.
> + if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> + ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> + -EINVAL : 0;
> + goto out;
> + }
> +
> + /* All '0's are just unused parameters. */
> + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +
> + tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> + if (ret)
> + tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> +out:
> + raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> +
> + return ret;
> +}
> +
> +/**
> + * tdx_cpu_enable - Enable TDX on local cpu
> + *
> + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
> + * global initialization SEAMCALL if not done) on local cpu to make this
> + * cpu be ready to run any other SEAMCALLs.
> + *
> + * Note this function must be called when preemption is not possible
> + * (i.e. via SMP call or in per-cpu thread). It is not IRQ safe either
> + * (i.e. cannot be called in per-cpu thread and via SMP call from remote
> + * cpu simultaneously).
lockdep_assert_*() are your friends. Unlike comments, they will
actually tell you if this goes wrong.
> +int tdx_cpu_enable(void)
> +{
> + unsigned int lp_status;
> + int ret;
> +
> + if (!platform_tdx_enabled())
> + return -EINVAL;
> +
> + lp_status = __this_cpu_read(tdx_lp_init_status);
> +
> + /* Already done */
> + if (lp_status & TDX_LP_INIT_DONE)
> + return lp_status & TDX_LP_INIT_FAILED ? -EINVAL : 0;
> +
> + /*
> + * The TDX module global initialization is the very first step
> + * to enable TDX. Need to do it first (if hasn't been done)
> + * before doing the per-cpu initialization.
> + */
> + ret = try_init_module_global();
> +
> + /*
> + * If the module global initialization failed, there's no point
> + * to do the per-cpu initialization. Just mark it as done but
> + * failed.
> + */
> + if (ret)
> + goto update_status;
> +
> + /* All '0's are just unused parameters */
> + ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> +
> +update_status:
> + lp_status = TDX_LP_INIT_DONE;
> + if (ret)
> + lp_status |= TDX_LP_INIT_FAILED;
> +
> + this_cpu_write(tdx_lp_init_status, lp_status);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
You danced around it in the changelog, but the reason for the exports is
not clear.
> +static int init_tdx_module(void)
> +{
> + /*
> + * TODO:
> + *
> + * - Get TDX module information and TDX-capable memory regions.
> + * - Build the list of TDX-usable memory regions.
> + * - Construct a list of "TD Memory Regions" (TDMRs) to cover
> + * all TDX-usable memory regions.
> + * - Configure the TDMRs and the global KeyID to the TDX module.
> + * - Configure the global KeyID on all packages.
> + * - Initialize all TDMRs.
> + *
> + * Return error before all steps are done.
> + */
> + return -EINVAL;
> +}
> +
> +static int __tdx_enable(void)
> +{
> + int ret;
> +
> + ret = init_tdx_module();
> + if (ret) {
> + pr_err("TDX module initialization failed (%d)\n", ret);
Have you actually gone any looked at how this pr_*()'s look?
Won't they say:
tdx: TDX module initialized
Isn't that a _bit_ silly? Why not just say:
pr_info("module initialized.\n");
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
[not found] ` <50386eddbb8046b0b222d385e56e8115ed566526.1685887183.git.kai.huang@intel.com>
@ 2023-06-07 15:25 ` Dave Hansen
2023-06-08 0:27 ` kirill.shutemov
` (2 subsequent siblings)
3 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 15:25 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
> Start to transit out the "multi-steps" to initialize the TDX module.
>
> TDX provides increased levels of memory confidentiality and integrity.
> This requires special hardware support for features like memory
> encryption and storage of memory integrity checksums. Not all memory
> satisfies these requirements.
>
> As a result, TDX introduced the concept of a "Convertible Memory Region"
> (CMR). During boot, the firmware builds a list of all of the memory
> ranges which can provide the TDX security guarantees.
>
> CMRs tell the kernel which memory is TDX compatible. The kernel takes
> CMRs (plus a little more metadata) and constructs "TD Memory Regions"
> (TDMRs). TDMRs let the kernel grant TDX protections to some or all of
> the CMR areas.
>
> The TDX module also reports necessary information to let the kernel
> build TDMRs and run TDX guests in structure 'tdsysinfo_struct'. The
> list of CMRs, along with the TDX module information, is available to
> the kernel by querying the TDX module.
>
> As a preparation to construct TDMRs, get the TDX module information and
> the list of CMRs. Print out CMRs to help user to decode which memory
> regions are TDX convertible.
>
> The 'tdsysinfo_struct' is fairly large (1024 bytes) and contains a lot
> of info about the TDX module. Fully define the entire structure, but
> only use the fields necessary to build the TDMRs and pr_info() some
> basics about the module. The rest of the fields will get used by KVM.
>
> For now both 'tdsysinfo_struct' and CMRs are only used during the module
> initialization. But because they are both relatively big, declare them
> inside the module initialization function but as static variables.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 09/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
[not found] ` <468533166590ff5ed11730350c4af8cdb0b99165.1685887183.git.kai.huang@intel.com>
@ 2023-06-07 15:48 ` Dave Hansen
2023-06-07 23:22 ` Huang, Kai
2023-06-08 22:40 ` kirill.shutemov
1 sibling, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 15:48 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
> As a step of initializing the TDX module, the kernel needs to tell the
> TDX module which memory regions can be used by the TDX module as TDX
> guest memory.
...
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
This is rather short on reviews from folks who do a lot of memory
hotplug work. Partly because I don't see any of them cc'd.
Can you wrangle some mm reviews on this, please?
For the x86 side (and <sigh> because this patch probably took two years
to coalesce <double sigh>):
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
[not found] ` <f9148e67e968d7aed4707b67ea9b1aa761401255.1685887183.git.kai.huang@intel.com>
@ 2023-06-07 15:54 ` Dave Hansen
2023-06-07 15:57 ` Dave Hansen
2023-06-08 22:52 ` kirill.shutemov
2 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 15:54 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
> Constructing the list of TDMRs consists below steps:
<grumble> <grumble> grammar <grumble> <grumble>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
[not found] ` <f9148e67e968d7aed4707b67ea9b1aa761401255.1685887183.git.kai.huang@intel.com>
2023-06-07 15:54 ` [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Dave Hansen
@ 2023-06-07 15:57 ` Dave Hansen
2023-06-08 10:18 ` Huang, Kai
2023-06-08 22:52 ` kirill.shutemov
2 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 15:57 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
> +struct tdmr_info_list {
> + void *tdmrs; /* Flexible array to hold 'tdmr_info's */
> + int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */
I'm looking back here after seeing the weird cast in the next patch.
Why is this a void* instead of a _real_ type?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
[not found] ` <927ec9871721d2a50f1aba7d1cf7c3be50e4f49b.1685887183.git.kai.huang@intel.com>
@ 2023-06-07 16:05 ` Dave Hansen
2023-06-08 10:48 ` Huang, Kai
2023-06-08 23:02 ` kirill.shutemov
` (2 subsequent siblings)
3 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 16:05 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
> + /*
> + * Loop over TDX memory regions and fill out TDMRs to cover them.
> + * To keep it simple, always try to use one TDMR to cover one
> + * memory region.
> + *
> + * In practice TDX1.0 supports 64 TDMRs, which is big enough to
> + * cover all memory regions in reality if the admin doesn't use
> + * 'memmap' to create a bunch of discrete memory regions. When
> + * there's a real problem, enhancement can be done to merge TDMRs
> + * to reduce the final number of TDMRs.
> + */
Rather than focus in on one specific command-line parameter, let's just say:
In practice TDX supports at least 64 TDMRs. A 2-socket system
typically only consumes <NUMBER> of those. This code is dumb
and simple and may use more TMDRs than is strictly required.
Let's also put a pr_warn() in here if we exceed, say 1/2 or maybe 3/4 of
the 64. We'll hopefully start to get reports somewhat in advance if
systems get close to the limit.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-07 14:24 ` Dave Hansen
@ 2023-06-07 18:53 ` Isaku Yamahata
2023-06-07 19:27 ` Dave Hansen
2023-06-07 22:56 ` Huang, Kai
1 sibling, 1 reply; 144+ messages in thread
From: Isaku Yamahata @ 2023-06-07 18:53 UTC (permalink / raw)
To: Dave Hansen
Cc: Kai Huang, linux-kernel, kvm, linux-mm, kirill.shutemov,
tony.luck, peterz, tglx, seanjc, pbonzini, david, dan.j.williams,
rafael.j.wysocki, ying.huang, reinette.chatre, len.brown, ak,
isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, bagasdotme,
sagis, imammedo, isaku.yamahata
On Wed, Jun 07, 2023 at 07:24:23AM -0700,
Dave Hansen <dave.hansen@intel.com> wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> > TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
> > mode runs only the TDX module itself or other code to load the TDX
> > module.
> >
> > The host kernel communicates with SEAM software via a new SEAMCALL
> > instruction. This is conceptually similar to a guest->host hypercall,
> > except it is made from the host to SEAM software instead. The TDX
> > module establishes a new SEAMCALL ABI which allows the host to
> > initialize the module and to manage VMs.
> >
> > Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very similar
> > to the TDCALL ABI and leverages much TDCALL infrastructure.
> >
> > SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD
> > when CPU is not in VMX operation. Currently, only KVM code mocks with
>
> "mocks"? Did you mean "mucks"?
>
> > VMX enabling, and KVM is the only user of TDX. This implementation
> > chooses to make KVM itself responsible for enabling VMX before using
> > TDX and let the rest of the kernel stay blissfully unaware of VMX.
> >
> > The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
> > kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
> > first. Architecturally, there is no CPU flag to check whether the CPU
> > is in VMX operation. Also, if a BIOS were buggy, it could still report
> > valid TDX private KeyIDs when TDX actually couldn't be enabled.
>
> I'm not sure this is a great justification. If the BIOS is lying to the
> OS, we _should_ oops.
>
> How else can this happen other than silly kernel bugs. It's OK to oops
> in the face of silly kernel bugs.
TDX KVM + reboot can hit #UD. On reboot, VMX is disabled (VMXOFF) via
syscore.shutdown callback. However, guest TD can be still running to issue
SEAMCALL resulting in #UD.
Or we can postpone the change and make the TDX KVM patch series carry a patch
for it.
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-07 18:53 ` Isaku Yamahata
@ 2023-06-07 19:27 ` Dave Hansen
2023-06-07 19:47 ` Isaku Yamahata
0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 19:27 UTC (permalink / raw)
To: Isaku Yamahata
Cc: Kai Huang, linux-kernel, kvm, linux-mm, kirill.shutemov,
tony.luck, peterz, tglx, seanjc, pbonzini, david, dan.j.williams,
rafael.j.wysocki, ying.huang, reinette.chatre, len.brown, ak,
isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, bagasdotme,
sagis, imammedo
On 6/7/23 11:53, Isaku Yamahata wrote:
>>> VMX enabling, and KVM is the only user of TDX. This implementation
>>> chooses to make KVM itself responsible for enabling VMX before using
>>> TDX and let the rest of the kernel stay blissfully unaware of VMX.
>>>
>>> The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
>>> kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
>>> first. Architecturally, there is no CPU flag to check whether the CPU
>>> is in VMX operation. Also, if a BIOS were buggy, it could still report
>>> valid TDX private KeyIDs when TDX actually couldn't be enabled.
>> I'm not sure this is a great justification. If the BIOS is lying to the
>> OS, we _should_ oops.
>>
>> How else can this happen other than silly kernel bugs. It's OK to oops
>> in the face of silly kernel bugs.
> TDX KVM + reboot can hit #UD. On reboot, VMX is disabled (VMXOFF) via
> syscore.shutdown callback. However, guest TD can be still running to issue
> SEAMCALL resulting in #UD.
>
> Or we can postpone the change and make the TDX KVM patch series carry a patch
> for it.
How does the existing KVM use of VMLAUNCH/VMRESUME avoid that problem?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-07 19:27 ` Dave Hansen
@ 2023-06-07 19:47 ` Isaku Yamahata
2023-06-07 20:08 ` Sean Christopherson
0 siblings, 1 reply; 144+ messages in thread
From: Isaku Yamahata @ 2023-06-07 19:47 UTC (permalink / raw)
To: Dave Hansen
Cc: Isaku Yamahata, Kai Huang, linux-kernel, kvm, linux-mm,
kirill.shutemov, tony.luck, peterz, tglx, seanjc, pbonzini,
david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On Wed, Jun 07, 2023 at 12:27:33PM -0700,
Dave Hansen <dave.hansen@intel.com> wrote:
> On 6/7/23 11:53, Isaku Yamahata wrote:
> >>> VMX enabling, and KVM is the only user of TDX. This implementation
> >>> chooses to make KVM itself responsible for enabling VMX before using
> >>> TDX and let the rest of the kernel stay blissfully unaware of VMX.
> >>>
> >>> The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
> >>> kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
> >>> first. Architecturally, there is no CPU flag to check whether the CPU
> >>> is in VMX operation. Also, if a BIOS were buggy, it could still report
> >>> valid TDX private KeyIDs when TDX actually couldn't be enabled.
> >> I'm not sure this is a great justification. If the BIOS is lying to the
> >> OS, we _should_ oops.
> >>
> >> How else can this happen other than silly kernel bugs. It's OK to oops
> >> in the face of silly kernel bugs.
> > TDX KVM + reboot can hit #UD. On reboot, VMX is disabled (VMXOFF) via
> > syscore.shutdown callback. However, guest TD can be still running to issue
> > SEAMCALL resulting in #UD.
> >
> > Or we can postpone the change and make the TDX KVM patch series carry a patch
> > for it.
>
> How does the existing KVM use of VMLAUNCH/VMRESUME avoid that problem?
extable. From arch/x86/kvm/vmx/vmenter.S
.Lvmresume:
vmresume
jmp .Lvmfail
.Lvmlaunch:
vmlaunch
jmp .Lvmfail
_ASM_EXTABLE(.Lvmresume, .Lfixup)
_ASM_EXTABLE(.Lvmlaunch, .Lfixup)
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-07 19:47 ` Isaku Yamahata
@ 2023-06-07 20:08 ` Sean Christopherson
2023-06-07 20:22 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: Sean Christopherson @ 2023-06-07 20:08 UTC (permalink / raw)
To: Isaku Yamahata
Cc: Dave Hansen, Kai Huang, linux-kernel, kvm, linux-mm,
kirill.shutemov, tony.luck, peterz, tglx, pbonzini, david,
dan.j.williams, rafael.j.wysocki, ying.huang, reinette.chatre,
len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On Wed, Jun 07, 2023, Isaku Yamahata wrote:
> On Wed, Jun 07, 2023 at 12:27:33PM -0700,
> Dave Hansen <dave.hansen@intel.com> wrote:
>
> > On 6/7/23 11:53, Isaku Yamahata wrote:
> > >>> VMX enabling, and KVM is the only user of TDX. This implementation
> > >>> chooses to make KVM itself responsible for enabling VMX before using
> > >>> TDX and let the rest of the kernel stay blissfully unaware of VMX.
> > >>>
> > >>> The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
> > >>> kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
> > >>> first. Architecturally, there is no CPU flag to check whether the CPU
> > >>> is in VMX operation. Also, if a BIOS were buggy, it could still report
> > >>> valid TDX private KeyIDs when TDX actually couldn't be enabled.
> > >> I'm not sure this is a great justification. If the BIOS is lying to the
> > >> OS, we _should_ oops.
> > >>
> > >> How else can this happen other than silly kernel bugs. It's OK to oops
> > >> in the face of silly kernel bugs.
> > > TDX KVM + reboot can hit #UD. On reboot, VMX is disabled (VMXOFF) via
> > > syscore.shutdown callback. However, guest TD can be still running to issue
> > > SEAMCALL resulting in #UD.
> > >
> > > Or we can postpone the change and make the TDX KVM patch series carry a patch
> > > for it.
> >
> > How does the existing KVM use of VMLAUNCH/VMRESUME avoid that problem?
>
> extable. From arch/x86/kvm/vmx/vmenter.S
>
> .Lvmresume:
> vmresume
> jmp .Lvmfail
>
> .Lvmlaunch:
> vmlaunch
> jmp .Lvmfail
>
> _ASM_EXTABLE(.Lvmresume, .Lfixup)
> _ASM_EXTABLE(.Lvmlaunch, .Lfixup)
More specifically, KVM eats faults on VMX and SVM instructions that occur after
KVM forcefully disables VMX/SVM.
E.g. with reboot -f, this will be reached without first stopping VMs:
static void kvm_shutdown(void)
{
/*
* Disable hardware virtualization and set kvm_rebooting to indicate
* that KVM has asynchronously disabled hardware virtualization, i.e.
* that relevant errors and exceptions aren't entirely unexpected.
* Some flavors of hardware virtualization need to be disabled before
* transferring control to firmware (to perform shutdown/reboot), e.g.
* on x86, virtualization can block INIT interrupts, which are used by
* firmware to pull APs back under firmware control. Note, this path
* is used for both shutdown and reboot scenarios, i.e. neither name is
* 100% comprehensive.
*/
pr_info("kvm: exiting hardware virtualization\n");
kvm_rebooting = true;
on_each_cpu(hardware_disable_nolock, NULL, 1);
}
which KVM x86 (VMX and SVM) then queries when deciding what to do with a spurious
fault on a VMX/SVM instruction
/*
* Handle a fault on a hardware virtualization (VMX or SVM) instruction.
*
* Hardware virtualization extension instructions may fault if a reboot turns
* off virtualization while processes are running. Usually after catching the
* fault we just panic; during reboot instead the instruction is ignored.
*/
noinstr void kvm_spurious_fault(void)
{
/* Fault while not rebooting. We want the trace. */
BUG_ON(!kvm_rebooting);
}
EXPORT_SYMBOL_GPL(kvm_spurious_fault);
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-07 20:08 ` Sean Christopherson
@ 2023-06-07 20:22 ` Dave Hansen
2023-06-08 0:51 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-07 20:22 UTC (permalink / raw)
To: Sean Christopherson, Isaku Yamahata
Cc: Kai Huang, linux-kernel, kvm, linux-mm, kirill.shutemov,
tony.luck, peterz, tglx, pbonzini, david, dan.j.williams,
rafael.j.wysocki, ying.huang, reinette.chatre, len.brown, ak,
isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, bagasdotme,
sagis, imammedo
On 6/7/23 13:08, Sean Christopherson wrote:
>>>>>> The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
>>>>>> kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
>>>>>> first. Architecturally, there is no CPU flag to check whether the CPU
>>>>>> is in VMX operation. Also, if a BIOS were buggy, it could still report
>>>>>> valid TDX private KeyIDs when TDX actually couldn't be enabled.
>>>>> I'm not sure this is a great justification. If the BIOS is lying to the
>>>>> OS, we _should_ oops.
>>>>>
>>>>> How else can this happen other than silly kernel bugs. It's OK to oops
>>>>> in the face of silly kernel bugs.
>>>> TDX KVM + reboot can hit #UD. On reboot, VMX is disabled (VMXOFF) via
>>>> syscore.shutdown callback. However, guest TD can be still running to issue
>>>> SEAMCALL resulting in #UD.
>>>>
>>>> Or we can postpone the change and make the TDX KVM patch series carry a patch
>>>> for it.
>>> How does the existing KVM use of VMLAUNCH/VMRESUME avoid that problem?
>> extable. From arch/x86/kvm/vmx/vmenter.S
>>
>> .Lvmresume:
>> vmresume
>> jmp .Lvmfail
>>
>> .Lvmlaunch:
>> vmlaunch
>> jmp .Lvmfail
>>
>> _ASM_EXTABLE(.Lvmresume, .Lfixup)
>> _ASM_EXTABLE(.Lvmlaunch, .Lfixup)
> More specifically, KVM eats faults on VMX and SVM instructions that occur after
> KVM forcefully disables VMX/SVM.
<grumble> That's a *TOTALLY* different argument than the patch makes.
KVM is being a _bit_ nutty here, but I do respect it trying to honor the
"-f". I have no objections to the SEAMCALL code being nutty in the same
way.
Why do I get the feeling that code is being written without
understanding _why_, despite this being v11?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
2023-06-07 14:15 ` Dave Hansen
@ 2023-06-07 22:43 ` Huang, Kai
2023-06-19 11:37 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-07 22:43 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-07 at 07:15 -0700, Hansen, Dave wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> > TDX memory has integrity and confidentiality protections. Violations of
> > this integrity protection are supposed to only affect TDX operations and
> > are never supposed to affect the host kernel itself. In other words,
> > the host kernel should never, itself, see machine checks induced by the
> > TDX integrity hardware.
>
> At the risk of patting myself on the back by acking a changelog that I
> wrote 95% of:
>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
>
Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-07 14:24 ` Dave Hansen
2023-06-07 18:53 ` Isaku Yamahata
@ 2023-06-07 22:56 ` Huang, Kai
2023-06-08 14:05 ` Dave Hansen
1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-07 22:56 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-07 at 07:24 -0700, Hansen, Dave wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> > TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
> > mode runs only the TDX module itself or other code to load the TDX
> > module.
> >
> > The host kernel communicates with SEAM software via a new SEAMCALL
> > instruction. This is conceptually similar to a guest->host hypercall,
> > except it is made from the host to SEAM software instead. The TDX
> > module establishes a new SEAMCALL ABI which allows the host to
> > initialize the module and to manage VMs.
> >
> > Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very similar
> > to the TDCALL ABI and leverages much TDCALL infrastructure.
> >
> > SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD
> > when CPU is not in VMX operation. Currently, only KVM code mocks with
>
> "mocks"? Did you mean "mucks"?
Yes "mucks". I believe I made some mistake.
>
> > VMX enabling, and KVM is the only user of TDX. This implementation
> > chooses to make KVM itself responsible for enabling VMX before using
> > TDX and let the rest of the kernel stay blissfully unaware of VMX.
> >
> > The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
> > kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
> > first. Architecturally, there is no CPU flag to check whether the CPU
> > is in VMX operation. Also, if a BIOS were buggy, it could still report
> > valid TDX private KeyIDs when TDX actually couldn't be enabled.
>
> I'm not sure this is a great justification. If the BIOS is lying to the
> OS, we _should_ oops.
>
> How else can this happen other than silly kernel bugs. It's OK to oops
> in the face of silly kernel bugs.
Agreed. And I'll just remove that sentence if you agree with below ...
[...]
> > + /*
> > + * SEAMCALL caused #GP or #UD. By reaching here %eax contains
> > + * the trap number. Convert the trap number to the TDX error
> > + * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
> > + *
> > + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
> > + * only accepts 32-bit immediate at most.
> > + */
> > + mov $TDX_SW_ERROR, %r12
> > + orq %r12, %rax
>
> I think the justification for doing the #UD/#GP handling is a bit weak.
> In the end, it gets us a nicer error message. Is that error message
> *REALLY* needed? Or is an oops OK in the very rare circumstance that
> the BIOS is totally buggy?
...
It's not just for the "BIOS buggy" case. The main purpose is to give an error
message when the caller mistakenly calls tdx_enable().
Also, now the machine check handler improvement patch also calls SEAMCALL to get
a given page's page type. It's totally legal that a machine check happens when
the CPU isn't in VMX operation (e.g. KVM isn't loaded), and in fact we use the
SEAMCALL return value to detect whether CPU is in VMX operation and handles such
case accordingly.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 09/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
2023-06-07 15:48 ` [PATCH v11 09/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Dave Hansen
@ 2023-06-07 23:22 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-07 23:22 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-07 at 08:48 -0700, Dave Hansen wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> > As a step of initializing the TDX module, the kernel needs to tell the
> > TDX module which memory regions can be used by the TDX module as TDX
> > guest memory.
> ...
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
>
> This is rather short on reviews from folks who do a lot of memory
> hotplug work. Partly because I don't see any of them cc'd.
David and Kirill are Cc'ed :) linux-mm list is Cc'ed too.
>
> Can you wrangle some mm reviews on this, please?
Yes in v8 I followed your suggestion to ask Cc MM people (David/Oscar/Andrew) to
take a look, and David said it looked good to him.
https://lore.kernel.org/lkml/cover.1670566861.git.kai.huang@intel.com/T/#mffb978d157c99da598d6354d55650952c425c6fd
I then removed Oscar and Andrew from Cc list since this patch doesn't touch
core-MM memory hotplug but only uses memory notifier.
I'll ask Kirill to help to review.
Hi David,
I appreciate if you could give your Acked-by or Reviewed-by if this patch looks
good to you? Thanks :)
>
> For the x86 side (and <sigh> because this patch probably took two years
> to coalesce <double sigh>):
>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
2023-06-07 15:08 ` Dave Hansen
@ 2023-06-07 23:36 ` Huang, Kai
2023-06-08 0:29 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-07 23:36 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-07 at 08:08 -0700, Hansen, Dave wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> > Certain SEAMCALL leaf functions may return error due to running out of
> > entropy, in which case the SEAMCALL should be retried as suggested by
> > the TDX spec.
> >
> > Handle this case in SEAMCALL common function. Mimic the existing
> > rdrand_long() to retry RDRAND_RETRY_LOOPS times.
>
> ... because who are we kidding? When the TDX module says it doesn't
> have enough entropy it means rdrand.
The TDX spec says "e.g., RDRAND or RDSEED".
Do you prefer below?
Certain SEAMCALL leaf functions may return error due to running out of entropy
(e.g., RDRAND or RDSEED), in which case the SEAMCALL should be retried as
suggested by the TDX spec.
Handle this case in SEAMCALL common function. Based on the SDM there's no big
difference between RDRAND and RDSEED except the latter is "compliant to NIST
SP800-90B and NIST SP800-90C in the XOR construction mode". Just Mimic the
existing rdrand_long() to retry RDRAND_RETRY_LOOPS times.
>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
>
>
Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 03/20] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
[not found] ` <cee2f2664aac3c5314896c6d14cba50f2617c0e5.1685887183.git.kai.huang@intel.com>
@ 2023-06-08 0:08 ` kirill.shutemov
0 siblings, 0 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 0:08 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:16AM +1200, Kai Huang wrote:
> TDX capable platforms are locked to X2APIC mode and cannot fall back to
> the legacy xAPIC mode when TDX is enabled by the BIOS. TDX host support
> requires x2APIC. Make INTEL_TDX_HOST depend on X86_X2APIC.
>
> Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
[not found] ` <9b3582c9f3a81ae68b32d9997fcd20baecb63b9b.1685887183.git.kai.huang@intel.com>
2023-06-07 8:19 ` [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error Isaku Yamahata
2023-06-07 15:08 ` Dave Hansen
@ 2023-06-08 0:08 ` kirill.shutemov
2023-06-09 14:42 ` Nikolay Borisov
2023-06-19 13:00 ` David Hildenbrand
4 siblings, 0 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 0:08 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:19AM +1200, Kai Huang wrote:
> Certain SEAMCALL leaf functions may return error due to running out of
> entropy, in which case the SEAMCALL should be retried as suggested by
> the TDX spec.
>
> Handle this case in SEAMCALL common function. Mimic the existing
> rdrand_long() to retry RDRAND_RETRY_LOOPS times.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
[not found] ` <50386eddbb8046b0b222d385e56e8115ed566526.1685887183.git.kai.huang@intel.com>
2023-06-07 15:25 ` [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory Dave Hansen
@ 2023-06-08 0:27 ` kirill.shutemov
2023-06-08 2:40 ` Huang, Kai
2023-06-09 10:02 ` kirill.shutemov
2023-06-19 13:29 ` David Hildenbrand
3 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 0:27 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:21AM +1200, Kai Huang wrote:
> For now both 'tdsysinfo_struct' and CMRs are only used during the module
> initialization. But because they are both relatively big, declare them
> inside the module initialization function but as static variables.
This justification does not make sense to me. static variables will not be
freed after function returned. They will still consume memory.
I think you need to allocate/free memory dynamically, if they are too big
for stack.
...
> static int init_tdx_module(void)
> {
> + static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
> + TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
> + static struct cmr_info cmr_array[MAX_CMRS]
> + __aligned(CMR_INFO_ARRAY_ALIGNMENT);
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
2023-06-07 23:36 ` Huang, Kai
@ 2023-06-08 0:29 ` Dave Hansen
0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-08 0:29 UTC (permalink / raw)
To: Huang, Kai, kvm, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/7/23 16:36, Huang, Kai wrote:
> On Wed, 2023-06-07 at 08:08 -0700, Hansen, Dave wrote:
>> On 6/4/23 07:27, Kai Huang wrote:
>>> Certain SEAMCALL leaf functions may return error due to running out of
>>> entropy, in which case the SEAMCALL should be retried as suggested by
>>> the TDX spec.
>>>
>>> Handle this case in SEAMCALL common function. Mimic the existing
>>> rdrand_long() to retry RDRAND_RETRY_LOOPS times.
>>
>> ... because who are we kidding? When the TDX module says it doesn't
>> have enough entropy it means rdrand.
>
> The TDX spec says "e.g., RDRAND or RDSEED".
Let's just say something a bit more useful and ambiguous:
Some SEAMCALLs use the RDRAND hardware and can fail for the
same reasons as RDRAND. Use the kernel RDRAND retry logic for
them.
We don't need to say "RDRAND and RDSEED", just saying "RDRAND hardware"
is fine. Everybody knows what you mean.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-07 20:22 ` Dave Hansen
@ 2023-06-08 0:51 ` Huang, Kai
2023-06-08 13:50 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-08 0:51 UTC (permalink / raw)
To: Hansen, Dave, Christopherson,, Sean, isaku.yamahata
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, pbonzini, linux-mm, tglx,
linux-kernel, Yamahata, Isaku, peterz, Shahar, Sagi, imammedo,
Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
Williams, Dan J
On Wed, 2023-06-07 at 13:22 -0700, Hansen, Dave wrote:
> On 6/7/23 13:08, Sean Christopherson wrote:
> > > > > > > The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
> > > > > > > kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
> > > > > > > first. Architecturally, there is no CPU flag to check whether the CPU
> > > > > > > is in VMX operation. Also, if a BIOS were buggy, it could still report
> > > > > > > valid TDX private KeyIDs when TDX actually couldn't be enabled.
> > > > > > I'm not sure this is a great justification. If the BIOS is lying to the
> > > > > > OS, we _should_ oops.
> > > > > >
> > > > > > How else can this happen other than silly kernel bugs. It's OK to oops
> > > > > > in the face of silly kernel bugs.
> > > > > TDX KVM + reboot can hit #UD. On reboot, VMX is disabled (VMXOFF) via
> > > > > syscore.shutdown callback. However, guest TD can be still running to issue
> > > > > SEAMCALL resulting in #UD.
> > > > >
> > > > > Or we can postpone the change and make the TDX KVM patch series carry a patch
> > > > > for it.
> > > > How does the existing KVM use of VMLAUNCH/VMRESUME avoid that problem?
> > > extable. From arch/x86/kvm/vmx/vmenter.S
> > >
> > > .Lvmresume:
> > > vmresume
> > > jmp .Lvmfail
> > >
> > > .Lvmlaunch:
> > > vmlaunch
> > > jmp .Lvmfail
> > >
> > > _ASM_EXTABLE(.Lvmresume, .Lfixup)
> > > _ASM_EXTABLE(.Lvmlaunch, .Lfixup)
> > More specifically, KVM eats faults on VMX and SVM instructions that occur after
> > KVM forcefully disables VMX/SVM.
>
> <grumble> That's a *TOTALLY* different argument than the patch makes.
>
> KVM is being a _bit_ nutty here, but I do respect it trying to honor the
> "-f". I have no objections to the SEAMCALL code being nutty in the same
> way.
>
> Why do I get the feeling that code is being written without
> understanding _why_, despite this being v11?
Hi Dave,
As I replied in another email, the main reason is to return an error code
instead of Oops when tdx_enable() is called mistakenly when CPU isn't in VMX
operation. Also in this version, the machine check handler can call SEAMCALL
legally when CPU isn't in VMX operation.
I once mentioned alternatively we could check CR4.VMXE to see whether CPU is in
VMX operation but looks you preferred to use EXTTABLE. From hardware's point of
view, checking CR4.VMXE isn't enough, although currently setting it and doing
VMXON are always done together with IRQ disabled.
https://lore.kernel.org/lkml/cover.1655894131.git.kai.huang@intel.com/T/#m6e5673a191254bf36f48083cd215f7ff8f2b315b
How about I add below to the changelog?
"
The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The kernel would
hit Oops if SEAMCALL were mistakenly made when TDX is enabled by the BIOS or
when CPU isn't in VMX operation. For the former, the callers could check
platform_tdx_enabled() first, although that doesn't rule out the buggy BIOS in
which case the kernel could still get Oops. For the latter, the caller could
check CR4.VMXE based on the fact that currently setting this bit and doing VMXON
are done together when IRQ is disabled, although from hardware's perspective
checking CR4.VMXE isn't enough.
However this could be problematic if SEAMCALL is called in the cases such as
exception handler, NMI handler, etc, as disabling IRQ doesn't prevent any of
them from happening.
To have a clean solution, just make the SEAMCALL always return error code by
using EXTTABLE so the SEAMCALL can be safely called in any context. A later
patch will need to use SEAMCALL in the machine check handler. There might be
such use cases in the future too.
"
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand
2023-06-07 15:22 ` [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand Dave Hansen
@ 2023-06-08 2:10 ` Huang, Kai
2023-06-08 13:43 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-08 2:10 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-07 at 08:22 -0700, Dave Hansen wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> ...
> > +static int try_init_module_global(void)
> > +{
> > + unsigned long flags;
> > + int ret;
> > +
> > + /*
> > + * The TDX module global initialization only needs to be done
> > + * once on any cpu.
> > + */
> > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
>
> Why is this "raw_"?
>
> There's zero mention of it anywhere.
Isaku pointed out the normal spinlock_t is converted to sleeping lock for
PREEMPT_RT kernel. KVM calls this with IRQ disabled, thus requires a non-
sleeping lock.
How about adding below comment here?
/*
* Normal spinlock_t is converted to sleeping lock in PREEMPT_RT
* kernel. Use raw_spinlock_t instead so this function can be called
* even when IRQ is disabled in any kernel configuration.
*/
>
> > + if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > + ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > + -EINVAL : 0;
> > + goto out;
> > + }
> > +
> > + /* All '0's are just unused parameters. */
> > + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > +
> > + tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > + if (ret)
> > + tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > +out:
> > + raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> > +
> > + return ret;
> > +}
> > +
> > +/**
> > + * tdx_cpu_enable - Enable TDX on local cpu
> > + *
> > + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
> > + * global initialization SEAMCALL if not done) on local cpu to make this
> > + * cpu be ready to run any other SEAMCALLs.
> > + *
> > + * Note this function must be called when preemption is not possible
> > + * (i.e. via SMP call or in per-cpu thread). It is not IRQ safe either
> > + * (i.e. cannot be called in per-cpu thread and via SMP call from remote
> > + * cpu simultaneously).
>
> lockdep_assert_*() are your friends. Unlike comments, they will
> actually tell you if this goes wrong.
Yeah. Will do. Thanks for reminding.
>
> > +int tdx_cpu_enable(void)
> > +{
> > + unsigned int lp_status;
> > + int ret;
> > +
> > + if (!platform_tdx_enabled())
> > + return -EINVAL;
> > +
> > + lp_status = __this_cpu_read(tdx_lp_init_status);
> > +
> > + /* Already done */
> > + if (lp_status & TDX_LP_INIT_DONE)
> > + return lp_status & TDX_LP_INIT_FAILED ? -EINVAL : 0;
> > +
> > + /*
> > + * The TDX module global initialization is the very first step
> > + * to enable TDX. Need to do it first (if hasn't been done)
> > + * before doing the per-cpu initialization.
> > + */
> > + ret = try_init_module_global();
> > +
> > + /*
> > + * If the module global initialization failed, there's no point
> > + * to do the per-cpu initialization. Just mark it as done but
> > + * failed.
> > + */
> > + if (ret)
> > + goto update_status;
> > +
> > + /* All '0's are just unused parameters */
> > + ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> > +
> > +update_status:
> > + lp_status = TDX_LP_INIT_DONE;
> > + if (ret)
> > + lp_status |= TDX_LP_INIT_FAILED;
> > +
> > + this_cpu_write(tdx_lp_init_status, lp_status);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
>
> You danced around it in the changelog, but the reason for the exports is
> not clear.
I'll add one sentence to the changelog to explain:
Export both tdx_cpu_enable() and tdx_enable() as KVM will be the kernel
component to use TDX.
>
> > +static int init_tdx_module(void)
> > +{
> > + /*
> > + * TODO:
> > + *
> > + * - Get TDX module information and TDX-capable memory regions.
> > + * - Build the list of TDX-usable memory regions.
> > + * - Construct a list of "TD Memory Regions" (TDMRs) to cover
> > + * all TDX-usable memory regions.
> > + * - Configure the TDMRs and the global KeyID to the TDX module.
> > + * - Configure the global KeyID on all packages.
> > + * - Initialize all TDMRs.
> > + *
> > + * Return error before all steps are done.
> > + */
> > + return -EINVAL;
> > +}
> > +
> > +static int __tdx_enable(void)
> > +{
> > + int ret;
> > +
> > + ret = init_tdx_module();
> > + if (ret) {
> > + pr_err("TDX module initialization failed (%d)\n", ret);
>
> Have you actually gone any looked at how this pr_*()'s look?
>
> Won't they say:
>
> tdx: TDX module initialized
>
> Isn't that a _bit_ silly? Why not just say:
>
> pr_info("module initialized.\n");
I did. However I might have a bad taste :)
Will change (and change other pr() if there's similar problem).
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
2023-06-08 0:27 ` kirill.shutemov
@ 2023-06-08 2:40 ` Huang, Kai
2023-06-08 11:41 ` kirill.shutemov
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-08 2:40 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Thu, 2023-06-08 at 03:27 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 05, 2023 at 02:27:21AM +1200, Kai Huang wrote:
> > For now both 'tdsysinfo_struct' and CMRs are only used during the module
> > initialization. But because they are both relatively big, declare them
> > inside the module initialization function but as static variables.
>
> This justification does not make sense to me. static variables will not be
> freed after function returned. They will still consume memory.
>
> I think you need to allocate/free memory dynamically, if they are too big
> for stack.
I do need to keep tdsysinfo_struct as it will be used by KVM too. CMRs are not
used by KVM now but they might get used in the future, e.g., we may want to
expose them to /sys in the future.
Also it takes more lines of code to do dynamic allocation. I'd prefer the code
simplicity. Dave is fine with static too, but prefers to putting them inside
the function:
https://lore.kernel.org/lkml/cover.1670566861.git.kai.huang@intel.com/T/#mbfdaa353278588da09e43f3ce37b7bf8ddedc1b2
I can update the changelog to reflect above:
For now both 'tdsysinfo_struct' and CMRs are only used during the
module initialization. KVM will need to at least use
'tdsysinfo_struct'
when supporting TDX guests. For now just declare them inside the
module
initialization function but as static variables.
?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
2023-06-07 15:57 ` Dave Hansen
@ 2023-06-08 10:18 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-08 10:18 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-07 at 08:57 -0700, Dave Hansen wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> > +struct tdmr_info_list {
> > + void *tdmrs; /* Flexible array to hold 'tdmr_info's */
> > + int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */
>
> I'm looking back here after seeing the weird cast in the next patch.
>
> Why is this a void* instead of a _real_ type?
I followed your suggestion in v8:
https://lore.kernel.org/linux-mm/725de6e9-e468-48ef-3bae-1e8a1b7ef0f7@intel.com/
I quoted the relevant part here:
> +/* Get the TDMR from the list at the given index. */
> +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
> + int idx)
> +{
> + return (struct tdmr_info *)((unsigned long)tdmr_list->first_tdmr +
> + tdmr_list->tdmr_sz * idx);
> +}
I think that's more complicated and has more casting than necessary.
This looks nicer:
int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
return (void *)tdmr_list->first_tdmr + tdmr_info_offset;
Also, it might even be worth keeping ->first_tdmr as a void*. It isn't
a real C array and keeping it as void* would keep anyone from doing:
tdmr_foo = tdmr_list->first_tdmr[foo];
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-07 16:05 ` [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions Dave Hansen
@ 2023-06-08 10:48 ` Huang, Kai
2023-06-08 13:11 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-08 10:48 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-07 at 09:05 -0700, Dave Hansen wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> > + /*
> > + * Loop over TDX memory regions and fill out TDMRs to cover them.
> > + * To keep it simple, always try to use one TDMR to cover one
> > + * memory region.
> > + *
> > + * In practice TDX1.0 supports 64 TDMRs, which is big enough to
> > + * cover all memory regions in reality if the admin doesn't use
> > + * 'memmap' to create a bunch of discrete memory regions. When
> > + * there's a real problem, enhancement can be done to merge TDMRs
> > + * to reduce the final number of TDMRs.
> > + */
>
> Rather than focus in on one specific command-line parameter, let's just say:
>
> In practice TDX supports at least 64 TDMRs. A 2-socket system
> typically only consumes <NUMBER> of those. This code is dumb
> and simple and may use more TMDRs than is strictly required.
Thanks will do. Will take a look at machine to get the <NUMBER>.
>
> Let's also put a pr_warn() in here if we exceed, say 1/2 or maybe 3/4 of
> the 64. We'll hopefully start to get reports somewhat in advance if
> systems get close to the limit.
May I ask why this is useful? TDX module can only be initialized once, so if
not considering module runtime update case, the kernel can only get two results
for once:
1) Succeed to initialize: consumed TDMRs doesn't exceed maximum TDMRs
2) Fail to initialize: consumed TDMRs exceeds maximum TDMRs
What's the value of pr_warn() user when consumed TDMRs exceeds some threshold?
Anyway, if you want it, how does below code look?
static int fill_out_tdmrs(struct list_head *tmb_list,
struct tdmr_info_list *tdmr_list)
{
+ int consumed_tdmrs_threshold, tdmr_idx = 0;
struct tdx_memblock *tmb;
- int tdmr_idx = 0;
/*
* Loop over TDX memory regions and fill out TDMRs to cover them.
* To keep it simple, always try to use one TDMR to cover one
* memory region.
*
- * In practice TDX1.0 supports 64 TDMRs, which is big enough to
- * cover all memory regions in reality if the admin doesn't use
- * 'memmap' to create a bunch of discrete memory regions. When
- * there's a real problem, enhancement can be done to merge TDMRs
- * to reduce the final number of TDMRs.
+ * In practice TDX supports at least 64 TDMRs. A 2-socket system
+ * typically only consumes <NUMBER> of those. This code is dumb
+ * and simple and may use more TMDRs than is strictly required.
+ *
+ * Also set a threshold of consumed TDMRs, and pr_warn() to warn
+ * the user the system is getting close to the limit of supported
+ * number of TDMRs if the number of consumed TDMRs exceeds the
+ * threshold.
*/
+ consumed_tdmrs_threshold = tdmr_list->max_tdmrs * 3 / 4;
list_for_each_entry(tmb, tmb_list, list) {
struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
u64 start, end;
@@ -463,6 +467,10 @@ static int fill_out_tdmrs(struct list_head *tmb_list,
return -ENOSPC;
}
+ if (tdmr_idx == consumed_tdmrs_threshold)
+ pr_warn("consumed TDMRs reaching limit: %d used
(out of %d)\n",
+ tdmr_idx, tdmr_list->max_tdmrs);
+
tdmr = tdmr_entry(tdmr_list, tdmr_idx);
}
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
2023-06-08 2:40 ` Huang, Kai
@ 2023-06-08 11:41 ` kirill.shutemov
2023-06-08 13:13 ` Dave Hansen
2023-06-08 23:29 ` Isaku Yamahata
0 siblings, 2 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 11:41 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Thu, Jun 08, 2023 at 02:40:27AM +0000, Huang, Kai wrote:
> On Thu, 2023-06-08 at 03:27 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Mon, Jun 05, 2023 at 02:27:21AM +1200, Kai Huang wrote:
> > > For now both 'tdsysinfo_struct' and CMRs are only used during the module
> > > initialization. But because they are both relatively big, declare them
> > > inside the module initialization function but as static variables.
> >
> > This justification does not make sense to me. static variables will not be
> > freed after function returned. They will still consume memory.
> >
> > I think you need to allocate/free memory dynamically, if they are too big
> > for stack.
>
>
> I do need to keep tdsysinfo_struct as it will be used by KVM too.
Will you pass it down to KVM from this function? Will KVM use the struct
after the function returns?
> CMRs are not
> used by KVM now but they might get used in the future, e.g., we may want to
> expose them to /sys in the future.
>
> Also it takes more lines of code to do dynamic allocation. I'd prefer the code
> simplicity.
These structures take 1.5K of memory and the memory will be allocated for
all machines that boots the kernel with TDX enabled, regardless if the
machine has TDX or not. It seems very wasteful to me.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-08 10:48 ` Huang, Kai
@ 2023-06-08 13:11 ` Dave Hansen
2023-06-12 2:33 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-08 13:11 UTC (permalink / raw)
To: Huang, Kai, kvm, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/8/23 03:48, Huang, Kai wrote:
>> Let's also put a pr_warn() in here if we exceed, say 1/2 or maybe 3/4 of
>> the 64. We'll hopefully start to get reports somewhat in advance if
>> systems get close to the limit.
> May I ask why this is useful? TDX module can only be initialized once, so if
> not considering module runtime update case, the kernel can only get two results
> for once:
>
> 1) Succeed to initialize: consumed TDMRs doesn't exceed maximum TDMRs
> 2) Fail to initialize: consumed TDMRs exceeds maximum TDMRs
>
> What's the value of pr_warn() user when consumed TDMRs exceeds some threshold?
Today, we're saying, "64 TMDRs out to be enough for anybody!"
I'd actually kinda like to know if anybody starts building platforms
that get anywhere near using 64. That way, we won't get a bug report
that TDX is broken and we'll have a fire drill. We'll get a bug report
that TDX is complaining and we'll have some time to go fix it without
anyone actually being broken.
Maybe not even a pr_warn(), but something that's a bit ominous and has a
chance of getting users to act.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
2023-06-08 11:41 ` kirill.shutemov
@ 2023-06-08 13:13 ` Dave Hansen
2023-06-12 2:00 ` Huang, Kai
2023-06-08 23:29 ` Isaku Yamahata
1 sibling, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-08 13:13 UTC (permalink / raw)
To: kirill.shutemov, Huang, Kai
Cc: kvm, david, bagasdotme, ak, Wysocki, Rafael J, linux-kernel,
Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/8/23 04:41, kirill.shutemov@linux.intel.com wrote:
> These structures take 1.5K of memory and the memory will be allocated for
> all machines that boots the kernel with TDX enabled, regardless if the
> machine has TDX or not. It seems very wasteful to me.
Actually, those variables are in .bss. They're allocated forever for
anyone that runs a kernel that has TDX support.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand
2023-06-08 2:10 ` Huang, Kai
@ 2023-06-08 13:43 ` Dave Hansen
2023-06-12 11:21 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-08 13:43 UTC (permalink / raw)
To: Huang, Kai, kvm, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/7/23 19:10, Huang, Kai wrote:
> On Wed, 2023-06-07 at 08:22 -0700, Dave Hansen wrote:
>> On 6/4/23 07:27, Kai Huang wrote:
>> ...
>>> +static int try_init_module_global(void)
>>> +{
>>> + unsigned long flags;
>>> + int ret;
>>> +
>>> + /*
>>> + * The TDX module global initialization only needs to be done
>>> + * once on any cpu.
>>> + */
>>> + raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
>>
>> Why is this "raw_"?
>>
>> There's zero mention of it anywhere.
>
> Isaku pointed out the normal spinlock_t is converted to sleeping lock for
> PREEMPT_RT kernel. KVM calls this with IRQ disabled, thus requires a non-
> sleeping lock.
>
> How about adding below comment here?
>
> /*
> * Normal spinlock_t is converted to sleeping lock in PREEMPT_RT
> * kernel. Use raw_spinlock_t instead so this function can be called
> * even when IRQ is disabled in any kernel configuration.
> */
Go look at *EVERY* *OTHER* raw_spinlock_t in the kernel. Do any of them
say this?
Comment the function, say that it's always called with interrupts and
preempt disabled. Leaves it at that. *Maybe* add on that it needs raw
spinlocks because of it. But don't (try to) explain the background of
the lock type.
>>> +int tdx_cpu_enable(void)
>>> +{
>>> + unsigned int lp_status;
>>> + int ret;
>>> +
>>> + if (!platform_tdx_enabled())
>>> + return -EINVAL;
>>> +
>>> + lp_status = __this_cpu_read(tdx_lp_init_status);
>>> +
>>> + /* Already done */
>>> + if (lp_status & TDX_LP_INIT_DONE)
>>> + return lp_status & TDX_LP_INIT_FAILED ? -EINVAL : 0;
>>> +
>>> + /*
>>> + * The TDX module global initialization is the very first step
>>> + * to enable TDX. Need to do it first (if hasn't been done)
>>> + * before doing the per-cpu initialization.
>>> + */
>>> + ret = try_init_module_global();
>>> +
>>> + /*
>>> + * If the module global initialization failed, there's no point
>>> + * to do the per-cpu initialization. Just mark it as done but
>>> + * failed.
>>> + */
>>> + if (ret)
>>> + goto update_status;
>>> +
>>> + /* All '0's are just unused parameters */
>>> + ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
>>> +
>>> +update_status:
>>> + lp_status = TDX_LP_INIT_DONE;
>>> + if (ret)
>>> + lp_status |= TDX_LP_INIT_FAILED;
>>> +
>>> + this_cpu_write(tdx_lp_init_status, lp_status);
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
>>
>> You danced around it in the changelog, but the reason for the exports is
>> not clear.
>
> I'll add one sentence to the changelog to explain:
>
> Export both tdx_cpu_enable() and tdx_enable() as KVM will be the kernel
> component to use TDX.
Intel doesn't pay me by the word. Do you get paid that way? If not,
please just say:
Export both tdx_cpu_enable() and tdx_enable() for KVM use.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-08 0:51 ` Huang, Kai
@ 2023-06-08 13:50 ` Dave Hansen
0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-08 13:50 UTC (permalink / raw)
To: Huang, Kai, Christopherson,, Sean, isaku.yamahata
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, pbonzini, linux-mm, tglx,
linux-kernel, Yamahata, Isaku, peterz, Shahar, Sagi, imammedo,
Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
Williams, Dan J
On 6/7/23 17:51, Huang, Kai wrote:
> How about I add below to the changelog?
>
> "
> The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The kernel would
> hit Oops if SEAMCALL were mistakenly made when TDX is enabled by the BIOS or
> when CPU isn't in VMX operation. For the former, the callers could check
> platform_tdx_enabled() first, although that doesn't rule out the buggy BIOS in
> which case the kernel could still get Oops. For the latter, the caller could
> check CR4.VMXE based on the fact that currently setting this bit and doing VMXON
> are done together when IRQ is disabled, although from hardware's perspective
> checking CR4.VMXE isn't enough.
>
> However this could be problematic if SEAMCALL is called in the cases such as
> exception handler, NMI handler, etc, as disabling IRQ doesn't prevent any of
> them from happening.
>
> To have a clean solution, just make the SEAMCALL always return error code by
> using EXTTABLE so the SEAMCALL can be safely called in any context. A later
> patch will need to use SEAMCALL in the machine check handler. There might be
> such use cases in the future too.
> "
No, that's just word salad.
SEAMCALL is like VMRESUME. It's will be called by KVM in unsafe (VMX
off) contexts in normal operation like "reboot -f". That means it needs
an exception handler for #UD(???).
I don't care if a bad BIOS can cause #GP. Bad BIOS == oops. You can
argue that even if I don't care, it's worth having a nice error message
and a common place for SEAMCALL error handling. But it's not
functionally needed.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-07 22:56 ` Huang, Kai
@ 2023-06-08 14:05 ` Dave Hansen
0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-08 14:05 UTC (permalink / raw)
To: Huang, Kai, kvm, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/7/23 15:56, Huang, Kai wrote:
> It's not just for the "BIOS buggy" case. The main purpose is to give an error
> message when the caller mistakenly calls tdx_enable().
It's also OK to oops when there's a kernel bug, aka. caller mistake.
> Also, now the machine check handler improvement patch also calls SEAMCALL to get
> a given page's page type. It's totally legal that a machine check happens when
> the CPU isn't in VMX operation (e.g. KVM isn't loaded), and in fact we use the
> SEAMCALL return value to detect whether CPU is in VMX operation and handles such
> case accordingly.
Listen, I didn't say there wasn't a reason for it. I said that this
patch lacked the justification. So, stop throwing things at the wall,
pick the *REAL* reason, and go rewrite the patch, please.
^ permalink raw reply [flat|nested] 144+ messages in thread
* RE: [PATCH v11 00/20] TDX host kernel support
[not found] <cover.1685887183.git.kai.huang@intel.com>
` (7 preceding siblings ...)
[not found] ` <9b3582c9f3a81ae68b32d9997fcd20baecb63b9b.1685887183.git.kai.huang@intel.com>
@ 2023-06-08 21:03 ` Dan Williams
2023-06-12 10:56 ` Huang, Kai
[not found] ` <468533166590ff5ed11730350c4af8cdb0b99165.1685887183.git.kai.huang@intel.com>
` (10 subsequent siblings)
19 siblings, 1 reply; 144+ messages in thread
From: Dan Williams @ 2023-06-08 21:03 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo, kai.huang
Kai Huang wrote:
> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. TDX specs are available in [1].
>
> This series is the initial support to enable TDX with minimal code to
> allow KVM to create and run TDX guests. KVM support for TDX is being
> developed separately[2]. A new "userspace inaccessible memfd" approach
> to support TDX private memory is also being developed[3]. The KVM will
> only support the new "userspace inaccessible memfd" as TDX guest memory.
This memfd approach is incompatible with one of the primary ways that
new memory topologies like high-bandwidth-memory and CXL are accessed,
via a device-special-file mapping. There is already precedent for mmap()
to only be used for communicating address value and not CPU accessible
memory. See "Userspace P2PDMA with O_DIRECT NVMe devices" [1].
So before this memfd requirement becomes too baked in to the design I
want to understand if "userspace inaccessible" is the only requirement
so I can look to add that to the device-special-file interface for
"device" / "Soft Reserved" memory like HBM and CXL.
[1]: https://lore.kernel.org/all/20221021174116.7200-1-logang@deltatee.com/
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 09/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
[not found] ` <468533166590ff5ed11730350c4af8cdb0b99165.1685887183.git.kai.huang@intel.com>
2023-06-07 15:48 ` [PATCH v11 09/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Dave Hansen
@ 2023-06-08 22:40 ` kirill.shutemov
1 sibling, 0 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 22:40 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:22AM +1200, Kai Huang wrote:
> As a step of initializing the TDX module, the kernel needs to tell the
> TDX module which memory regions can be used by the TDX module as TDX
> guest memory.
>
> TDX reports a list of "Convertible Memory Region" (CMR) to tell the
> kernel which memory is TDX compatible. The kernel needs to build a list
> of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
> the TDX module. Once this is done, those "TDX-usable" memory regions
> are fixed during module's lifetime.
>
> To keep things simple, assume that all TDX-protected memory will come
> from the page allocator. Make sure all pages in the page allocator
> *are* TDX-usable memory.
>
> As TDX-usable memory is a fixed configuration, take a snapshot of the
> memory configuration from memblocks at the time of module initialization
> (memblocks are modified on memory hotplug). This snapshot is used to
> enable TDX support for *this* memory configuration only. Use a memory
> hotplug notifier to ensure that no other RAM can be added outside of
> this configuration.
>
> This approach requires all memblock memory regions at the time of module
> initialization to be TDX convertible memory to work, otherwise module
> initialization will fail in a later SEAMCALL when passing those regions
> to the module. This approach works when all boot-time "system RAM" is
> TDX convertible memory, and no non-TDX-convertible memory is hot-added
> to the core-mm before module initialization.
>
> For instance, on the first generation of TDX machines, both CXL memory
> and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add
> any CXL memory or NVDIMM to the core-mm before module initialization
> will result in failure to initialize the module. The SEAMCALL error
> code will be available in the dmesg to help user to understand the
> failure.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
[not found] ` <f9148e67e968d7aed4707b67ea9b1aa761401255.1685887183.git.kai.huang@intel.com>
2023-06-07 15:54 ` [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Dave Hansen
2023-06-07 15:57 ` Dave Hansen
@ 2023-06-08 22:52 ` kirill.shutemov
2023-06-12 2:21 ` Huang, Kai
2 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 22:52 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:23AM +1200, Kai Huang wrote:
> @@ -50,6 +51,8 @@ static DEFINE_MUTEX(tdx_module_lock);
> /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
> static LIST_HEAD(tdx_memlist);
>
> +static struct tdmr_info_list tdx_tdmr_list;
> +
> /*
> * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
The name is misleading. It is not list, it is an array.
...
> @@ -112,6 +135,15 @@ struct tdx_memblock {
> unsigned long end_pfn;
> };
>
> +struct tdmr_info_list {
> + void *tdmrs; /* Flexible array to hold 'tdmr_info's */
> + int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */
> +
> + /* Metadata for finding target 'tdmr_info' and freeing @tdmrs */
> + int tdmr_sz; /* Size of one 'tdmr_info' */
> + int max_tdmrs; /* How many 'tdmr_info's are allocated */
> +};
> +
> struct tdx_module_output;
> u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> struct tdx_module_output *out);
Otherwise, looks okay.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
[not found] ` <927ec9871721d2a50f1aba7d1cf7c3be50e4f49b.1685887183.git.kai.huang@intel.com>
2023-06-07 16:05 ` [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions Dave Hansen
@ 2023-06-08 23:02 ` kirill.shutemov
2023-06-12 2:25 ` Huang, Kai
2023-06-09 4:01 ` Sathyanarayanan Kuppuswamy
2023-06-14 12:31 ` Nikolay Borisov
3 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 23:02 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:24AM +1200, Kai Huang wrote:
> +#define TDMR_ALIGNMENT BIT_ULL(30)
Nit: SZ_1G can be a little bit more readable here.
Anyway:
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
[not found] ` <4e108968c3294189ad150f62df1f146168036342.1685887183.git.kai.huang@intel.com>
@ 2023-06-08 23:24 ` kirill.shutemov
2023-06-08 23:43 ` Dave Hansen
2023-06-25 15:38 ` Huang, Kai
2023-06-15 7:48 ` Nikolay Borisov
1 sibling, 2 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 23:24 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:25AM +1200, Kai Huang wrote:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index fa9fa8bc581a..5f0499ba5d67 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -265,7 +265,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> * overlap.
> */
> static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> - unsigned long end_pfn)
> + unsigned long end_pfn, int nid)
> {
> struct tdx_memblock *tmb;
>
> @@ -276,6 +276,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> INIT_LIST_HEAD(&tmb->list);
> tmb->start_pfn = start_pfn;
> tmb->end_pfn = end_pfn;
> + tmb->nid = nid;
>
> /* @tmb_list is protected by mem_hotplug_lock */
> list_add_tail(&tmb->list, tmb_list);
> @@ -303,9 +304,9 @@ static void free_tdx_memlist(struct list_head *tmb_list)
> static int build_tdx_memlist(struct list_head *tmb_list)
> {
> unsigned long start_pfn, end_pfn;
> - int i, ret;
> + int i, nid, ret;
>
> - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> /*
> * The first 1MB is not reported as TDX convertible memory.
> * Although the first 1MB is always reserved and won't end up
> @@ -321,7 +322,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
> * memblock has already guaranteed they are in address
> * ascending order and don't overlap.
> */
> - ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> + ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
> if (ret)
> goto err;
> }
These three hunks and change to struct tdx_memblock looks unrelated.
Why not fold this to 09/20?
> @@ -472,6 +473,202 @@ static int fill_out_tdmrs(struct list_head *tmb_list,
> return 0;
> }
>
> +/*
> + * Calculate PAMT size given a TDMR and a page size. The returned
> + * PAMT size is always aligned up to 4K page boundary.
> + */
> +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
> + u16 pamt_entry_size)
> +{
> + unsigned long pamt_sz, nr_pamt_entries;
> +
> + switch (pgsz) {
> + case TDX_PS_4K:
> + nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
> + break;
> + case TDX_PS_2M:
> + nr_pamt_entries = tdmr->size >> PMD_SHIFT;
> + break;
> + case TDX_PS_1G:
> + nr_pamt_entries = tdmr->size >> PUD_SHIFT;
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + return 0;
> + }
> +
> + pamt_sz = nr_pamt_entries * pamt_entry_size;
> + /* TDX requires PAMT size must be 4K aligned */
> + pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> +
> + return pamt_sz;
> +}
> +
> +/*
> + * Locate a NUMA node which should hold the allocation of the @tdmr
> + * PAMT. This node will have some memory covered by the TDMR. The
> + * relative amount of memory covered is not considered.
> + */
> +static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list)
> +{
> + struct tdx_memblock *tmb;
> +
> + /*
> + * A TDMR must cover at least part of one TMB. That TMB will end
> + * after the TDMR begins. But, that TMB may have started before
> + * the TDMR. Find the next 'tmb' that _ends_ after this TDMR
> + * begins. Ignore 'tmb' start addresses. They are irrelevant.
> + */
> + list_for_each_entry(tmb, tmb_list, list) {
> + if (tmb->end_pfn > PHYS_PFN(tdmr->base))
> + return tmb->nid;
> + }
> +
> + /*
> + * Fall back to allocating the TDMR's metadata from node 0 when
> + * no TDX memory block can be found. This should never happen
> + * since TDMRs originate from TDX memory blocks.
> + */
> + pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n",
> + tdmr->base, tdmr_end(tdmr));
> + return 0;
> +}
> +
> +#define TDX_PS_NR (TDX_PS_1G + 1)
This should be next to the rest TDX_PS_*.
> +
> +/*
> + * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
> + * within @tdmr, and set up PAMTs for @tdmr.
> + */
> +static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> + struct list_head *tmb_list,
> + u16 pamt_entry_size)
> +{
> + unsigned long pamt_base[TDX_PS_NR];
> + unsigned long pamt_size[TDX_PS_NR];
> + unsigned long tdmr_pamt_base;
> + unsigned long tdmr_pamt_size;
> + struct page *pamt;
> + int pgsz, nid;
> +
> + nid = tdmr_get_nid(tdmr, tmb_list);
> +
> + /*
> + * Calculate the PAMT size for each TDX supported page size
> + * and the total PAMT size.
> + */
> + tdmr_pamt_size = 0;
> + for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) {
"< TDX_PS_NR" instead of "<= TDX_PS_1G".
> + pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
> + pamt_entry_size);
> + tdmr_pamt_size += pamt_size[pgsz];
> + }
> +
> + /*
> + * Allocate one chunk of physically contiguous memory for all
> + * PAMTs. This helps minimize the PAMT's use of reserved areas
> + * in overlapped TDMRs.
> + */
> + pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
> + nid, &node_online_map);
> + if (!pamt)
> + return -ENOMEM;
> +
> + /*
> + * Break the contiguous allocation back up into the
> + * individual PAMTs for each page size.
> + */
> + tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> + for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) {
> + pamt_base[pgsz] = tdmr_pamt_base;
> + tdmr_pamt_base += pamt_size[pgsz];
> + }
> +
> + tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> + tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> + tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
> + tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
> + tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
> + tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
> +
> + return 0;
> +}
> +
> +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
> + unsigned long *pamt_npages)
> +{
> + unsigned long pamt_base, pamt_sz;
> +
> + /*
> + * The PAMT was allocated in one contiguous unit. The 4K PAMT
> + * should always point to the beginning of that allocation.
> + */
> + pamt_base = tdmr->pamt_4k_base;
> + pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> +
> + *pamt_pfn = PHYS_PFN(pamt_base);
> + *pamt_npages = pamt_sz >> PAGE_SHIFT;
> +}
> +
> +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> +{
> + unsigned long pamt_pfn, pamt_npages;
> +
> + tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
> +
> + /* Do nothing if PAMT hasn't been allocated for this TDMR */
> + if (!pamt_npages)
> + return;
> +
> + if (WARN_ON_ONCE(!pamt_pfn))
> + return;
> +
> + free_contig_range(pamt_pfn, pamt_npages);
> +}
> +
> +static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
> +{
> + int i;
> +
> + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
> + tdmr_free_pamt(tdmr_entry(tdmr_list, i));
> +}
> +
> +/* Allocate and set up PAMTs for all TDMRs */
> +static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
> + struct list_head *tmb_list,
> + u16 pamt_entry_size)
> +{
> + int i, ret = 0;
> +
> + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> + ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
> + pamt_entry_size);
> + if (ret)
> + goto err;
> + }
> +
> + return 0;
> +err:
> + tdmrs_free_pamt_all(tdmr_list);
> + return ret;
> +}
> +
> +static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list *tdmr_list)
> +{
> + unsigned long pamt_npages = 0;
> + int i;
> +
> + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> + unsigned long pfn, npages;
> +
> + tdmr_get_pamt(tdmr_entry(tdmr_list, i), &pfn, &npages);
> + pamt_npages += npages;
> + }
> +
> + return pamt_npages;
> +}
> +
> /*
> * Construct a list of TDMRs on the preallocated space in @tdmr_list
> * to cover all TDX memory regions in @tmb_list based on the TDX module
> @@ -487,10 +684,13 @@ static int construct_tdmrs(struct list_head *tmb_list,
> if (ret)
> return ret;
>
> + ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
> + sysinfo->pamt_entry_size);
> + if (ret)
> + return ret;
> /*
> * TODO:
> *
> - * - Allocate and set up PAMTs for each TDMR.
> * - Designate reserved areas for each TDMR.
> *
> * Return -EINVAL until constructing TDMRs is done
> @@ -547,6 +747,11 @@ static int init_tdx_module(void)
> * Return error before all steps are done.
> */
> ret = -EINVAL;
> + if (ret)
> + tdmrs_free_pamt_all(&tdx_tdmr_list);
> + else
> + pr_info("%lu KBs allocated for PAMT.\n",
> + tdmrs_count_pamt_pages(&tdx_tdmr_list) * 4);
"* 4"? This is very cryptic. procfs uses "<< (PAGE_SHIFT - 10)" which
slightly less magic to me. And just make the helper that returns kilobytes
to begin with, if it is the only caller.
> out_free_tdmrs:
> if (ret)
> free_tdmr_list(&tdx_tdmr_list);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index c20848e76469..e8110e1a9980 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -133,6 +133,7 @@ struct tdx_memblock {
> struct list_head list;
> unsigned long start_pfn;
> unsigned long end_pfn;
> + int nid;
> };
>
> struct tdmr_info_list {
> --
> 2.40.1
>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
2023-06-08 11:41 ` kirill.shutemov
2023-06-08 13:13 ` Dave Hansen
@ 2023-06-08 23:29 ` Isaku Yamahata
2023-06-08 23:54 ` kirill.shutemov
1 sibling, 1 reply; 144+ messages in thread
From: Isaku Yamahata @ 2023-06-08 23:29 UTC (permalink / raw)
To: kirill.shutemov
Cc: Huang, Kai, kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki,
Rafael J, linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J,
isaku.yamahata
On Thu, Jun 08, 2023 at 02:41:28PM +0300,
"kirill.shutemov@linux.intel.com" <kirill.shutemov@linux.intel.com> wrote:
> On Thu, Jun 08, 2023 at 02:40:27AM +0000, Huang, Kai wrote:
> > On Thu, 2023-06-08 at 03:27 +0300, kirill.shutemov@linux.intel.com wrote:
> > > On Mon, Jun 05, 2023 at 02:27:21AM +1200, Kai Huang wrote:
> > > > For now both 'tdsysinfo_struct' and CMRs are only used during the module
> > > > initialization. But because they are both relatively big, declare them
> > > > inside the module initialization function but as static variables.
> > >
> > > This justification does not make sense to me. static variables will not be
> > > freed after function returned. They will still consume memory.
> > >
> > > I think you need to allocate/free memory dynamically, if they are too big
> > > for stack.
> >
> >
> > I do need to keep tdsysinfo_struct as it will be used by KVM too.
>
> Will you pass it down to KVM from this function? Will KVM use the struct
> after the function returns?
KVM needs tdsysinfo_struct to create guest TD. It doesn't require
1024-alignment.
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
2023-06-08 23:24 ` [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs kirill.shutemov
@ 2023-06-08 23:43 ` Dave Hansen
2023-06-12 2:52 ` Huang, Kai
2023-06-25 15:38 ` Huang, Kai
1 sibling, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-08 23:43 UTC (permalink / raw)
To: kirill.shutemov, Kai Huang
Cc: linux-kernel, kvm, linux-mm, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/8/23 16:24, kirill.shutemov@linux.intel.com wrote:
>> ret = -EINVAL;
>> + if (ret)
>> + tdmrs_free_pamt_all(&tdx_tdmr_list);
>> + else
>> + pr_info("%lu KBs allocated for PAMT.\n",
>> + tdmrs_count_pamt_pages(&tdx_tdmr_list) * 4);
> "* 4"? This is very cryptic. procfs uses "<< (PAGE_SHIFT - 10)" which
> slightly less magic to me. And just make the helper that returns kilobytes
> to begin with, if it is the only caller.
Let's look at where this data comes from:
+static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list
*tdmr_list)
+{
+ unsigned long pamt_npages = 0;
+ int i;
+
+ for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+ unsigned long pfn, npages;
+
+ tdmr_get_pamt(tdmr_entry(tdmr_list, i), &pfn, &npages);
+ pamt_npages += npages;
+ }
OK, so tdmr_get_pamt() is getting it in pages. How is it *stored*?
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
+ unsigned long *pamt_npages)
+{
...
+ pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
++ *pamt_pfn = PHYS_PFN(pamt_base);
+ *pamt_npages = pamt_sz >> PAGE_SHIFT;
+}
Oh, it's actually stored in bytes. So to print it out you actually
convert it from bytes->pages->kbytes. Not the best.
If tdmr_get_pamt() just returned 'pamt_size_bytes', you could do one
conversion at:
free_contig_range(pamt_pfn, pamt_size_bytes >> PAGE_SIZE);
and since tdmrs_count_pamt_pages() has only one caller you can just make
it: tdmrs_count_pamt_kb(). The print becomes:
pr_info("%lu KBs allocated for PAMT.\n",
tdmrs_count_pamt_kb(&tdx_tdmr_list) * 4);
and tdmrs_count_pamt_kb() does something super fancy like:
return pamt_size_bytes / 1024;
which makes total complete obvious sense and needs zero explanation.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 13/20] x86/virt/tdx: Designate reserved areas for all TDMRs
[not found] ` <409448809f7c78191aa27d6d2970ba1384c2d464.1685887183.git.kai.huang@intel.com>
@ 2023-06-08 23:53 ` kirill.shutemov
0 siblings, 0 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 23:53 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:26AM +1200, Kai Huang wrote:
> As the last step of constructing TDMRs, populate reserved areas for all
> TDMRs. For each TDMR, put all memory holes within this TDMR to the
> reserved areas. And for all PAMTs which overlap with this TDMR, put
> all the overlapping parts to reserved areas too.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 14/20] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
[not found] ` <4e6cd933edd2501147366df7a17e1087560a4320.1685887183.git.kai.huang@intel.com>
@ 2023-06-08 23:53 ` kirill.shutemov
0 siblings, 0 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 23:53 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:27AM +1200, Kai Huang wrote:
> The TDX module uses a private KeyID as the "global KeyID" for mapping
> things like the PAMT and other TDX metadata. This KeyID has already
> been reserved when detecting TDX during the kernel early boot.
>
> After the list of "TD Memory Regions" (TDMRs) has been constructed to
> cover all TDX-usable memory regions, the next step is to pass them to
> the TDX module together with the global KeyID.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 15/20] x86/virt/tdx: Configure global KeyID on all packages
[not found] ` <30358db4eff961c69783bbd4d9f3e50932a9a759.1685887183.git.kai.huang@intel.com>
@ 2023-06-08 23:53 ` kirill.shutemov
2023-06-15 8:12 ` Nikolay Borisov
1 sibling, 0 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 23:53 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:28AM +1200, Kai Huang wrote:
> After the list of TDMRs and the global KeyID are configured to the TDX
> module, the kernel needs to configure the key of the global KeyID on all
> packages using TDH.SYS.KEY.CONFIG.
>
> This SEAMCALL cannot run parallel on different cpus. Loop all online
> cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
> each package.
>
> To keep things simple, this implementation takes no affirmative steps to
> online cpus to make sure there's at least one cpu for each package. The
> callers (aka. KVM) can ensure success by ensuring that.
>
> Intel hardware doesn't guarantee cache coherency across different
> KeyIDs. The PAMTs are transitioning from being used by the kernel
> mapping (KeyId 0) to the TDX module's "global KeyID" mapping.
>
> This means that the kernel must flush any dirty KeyID-0 PAMT cachelines
> before the TDX module uses the global KeyID to access the PAMTs.
> Otherwise, if those dirty cachelines were written back, they would
> corrupt the TDX module's metadata. Aside: This corruption would be
> detected by the memory integrity hardware on the next read of the memory
> with the global KeyID. The result would likely be fatal to the system
> but would not impact TDX security.
>
> Following the TDX module specification, flush cache before configuring
> the global KeyID on all packages. Given the PAMT size can be large
> (~1/256th of system RAM), just use WBINVD on all CPUs to flush.
>
> If TDH.SYS.KEY.CONFIG fails, the TDX module may already have used the
> global KeyID to write the PAMTs. Therefore, use WBINVD to flush cache
> before returning the PAMTs back to the kernel. Also convert all PAMTs
> back to normal by using MOVDIR64B as suggested by the TDX module spec,
> although on the platform without the "partial write machine check"
> erratum it's OK to leave PAMTs as is.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
2023-06-08 23:29 ` Isaku Yamahata
@ 2023-06-08 23:54 ` kirill.shutemov
2023-06-09 1:33 ` Isaku Yamahata
0 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-08 23:54 UTC (permalink / raw)
To: Isaku Yamahata
Cc: Huang, Kai, kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki,
Rafael J, linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Thu, Jun 08, 2023 at 04:29:19PM -0700, Isaku Yamahata wrote:
> On Thu, Jun 08, 2023 at 02:41:28PM +0300,
> "kirill.shutemov@linux.intel.com" <kirill.shutemov@linux.intel.com> wrote:
>
> > On Thu, Jun 08, 2023 at 02:40:27AM +0000, Huang, Kai wrote:
> > > On Thu, 2023-06-08 at 03:27 +0300, kirill.shutemov@linux.intel.com wrote:
> > > > On Mon, Jun 05, 2023 at 02:27:21AM +1200, Kai Huang wrote:
> > > > > For now both 'tdsysinfo_struct' and CMRs are only used during the module
> > > > > initialization. But because they are both relatively big, declare them
> > > > > inside the module initialization function but as static variables.
> > > >
> > > > This justification does not make sense to me. static variables will not be
> > > > freed after function returned. They will still consume memory.
> > > >
> > > > I think you need to allocate/free memory dynamically, if they are too big
> > > > for stack.
> > >
> > >
> > > I do need to keep tdsysinfo_struct as it will be used by KVM too.
> >
> > Will you pass it down to KVM from this function? Will KVM use the struct
> > after the function returns?
>
> KVM needs tdsysinfo_struct to create guest TD. It doesn't require
> 1024-alignment.
How KVM gets it from here?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 20/20] Documentation/x86: Add documentation for TDX host support
[not found] ` <34853e0f8f38ec2fda66b0ba480d4df63b8aab43.1685887183.git.kai.huang@intel.com>
@ 2023-06-08 23:56 ` Dave Hansen
2023-06-12 3:41 ` Huang, Kai
2023-06-16 9:02 ` Nikolay Borisov
1 sibling, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-08 23:56 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/4/23 07:27, Kai Huang wrote:
> +There is no CPUID or MSR to detect the TDX module. The kernel detects it
> +by initializing it.
Is this really what you want to say?
If the module is there, SEAMCALL works. If not, it doesn't. The module
is assumed to be there when SEAMCALL works. Right?
Yeah, that first SEAMCALL is probably an initialization call, but feel
like that's beside the point.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
2023-06-08 23:54 ` kirill.shutemov
@ 2023-06-09 1:33 ` Isaku Yamahata
0 siblings, 0 replies; 144+ messages in thread
From: Isaku Yamahata @ 2023-06-09 1:33 UTC (permalink / raw)
To: kirill.shutemov
Cc: Isaku Yamahata, Huang, Kai, kvm, Hansen, Dave, david, bagasdotme,
ak, Wysocki, Rafael J, linux-kernel, Chatre, Reinette,
Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Fri, Jun 09, 2023 at 02:54:41AM +0300,
"kirill.shutemov@linux.intel.com" <kirill.shutemov@linux.intel.com> wrote:
> On Thu, Jun 08, 2023 at 04:29:19PM -0700, Isaku Yamahata wrote:
> > On Thu, Jun 08, 2023 at 02:41:28PM +0300,
> > "kirill.shutemov@linux.intel.com" <kirill.shutemov@linux.intel.com> wrote:
> >
> > > On Thu, Jun 08, 2023 at 02:40:27AM +0000, Huang, Kai wrote:
> > > > On Thu, 2023-06-08 at 03:27 +0300, kirill.shutemov@linux.intel.com wrote:
> > > > > On Mon, Jun 05, 2023 at 02:27:21AM +1200, Kai Huang wrote:
> > > > > > For now both 'tdsysinfo_struct' and CMRs are only used during the module
> > > > > > initialization. But because they are both relatively big, declare them
> > > > > > inside the module initialization function but as static variables.
> > > > >
> > > > > This justification does not make sense to me. static variables will not be
> > > > > freed after function returned. They will still consume memory.
> > > > >
> > > > > I think you need to allocate/free memory dynamically, if they are too big
> > > > > for stack.
> > > >
> > > >
> > > > I do need to keep tdsysinfo_struct as it will be used by KVM too.
> > >
> > > Will you pass it down to KVM from this function? Will KVM use the struct
> > > after the function returns?
> >
> > KVM needs tdsysinfo_struct to create guest TD. It doesn't require
> > 1024-alignment.
>
> How KVM gets it from here?
For now, TDX KVM patch series moves the tdsysinfo out of the function, and add
a getter function of it.
As long as KVM can access the info, it doesn't care how its memory is allocated.
static or dynamic.
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
[not found] ` <927ec9871721d2a50f1aba7d1cf7c3be50e4f49b.1685887183.git.kai.huang@intel.com>
2023-06-07 16:05 ` [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions Dave Hansen
2023-06-08 23:02 ` kirill.shutemov
@ 2023-06-09 4:01 ` Sathyanarayanan Kuppuswamy
2023-06-12 2:28 ` Huang, Kai
2023-06-14 12:31 ` Nikolay Borisov
3 siblings, 1 reply; 144+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2023-06-09 4:01 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, bagasdotme, sagis, imammedo
On 6/4/23 7:27 AM, Kai Huang wrote:
> Start to transit out the "multi-steps" to construct a list of "TD Memory
> Regions" (TDMRs) to cover all TDX-usable memory regions.
>
> The kernel configures TDX-usable memory regions by passing a list of
> TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains
> the information of the base/size of a memory region, the base/size of the
> associated Physical Address Metadata Table (PAMT) and a list of reserved
> areas in the region.
>
> Do the first step to fill out a number of TDMRs to cover all TDX memory
> regions. To keep it simple, always try to use one TDMR for each memory
> region. As the first step only set up the base/size for each TDMR.
As a first step?
>
> Each TDMR must be 1G aligned and the size must be in 1G granularity.
> This implies that one TDMR could cover multiple memory regions. If a
> memory region spans the 1GB boundary and the former part is already
> covered by the previous TDMR, just use a new TDMR for the remaining
> part.
>
> TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs
> are consumed but there is more memory region to cover.
>
> There are fancier things that could be done like trying to merge
> adjacent TDMRs. This would allow more pathological memory layouts to be
> supported. But, current systems are not even close to exhausting the
> existing TDMR resources in practice. For now, keep it simple.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> v10 -> v11:
> - No update
>
> v9 -> v10:
> - No change.
>
> v8 -> v9:
>
> - Added the last paragraph in the changelog (Dave).
> - Removed unnecessary type cast in tdmr_entry() (Dave).
>
>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 94 ++++++++++++++++++++++++++++++++++++-
> 1 file changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 7a20c72361e7..fa9fa8bc581a 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -385,6 +385,93 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
> tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
> }
>
> +/* Get the TDMR from the list at the given index. */
> +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
> + int idx)
> +{
> + int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
> +
> + return (void *)tdmr_list->tdmrs + tdmr_info_offset;
> +}
> +
> +#define TDMR_ALIGNMENT BIT_ULL(30)
> +#define TDMR_PFN_ALIGNMENT (TDMR_ALIGNMENT >> PAGE_SHIFT)
This macro is never used. Maybe you can drop it from this patch.
> +#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
> +#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT)
> +
> +static inline u64 tdmr_end(struct tdmr_info *tdmr)
> +{
> + return tdmr->base + tdmr->size;
> +}
> +
> +/*
> + * Take the memory referenced in @tmb_list and populate the
> + * preallocated @tdmr_list, following all the special alignment
> + * and size rules for TDMR.
> + */
> +static int fill_out_tdmrs(struct list_head *tmb_list,
> + struct tdmr_info_list *tdmr_list)
> +{
> + struct tdx_memblock *tmb;
> + int tdmr_idx = 0;
> +
> + /*
> + * Loop over TDX memory regions and fill out TDMRs to cover them.
> + * To keep it simple, always try to use one TDMR to cover one
> + * memory region.
> + *
> + * In practice TDX1.0 supports 64 TDMRs, which is big enough to
> + * cover all memory regions in reality if the admin doesn't use
> + * 'memmap' to create a bunch of discrete memory regions. When
> + * there's a real problem, enhancement can be done to merge TDMRs
> + * to reduce the final number of TDMRs.
> + */
> + list_for_each_entry(tmb, tmb_list, list) {
> + struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
> + u64 start, end;
> +
> + start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn));
> + end = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn));
> +
> + /*
> + * A valid size indicates the current TDMR has already
> + * been filled out to cover the previous memory region(s).
> + */
> + if (tdmr->size) {
> + /*
> + * Loop to the next if the current memory region
> + * has already been fully covered.
> + */
> + if (end <= tdmr_end(tdmr))
> + continue;
> +
> + /* Otherwise, skip the already covered part. */
> + if (start < tdmr_end(tdmr))
> + start = tdmr_end(tdmr);
> +
> + /*
> + * Create a new TDMR to cover the current memory
> + * region, or the remaining part of it.
> + */
> + tdmr_idx++;
> + if (tdmr_idx >= tdmr_list->max_tdmrs) {
> + pr_warn("initialization failed: TDMRs exhausted.\n");
> + return -ENOSPC;
> + }
> +
> + tdmr = tdmr_entry(tdmr_list, tdmr_idx);
> + }
> +
> + tdmr->base = start;
> + tdmr->size = end - start;
> + }
> +
> + /* @tdmr_idx is always the index of last valid TDMR. */
> + tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1;
> +
> + return 0;
> +}
> +
> /*
> * Construct a list of TDMRs on the preallocated space in @tdmr_list
> * to cover all TDX memory regions in @tmb_list based on the TDX module
> @@ -394,10 +481,15 @@ static int construct_tdmrs(struct list_head *tmb_list,
> struct tdmr_info_list *tdmr_list,
> struct tdsysinfo_struct *sysinfo)
> {
> + int ret;
> +
> + ret = fill_out_tdmrs(tmb_list, tdmr_list);
> + if (ret)
> + return ret;
> +
> /*
> * TODO:
> *
> - * - Fill out TDMRs to cover all TDX memory regions.
> * - Allocate and set up PAMTs for each TDMR.
> * - Designate reserved areas for each TDMR.
> *
Rest looks good to me.
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
[not found] ` <50386eddbb8046b0b222d385e56e8115ed566526.1685887183.git.kai.huang@intel.com>
2023-06-07 15:25 ` [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory Dave Hansen
2023-06-08 0:27 ` kirill.shutemov
@ 2023-06-09 10:02 ` kirill.shutemov
2023-06-12 2:00 ` Huang, Kai
2023-06-19 13:29 ` David Hildenbrand
3 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-09 10:02 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
> @@ -21,6 +23,76 @@
> */
> #define TDH_SYS_INIT 33
> #define TDH_SYS_LP_INIT 35
> +#define TDH_SYS_INFO 32
Could you keep these defines ordered? Here and all following patches.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 16/20] x86/virt/tdx: Initialize all TDMRs
[not found] ` <7bd7d0c6196deb58b54d6e629603775844b1307d.1685887183.git.kai.huang@intel.com>
@ 2023-06-09 10:03 ` kirill.shutemov
0 siblings, 0 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-09 10:03 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:29AM +1200, Kai Huang wrote:
> After the global KeyID has been configured on all packages, initialize
> all TDMRs to make all TDX-usable memory regions that are passed to the
> TDX module become usable.
>
> This is the last step of initializing the TDX module.
>
> Initializing TDMRs can be time consuming on large memory systems as it
> involves initializing all metadata entries for all pages that can be
> used by TDX guests. Initializing different TDMRs can be parallelized.
> For now to keep it simple, just initialize all TDMRs one by one. It can
> be enhanced in the future.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 17/20] x86/kexec: Flush cache of TDX private memory
[not found] ` <17bcbe3e154415ee7a4c77489809a3db0c5ddf3f.1685887183.git.kai.huang@intel.com>
@ 2023-06-09 10:14 ` kirill.shutemov
0 siblings, 0 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-09 10:14 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:30AM +1200, Kai Huang wrote:
> There are two problems in terms of using kexec() to boot to a new kernel
> when the old kernel has enabled TDX: 1) Part of the memory pages are
> still TDX private pages; 2) There might be dirty cachelines associated
> with TDX private pages.
>
> The first problem doesn't matter on the platforms w/o the "partial write
> machine check" erratum. KeyID 0 doesn't have integrity check. If the
> new kernel wants to use any non-zero KeyID, it needs to convert the
> memory to that KeyID and such conversion would work from any KeyID.
>
> However the old kernel needs to guarantee there's no dirty cacheline
> left behind before booting to the new kernel to avoid silent corruption
> from later cacheline writeback (Intel hardware doesn't guarantee cache
> coherency across different KeyIDs).
>
> There are two things that the old kernel needs to do to achieve that:
>
> 1) Stop accessing TDX private memory mappings:
> a. Stop making TDX module SEAMCALLs (TDX global KeyID);
> b. Stop TDX guests from running (per-guest TDX KeyID).
> 2) Flush any cachelines from previous TDX private KeyID writes.
>
> For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
> support. And in this way 1) happens for free as there's no TDX activity
> between wbinvd() and the native_halt().
>
> Flushing cache in stop_this_cpu() only flushes cache on remote cpus. On
> the cpu which does kexec(), unlike SME which does the cache flush in
> relocate_kernel(), do the cache flush right after stopping remote cpus
> in machine_shutdown(). This is because on the platforms with above
> erratum, the kernel needs to convert all TDX private pages back to
> normal before a fast warm reset reboot or booting to the new kernel in
> kexec(). Flushing cache in relocate_kernel() only covers the kexec()
> but not the fast warm reset reboot.
>
> Theoretically, cache flush is only needed when the TDX module has been
> initialized. However initializing the TDX module is done on demand at
> runtime, and it takes a mutex to read the module status. Just check
> whether TDX is enabled by the BIOS instead to flush cache.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum
[not found] ` <116cafb15625ac0bcda7b47143921d0c42061b69.1685887183.git.kai.huang@intel.com>
@ 2023-06-09 13:17 ` kirill.shutemov
2023-06-12 3:08 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-09 13:17 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:32AM +1200, Kai Huang wrote:
> The first few generations of TDX hardware have an erratum. Triggering
> it in Linux requires some kind of kernel bug involving relatively exotic
> memory writes to TDX private memory and will manifest via
> spurious-looking machine checks when reading the affected memory.
>
> == Background ==
>
> Virtually all kernel memory accesses operations happen in full
> cachelines. In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.
>
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller. The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings. The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
>
> == Problem ==
>
> A partial write to a TDX private memory cacheline will silently "poison"
> the line. Subsequent reads will consume the poison and generate a
> machine check. According to the TDX hardware spec, neither of these
> things should have happened.
>
> To add insult to injury, the Linux machine code will present these as a
> literal "Hardware error" when they were, in fact, a software-triggered
> issue.
>
> == Solution ==
>
> In the end, this issue is hard to trigger. Rather than do something
> rash (and incomplete) like unmap TDX private memory from the direct map,
> improve the machine check handler.
>
> Currently, the #MC handler doesn't distinguish whether the memory is
> TDX private memory or not but just dump, for instance, below message:
>
> [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
> [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
> ...
> [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> [...] Kernel panic - not syncing: Fatal local machine check
>
> Which says "Hardware Error" and "Data load in unrecoverable area of
> kernel".
>
> Ideally, it's better for the log to say "software bug around TDX private
> memory" instead of "Hardware Error". But in reality the real hardware
> memory error can happen, and sadly such software-triggered #MC cannot be
> distinguished from the real hardware error. Also, the error message is
> used by userspace tool 'mcelog' to parse, so changing the output may
> break userspace.
>
> So keep the "Hardware Error". The "Data load in unrecoverable area of
> kernel" is also helpful, so keep it too.
>
> Instead of modifying above error log, improve the error log by printing
> additional TDX related message to make the log like:
>
> ...
> [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> [...] mce: [Hardware Error]: Machine Check: Memory error from TDX private memory. May be result of CPU erratum.
The message mentions one part of issue -- CPU erratum -- but misses the
other required part -- kernel bug that makes kernel access the memory it
not suppose to.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
[not found] ` <5aa7506d4fedbf625e3fe8ceeb88af3be1ce97ea.1685887183.git.kai.huang@intel.com>
@ 2023-06-09 13:23 ` kirill.shutemov
2023-06-12 3:06 ` Huang, Kai
2023-06-14 9:33 ` Huang, Kai
1 sibling, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-09 13:23 UTC (permalink / raw)
To: Kai Huang
Cc: linux-kernel, kvm, linux-mm, dave.hansen, tony.luck, peterz,
tglx, seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 8ff07256a515..0aa413b712e8 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> tdmr_pamt_base += pamt_size[pgsz];
> }
>
> + /*
> + * tdx_memory_shutdown() also reads TDMR's PAMT during
> + * kexec() or reboot, which could happen at anytime, even
> + * during this particular code. Make sure pamt_4k_base
> + * is firstly set otherwise tdx_memory_shutdown() may
> + * get an invalid PAMT base when it sees a valid number
> + * of PAMT pages.
> + */
Hmm? What prevents compiler from messing this up. It can reorder as it
wishes, no?
Maybe add a proper locking? Anything that prevent preemption would do,
right?
> tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
[not found] ` <9b3582c9f3a81ae68b32d9997fcd20baecb63b9b.1685887183.git.kai.huang@intel.com>
` (2 preceding siblings ...)
2023-06-08 0:08 ` kirill.shutemov
@ 2023-06-09 14:42 ` Nikolay Borisov
2023-06-12 11:04 ` Huang, Kai
2023-06-19 13:00 ` David Hildenbrand
4 siblings, 1 reply; 144+ messages in thread
From: Nikolay Borisov @ 2023-06-09 14:42 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On 4.06.23 г. 17:27 ч., Kai Huang wrote:
> Certain SEAMCALL leaf functions may return error due to running out of
> entropy, in which case the SEAMCALL should be retried as suggested by
> the TDX spec.
>
> Handle this case in SEAMCALL common function. Mimic the existing
> rdrand_long() to retry RDRAND_RETRY_LOOPS times.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> v10 -> v11:
> - New patch
>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 15 ++++++++++++++-
> arch/x86/virt/vmx/tdx/tdx.h | 17 +++++++++++++++++
> 2 files changed, 31 insertions(+), 1 deletion(-)
>
<snip>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 48ad1a1ba737..55dbb1b8c971 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -4,6 +4,23 @@
>
> #include <linux/types.h>
>
> +/*
> + * This file contains both macros and data structures defined by the TDX
> + * architecture and Linux defined software data structures and functions.
> + * The two should not be mixed together for better readability. The
> + * architectural definitions come first.
> + */
> +
> +/*
> + * TDX SEAMCALL error codes
> + */
> +#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL
Where is this return value documented, in TDX module 1.0 spec there are
only: 8000020[123]00000000 specified and there's 80000800
(TDX_KEY_GENERATION_FAILED) and its description mentions the possible
failure due to lack of entropy?
<snip>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
2023-06-08 13:13 ` Dave Hansen
@ 2023-06-12 2:00 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 2:00 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz,
imammedo, Shahar, Sagi, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Thu, 2023-06-08 at 06:13 -0700, Dave Hansen wrote:
> On 6/8/23 04:41, kirill.shutemov@linux.intel.com wrote:
> > These structures take 1.5K of memory and the memory will be allocated for
> > all machines that boots the kernel with TDX enabled, regardless if the
> > machine has TDX or not. It seems very wasteful to me.
>
> Actually, those variables are in .bss. They're allocated forever for
> anyone that runs a kernel that has TDX support.
>
Hi Dave/Kirill,
Thanks for feedback.
My understanding is you both prefer dynamic allocation. I'll change to use
that. Also I will free them after module initialization as for now they are
only used by module initialization.
Please let me know if you have any comments.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
2023-06-09 10:02 ` kirill.shutemov
@ 2023-06-12 2:00 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 2:00 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Fri, 2023-06-09 at 13:02 +0300, kirill.shutemov@linux.intel.com wrote:
> > @@ -21,6 +23,76 @@
> > */
> > #define TDH_SYS_INIT 33
> > #define TDH_SYS_LP_INIT 35
> > +#define TDH_SYS_INFO 32
>
> Could you keep these defines ordered? Here and all following patches.
>
Sure will do. Thanks.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
2023-06-08 22:52 ` kirill.shutemov
@ 2023-06-12 2:21 ` Huang, Kai
2023-06-12 3:01 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 2:21 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Fri, 2023-06-09 at 01:52 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 05, 2023 at 02:27:23AM +1200, Kai Huang wrote:
> > @@ -50,6 +51,8 @@ static DEFINE_MUTEX(tdx_module_lock);
> > /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
> > static LIST_HEAD(tdx_memlist);
> >
> > +static struct tdmr_info_list tdx_tdmr_list;
> > +
> > /*
> > * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
>
> The name is misleading. It is not list, it is an array.
I followed Dave's suggestion in v7:
https://lore.kernel.org/lkml/d84ad1d2-83f9-dab5-5639-8d71f382e3ff@intel.com/
Quote the relevant part here:
"
> +static struct tdmr_info *tdmr_array_entry(struct tdmr_info *tdmr_array,
> + int idx)
> +{
> + return (struct tdmr_info *)((unsigned long)tdmr_array +
> + cal_tdmr_size() * idx);
> +}
FWIW, I think it's probably a bad idea to have 'struct tdmr_info *'
types floating around since:
tmdr_info_array[0]
works, but:
tmdr_info_array[1]
will blow up in your face. It would almost make sense to have
struct tdmr_info_list {
struct tdmr_info *first_tdmr;
}
and then pass around pointers to the 'struct tdmr_info_list'. Maybe
that's overkill, but it is kinda silly to call something an array if []
doesn't work on it.
"
Personally I think it's also fine to use 'list' (e.g., we can also interpret the
name from "English language"'s perspective).
Hi Dave,
Should I change the name to "tdmr_info_array"?
>
>
> ...
>
> > @@ -112,6 +135,15 @@ struct tdx_memblock {
> > unsigned long end_pfn;
> > };
> >
> > +struct tdmr_info_list {
> > + void *tdmrs; /* Flexible array to hold 'tdmr_info's */
> > + int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */
> > +
> > + /* Metadata for finding target 'tdmr_info' and freeing @tdmrs */
> > + int tdmr_sz; /* Size of one 'tdmr_info' */
> > + int max_tdmrs; /* How many 'tdmr_info's are allocated */
> > +};
> > +
> > struct tdx_module_output;
> > u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > struct tdx_module_output *out);
>
> Otherwise, looks okay.
>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-08 23:02 ` kirill.shutemov
@ 2023-06-12 2:25 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 2:25 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Fri, 2023-06-09 at 02:02 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 05, 2023 at 02:27:24AM +1200, Kai Huang wrote:
> > +#define TDMR_ALIGNMENT BIT_ULL(30)
>
> Nit: SZ_1G can be a little bit more readable here.
Will do.
>
> Anyway:
>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>
>
Thanks.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-09 4:01 ` Sathyanarayanan Kuppuswamy
@ 2023-06-12 2:28 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 2:28 UTC (permalink / raw)
To: sathyanarayanan.kuppuswamy, kvm, linux-kernel
Cc: Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, linux-mm, Yamahata, Isaku, tglx, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, Huang,
Ying, Williams, Dan J
On Thu, 2023-06-08 at 21:01 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 6/4/23 7:27 AM, Kai Huang wrote:
> > Start to transit out the "multi-steps" to construct a list of "TD Memory
> > Regions" (TDMRs) to cover all TDX-usable memory regions.
> >
> > The kernel configures TDX-usable memory regions by passing a list of
> > TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains
> > the information of the base/size of a memory region, the base/size of the
> > associated Physical Address Metadata Table (PAMT) and a list of reserved
> > areas in the region.
> >
> > Do the first step to fill out a number of TDMRs to cover all TDX memory
> > regions. To keep it simple, always try to use one TDMR for each memory
> > region. As the first step only set up the base/size for each TDMR.
>
> As a first step?
Not sure there are two or more first steps? I think I'll keep it as is.
[...]
> > +#define TDMR_ALIGNMENT BIT_ULL(30)
> > +#define TDMR_PFN_ALIGNMENT (TDMR_ALIGNMENT >> PAGE_SHIFT)
>
> This macro is never used. Maybe you can drop it from this patch.
OK will do.
[...]
>
> Rest looks good to me.
>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>
Thanks.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-08 13:11 ` Dave Hansen
@ 2023-06-12 2:33 ` Huang, Kai
2023-06-12 14:33 ` kirill.shutemov
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 2:33 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
>
> Maybe not even a pr_warn(), but something that's a bit ominous and has a
> chance of getting users to act.
Sorry I am not sure how to do. Could you give some suggestion?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
2023-06-08 23:43 ` Dave Hansen
@ 2023-06-12 2:52 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 2:52 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, linux-mm, Yamahata, Isaku, tglx, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Thu, 2023-06-08 at 16:43 -0700, Dave Hansen wrote:
> On 6/8/23 16:24, kirill.shutemov@linux.intel.com wrote:
> > > ret = -EINVAL;
> > > + if (ret)
> > > + tdmrs_free_pamt_all(&tdx_tdmr_list);
> > > + else
> > > + pr_info("%lu KBs allocated for PAMT.\n",
> > > + tdmrs_count_pamt_pages(&tdx_tdmr_list) * 4);
> > "* 4"? This is very cryptic. procfs uses "<< (PAGE_SHIFT - 10)" which
> > slightly less magic to me. And just make the helper that returns kilobytes
> > to begin with, if it is the only caller.
>
> Let's look at where this data comes from:
>
> +static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list
> *tdmr_list)
> +{
> + unsigned long pamt_npages = 0;
> + int i;
> +
> + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> + unsigned long pfn, npages;
> +
> + tdmr_get_pamt(tdmr_entry(tdmr_list, i), &pfn, &npages);
> + pamt_npages += npages;
> + }
>
> OK, so tdmr_get_pamt() is getting it in pages. How is it *stored*?
>
> +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
> + unsigned long *pamt_npages)
> +{
> ...
> + pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> ++ *pamt_pfn = PHYS_PFN(pamt_base);
> + *pamt_npages = pamt_sz >> PAGE_SHIFT;
> +}
>
> Oh, it's actually stored in bytes. So to print it out you actually
> convert it from bytes->pages->kbytes. Not the best.
>
> If tdmr_get_pamt() just returned 'pamt_size_bytes', you could do one
> conversion at:
>
> free_contig_range(pamt_pfn, pamt_size_bytes >> PAGE_SIZE);
I thought making tdmr_get_pamt() return pamt_pfn and pamt_npages would be more
clear that PAMTs must be in 4K granularity, but I guess it doesn't matter
anyway.
If we return bytes for PAMT size, I think we should also return physical address
instead of PFN for PAMT start?
I'll change tdmr_get_pamt() to return physical address and bytes for PAMT
location and size respectively. Please let me know if you have any comments.
>
> and since tdmrs_count_pamt_pages() has only one caller you can just make
> it: tdmrs_count_pamt_kb(). The print becomes:
>
> pr_info("%lu KBs allocated for PAMT.\n",
> tdmrs_count_pamt_kb(&tdx_tdmr_list) * 4);
>
> and tdmrs_count_pamt_kb() does something super fancy like:
>
> return pamt_size_bytes / 1024;
>
> which makes total complete obvious sense and needs zero explanation.
Will do.
Thanks for the feedback.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
2023-06-12 2:21 ` Huang, Kai
@ 2023-06-12 3:01 ` Dave Hansen
0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-12 3:01 UTC (permalink / raw)
To: Huang, Kai, kirill.shutemov
Cc: kvm, david, bagasdotme, ak, Wysocki, Rafael J, linux-kernel,
Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/11/23 19:21, Huang, Kai wrote:
> Should I change the name to "tdmr_info_array"?
If foo[bar] works on it, then it's an array. If not, then it's not an
array.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-09 13:23 ` [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot kirill.shutemov
@ 2023-06-12 3:06 ` Huang, Kai
2023-06-12 7:58 ` kirill.shutemov
2023-06-20 8:11 ` Peter Zijlstra
0 siblings, 2 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 3:06 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Fri, 2023-06-09 at 16:23 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote:
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 8ff07256a515..0aa413b712e8 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> > tdmr_pamt_base += pamt_size[pgsz];
> > }
> >
> > + /*
> > + * tdx_memory_shutdown() also reads TDMR's PAMT during
> > + * kexec() or reboot, which could happen at anytime, even
> > + * during this particular code. Make sure pamt_4k_base
> > + * is firstly set otherwise tdx_memory_shutdown() may
> > + * get an invalid PAMT base when it sees a valid number
> > + * of PAMT pages.
> > + */
>
> Hmm? What prevents compiler from messing this up. It can reorder as it
> wishes, no?
Hmm.. Right. Sorry I missed.
>
> Maybe add a proper locking? Anything that prevent preemption would do,
> right?
>
> > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
>
I think a simple memory barrier will do. How does below look?
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -591,11 +591,12 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
* tdx_memory_shutdown() also reads TDMR's PAMT during
* kexec() or reboot, which could happen at anytime, even
* during this particular code. Make sure pamt_4k_base
- * is firstly set otherwise tdx_memory_shutdown() may
- * get an invalid PAMT base when it sees a valid number
- * of PAMT pages.
+ * is firstly set and place a __mb() after it otherwise
+ * tdx_memory_shutdown() may get an invalid PAMT base
+ * when it sees a valid number of PAMT pages.
*/
tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+ __mb();
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum
2023-06-09 13:17 ` [PATCH v11 19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum kirill.shutemov
@ 2023-06-12 3:08 ` Huang, Kai
2023-06-12 7:59 ` kirill.shutemov
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 3:08 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Fri, 2023-06-09 at 16:17 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 05, 2023 at 02:27:32AM +1200, Kai Huang wrote:
> > The first few generations of TDX hardware have an erratum. Triggering
> > it in Linux requires some kind of kernel bug involving relatively exotic
> > memory writes to TDX private memory and will manifest via
> > spurious-looking machine checks when reading the affected memory.
> >
> > == Background ==
> >
> > Virtually all kernel memory accesses operations happen in full
> > cachelines. In practice, writing a "byte" of memory usually reads a 64
> > byte cacheline of memory, modifies it, then writes the whole line back.
> > Those operations do not trigger this problem.
> >
> > This problem is triggered by "partial" writes where a write transaction
> > of less than cacheline lands at the memory controller. The CPU does
> > these via non-temporal write instructions (like MOVNTI), or through
> > UC/WC memory mappings. The issue can also be triggered away from the
> > CPU by devices doing partial writes via DMA.
> >
> > == Problem ==
> >
> > A partial write to a TDX private memory cacheline will silently "poison"
> > the line. Subsequent reads will consume the poison and generate a
> > machine check. According to the TDX hardware spec, neither of these
> > things should have happened.
> >
> > To add insult to injury, the Linux machine code will present these as a
> > literal "Hardware error" when they were, in fact, a software-triggered
> > issue.
> >
> > == Solution ==
> >
> > In the end, this issue is hard to trigger. Rather than do something
> > rash (and incomplete) like unmap TDX private memory from the direct map,
> > improve the machine check handler.
> >
> > Currently, the #MC handler doesn't distinguish whether the memory is
> > TDX private memory or not but just dump, for instance, below message:
> >
> > [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
> > [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
> > ...
> > [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> > [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> > [...] Kernel panic - not syncing: Fatal local machine check
> >
> > Which says "Hardware Error" and "Data load in unrecoverable area of
> > kernel".
> >
> > Ideally, it's better for the log to say "software bug around TDX private
> > memory" instead of "Hardware Error". But in reality the real hardware
> > memory error can happen, and sadly such software-triggered #MC cannot be
> > distinguished from the real hardware error. Also, the error message is
> > used by userspace tool 'mcelog' to parse, so changing the output may
> > break userspace.
> >
> > So keep the "Hardware Error". The "Data load in unrecoverable area of
> > kernel" is also helpful, so keep it too.
> >
> > Instead of modifying above error log, improve the error log by printing
> > additional TDX related message to make the log like:
> >
> > ...
> > [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> > [...] mce: [Hardware Error]: Machine Check: Memory error from TDX private memory. May be result of CPU erratum.
>
> The message mentions one part of issue -- CPU erratum -- but misses the
> other required part -- kernel bug that makes kernel access the memory it
> not suppose to.
>
How about below?
"Memory error from TDX private memory. May be result of CPU erratum caused by
kernel bug."
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 20/20] Documentation/x86: Add documentation for TDX host support
2023-06-08 23:56 ` [PATCH v11 20/20] Documentation/x86: Add documentation for TDX host support Dave Hansen
@ 2023-06-12 3:41 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 3:41 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Thu, 2023-06-08 at 16:56 -0700, Dave Hansen wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> > +There is no CPUID or MSR to detect the TDX module. The kernel detects it
> > +by initializing it.
>
> Is this really what you want to say?
>
> If the module is there, SEAMCALL works. If not, it doesn't. The module
> is assumed to be there when SEAMCALL works. Right?
>
> Yeah, that first SEAMCALL is probably an initialization call, but feel
> like that's beside the point.
Thanks for reviewing the documentation.
I guess I don't need to mention the "detection" part at all?
-TDX module detection and initialization
+TDX module initialization
---------------------------------------
-There is no CPUID or MSR to detect the TDX module. The kernel detects it
-by initializing it.
-
The kernel talks to the TDX module via the new SEAMCALL instruction. The
TDX module implements SEAMCALL leaf functions to allow the kernel to
initialize it.
+If the TDX module isn't loaded, the SEAMCALL instruction fails with a
+special error. In this case the kernel fails the module initialization
+and reports the module isn't loaded::
+
+ [..] tdx: Module isn't loaded.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-12 3:06 ` Huang, Kai
@ 2023-06-12 7:58 ` kirill.shutemov
2023-06-12 10:27 ` Huang, Kai
2023-06-20 8:11 ` Peter Zijlstra
1 sibling, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-12 7:58 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
> On Fri, 2023-06-09 at 16:23 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote:
> > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > index 8ff07256a515..0aa413b712e8 100644
> > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> > > tdmr_pamt_base += pamt_size[pgsz];
> > > }
> > >
> > > + /*
> > > + * tdx_memory_shutdown() also reads TDMR's PAMT during
> > > + * kexec() or reboot, which could happen at anytime, even
> > > + * during this particular code. Make sure pamt_4k_base
> > > + * is firstly set otherwise tdx_memory_shutdown() may
> > > + * get an invalid PAMT base when it sees a valid number
> > > + * of PAMT pages.
> > > + */
> >
> > Hmm? What prevents compiler from messing this up. It can reorder as it
> > wishes, no?
>
> Hmm.. Right. Sorry I missed.
>
> >
> > Maybe add a proper locking? Anything that prevent preemption would do,
> > right?
> >
> > > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> > > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> > > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
> >
>
> I think a simple memory barrier will do. How does below look?
>
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -591,11 +591,12 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> * tdx_memory_shutdown() also reads TDMR's PAMT during
> * kexec() or reboot, which could happen at anytime, even
> * during this particular code. Make sure pamt_4k_base
> - * is firstly set otherwise tdx_memory_shutdown() may
> - * get an invalid PAMT base when it sees a valid number
> - * of PAMT pages.
> + * is firstly set and place a __mb() after it otherwise
> + * tdx_memory_shutdown() may get an invalid PAMT base
> + * when it sees a valid number of PAMT pages.
> */
> tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> + __mb();
If you want to play with barriers, assign pamt_4k_base the last with
smp_store_release() and read it first in tdmr_get_pamt() with
smp_load_acquire(). If it is non-zero, all pamt_* fields are valid.
Or just drop this non-sense and use a spin lock for serialization.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum
2023-06-12 3:08 ` Huang, Kai
@ 2023-06-12 7:59 ` kirill.shutemov
2023-06-12 13:51 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-12 7:59 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, Jun 12, 2023 at 03:08:40AM +0000, Huang, Kai wrote:
> On Fri, 2023-06-09 at 16:17 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Mon, Jun 05, 2023 at 02:27:32AM +1200, Kai Huang wrote:
> > > The first few generations of TDX hardware have an erratum. Triggering
> > > it in Linux requires some kind of kernel bug involving relatively exotic
> > > memory writes to TDX private memory and will manifest via
> > > spurious-looking machine checks when reading the affected memory.
> > >
> > > == Background ==
> > >
> > > Virtually all kernel memory accesses operations happen in full
> > > cachelines. In practice, writing a "byte" of memory usually reads a 64
> > > byte cacheline of memory, modifies it, then writes the whole line back.
> > > Those operations do not trigger this problem.
> > >
> > > This problem is triggered by "partial" writes where a write transaction
> > > of less than cacheline lands at the memory controller. The CPU does
> > > these via non-temporal write instructions (like MOVNTI), or through
> > > UC/WC memory mappings. The issue can also be triggered away from the
> > > CPU by devices doing partial writes via DMA.
> > >
> > > == Problem ==
> > >
> > > A partial write to a TDX private memory cacheline will silently "poison"
> > > the line. Subsequent reads will consume the poison and generate a
> > > machine check. According to the TDX hardware spec, neither of these
> > > things should have happened.
> > >
> > > To add insult to injury, the Linux machine code will present these as a
> > > literal "Hardware error" when they were, in fact, a software-triggered
> > > issue.
> > >
> > > == Solution ==
> > >
> > > In the end, this issue is hard to trigger. Rather than do something
> > > rash (and incomplete) like unmap TDX private memory from the direct map,
> > > improve the machine check handler.
> > >
> > > Currently, the #MC handler doesn't distinguish whether the memory is
> > > TDX private memory or not but just dump, for instance, below message:
> > >
> > > [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
> > > [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
> > > ...
> > > [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> > > [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> > > [...] Kernel panic - not syncing: Fatal local machine check
> > >
> > > Which says "Hardware Error" and "Data load in unrecoverable area of
> > > kernel".
> > >
> > > Ideally, it's better for the log to say "software bug around TDX private
> > > memory" instead of "Hardware Error". But in reality the real hardware
> > > memory error can happen, and sadly such software-triggered #MC cannot be
> > > distinguished from the real hardware error. Also, the error message is
> > > used by userspace tool 'mcelog' to parse, so changing the output may
> > > break userspace.
> > >
> > > So keep the "Hardware Error". The "Data load in unrecoverable area of
> > > kernel" is also helpful, so keep it too.
> > >
> > > Instead of modifying above error log, improve the error log by printing
> > > additional TDX related message to make the log like:
> > >
> > > ...
> > > [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> > > [...] mce: [Hardware Error]: Machine Check: Memory error from TDX private memory. May be result of CPU erratum.
> >
> > The message mentions one part of issue -- CPU erratum -- but misses the
> > other required part -- kernel bug that makes kernel access the memory it
> > not suppose to.
> >
>
> How about below?
>
> "Memory error from TDX private memory. May be result of CPU erratum caused by
> kernel bug."
Fine, I guess.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-12 7:58 ` kirill.shutemov
@ 2023-06-12 10:27 ` Huang, Kai
2023-06-12 11:48 ` kirill.shutemov
2023-06-12 13:47 ` Dave Hansen
0 siblings, 2 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 10:27 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
Luck, Tony, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-kernel, linux-mm,
Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-12 at 10:58 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
> > On Fri, 2023-06-09 at 16:23 +0300, kirill.shutemov@linux.intel.com wrote:
> > > On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote:
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > index 8ff07256a515..0aa413b712e8 100644
> > > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> > > > tdmr_pamt_base += pamt_size[pgsz];
> > > > }
> > > >
> > > > + /*
> > > > + * tdx_memory_shutdown() also reads TDMR's PAMT during
> > > > + * kexec() or reboot, which could happen at anytime, even
> > > > + * during this particular code. Make sure pamt_4k_base
> > > > + * is firstly set otherwise tdx_memory_shutdown() may
> > > > + * get an invalid PAMT base when it sees a valid number
> > > > + * of PAMT pages.
> > > > + */
> > >
> > > Hmm? What prevents compiler from messing this up. It can reorder as it
> > > wishes, no?
> >
> > Hmm.. Right. Sorry I missed.
> >
> > >
> > > Maybe add a proper locking? Anything that prevent preemption would do,
> > > right?
> > >
> > > > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> > > > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> > > > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
> > >
> >
> > I think a simple memory barrier will do. How does below look?
> >
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -591,11 +591,12 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> > * tdx_memory_shutdown() also reads TDMR's PAMT during
> > * kexec() or reboot, which could happen at anytime, even
> > * during this particular code. Make sure pamt_4k_base
> > - * is firstly set otherwise tdx_memory_shutdown() may
> > - * get an invalid PAMT base when it sees a valid number
> > - * of PAMT pages.
> > + * is firstly set and place a __mb() after it otherwise
> > + * tdx_memory_shutdown() may get an invalid PAMT base
> > + * when it sees a valid number of PAMT pages.
> > */
> > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> > + __mb();
>
> If you want to play with barriers, assign pamt_4k_base the last with
> smp_store_release() and read it first in tdmr_get_pamt() with
> smp_load_acquire(). If it is non-zero, all pamt_* fields are valid.
>
> Or just drop this non-sense and use a spin lock for serialization.
>
We don't need to guarantee when pamt_4k_base is valid, all other pamt_* are
valid. Instead, we need to guarantee when (at least) _one_ of pamt_*_size is
valid, the pamt_4k_base is valid.
For example,
pamt_4k_base -> valid
pamt_4k_size -> invalid (0)
pamt_2m_size -> invalid
pamt_1g_size -> invalid
and
pamt_4k_base -> valid
pamt_4k_size -> valid
pamt_2m_size -> invalid
pamt_1g_size -> invalid
are both OK.
The reason is the PAMTs are only written by the TDX module in init_tdmrs(). So
if tdx_memory_shutdown() sees a part of PAMT (the second case above), those PAMT
pages are not yet TDX private pages, thus converting part of PAMT is fine.
The invalid case is when any pamt_*_size is valid, pamt_4k_base is invalid,
e.g.:
pamt_4k_base -> invalid
pamt_4k_size -> valid
pamt_2m_size -> invalid
pamt_1g_size -> invalid
as this case tdx_memory_shutdown() will convert a incorrect (not partial) PAMT
area.
So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
will be seen by other cpus.
Does it make sense?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 00/20] TDX host kernel support
2023-06-08 21:03 ` [PATCH v11 00/20] TDX host kernel support Dan Williams
@ 2023-06-12 10:56 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 10:56 UTC (permalink / raw)
To: Williams, Dan J, kvm, linux-kernel, chao.p.peng
Cc: Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying
On Thu, 2023-06-08 at 14:03 -0700, Dan Williams wrote:
> Kai Huang wrote:
> > Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks. TDX specs are available in [1].
> >
> > This series is the initial support to enable TDX with minimal code to
> > allow KVM to create and run TDX guests. KVM support for TDX is being
> > developed separately[2]. A new "userspace inaccessible memfd" approach
> > to support TDX private memory is also being developed[3]. The KVM will
> > only support the new "userspace inaccessible memfd" as TDX guest memory.
>
> This memfd approach is incompatible with one of the primary ways that
> new memory topologies like high-bandwidth-memory and CXL are accessed,
> via a device-special-file mapping. There is already precedent for mmap()
> to only be used for communicating address value and not CPU accessible
> memory. See "Userspace P2PDMA with O_DIRECT NVMe devices" [1].
>
> So before this memfd requirement becomes too baked in to the design I
> want to understand if "userspace inaccessible" is the only requirement
> so I can look to add that to the device-special-file interface for
> "device" / "Soft Reserved" memory like HBM and CXL.
>
> [1]: https://lore.kernel.org/all/20221021174116.7200-1-logang@deltatee.com/
+ Peng, Chao who is working on this with Sean.
There are some recent developments around the design of the "userspace
inaccessible memfd", e.g., IIUC Sean is proposing to replace the new syscall
with a new KVM ioctl().
Hi Sean, Chao,
Could you give comments to Dan's concern?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
2023-06-09 14:42 ` Nikolay Borisov
@ 2023-06-12 11:04 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 11:04 UTC (permalink / raw)
To: kvm, linux-kernel, n.borisov.lkml
Cc: Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, linux-mm, Yamahata, Isaku, tglx, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Fri, 2023-06-09 at 17:42 +0300, Nikolay Borisov wrote:
>
> On 4.06.23 г. 17:27 ч., Kai Huang wrote:
> > Certain SEAMCALL leaf functions may return error due to running out of
> > entropy, in which case the SEAMCALL should be retried as suggested by
> > the TDX spec.
> >
> > Handle this case in SEAMCALL common function. Mimic the existing
> > rdrand_long() to retry RDRAND_RETRY_LOOPS times.
> >
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >
> > v10 -> v11:
> > - New patch
> >
> > ---
> > arch/x86/virt/vmx/tdx/tdx.c | 15 ++++++++++++++-
> > arch/x86/virt/vmx/tdx/tdx.h | 17 +++++++++++++++++
> > 2 files changed, 31 insertions(+), 1 deletion(-)
> >
>
> <snip>
>
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> > index 48ad1a1ba737..55dbb1b8c971 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.h
> > +++ b/arch/x86/virt/vmx/tdx/tdx.h
> > @@ -4,6 +4,23 @@
> >
> > #include <linux/types.h>
> >
> > +/*
> > + * This file contains both macros and data structures defined by the TDX
> > + * architecture and Linux defined software data structures and functions.
> > + * The two should not be mixed together for better readability. The
> > + * architectural definitions come first.
> > + */
> > +
> > +/*
> > + * TDX SEAMCALL error codes
> > + */
> > +#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL
>
> Where is this return value documented, in TDX module 1.0 spec there are
> only: 8000020[123]00000000 specified and there's 80000800
> (TDX_KEY_GENERATION_FAILED) and its description mentions the possible
> failure due to lack of entropy?
>
It's documented in TDX module V1.5 ABI Specification:
https://cdrdv2.intel.com/v1/dl/getContent/733579
The later versions of TDX module try to use TDX_RND_NO_ENTROPY to cover all
errors due to running out of entropy, but TDX module 1.0 for now doesn't.
This patch aims to resolve this error code in the common code.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand
2023-06-08 13:43 ` Dave Hansen
@ 2023-06-12 11:21 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 11:21 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Thu, 2023-06-08 at 06:43 -0700, Dave Hansen wrote:
> On 6/7/23 19:10, Huang, Kai wrote:
> > On Wed, 2023-06-07 at 08:22 -0700, Dave Hansen wrote:
> > > On 6/4/23 07:27, Kai Huang wrote:
> > > ...
> > > > +static int try_init_module_global(void)
> > > > +{
> > > > + unsigned long flags;
> > > > + int ret;
> > > > +
> > > > + /*
> > > > + * The TDX module global initialization only needs to be done
> > > > + * once on any cpu.
> > > > + */
> > > > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> > >
> > > Why is this "raw_"?
> > >
> > > There's zero mention of it anywhere.
> >
> > Isaku pointed out the normal spinlock_t is converted to sleeping lock for
> > PREEMPT_RT kernel. KVM calls this with IRQ disabled, thus requires a non-
> > sleeping lock.
> >
> > How about adding below comment here?
> >
> > /*
> > * Normal spinlock_t is converted to sleeping lock in PREEMPT_RT
> > * kernel. Use raw_spinlock_t instead so this function can be called
> > * even when IRQ is disabled in any kernel configuration.
> > */
>
> Go look at *EVERY* *OTHER* raw_spinlock_t in the kernel. Do any of them
> say this?
>
> Comment the function, say that it's always called with interrupts and
> preempt disabled. Leaves it at that. *Maybe* add on that it needs raw
> spinlocks because of it. But don't (try to) explain the background of
> the lock type.
>
Thanks. Will do, with one minor:
I'd like to replace "it's always called with interrupts and preempt disabled"
with "it can be called with interrupts disabled", because in the future non-KVM
code may call this when interrupt is enabled but preemption is disabled.
[...]
> > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > >
> > > You danced around it in the changelog, but the reason for the exports is
> > > not clear.
> >
> > I'll add one sentence to the changelog to explain:
> >
> > Export both tdx_cpu_enable() and tdx_enable() as KVM will be the kernel
> > component to use TDX.
>
> Intel doesn't pay me by the word. Do you get paid that way? If not,
> please just say:
>
> Export both tdx_cpu_enable() and tdx_enable() for KVM use.
Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-12 10:27 ` Huang, Kai
@ 2023-06-12 11:48 ` kirill.shutemov
2023-06-12 13:18 ` David Laight
2023-06-12 13:47 ` Dave Hansen
1 sibling, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-12 11:48 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
Luck, Tony, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-kernel, linux-mm,
Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, Jun 12, 2023 at 10:27:44AM +0000, Huang, Kai wrote:
> Does it make sense?
I understand your logic. AFAICS, it is correct (smp_mb() instead of __mb()
would be better), but it is not justified from complexity PoV. This
lockless exercise gave me a pause to understand.
Lockless doesn't buy you anything here, only increases complexity.
Just take a lock.
Kernel is big. I'm sure you'll find a better opportunity to be clever
about serialization :P
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* RE: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-12 11:48 ` kirill.shutemov
@ 2023-06-12 13:18 ` David Laight
0 siblings, 0 replies; 144+ messages in thread
From: David Laight @ 2023-06-12 13:18 UTC (permalink / raw)
To: 'kirill.shutemov@linux.intel.com', Huang, Kai
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
Luck, Tony, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-kernel, linux-mm,
Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
From: kirill.shutemov@linux.intel.com
> Sent: 12 June 2023 12:49
>
> On Mon, Jun 12, 2023 at 10:27:44AM +0000, Huang, Kai wrote:
> > Does it make sense?
>
> I understand your logic. AFAICS, it is correct (smp_mb() instead of __mb()
> would be better), but it is not justified from complexity PoV.
Given that x86 performs writes pretty much in code order.
Do you need anything more than a compile barrier?
> This lockless exercise gave me a pause to understand.
>
> Lockless doesn't buy you anything here, only increases complexity.
> Just take a lock.
Indeed...
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-12 10:27 ` Huang, Kai
2023-06-12 11:48 ` kirill.shutemov
@ 2023-06-12 13:47 ` Dave Hansen
2023-06-13 0:51 ` Huang, Kai
2023-06-19 11:43 ` Huang, Kai
1 sibling, 2 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-12 13:47 UTC (permalink / raw)
To: Huang, Kai, kirill.shutemov
Cc: kvm, david, bagasdotme, ak, Wysocki, Rafael J, Luck, Tony,
Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-kernel, linux-mm,
Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/12/23 03:27, Huang, Kai wrote:
> So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> will be seen by other cpus.
>
> Does it make sense?
Just use a normal old atomic_t or set_bit()/test_bit(). They have
built-in memory barriers are are less likely to get botched.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum
2023-06-12 7:59 ` kirill.shutemov
@ 2023-06-12 13:51 ` Dave Hansen
2023-06-12 23:31 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-12 13:51 UTC (permalink / raw)
To: kirill.shutemov, Huang, Kai
Cc: kvm, david, bagasdotme, ak, Wysocki, Rafael J, linux-kernel,
Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/12/23 00:59, kirill.shutemov@linux.intel.com wrote:
>> "Memory error from TDX private memory. May be result of CPU erratum caused by
>> kernel bug."
> Fine, I guess.
Just be short and sweet:
mce: [Hardware Error]: TDX private memory error. Possible kernel bug.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-12 2:33 ` Huang, Kai
@ 2023-06-12 14:33 ` kirill.shutemov
2023-06-12 22:10 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-12 14:33 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, Hansen, Dave, linux-kernel, Luck, Tony, david, bagasdotme,
ak, Wysocki, Rafael J, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, Jun 12, 2023 at 02:33:58AM +0000, Huang, Kai wrote:
>
> >
> > Maybe not even a pr_warn(), but something that's a bit ominous and has a
> > chance of getting users to act.
>
> Sorry I am not sure how to do. Could you give some suggestion?
Maybe something like this would do?
I'm struggle with the warning message. Any suggestion is welcome.
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 9cd4f6b58d4a..cc141025b249 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -627,6 +627,15 @@ static int fill_out_tdmrs(struct list_head *tmb_list,
/* @tdmr_idx is always the index of last valid TDMR. */
tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1;
+ /*
+ * Warn early that kernel is about to run out of TDMRs.
+ *
+ * This is indication that TDMR allocation has to be reworked to be
+ * smarter to not run into an issue.
+ */
+ if (tdmr_list->max_tdmrs - tdmr_list->nr_consumed_tdmrs < TDMR_NR_WARN)
+ pr_warn("Low number of spare TDMRs\n");
+
return 0;
}
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 323ce744b853..17efe33847ae 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -98,6 +98,9 @@ struct tdx_memblock {
int nid;
};
+/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
+#define TDMR_NR_WARN 4
+
struct tdmr_info_list {
void *tdmrs; /* Flexible array to hold 'tdmr_info's */
int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-12 14:33 ` kirill.shutemov
@ 2023-06-12 22:10 ` Huang, Kai
2023-06-13 10:18 ` kirill.shutemov
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 22:10 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Luck, Tony,
Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-12 at 17:33 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 12, 2023 at 02:33:58AM +0000, Huang, Kai wrote:
> >
> > >
> > > Maybe not even a pr_warn(), but something that's a bit ominous and has a
> > > chance of getting users to act.
> >
> > Sorry I am not sure how to do. Could you give some suggestion?
>
> Maybe something like this would do?
>
> I'm struggle with the warning message. Any suggestion is welcome.
I guess it would be helpful to print out the actual consumed TDMRs?
pr_warn("consumed TDMRs reaching limit: %d used (out of %d)\n",
tdmr_idx, tdmr_list->max_tdmrs);
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9cd4f6b58d4a..cc141025b249 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -627,6 +627,15 @@ static int fill_out_tdmrs(struct list_head *tmb_list,
> /* @tdmr_idx is always the index of last valid TDMR. */
> tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1;
>
> + /*
> + * Warn early that kernel is about to run out of TDMRs.
> + *
> + * This is indication that TDMR allocation has to be reworked to be
> + * smarter to not run into an issue.
> + */
> + if (tdmr_list->max_tdmrs - tdmr_list->nr_consumed_tdmrs < TDMR_NR_WARN)
> + pr_warn("Low number of spare TDMRs\n");
> +
> return 0;
> }
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 323ce744b853..17efe33847ae 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -98,6 +98,9 @@ struct tdx_memblock {
> int nid;
> };
>
> +/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
> +#define TDMR_NR_WARN 4
> +
> struct tdmr_info_list {
> void *tdmrs; /* Flexible array to hold 'tdmr_info's */
> int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum
2023-06-12 13:51 ` Dave Hansen
@ 2023-06-12 23:31 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-12 23:31 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz,
imammedo, Shahar, Sagi, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-12 at 06:51 -0700, Dave Hansen wrote:
> On 6/12/23 00:59, kirill.shutemov@linux.intel.com wrote:
> > > "Memory error from TDX private memory. May be result of CPU erratum caused by
> > > kernel bug."
> > Fine, I guess.
>
> Just be short and sweet:
>
> mce: [Hardware Error]: TDX private memory error. Possible kernel bug.
Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-12 13:47 ` Dave Hansen
@ 2023-06-13 0:51 ` Huang, Kai
2023-06-13 11:05 ` kirill.shutemov
2023-06-13 14:25 ` Dave Hansen
2023-06-19 11:43 ` Huang, Kai
1 sibling, 2 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-13 0:51 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> On 6/12/23 03:27, Huang, Kai wrote:
> > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > will be seen by other cpus.
> >
> > Does it make sense?
>
> Just use a normal old atomic_t or set_bit()/test_bit(). They have
> built-in memory barriers are are less likely to get botched.
Thanks for the suggestion.
Hi Dave, Kirill,
I'd like to check with you that whether we should introduce a mechanism to track
TDX private pages for both this patch and the next.
As you can see this patch only deals PAMT pages due to couple of reasons that
mnentioned in the changelog. The next MCE patch handles all TDX private pages,
but it uses SEAMCALL in the #MC handler. Using SEAMCALL has two cons: 1) it is
slow (probably doesn't matter, though); 2) it brings additional risk of
triggering further #MC inside TDX module, although such risk should be a
theoretical thing.
If we introduce a helper to mark a page as TDX private page, then both above
patches can utilize it. We don't need to consult TDMRs to get PAMT anymore in
this patch (we will need a way to loop all TDX-usable memory pages, but this
needs to be done anyway with TDX guests). I believe eventually we can end up
with less code.
In terms of how to do, for PAMT pages, we can set page->private to a TDX magic
number because they come out of page allocator directly. Secure-EPT pages are
like PAMT pages too. For TDX guest private pages, Sean is moving to implement
KVM's own pseudo filesystem so they will have a unique mapping to identify.
https://github.com/sean-jc/linux/commit/40d338c8629287dda60a9f7c800ede8549295a7c
And my thinking is in this TDX host series, we can just handle PAMT pages. Both
secure-EPT and TDX guest private pages can be handled later in KVM TDX series.
I think eventually we can have a function like below to tell whether a page is
TDX private page:
bool page_is_tdx_private(struct page *page)
{
if (page->private == TDX_PRIVATE_MAGIC)
return true;
if (!page_mapping(page))
return false;
return page_mapping(page)->a_ops == &kvm_gmem_ops;
}
How does this sound? Or any other comments? Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-12 22:10 ` Huang, Kai
@ 2023-06-13 10:18 ` kirill.shutemov
2023-06-13 23:19 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-13 10:18 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Luck, Tony,
Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, Jun 12, 2023 at 10:10:39PM +0000, Huang, Kai wrote:
> On Mon, 2023-06-12 at 17:33 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Mon, Jun 12, 2023 at 02:33:58AM +0000, Huang, Kai wrote:
> > >
> > > >
> > > > Maybe not even a pr_warn(), but something that's a bit ominous and has a
> > > > chance of getting users to act.
> > >
> > > Sorry I am not sure how to do. Could you give some suggestion?
> >
> > Maybe something like this would do?
> >
> > I'm struggle with the warning message. Any suggestion is welcome.
>
> I guess it would be helpful to print out the actual consumed TDMRs?
>
> pr_warn("consumed TDMRs reaching limit: %d used (out of %d)\n",
> tdmr_idx, tdmr_list->max_tdmrs);
It is off-by-one. It supposed to be tdmr_idx + 1.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-13 0:51 ` Huang, Kai
@ 2023-06-13 11:05 ` kirill.shutemov
2023-06-14 0:15 ` Huang, Kai
2023-06-13 14:25 ` Dave Hansen
1 sibling, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-13 11:05 UTC (permalink / raw)
To: Huang, Kai
Cc: Hansen, Dave, kvm, Luck, Tony, david, bagasdotme, ak, Wysocki,
Rafael J, linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, Jun 13, 2023 at 12:51:23AM +0000, Huang, Kai wrote:
> On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> > On 6/12/23 03:27, Huang, Kai wrote:
> > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > > will be seen by other cpus.
> > >
> > > Does it make sense?
> >
> > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > built-in memory barriers are are less likely to get botched.
>
> Thanks for the suggestion.
>
> Hi Dave, Kirill,
>
> I'd like to check with you that whether we should introduce a mechanism to track
> TDX private pages for both this patch and the next.
>
> As you can see this patch only deals PAMT pages due to couple of reasons that
> mnentioned in the changelog. The next MCE patch handles all TDX private pages,
> but it uses SEAMCALL in the #MC handler. Using SEAMCALL has two cons: 1) it is
> slow (probably doesn't matter, though); 2) it brings additional risk of
> triggering further #MC inside TDX module, although such risk should be a
> theoretical thing.
>
> If we introduce a helper to mark a page as TDX private page, then both above
> patches can utilize it. We don't need to consult TDMRs to get PAMT anymore in
> this patch (we will need a way to loop all TDX-usable memory pages, but this
> needs to be done anyway with TDX guests). I believe eventually we can end up
> with less code.
>
> In terms of how to do, for PAMT pages, we can set page->private to a TDX magic
> number because they come out of page allocator directly. Secure-EPT pages are
> like PAMT pages too. For TDX guest private pages, Sean is moving to implement
> KVM's own pseudo filesystem so they will have a unique mapping to identify.
>
> https://github.com/sean-jc/linux/commit/40d338c8629287dda60a9f7c800ede8549295a7c
>
> And my thinking is in this TDX host series, we can just handle PAMT pages. Both
> secure-EPT and TDX guest private pages can be handled later in KVM TDX series.
> I think eventually we can have a function like below to tell whether a page is
> TDX private page:
>
> bool page_is_tdx_private(struct page *page)
> {
> if (page->private == TDX_PRIVATE_MAGIC)
> return true;
>
> if (!page_mapping(page))
> return false;
>
> return page_mapping(page)->a_ops == &kvm_gmem_ops;
> }
>
> How does this sound? Or any other comments? Thanks!
If you going to take this path it has to be supported natively by
kvm_gmem_: it has to provide API for that. You should not assume that
page->private is free to use. It is owned by kvm_gmmem.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-13 0:51 ` Huang, Kai
2023-06-13 11:05 ` kirill.shutemov
@ 2023-06-13 14:25 ` Dave Hansen
2023-06-13 23:18 ` Huang, Kai
1 sibling, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-13 14:25 UTC (permalink / raw)
To: Huang, Kai, kirill.shutemov
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/12/23 17:51, Huang, Kai wrote:
> If we introduce a helper to mark a page as TDX private page,
Let me get this right: you have working, functional code for a
highly-unlikely scenario (kernel bugs or even more rare hardware
errors). But, you want to optimize this super-rare case? It's not fast
enough?
Is there any other motivation here that I'm missing?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-13 14:25 ` Dave Hansen
@ 2023-06-13 23:18 ` Huang, Kai
2023-06-14 0:24 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-13 23:18 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote:
> On 6/12/23 17:51, Huang, Kai wrote:
> > If we introduce a helper to mark a page as TDX private page,
>
> Let me get this right: you have working, functional code for a
> highly-unlikely scenario (kernel bugs or even more rare hardware
> errors). But, you want to optimize this super-rare case? It's not fast
> enough?
>
> Is there any other motivation here that I'm missing?
>
No it's not about speed. The motivation is to have a common code to yield less
line of code, though I don't have clear number of how many LoC can be reduced.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-13 10:18 ` kirill.shutemov
@ 2023-06-13 23:19 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-13 23:19 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
Luck, Tony, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-kernel, linux-mm,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, 2023-06-13 at 13:18 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 12, 2023 at 10:10:39PM +0000, Huang, Kai wrote:
> > On Mon, 2023-06-12 at 17:33 +0300, kirill.shutemov@linux.intel.com wrote:
> > > On Mon, Jun 12, 2023 at 02:33:58AM +0000, Huang, Kai wrote:
> > > >
> > > > >
> > > > > Maybe not even a pr_warn(), but something that's a bit ominous and has a
> > > > > chance of getting users to act.
> > > >
> > > > Sorry I am not sure how to do. Could you give some suggestion?
> > >
> > > Maybe something like this would do?
> > >
> > > I'm struggle with the warning message. Any suggestion is welcome.
> >
> > I guess it would be helpful to print out the actual consumed TDMRs?
> >
> > pr_warn("consumed TDMRs reaching limit: %d used (out of %d)\n",
> > tdmr_idx, tdmr_list->max_tdmrs);
>
> It is off-by-one. It supposed to be tdmr_idx + 1.
>
In your code, yes. Thanks for pointing out. I copied it from my code.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-13 11:05 ` kirill.shutemov
@ 2023-06-14 0:15 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-14 0:15 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, 2023-06-13 at 14:05 +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, Jun 13, 2023 at 12:51:23AM +0000, Huang, Kai wrote:
> > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> > > On 6/12/23 03:27, Huang, Kai wrote:
> > > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > > > will be seen by other cpus.
> > > >
> > > > Does it make sense?
> > >
> > > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > > built-in memory barriers are are less likely to get botched.
> >
> > Thanks for the suggestion.
> >
> > Hi Dave, Kirill,
> >
> > I'd like to check with you that whether we should introduce a mechanism to track
> > TDX private pages for both this patch and the next.
> >
> > As you can see this patch only deals PAMT pages due to couple of reasons that
> > mnentioned in the changelog. The next MCE patch handles all TDX private pages,
> > but it uses SEAMCALL in the #MC handler. Using SEAMCALL has two cons: 1) it is
> > slow (probably doesn't matter, though); 2) it brings additional risk of
> > triggering further #MC inside TDX module, although such risk should be a
> > theoretical thing.
> >
> > If we introduce a helper to mark a page as TDX private page, then both above
> > patches can utilize it. We don't need to consult TDMRs to get PAMT anymore in
> > this patch (we will need a way to loop all TDX-usable memory pages, but this
> > needs to be done anyway with TDX guests). I believe eventually we can end up
> > with less code.
> >
> > In terms of how to do, for PAMT pages, we can set page->private to a TDX magic
> > number because they come out of page allocator directly. Secure-EPT pages are
> > like PAMT pages too. For TDX guest private pages, Sean is moving to implement
> > KVM's own pseudo filesystem so they will have a unique mapping to identify.
> >
> > https://github.com/sean-jc/linux/commit/40d338c8629287dda60a9f7c800ede8549295a7c
> >
> > And my thinking is in this TDX host series, we can just handle PAMT pages. Both
> > secure-EPT and TDX guest private pages can be handled later in KVM TDX series.
> > I think eventually we can have a function like below to tell whether a page is
> > TDX private page:
> >
> > bool page_is_tdx_private(struct page *page)
> > {
> > if (page->private == TDX_PRIVATE_MAGIC)
> > return true;
> >
> > if (!page_mapping(page))
> > return false;
> >
> > return page_mapping(page)->a_ops == &kvm_gmem_ops;
> > }
> >
> > How does this sound? Or any other comments? Thanks!
>
> If you going to take this path it has to be supported natively by
> kvm_gmem_: it has to provide API for that.
>
Yes.
> You should not assume that
> page->private is free to use. It is owned by kvm_gmmem.
>
page->private is only for PAMT and SEPT pages. kmem_gmem has it's own mapping
which can be used to identify the pages owned by it.
Hmm.. I think we should just leave them out for now, as they theoretically are
owned by KVM thus can be handled by KVM, e.g., in it's reboot or shutdown or
module unload code path.
If we are fine to use SEAMCALL in the #MC handler code path, I think perhaps we
can just keep using TDMRs to locate PAMTs.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-13 23:18 ` Huang, Kai
@ 2023-06-14 0:24 ` Dave Hansen
2023-06-14 0:38 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-14 0:24 UTC (permalink / raw)
To: Huang, Kai, kirill.shutemov
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/13/23 16:18, Huang, Kai wrote:
> On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote:
>> On 6/12/23 17:51, Huang, Kai wrote:
>>> If we introduce a helper to mark a page as TDX private page,
>> Let me get this right: you have working, functional code for a
>> highly-unlikely scenario (kernel bugs or even more rare hardware
>> errors). But, you want to optimize this super-rare case? It's not fast
>> enough?
>>
>> Is there any other motivation here that I'm missing?
>>
> No it's not about speed. The motivation is to have a common code to yield less
> line of code, though I don't have clear number of how many LoC can be reduced.
OK, so ... ballpark. How many lines of code are we going to _save_ for
this super-rare case? 10? 100? 1000?
The upside is saving X lines of code ... somewhere. The downside is
adding Y lines of code ... somewhere else and maybe breaking things in
the process.
You've evidently done _some_ kind of calculus in your head to make this
tradeoff worthwhile. I'd love to hear what your calculus is, even if
it's just a gut feel.
Could you share your logic here, please?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-14 0:24 ` Dave Hansen
@ 2023-06-14 0:38 ` Huang, Kai
2023-06-14 0:42 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-14 0:38 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, 2023-06-13 at 17:24 -0700, Dave Hansen wrote:
> On 6/13/23 16:18, Huang, Kai wrote:
> > On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote:
> > > On 6/12/23 17:51, Huang, Kai wrote:
> > > > If we introduce a helper to mark a page as TDX private page,
> > > Let me get this right: you have working, functional code for a
> > > highly-unlikely scenario (kernel bugs or even more rare hardware
> > > errors). But, you want to optimize this super-rare case? It's not fast
> > > enough?
> > >
> > > Is there any other motivation here that I'm missing?
> > >
> > No it's not about speed. The motivation is to have a common code to yield less
> > line of code, though I don't have clear number of how many LoC can be reduced.
>
> OK, so ... ballpark. How many lines of code are we going to _save_ for
> this super-rare case? 10? 100? 1000?
~50 LoC I guess, certainly < 100.
>
> The upside is saving X lines of code ... somewhere. The downside is
> adding Y lines of code ... somewhere else and maybe breaking things in
> the process.
>
> You've evidently done _some_ kind of calculus in your head to make this
> tradeoff worthwhile. I'd love to hear what your calculus is, even if
> it's just a gut feel.
>
> Could you share your logic here, please?
The logic is the whole tdx_is_private_mem() function in the next patch (#MC
handling one) can be significantly reduced from 100 -> ~10, and we roughly needs
some more code (<50 LoC) to mark PAMT as private.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-14 0:38 ` Huang, Kai
@ 2023-06-14 0:42 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-14 0:42 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-14 at 00:38 +0000, Huang, Kai wrote:
> On Tue, 2023-06-13 at 17:24 -0700, Dave Hansen wrote:
> > On 6/13/23 16:18, Huang, Kai wrote:
> > > On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote:
> > > > On 6/12/23 17:51, Huang, Kai wrote:
> > > > > If we introduce a helper to mark a page as TDX private page,
> > > > Let me get this right: you have working, functional code for a
> > > > highly-unlikely scenario (kernel bugs or even more rare hardware
> > > > errors). But, you want to optimize this super-rare case? It's not fast
> > > > enough?
> > > >
> > > > Is there any other motivation here that I'm missing?
> > > >
> > > No it's not about speed. The motivation is to have a common code to yield less
> > > line of code, though I don't have clear number of how many LoC can be reduced.
> >
> > OK, so ... ballpark. How many lines of code are we going to _save_ for
> > this super-rare case? 10? 100? 1000?
>
> ~50 LoC I guess, certainly < 100.
>
> >
> > The upside is saving X lines of code ... somewhere. The downside is
> > adding Y lines of code ... somewhere else and maybe breaking things in
> > the process.
> >
> > You've evidently done _some_ kind of calculus in your head to make this
> > tradeoff worthwhile. I'd love to hear what your calculus is, even if
> > it's just a gut feel.
> >
> > Could you share your logic here, please?
>
> The logic is the whole tdx_is_private_mem() function in the next patch (#MC
> handling one) can be significantly reduced from 100 -> ~10, and we roughly needs
> some more code (<50 LoC) to mark PAMT as private.
>
Apologize, should be "we roughly need some more code (<50 LoC) to mark PAMT and
Secure-EPT and TDX guest private pages as TDX private pages". But now we only
have PAMT.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
[not found] ` <5aa7506d4fedbf625e3fe8ceeb88af3be1ce97ea.1685887183.git.kai.huang@intel.com>
2023-06-09 13:23 ` [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot kirill.shutemov
@ 2023-06-14 9:33 ` Huang, Kai
2023-06-14 10:02 ` kirill.shutemov
1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-14 9:33 UTC (permalink / raw)
To: kvm, linux-kernel
Cc: Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, linux-mm, Yamahata, Isaku, tglx, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-05 at 02:27 +1200, Kai Huang wrote:
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -720,6 +720,7 @@ void native_machine_shutdown(void)
>
> #ifdef CONFIG_X86_64
> x86_platform.iommu_shutdown();
> + x86_platform.memory_shutdown();
> #endif
> }
Hi Kirill/Dave,
I missed that this solution doesn't reset TDX private for emergency restart or
when reboot_force is set, because machine_shutdown() isn't called for them.
Is it acceptable? Or should we handle them too?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-14 9:33 ` Huang, Kai
@ 2023-06-14 10:02 ` kirill.shutemov
2023-06-14 10:58 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-14 10:02 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, linux-kernel, Hansen, Dave, david, bagasdotme, ak, Wysocki,
Rafael J, Chatre, Reinette, Christopherson,,
Sean, pbonzini, linux-mm, Yamahata, Isaku, tglx, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, Jun 14, 2023 at 09:33:45AM +0000, Huang, Kai wrote:
> On Mon, 2023-06-05 at 02:27 +1200, Kai Huang wrote:
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -720,6 +720,7 @@ void native_machine_shutdown(void)
> >
> > #ifdef CONFIG_X86_64
> > x86_platform.iommu_shutdown();
> > + x86_platform.memory_shutdown();
> > #endif
> > }
>
> Hi Kirill/Dave,
>
> I missed that this solution doesn't reset TDX private for emergency restart or
> when reboot_force is set, because machine_shutdown() isn't called for them.
>
> Is it acceptable? Or should we handle them too?
Force reboot is not used in kexec path, right? And the platform has to
handle erratum in BIOS to reset memory status on reboot anyway.
I think we should be fine. But it worth mentioning it in the commit
message.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-14 10:02 ` kirill.shutemov
@ 2023-06-14 10:58 ` Huang, Kai
2023-06-14 11:08 ` kirill.shutemov
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-14 10:58 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
Luck, Tony, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-kernel, linux-mm,
Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-14 at 13:02 +0300, kirill.shutemov@linux.intel.com wrote:
> On Wed, Jun 14, 2023 at 09:33:45AM +0000, Huang, Kai wrote:
> > On Mon, 2023-06-05 at 02:27 +1200, Kai Huang wrote:
> > > --- a/arch/x86/kernel/reboot.c
> > > +++ b/arch/x86/kernel/reboot.c
> > > @@ -720,6 +720,7 @@ void native_machine_shutdown(void)
> > >
> > > #ifdef CONFIG_X86_64
> > > x86_platform.iommu_shutdown();
> > > + x86_platform.memory_shutdown();
> > > #endif
> > > }
> >
> > Hi Kirill/Dave,
> >
> > I missed that this solution doesn't reset TDX private for emergency restart or
> > when reboot_force is set, because machine_shutdown() isn't called for them.
> >
> > Is it acceptable? Or should we handle them too?
>
> Force reboot is not used in kexec path, right?
>
Correct.
> And the platform has to
> handle erratum in BIOS to reset memory status on reboot anyway.
So "handle erratum in BIOS" I think you mean "warm reset" doesn't reset TDX
private pages, and the BIOS needs to disable "warm reset".
IIUC this means the kernel needs to depend on specific BIOS setting to work
normally, and IIUC the kernel even cannot be aware of this setting?
Should the kernel just reset all TDX private pages when erratum is present
during reboot so the kernel doesn't depend on BIOS?
>
> I think we should be fine. But it worth mentioning it in the commit
> message.
>
Agreed.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-14 10:58 ` Huang, Kai
@ 2023-06-14 11:08 ` kirill.shutemov
2023-06-14 11:17 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-14 11:08 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
Luck, Tony, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-kernel, linux-mm,
Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, Jun 14, 2023 at 10:58:13AM +0000, Huang, Kai wrote:
> > And the platform has to
> > handle erratum in BIOS to reset memory status on reboot anyway.
>
> So "handle erratum in BIOS" I think you mean "warm reset" doesn't reset TDX
> private pages, and the BIOS needs to disable "warm reset".
>
> IIUC this means the kernel needs to depend on specific BIOS setting to work
> normally, and IIUC the kernel even cannot be aware of this setting?
>
> Should the kernel just reset all TDX private pages when erratum is present
> during reboot so the kernel doesn't depend on BIOS?
Kernel cannot really function if we don't trust BIOS to do its job. Kernel
depends on BIOS services anyway. We cannot try to handle everything in
kernel just in case BIOS drops the ball.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-14 11:08 ` kirill.shutemov
@ 2023-06-14 11:17 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-14 11:17 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Luck, Tony,
Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-14 at 14:08 +0300, kirill.shutemov@linux.intel.com wrote:
> On Wed, Jun 14, 2023 at 10:58:13AM +0000, Huang, Kai wrote:
> > > And the platform has to
> > > handle erratum in BIOS to reset memory status on reboot anyway.
> >
> > So "handle erratum in BIOS" I think you mean "warm reset" doesn't reset TDX
> > private pages, and the BIOS needs to disable "warm reset".
> >
> > IIUC this means the kernel needs to depend on specific BIOS setting to work
> > normally, and IIUC the kernel even cannot be aware of this setting?
> >
> > Should the kernel just reset all TDX private pages when erratum is present
> > during reboot so the kernel doesn't depend on BIOS?
>
> Kernel cannot really function if we don't trust BIOS to do its job. Kernel
> depends on BIOS services anyway. We cannot try to handle everything in
> kernel just in case BIOS drops the ball.
>
In other words, I assume we just need to take care of kexec().
The current patch tries to handle reboot too, so I'll change to only cover
kexec(), assuming the BIOS will always disable warm reset reboot for platforms
with this erratum.
Thanks.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
[not found] ` <927ec9871721d2a50f1aba7d1cf7c3be50e4f49b.1685887183.git.kai.huang@intel.com>
` (2 preceding siblings ...)
2023-06-09 4:01 ` Sathyanarayanan Kuppuswamy
@ 2023-06-14 12:31 ` Nikolay Borisov
2023-06-14 22:45 ` Huang, Kai
3 siblings, 1 reply; 144+ messages in thread
From: Nikolay Borisov @ 2023-06-14 12:31 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On 4.06.23 г. 17:27 ч., Kai Huang wrote:
<snip>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 94 ++++++++++++++++++++++++++++++++++++-
> 1 file changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 7a20c72361e7..fa9fa8bc581a 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -385,6 +385,93 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
> tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
> }
>
> +/* Get the TDMR from the list at the given index. */
> +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
> + int idx)
> +{
> + int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
> +
> + return (void *)tdmr_list->tdmrs + tdmr_info_offset;
nit: I would just like to point that sizeof(void *) being treated as 1
is a gcc-specific compiler extension:
https://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/Pointer-Arith.html#Pointer-Arith
I don't know if clang treats it the same way, just for the sake of
simplicity you might wanna change this (void *) to (char *).
<snip>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
2023-06-14 12:31 ` Nikolay Borisov
@ 2023-06-14 22:45 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-14 22:45 UTC (permalink / raw)
To: kvm, nik.borisov, linux-kernel
Cc: Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, linux-mm, Yamahata, Isaku, tglx, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-14 at 15:31 +0300, Nikolay Borisov wrote:
>
> On 4.06.23 г. 17:27 ч., Kai Huang wrote:
> <snip>
>
> > ---
> > arch/x86/virt/vmx/tdx/tdx.c | 94 ++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 93 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 7a20c72361e7..fa9fa8bc581a 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -385,6 +385,93 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
> > tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
> > }
> >
> > +/* Get the TDMR from the list at the given index. */
> > +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
> > + int idx)
> > +{
> > + int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
> > +
> > + return (void *)tdmr_list->tdmrs + tdmr_info_offset;
>
> nit: I would just like to point that sizeof(void *) being treated as 1
> is a gcc-specific compiler extension:
> https://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/Pointer-Arith.html#Pointer-Arith
>
> I don't know if clang treats it the same way, just for the sake of
> simplicity you might wanna change this (void *) to (char *).
Then we will need additional cast from 'char *' to 'struct tdmr_info *' I
suppose? Not sure whether it is worth the additional cast.
And I found such 'void *' arithmetic operation is already used in other kernel
code too, e.g., below code in networking code:
./net/rds/tcp_send.c:105: (void *)&rm->m_inc.i_hdr + hdr_off,
and I believe there are other examples too (that I didn't spend a lot of time to
grep).
And it seems Linus also thinks "using arithmetic on 'void *' is generally
superior":
https://lore.kernel.org/lkml/CAHk-=whFKYMrF6euVvziW+drw7-yi1pYdf=uccnzJ8k09DoTXA@mail.gmail.com/t/#m983827708903c8c5bddf193343d392c9ed5af1a0
So I wouldn't worry about the Clang thing.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
[not found] ` <4e108968c3294189ad150f62df1f146168036342.1685887183.git.kai.huang@intel.com>
2023-06-08 23:24 ` [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs kirill.shutemov
@ 2023-06-15 7:48 ` Nikolay Borisov
1 sibling, 0 replies; 144+ messages in thread
From: Nikolay Borisov @ 2023-06-15 7:48 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On 4.06.23 г. 17:27 ч., Kai Huang wrote:
<snip>
> /*
> * Construct a list of TDMRs on the preallocated space in @tdmr_list
> * to cover all TDX memory regions in @tmb_list based on the TDX module
> @@ -487,10 +684,13 @@ static int construct_tdmrs(struct list_head *tmb_list,
> if (ret)
> return ret;
>
> + ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
> + sysinfo->pamt_entry_size);
> + if (ret)
> + return ret;
> /*
> * TODO:
> *
> - * - Allocate and set up PAMTs for each TDMR.
> * - Designate reserved areas for each TDMR.
> *
> * Return -EINVAL until constructing TDMRs is done
> @@ -547,6 +747,11 @@ static int init_tdx_module(void)
> * Return error before all steps are done.
> */
> ret = -EINVAL;
> + if (ret)
> + tdmrs_free_pamt_all(&tdx_tdmr_list);
> + else
> + pr_info("%lu KBs allocated for PAMT.\n",
> + tdmrs_count_pamt_pages(&tdx_tdmr_list) * 4);
Why not put the pr_info right after the 'if (ret)' check following
tdmrs_setup_pamt_all(). And make the tdmrs_free_pamt_all call
unconditional.
It seems the main reason for having a bunch of conditionals in the exit
reason is that you share the put_online_mems(); in both the success and
failure cases. If you simply add :
put_online_mems();
return 0;
// failure labels follow
Then you can make do without the if (ret) checks and have straight line
code doing the error handling.
> out_free_tdmrs:
> if (ret)
> free_tdmr_list(&tdx_tdmr_list);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index c20848e76469..e8110e1a9980 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -133,6 +133,7 @@ struct tdx_memblock {
> struct list_head list;
> unsigned long start_pfn;
> unsigned long end_pfn;
> + int nid;
> };
>
> struct tdmr_info_list {
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 15/20] x86/virt/tdx: Configure global KeyID on all packages
[not found] ` <30358db4eff961c69783bbd4d9f3e50932a9a759.1685887183.git.kai.huang@intel.com>
2023-06-08 23:53 ` [PATCH v11 15/20] x86/virt/tdx: Configure global KeyID on all packages kirill.shutemov
@ 2023-06-15 8:12 ` Nikolay Borisov
2023-06-15 22:24 ` Huang, Kai
1 sibling, 1 reply; 144+ messages in thread
From: Nikolay Borisov @ 2023-06-15 8:12 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On 4.06.23 г. 17:27 ч., Kai Huang wrote:
> After the list of TDMRs and the global KeyID are configured to the TDX
> module, the kernel needs to configure the key of the global KeyID on all
> packages using TDH.SYS.KEY.CONFIG.
>
> This SEAMCALL cannot run parallel on different cpus. Loop all online
> cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
> each package.
>
> To keep things simple, this implementation takes no affirmative steps to
> online cpus to make sure there's at least one cpu for each package. The
> callers (aka. KVM) can ensure success by ensuring that.
The last sentence is a bit hard to read due to the multiple use of
ensure/ensuring. OTOH I find the comment in the code somewhat more
coherent:
> + * This code takes no affirmative steps to online CPUs. Callers (aka.
> + * KVM) can ensure success by ensuring sufficient CPUs are online for
> + * this to succeed.
> + */
I'd suggest you just use those words. Or just saying "Callers (such as
KVM) can ensure success by onlining at least 1 CPU per package."
<snip>
> static int init_tdx_module(void)
> {
> static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
> @@ -980,15 +1073,47 @@ static int init_tdx_module(void)
> if (ret)
> goto out_free_pamts;
>
> + /*
> + * Hardware doesn't guarantee cache coherency across different
> + * KeyIDs. The kernel needs to flush PAMT's dirty cachelines
> + * (associated with KeyID 0) before the TDX module can use the
> + * global KeyID to access the PAMT. Given PAMTs are potentially
> + * large (~1/256th of system RAM), just use WBINVD on all cpus
> + * to flush the cache.
> + */
> + wbinvd_on_all_cpus();
> +
> + /* Config the key of global KeyID on all packages */
> + ret = config_global_keyid();
> + if (ret)
> + goto out_reset_pamts;
> +
> /*
> * TODO:
> *
> - * - Configure the global KeyID on all packages.
> * - Initialize all TDMRs.
> *
> * Return error before all steps are done.
> */
> ret = -EINVAL;
> +out_reset_pamts:
> + if (ret) {
Again with those conditionals in the error paths. Just copy the
put_mem_online(); ret 0. and this will decrease the indentation level
and make the code linear. Here;s what the diff looks like:
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 4aa41352edfc..49fda2a28f24 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1131,6 +1131,8 @@ static int init_tdx_module(void)
if (ret)
goto out_free_pamts;
+ pr_info("%lu KBs allocated for PAMT.\n",
+ tdmrs_count_pamt_pages(&tdx_tdmr_list) * 4);
/*
* Hardware doesn't guarantee cache coherency across different
* KeyIDs. The kernel needs to flush PAMT's dirty cachelines
@@ -1148,36 +1150,32 @@ static int init_tdx_module(void)
/* Initialize TDMRs to complete the TDX module initialization */
ret = init_tdmrs(&tdx_tdmr_list);
+
+ put_online_mems();
+
+ return 0;
out_reset_pamts:
- if (ret) {
- /*
- * Part of PAMTs may already have been initialized by the
- * TDX module. Flush cache before returning PAMTs back
- * to the kernel.
- */
- wbinvd_on_all_cpus();
- /*
- * According to the TDX hardware spec, if the platform
- * doesn't have the "partial write machine check"
- * erratum, any kernel read/write will never cause #MC
- * in kernel space, thus it's OK to not convert PAMTs
- * back to normal. But do the conversion anyway here
- * as suggested by the TDX spec.
- */
- tdmrs_reset_pamt_all(&tdx_tdmr_list);
- }
+ /*
+ * Part of PAMTs may already have been initialized by the
+ * TDX module. Flush cache before returning PAMTs back
+ * to the kernel.
+ */
+ wbinvd_on_all_cpus();
+ /*
+ * According to the TDX hardware spec, if the platform
+ * doesn't have the "partial write machine check"
+ * erratum, any kernel read/write will never cause #MC
+ * in kernel space, thus it's OK to not convert PAMTs
+ * back to normal. But do the conversion anyway here
+ * as suggested by the TDX spec.
+ */
+ tdmrs_reset_pamt_all(&tdx_tdmr_list);
out_free_pamts:
- if (ret)
- tdmrs_free_pamt_all(&tdx_tdmr_list);
- else
- pr_info("%lu KBs allocated for PAMT.\n",
- tdmrs_count_pamt_pages(&tdx_tdmr_list) * 4);
+ tdmrs_free_pamt_all(&tdx_tdmr_list);
out_free_tdmrs:
- if (ret)
- free_tdmr_list(&tdx_tdmr_list);
+ free_tdmr_list(&tdx_tdmr_list);
out_free_tdx_mem:
- if (ret)
- free_tdx_memlist(&tdx_memlist);
+ free_tdx_memlist(&tdx_memlist);
out:
/*
* @tdx_memlist is written here and read at memory hotplug time.
<snip>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 15/20] x86/virt/tdx: Configure global KeyID on all packages
2023-06-15 8:12 ` Nikolay Borisov
@ 2023-06-15 22:24 ` Huang, Kai
2023-06-19 14:56 ` kirill.shutemov
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-15 22:24 UTC (permalink / raw)
To: kvm, nik.borisov, linux-kernel
Cc: Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, linux-mm, Yamahata, Isaku, tglx, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Thu, 2023-06-15 at 11:12 +0300, Nikolay Borisov wrote:
>
> On 4.06.23 г. 17:27 ч., Kai Huang wrote:
> > After the list of TDMRs and the global KeyID are configured to the TDX
> > module, the kernel needs to configure the key of the global KeyID on all
> > packages using TDH.SYS.KEY.CONFIG.
> >
> > This SEAMCALL cannot run parallel on different cpus. Loop all online
> > cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
> > each package.
> >
> > To keep things simple, this implementation takes no affirmative steps to
> > online cpus to make sure there's at least one cpu for each package. The
> > callers (aka. KVM) can ensure success by ensuring that.
>
> The last sentence is a bit hard to read due to the multiple use of
> ensure/ensuring. OTOH I find the comment in the code somewhat more
> coherent:
>
> > + * This code takes no affirmative steps to online CPUs. Callers (aka.
> > + * KVM) can ensure success by ensuring sufficient CPUs are online for
> > + * this to succeed.
> > + */
>
> I'd suggest you just use those words. Or just saying "Callers (such as
> KVM) can ensure success by onlining at least 1 CPU per package."
>
OK will do. Thanks.
>
>
> > static int init_tdx_module(void)
> > {
> > static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
> > @@ -980,15 +1073,47 @@ static int init_tdx_module(void)
> > if (ret)
> > goto out_free_pamts;
> >
> > + /*
> > + * Hardware doesn't guarantee cache coherency across different
> > + * KeyIDs. The kernel needs to flush PAMT's dirty cachelines
> > + * (associated with KeyID 0) before the TDX module can use the
> > + * global KeyID to access the PAMT. Given PAMTs are potentially
> > + * large (~1/256th of system RAM), just use WBINVD on all cpus
> > + * to flush the cache.
> > + */
> > + wbinvd_on_all_cpus();
> > +
> > + /* Config the key of global KeyID on all packages */
> > + ret = config_global_keyid();
> > + if (ret)
> > + goto out_reset_pamts;
> > +
> > /*
> > * TODO:
> > *
> > - * - Configure the global KeyID on all packages.
> > * - Initialize all TDMRs.
> > *
> > * Return error before all steps are done.
> > */
> > ret = -EINVAL;
> > +out_reset_pamts:
> > + if (ret) {
>
> Again with those conditionals in the error paths. Just copy the
> put_mem_online(); ret 0. and this will decrease the indentation level
> and make the code linear. Here;s what the diff looks like:
(to your another reply too)
I noticed this too when preparing this series. In old versions the TDMRs were
always freed no matter module initialization result, thus it's not good to do
what you suggested. Now we can do what you suggested (assuming we don't change
back to always freeing TDMRs), I just wasn't so sure when I was preparing this
series.
I'll take a look at the yielding patches with your suggestion. Thanks.
Hi Kirill/Dave,
Since I have received couple of tags from you, may I know which way do you
prefer?
>
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 4aa41352edfc..49fda2a28f24 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1131,6 +1131,8 @@ static int init_tdx_module(void)
> if (ret)
> goto out_free_pamts;
>
> + pr_info("%lu KBs allocated for PAMT.\n",
> + tdmrs_count_pamt_pages(&tdx_tdmr_list) * 4);
> /*
> * Hardware doesn't guarantee cache coherency across different
> * KeyIDs. The kernel needs to flush PAMT's dirty cachelines
> @@ -1148,36 +1150,32 @@ static int init_tdx_module(void)
>
> /* Initialize TDMRs to complete the TDX module initialization */
> ret = init_tdmrs(&tdx_tdmr_list);
> +
> + put_online_mems();
> +
> + return 0;
> out_reset_pamts:
> - if (ret) {
> - /*
> - * Part of PAMTs may already have been initialized by the
> - * TDX module. Flush cache before returning PAMTs back
> - * to the kernel.
> - */
> - wbinvd_on_all_cpus();
> - /*
> - * According to the TDX hardware spec, if the platform
> - * doesn't have the "partial write machine check"
> - * erratum, any kernel read/write will never cause #MC
> - * in kernel space, thus it's OK to not convert PAMTs
> - * back to normal. But do the conversion anyway here
> - * as suggested by the TDX spec.
> - */
> - tdmrs_reset_pamt_all(&tdx_tdmr_list);
> - }
> + /*
> + * Part of PAMTs may already have been initialized by the
> + * TDX module. Flush cache before returning PAMTs back
> + * to the kernel.
> + */
> + wbinvd_on_all_cpus();
> + /*
> + * According to the TDX hardware spec, if the platform
> + * doesn't have the "partial write machine check"
> + * erratum, any kernel read/write will never cause #MC
> + * in kernel space, thus it's OK to not convert PAMTs
> + * back to normal. But do the conversion anyway here
> + * as suggested by the TDX spec.
> + */
> + tdmrs_reset_pamt_all(&tdx_tdmr_list);
> out_free_pamts:
> - if (ret)
> - tdmrs_free_pamt_all(&tdx_tdmr_list);
> - else
> - pr_info("%lu KBs allocated for PAMT.\n",
> - tdmrs_count_pamt_pages(&tdx_tdmr_list) * 4);
> + tdmrs_free_pamt_all(&tdx_tdmr_list);
> out_free_tdmrs:
> - if (ret)
> - free_tdmr_list(&tdx_tdmr_list);
> + free_tdmr_list(&tdx_tdmr_list);
> out_free_tdx_mem:
> - if (ret)
> - free_tdx_memlist(&tdx_memlist);
> + free_tdx_memlist(&tdx_memlist);
> out:
> /*
> * @tdx_memlist is written here and read at memory hotplug time.
>
>
> <snip>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 20/20] Documentation/x86: Add documentation for TDX host support
[not found] ` <34853e0f8f38ec2fda66b0ba480d4df63b8aab43.1685887183.git.kai.huang@intel.com>
2023-06-08 23:56 ` [PATCH v11 20/20] Documentation/x86: Add documentation for TDX host support Dave Hansen
@ 2023-06-16 9:02 ` Nikolay Borisov
2023-06-16 16:26 ` Dave Hansen
1 sibling, 1 reply; 144+ messages in thread
From: Nikolay Borisov @ 2023-06-16 9:02 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, david, dan.j.williams, rafael.j.wysocki,
ying.huang, reinette.chatre, len.brown, ak, isaku.yamahata,
chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
imammedo
On 4.06.23 г. 17:27 ч., Kai Huang wrote:
<snip>
> +
> +To enable TDX, the user of TDX should: 1) hold read lock of CPU hotplug
> +lock; 2) do VMXON and tdx_enable_cpu() on all online cpus successfully;
> +3) call tdx_enable(). For example::
> +
> + cpus_read_lock();
> + on_each_cpu(vmxon_and_tdx_cpu_enable());
> + ret = tdx_enable();
> + cpus_read_unlock();
> + if (ret)
> + goto no_tdx;
> + // TDX is ready to use
> +
> +And the user of TDX must be guarantee tdx_cpu_enable() has beene
s/be// and s/beene/been/
> +successfully done on any cpu before it wants to run any other SEAMCALL.
> +A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
> +online callback, and refuse to online if tdx_cpu_enable() fails.
> +
> +User can consult dmesg to see the presence of the TDX module, and whether
> +it has been initialized.
> +
> +If the TDX module is not loaded, dmesg shows below::
> +
> + [..] tdx: TDX module is not loaded.
nit: There were some comments that given the tdx: prefix it's redundant
to also have TDX in the printed string. You might modify this in the
code but it should also be reflected in the docs for the sake of
completeness.
> +
> +If the TDX module is initialized successfully, dmesg shows something
> +like below::
> +
> + [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
> + [..] tdx: 262668 KBs allocated for PAMT.
> + [..] tdx: TDX module initialized.
> +
> +If the TDX module failed to initialize, dmesg also shows it failed to
> +initialize::
> +
> + [..] tdx: TDX module initialization failed ...
> +
> +TDX Interaction to Other Kernel Components
> +------------------------------------------
> +
> +TDX Memory Policy
> +~~~~~~~~~~~~~~~~~
> +
> +TDX reports a list of "Convertible Memory Region" (CMR) to tell the
nit: It might be worth mentioning that those CMRs ultimately come from
the BIOS. Because it's never mentioned here and in the "Physical Memory
Hotplug" it's directly mentioned that bios shouldn't support hot-removal
of memory. So the bios is a central component in a sense.
> +kernel which memory is TDX compatible. The kernel needs to build a list
> +of memory regions (out of CMRs) as "TDX-usable" memory and pass those
> +regions to the TDX module. Once this is done, those "TDX-usable" memory
> +regions are fixed during module's lifetime.
> +
> +To keep things simple, currently the kernel simply guarantees all pages
> +in the page allocator are TDX memory. Specifically, the kernel uses all
> +system memory in the core-mm at the time of initializing the TDX module
> +as TDX memory, and in the meantime, refuses to online any non-TDX-memory
> +in the memory hotplug.
> +
<snip>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 20/20] Documentation/x86: Add documentation for TDX host support
2023-06-16 9:02 ` Nikolay Borisov
@ 2023-06-16 16:26 ` Dave Hansen
0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-16 16:26 UTC (permalink / raw)
To: Nikolay Borisov, Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, david, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/16/23 02:02, Nikolay Borisov wrote:
>>
>> +TDX reports a list of "Convertible Memory Region" (CMR) to tell the
>
> nit: It might be worth mentioning that those CMRs ultimately come from
> the BIOS. Because it's never mentioned here and in the "Physical Memory
> Hotplug" it's directly mentioned that bios shouldn't support hot-removal
> of memory. So the bios is a central component in a sense.
The BIOS is weird on TDX systems. It's central, sure, but it's also
untrusted. The TDX module generally has a kind of "trust but verify"
approach to the BIOS.
I guess the BIOS is the one poking at the memory controllers and getting
the DIMMs fired up. But I _do_ think it's OK to say that CMRs come from
the TDX module. The important thing is that they're trusted.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
2023-06-07 22:43 ` Huang, Kai
@ 2023-06-19 11:37 ` Huang, Kai
2023-06-20 15:44 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-19 11:37 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Wed, 2023-06-07 at 22:43 +0000, Huang, Kai wrote:
> On Wed, 2023-06-07 at 07:15 -0700, Hansen, Dave wrote:
> > On 6/4/23 07:27, Kai Huang wrote:
> > > TDX memory has integrity and confidentiality protections. Violations of
> > > this integrity protection are supposed to only affect TDX operations and
> > > are never supposed to affect the host kernel itself. In other words,
> > > the host kernel should never, itself, see machine checks induced by the
> > > TDX integrity hardware.
> >
> > At the risk of patting myself on the back by acking a changelog that I
> > wrote 95% of:
> >
> > Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> >
>
> Thanks!
Hi Dave,
Thanks for reviewing and providing the tag. However I found there's a bug if we
use early_initcall() to detect erratum here -- in the later kexec() patch, the
early_initcall(tdx_init) sets up the x86_platform.memory_shutdown() callback to
reset TDX private memory depending on presence of the erratum, but there's no
guarantee detecting erratum will be done before tdx_init() because they are both
early_initcall().
Kirill also said early_initcall() isn't the right place so I changed to do the
detection to earlier phase in bsp_init_intel(), because we just need to match
cpu once for BSP assuming CPU model is consistent across all cpus (which is the
assumption of x86_match_cpu() anyway).
Please let me know for any comments?
+/*
+ * These CPUs have an erratum. A partial write from non-TD
+ * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
+ * private memory poisons that memory, and a subsequent read of
+ * that memory triggers #MC.
+ */
+static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL),
+ { }
+};
+
static void bsp_init_intel(struct cpuinfo_x86 *c)
{
resctrl_cpu_detect(c);
+
+ if (x86_match_cpu(tdx_pw_mce_cpu_ids))
+ setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
}
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-12 13:47 ` Dave Hansen
2023-06-13 0:51 ` Huang, Kai
@ 2023-06-19 11:43 ` Huang, Kai
2023-06-19 14:31 ` Dave Hansen
1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-19 11:43 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> On 6/12/23 03:27, Huang, Kai wrote:
> > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > will be seen by other cpus.
> >
> > Does it make sense?
>
> Just use a normal old atomic_t or set_bit()/test_bit(). They have
> built-in memory barriers are are less likely to get botched.
Hi Dave,
Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
little bit silly or overkill IMHO. Looking at the code, it seems
arch_atomic_set() simply uses __WRITE_ONCE():
static __always_inline void arch_atomic_set(atomic_t *v, int i)
{
__WRITE_ONCE(v->counter, i);
}
Is it better to just use __WRITE_ONCE() or WRITE_ONCE() here?
- tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+ WRITE_ONCE(tdmr->pamt_4k_base, pamt_base[TDX_PS_4K]);
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 02/20] x86/virt/tdx: Detect TDX during kernel boot
[not found] ` <af4e428ab1245e9441031438e606c14472daf927.1685887183.git.kai.huang@intel.com>
[not found] ` <a2da8af2-41a9-a0cf-dbe9-7f0a14bf05fe@linux.intel.com>
2023-06-06 23:44 ` Isaku Yamahata
@ 2023-06-19 12:12 ` David Hildenbrand
2023-06-19 23:58 ` Huang, Kai
2 siblings, 1 reply; 144+ messages in thread
From: David Hildenbrand @ 2023-06-19 12:12 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 04.06.23 16:27, Kai Huang wrote:
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. A CPU-attested software module
> called 'the TDX module' runs inside a new isolated memory range as a
> trusted hypervisor to manage and run protected VMs.
>
> Pre-TDX Intel hardware has support for a memory encryption architecture
> called MKTME. The memory encryption hardware underpinning MKTME is also
> used for Intel TDX. TDX ends up "stealing" some of the physical address
> space from the MKTME architecture for crypto-protection to VMs. The
> BIOS is responsible for partitioning the "KeyID" space between legacy
> MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
> KeyIDs' or 'TDX KeyIDs' for short.
>
> TDX doesn't trust the BIOS. During machine boot, TDX verifies the TDX
> private KeyIDs are consistently and correctly programmed by the BIOS
> across all CPU packages before it enables TDX on any CPU core. A valid
> TDX private KeyID range on BSP indicates TDX has been enabled by the
> BIOS, otherwise the BIOS is buggy.
>
> The TDX module is expected to be loaded by the BIOS when it enables TDX,
> but the kernel needs to properly initialize it before it can be used to
> create and run any TDX guests. The TDX module will be initialized by
> the KVM subsystem when KVM wants to use TDX.
>
> Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
> private KeyIDs. Also add a function to report whether TDX is enabled by
> the BIOS. Similar to AMD SME, kexec() will use it to determine whether
> cache flush is needed.
>
> The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
> to protect its metadata. Each TDX guest also needs a TDX KeyID for its
> own protection. Just use the first TDX KeyID as the global KeyID and
> leave the rest for TDX guests. If no TDX KeyID is left for TDX guests,
> disable TDX as initializing the TDX module alone is useless.
>
> To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
> TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST
> to opt-in TDX host kernel support (to distinguish with TDX guest kernel
> support). So far only KVM uses TDX. Make the new config option depend
> on KVM_INTEL.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>
> v10 -> v11 (David):
> - "host kernel" -> "the host kernel"
> - "protected VM" -> "confidential VM".
> - Moved setting tdx_global_keyid to the end of tdx_init().
>
> v9 -> v10:
> - No change.
>
> v8 -> v9:
> - Moved MSR macro from local tdx.h to <asm/msr-index.h> (Dave).
> - Moved reserving the TDX global KeyID from later patch to here.
> - Changed 'tdx_keyid_start' and 'nr_tdx_keyids' to
> 'tdx_guest_keyid_start' and 'tdx_nr_guest_keyids' to represent KeyIDs
> can be used by guest. (Dave)
> - Slight changelog update according to above changes.
>
> v7 -> v8: (address Dave's comments)
> - Improved changelog:
> - "KVM user" -> "The TDX module will be initialized by KVM when ..."
> - Changed "tdx_int" part to "Just say what this patch is doing"
> - Fixed the last sentence of "kexec()" paragraph
> - detect_tdx() -> record_keyid_partitioning()
> - Improved how to calculate tdx_keyid_start.
> - tdx_keyid_num -> nr_tdx_keyids.
> - Improved dmesg printing.
> - Add comment to clear_tdx().
>
> v6 -> v7:
> - No change.
>
> v5 -> v6:
> - Removed SEAMRR detection to make code simpler.
> - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
> - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).
>
>
> ---
> arch/x86/Kconfig | 12 +++++
> arch/x86/Makefile | 2 +
> arch/x86/include/asm/msr-index.h | 3 ++
> arch/x86/include/asm/tdx.h | 7 +++
> arch/x86/virt/Makefile | 2 +
> arch/x86/virt/vmx/Makefile | 2 +
> arch/x86/virt/vmx/tdx/Makefile | 2 +
> arch/x86/virt/vmx/tdx/tdx.c | 92 ++++++++++++++++++++++++++++++++
> 8 files changed, 122 insertions(+)
> create mode 100644 arch/x86/virt/Makefile
> create mode 100644 arch/x86/virt/vmx/Makefile
> create mode 100644 arch/x86/virt/vmx/tdx/Makefile
> create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 53bab123a8ee..191587f75810 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1952,6 +1952,18 @@ config X86_SGX
>
> If unsure, say N.
>
> +config INTEL_TDX_HOST
> + bool "Intel Trust Domain Extensions (TDX) host support"
> + depends on CPU_SUP_INTEL
> + depends on X86_64
> + depends on KVM_INTEL
> + help
> + Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> + host and certain physical attacks. This option enables necessary TDX
> + support in the host kernel to run confidential VMs.
> +
> + If unsure, say N.
> +
> config EFI
> bool "EFI runtime service support"
> depends on ACPI
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index b39975977c03..ec0e71d8fa30 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -252,6 +252,8 @@ archheaders:
>
> libs-y += arch/x86/lib/
>
> +core-y += arch/x86/virt/
> +
> # drivers-y are linked after core-y
> drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
> drivers-$(CONFIG_PCI) += arch/x86/pci/
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 3aedae61af4f..6d8f15b1552c 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -523,6 +523,9 @@
> #define MSR_RELOAD_PMC0 0x000014c1
> #define MSR_RELOAD_FIXED_CTR0 0x00001309
>
> +/* KeyID partitioning between MKTME and TDX */
> +#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087
> +
> /*
> * AMD64 MSRs. Not complete. See the architecture manual for a more
> * complete list.
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 25fd6070dc0b..4dfe2e794411 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -94,5 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> return -ENODEV;
> }
> #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
> +
> +#ifdef CONFIG_INTEL_TDX_HOST
> +bool platform_tdx_enabled(void);
> +#else /* !CONFIG_INTEL_TDX_HOST */
> +static inline bool platform_tdx_enabled(void) { return false; }
> +#endif /* CONFIG_INTEL_TDX_HOST */
> +
> #endif /* !__ASSEMBLY__ */
> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
> new file mode 100644
> index 000000000000..1e36502cd738
> --- /dev/null
> +++ b/arch/x86/virt/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-y += vmx/
> diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
> new file mode 100644
> index 000000000000..feebda21d793
> --- /dev/null
> +++ b/arch/x86/virt/vmx/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_INTEL_TDX_HOST) += tdx/
> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> new file mode 100644
> index 000000000000..93ca8b73e1f1
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-y += tdx.o
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> new file mode 100644
> index 000000000000..2d91e7120c90
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright(c) 2023 Intel Corporation.
> + *
> + * Intel Trusted Domain Extensions (TDX) support
> + */
> +
> +#define pr_fmt(fmt) "tdx: " fmt
> +
> +#include <linux/types.h>
> +#include <linux/cache.h>
> +#include <linux/init.h>
> +#include <linux/errno.h>
> +#include <linux/printk.h>
> +#include <asm/msr-index.h>
> +#include <asm/msr.h>
> +#include <asm/tdx.h>
> +
> +static u32 tdx_global_keyid __ro_after_init;
> +static u32 tdx_guest_keyid_start __ro_after_init;
> +static u32 tdx_nr_guest_keyids __ro_after_init;
> +
> +static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
> + u32 *nr_tdx_keyids)
> +{
> + u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
> + int ret;
> +
> + /*
> + * IA32_MKTME_KEYID_PARTIONING:
> + * Bit [31:0]: Number of MKTME KeyIDs.
> + * Bit [63:32]: Number of TDX private KeyIDs.
> + */
> + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
> + &_nr_tdx_keyids);
> + if (ret)
> + return -ENODEV;
> +
> + if (!_nr_tdx_keyids)
> + return -ENODEV;
> +
> + /* TDX KeyIDs start after the last MKTME KeyID. */
> + _tdx_keyid_start = _nr_mktme_keyids + 1;
> +
> + *tdx_keyid_start = _tdx_keyid_start;
> + *nr_tdx_keyids = _nr_tdx_keyids;
> +
> + return 0;
> +}
> +
> +static int __init tdx_init(void)
> +{
> + u32 tdx_keyid_start, nr_tdx_keyids;
> + int err;
> +
> + err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids);
> + if (err)
> + return err;
> +
> + pr_info("BIOS enabled: private KeyID range [%u, %u)\n",
> + tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
> +
> + /*
> + * The TDX module itself requires one 'global KeyID' to protect
> + * its metadata. If there's only one TDX KeyID, there won't be
> + * any left for TDX guests thus there's no point to enable TDX
> + * at all.
> + */
> + if (nr_tdx_keyids < 2) {
> + pr_info("initialization failed: too few private KeyIDs available.\n");
> + goto no_tdx;
> + }
> +
> + /*
> + * Just use the first TDX KeyID as the 'global KeyID' and
> + * leave the rest for TDX guests.
> + */
> + tdx_global_keyid = tdx_keyid_start;
> + tdx_guest_keyid_start = ++tdx_keyid_start;
> + tdx_nr_guest_keyids = --nr_tdx_keyids;
tdx_guest_keyid_start = tdx_keyid_start + 1;
tdx_nr_guest_keyids = nr_tdx_keyids - 1;
Easier to get, because the modified values are unused.
I'd probably avoid the "tdx" terminology in the local variables
("keid_start", "nr_keyids") to give a better hint what the global
variables are (tdx_*), but just a personal preference.
Apart from that,
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
[not found] ` <86f2a8814240f4bbe850f6a09fc9d0b934979d1b.1685887183.git.kai.huang@intel.com>
[not found] ` <20230606123821.exit7gyxs42dxotz@box.shutemov.name>
2023-06-07 14:15 ` Dave Hansen
@ 2023-06-19 12:21 ` David Hildenbrand
2023-06-20 10:31 ` Huang, Kai
2023-06-20 15:39 ` Dave Hansen
2 siblings, 2 replies; 144+ messages in thread
From: David Hildenbrand @ 2023-06-19 12:21 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 04.06.23 16:27, Kai Huang wrote:
> TDX memory has integrity and confidentiality protections. Violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself. In other words,
> the host kernel should never, itself, see machine checks induced by the
> TDX integrity hardware.
>
> Alas, the first few generations of TDX hardware have an erratum. A
> "partial" write to a TDX private memory cacheline will silently "poison"
> the line. Subsequent reads will consume the poison and generate a
> machine check. According to the TDX hardware spec, neither of these
> things should have happened.
>
> Virtually all kernel memory accesses operations happen in full
> cachelines. In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.
So, ordinary writes to TD private memory are not a problem? I thought
one motivation for the unmapped-guest-memory discussion was to prevent
host (userspace) writes to such memory because it would trigger a MC and
eventually crash the host.
I recall that this would happen easily (not just in some weird "partial"
case and that the spec would allow for it)
1) Does that, in general, not happen anymore (was the hardware fixed?)?
2) Will new hardware prevent/"fix" that completely (was the spec updated?)?
... or was my understanding wrong?
Thanks!
>
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller. The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings. The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
>
> With this erratum, there are additional things need to be done around
> machine check handler and kexec(), etc. Similar to other CPU bugs, use
> a CPU bug bit to indicate this erratum, and detect this erratum during
> early boot. Note this bug reflects the hardware thus it is detected
> regardless of whether the kernel is built with TDX support or not.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> v10 -> v11:
> - New patch
>
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kernel/cpu/intel.c | 21 +++++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index cb8ca46213be..dc8701f8d88b 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -483,5 +483,6 @@
> #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
> #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
> #define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
> +#define X86_BUG_TDX_PW_MCE X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
>
> #endif /* _ASM_X86_CPUFEATURES_H */
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 1c4639588ff9..251b333e53d2 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -1552,3 +1552,24 @@ u8 get_this_hybrid_cpu_type(void)
>
> return cpuid_eax(0x0000001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT;
> }
> +
> +/*
> + * These CPUs have an erratum. A partial write from non-TD
> + * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
> + * private memory poisons that memory, and a subsequent read of
> + * that memory triggers #MC.
> + */
> +static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = {
> + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL),
> + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL),
> + { }
> +};
> +
> +static int __init tdx_erratum_detect(void)
> +{
> + if (x86_match_cpu(tdx_pw_mce_cpu_ids))
> + setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
> +
> + return 0;
> +}
> +early_initcall(tdx_erratum_detect);
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
[not found] ` <ec640452a4385d61bec97f8b761ed1ff38898504.1685887183.git.kai.huang@intel.com>
2023-06-06 23:55 ` [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure Isaku Yamahata
2023-06-07 14:24 ` Dave Hansen
@ 2023-06-19 12:52 ` David Hildenbrand
2023-06-20 10:37 ` Huang, Kai
2023-06-20 15:15 ` Dave Hansen
2 siblings, 2 replies; 144+ messages in thread
From: David Hildenbrand @ 2023-06-19 12:52 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 04.06.23 16:27, Kai Huang wrote:
> TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
> mode runs only the TDX module itself or other code to load the TDX
> module.
>
> The host kernel communicates with SEAM software via a new SEAMCALL
> instruction. This is conceptually similar to a guest->host hypercall,
> except it is made from the host to SEAM software instead. The TDX
> module establishes a new SEAMCALL ABI which allows the host to
> initialize the module and to manage VMs.
>
> Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very similar
> to the TDCALL ABI and leverages much TDCALL infrastructure.
>
> SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD
> when CPU is not in VMX operation. Currently, only KVM code mocks with
> VMX enabling, and KVM is the only user of TDX. This implementation
> chooses to make KVM itself responsible for enabling VMX before using
> TDX and let the rest of the kernel stay blissfully unaware of VMX.
>
> The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
> kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
> first. Architecturally, there is no CPU flag to check whether the CPU
> is in VMX operation. Also, if a BIOS were buggy, it could still report
> valid TDX private KeyIDs when TDX actually couldn't be enabled.
>
> Extend the TDX_MODULE_CALL macro to handle #UD and #GP to return error
> codes. Introduce two new TDX error codes for them respectively so the
> caller can distinguish.
>
> Also add a wrapper function of SEAMCALL to convert SEAMCALL error code
> to the kernel error code, and print out SEAMCALL error code to help the
> user to understand what went wrong.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
I agree with Dave that a buggy bios is not a good motivation for this
patch. The real strength of this infrastructure IMHO is central error
handling and expressive error messages. Maybe it makes some corner cases
(reboot -f) easier to handle. That would make a better justification
than buggy bios -- and should be spelled out in the patch description.
[...]
> +/*
> + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> + * leaf function return code and the additional output respectively if
> + * not NULL.
> + */
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + u64 *seamcall_ret,
> + struct tdx_module_output *out)
> +{
> + int cpu, ret = 0;
> + u64 sret;
> +
> + /* Need a stable CPU id for printing error message */
> + cpu = get_cpu();
> +
> + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> +
Why not
cpu = get_cpu();
sret = __seamcall(fn, rcx, rdx, r8, r9, out);
put_cpu();
> + /* Save SEAMCALL return code if the caller wants it */
> + if (seamcall_ret)
> + *seamcall_ret = sret;
> +
> + /* SEAMCALL was successful */
> + if (!sret)
> + goto out;
Why not move that into the switch statement below to avoid th goto?
If you do the put_cpu() early, you can avoid "ret" as well.
switch (sret) {
case 0:
/* SEAMCALL was successful */
return 0;
case TDX_SEAMCALL_GP:
pr_err_once("[firmware bug]: TDX is not enabled by BIOS.\n");
return -ENODEV;
...
}
[...]
> +
> static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
> u32 *nr_tdx_keyids)
> {
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> new file mode 100644
> index 000000000000..48ad1a1ba737
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _X86_VIRT_TDX_H
> +#define _X86_VIRT_TDX_H
> +
> +#include <linux/types.h>
> +
> +struct tdx_module_output;
> +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + struct tdx_module_output *out);
> +#endif
> diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> index 49a54356ae99..757b0c34be10 100644
> --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #include <asm/asm-offsets.h>
> #include <asm/tdx.h>
> +#include <asm/asm.h>
>
> /*
> * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> @@ -45,6 +46,7 @@
> /* Leave input param 2 in RDX */
>
> .if \host
> +1:
> seamcall
> /*
> * SEAMCALL instruction is essentially a VMExit from VMX root
> @@ -57,10 +59,23 @@
> * This value will never be used as actual SEAMCALL error code as
> * it is from the Reserved status code class.
> */
> - jnc .Lno_vmfailinvalid
> + jnc .Lseamcall_out
> mov $TDX_SEAMCALL_VMFAILINVALID, %rax
> -.Lno_vmfailinvalid:
> + jmp .Lseamcall_out
> +2:
> + /*
> + * SEAMCALL caused #GP or #UD. By reaching here %eax contains
> + * the trap number. Convert the trap number to the TDX error
> + * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
> + *
> + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
> + * only accepts 32-bit immediate at most.
Not sure if that comment is really helpful here. It's a common pattern
for large immediates, no?
> + */
> + mov $TDX_SW_ERROR, %r12
> + orq %r12, %rax
>
> + _ASM_EXTABLE_FAULT(1b, 2b)
> +.Lseamcall_out:
> .else
> tdcall
> .endif
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
[not found] ` <9b3582c9f3a81ae68b32d9997fcd20baecb63b9b.1685887183.git.kai.huang@intel.com>
` (3 preceding siblings ...)
2023-06-09 14:42 ` Nikolay Borisov
@ 2023-06-19 13:00 ` David Hildenbrand
2023-06-20 10:39 ` Huang, Kai
4 siblings, 1 reply; 144+ messages in thread
From: David Hildenbrand @ 2023-06-19 13:00 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 04.06.23 16:27, Kai Huang wrote:
> Certain SEAMCALL leaf functions may return error due to running out of
> entropy, in which case the SEAMCALL should be retried as suggested by
> the TDX spec.
>
> Handle this case in SEAMCALL common function. Mimic the existing
> rdrand_long() to retry RDRAND_RETRY_LOOPS times.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> v10 -> v11:
> - New patch
>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 15 ++++++++++++++-
> arch/x86/virt/vmx/tdx/tdx.h | 17 +++++++++++++++++
> 2 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index e82713dd5d54..e62e978eba1b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -15,6 +15,7 @@
> #include <linux/smp.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> +#include <asm/archrandom.h>
> #include <asm/tdx.h>
> #include "tdx.h"
>
> @@ -33,12 +34,24 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> struct tdx_module_output *out)
> {
> int cpu, ret = 0;
> + int retry;
> u64 sret;
>
> /* Need a stable CPU id for printing error message */
> cpu = get_cpu();
>
> - sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> + /*
> + * Certain SEAMCALL leaf functions may return error due to
> + * running out of entropy, in which case the SEAMCALL should
> + * be retried. Handle this in SEAMCALL common function.
> + *
> + * Mimic the existing rdrand_long() to retry
> + * RDRAND_RETRY_LOOPS times.
> + */
> + retry = RDRAND_RETRY_LOOPS;
Nit: I'd just do a "int retry = RDRAND_RETRY_LOOPS" and simplify this
comment to "Mimic rdrand_long() retry behavior."
> + do {
> + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> + } while (sret == TDX_RND_NO_ENTROPY && --retry);
>
> /* Save SEAMCALL return code if the caller wants it */
> if (seamcall_ret)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 48ad1a1ba737..55dbb1b8c971 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -4,6 +4,23 @@
>
> #include <linux/types.h>
>
> +/*
> + * This file contains both macros and data structures defined by the TDX
> + * architecture and Linux defined software data structures and functions.
> + * The two should not be mixed together for better readability. The
> + * architectural definitions come first.
> + */
> +
> +/*
> + * TDX SEAMCALL error codes
> + */
> +#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL
> +
> +/*
> + * Do not put any hardware-defined TDX structure representations below
> + * this comment!
> + */
> +
> struct tdx_module_output;
> u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> struct tdx_module_output *out);
In general, LGTM
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand
[not found] ` <21b3a45cb73b4e1917c1eba75b7769781a15aa14.1685887183.git.kai.huang@intel.com>
2023-06-07 15:22 ` [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand Dave Hansen
@ 2023-06-19 13:16 ` David Hildenbrand
2023-06-19 23:28 ` Huang, Kai
1 sibling, 1 reply; 144+ messages in thread
From: David Hildenbrand @ 2023-06-19 13:16 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 04.06.23 16:27, Kai Huang wrote:
> To enable TDX the kernel needs to initialize TDX from two perspectives:
> 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
> to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
> on one logical cpu before the kernel wants to make any other SEAMCALLs
> on that cpu (including those involved during module initialization and
> running TDX guests).
>
> The TDX module can be initialized only once in its lifetime. Instead
> of always initializing it at boot time, this implementation chooses an
> "on demand" approach to initialize TDX until there is a real need (e.g
> when requested by KVM). This approach has below pros:
>
> 1) It avoids consuming the memory that must be allocated by kernel and
> given to the TDX module as metadata (~1/256th of the TDX-usable memory),
> and also saves the CPU cycles of initializing the TDX module (and the
> metadata) when TDX is not used at all.
>
> 2) The TDX module design allows it to be updated while the system is
> running. The update procedure shares quite a few steps with this "on
> demand" initialization mechanism. The hope is that much of "on demand"
> mechanism can be shared with a future "update" mechanism. A boot-time
> TDX module implementation would not be able to share much code with the
> update mechanism.
>
> 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM
> code mucks with VMX enabling. If the TDX module were to be initialized
> separately from KVM (like at boot), the boot code would need to be
> taught how to muck with VMX enabling and KVM would need to be taught how
> to cope with that. Making KVM itself responsible for TDX initialization
> lets the rest of the kernel stay blissfully unaware of VMX.
>
> Similar to module initialization, also make the per-cpu initialization
> "on demand" as it also depends on VMX being enabled.
>
> Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
> module and enable TDX on local cpu respectively. For now tdx_enable()
> is a placeholder. The TODO list will be pared down as functionality is
> added.
>
> In tdx_enable() use a state machine protected by mutex to make sure the
> initialization will only be done once, as tdx_enable() can be called
> multiple times (i.e. KVM module can be reloaded) and may be called
> concurrently by other kernel components in the future.
>
> The per-cpu initialization on each cpu can only be done once during the
> module's life time. Use a per-cpu variable to track its status to make
> sure it is only done once in tdx_cpu_enable().
>
> Also, a SEAMCALL to do TDX module global initialization must be done
> once on any logical cpu before any per-cpu initialization SEAMCALL. Do
> it inside tdx_cpu_enable() too (if hasn't been done).
>
> tdx_enable() can potentially invoke SEAMCALLs on any online cpus. The
> per-cpu initialization must be done before those SEAMCALLs are invoked
> on some cpu. To keep things simple, in tdx_cpu_enable(), always do the
> per-cpu initialization regardless of whether the TDX module has been
> initialized or not. And in tdx_enable(), don't call tdx_cpu_enable()
> but assume the caller has disabled CPU hotplug, done VMXON and
> tdx_cpu_enable() on all online cpus before calling tdx_enable().
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>
> v10 -> v11:
> - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
> - Return the actual error code for tdx_enable() instead of -EINVAL.
> - Added Isaku's Reviewed-by.
>
> v9 -> v10:
> - Merged the patch to handle per-cpu initialization to this patch to
> tell the story better.
> - Changed how to handle the per-cpu initialization to only provide a
> tdx_cpu_enable() function to let the user of TDX to do it when the
> user wants to run TDX code on a certain cpu.
> - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
> call lockdep_assert_cpus_held() to assume the caller has done that.
> - Improved comments around tdx_enable() and tdx_cpu_enable().
> - Improved changelog to tell the story better accordingly.
>
> v8 -> v9:
> - Removed detailed TODO list in the changelog (Dave).
> - Added back steps to do module global initialization and per-cpu
> initialization in the TODO list comment.
> - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
>
> v7 -> v8:
> - Refined changelog (Dave).
> - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
> - Add a "TODO list" comment in init_tdx_module() to list all steps of
> initializing the TDX Module to tell the story (Dave).
> - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
> comments (Dave).
> - Simplified __tdx_enable() to only handle success or failure.
> - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
> - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
> - Improved comments (Dave).
> - Pointed out 'tdx_module_status' is software thing (Dave).
>
> v6 -> v7:
> - No change.
>
> v5 -> v6:
> - Added code to set status to TDX_MODULE_NONE if TDX module is not
> loaded (Chao)
> - Added Chao's Reviewed-by.
> - Improved comments around cpus_read_lock().
>
> - v3->v5 (no feedback on v4):
> - Removed the check that SEAMRR and TDX KeyID have been detected on
> all present cpus.
> - Removed tdx_detect().
> - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
> hotplug lock and return early with error message.
> - Improved dmesg printing for TDX module detection and initialization.
>
>
> ---
> arch/x86/include/asm/tdx.h | 4 +
> arch/x86/virt/vmx/tdx/tdx.c | 179 ++++++++++++++++++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.h | 13 +++
> 3 files changed, 196 insertions(+)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index b489b5b9de5d..03f74851608f 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -102,8 +102,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>
> #ifdef CONFIG_INTEL_TDX_HOST
> bool platform_tdx_enabled(void);
> +int tdx_cpu_enable(void);
> +int tdx_enable(void);
> #else /* !CONFIG_INTEL_TDX_HOST */
> static inline bool platform_tdx_enabled(void) { return false; }
> +static inline int tdx_cpu_enable(void) { return -ENODEV; }
> +static inline int tdx_enable(void) { return -ENODEV; }
> #endif /* CONFIG_INTEL_TDX_HOST */
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index e62e978eba1b..bcf2b2d15a2e 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,10 @@
> #include <linux/errno.h>
> #include <linux/printk.h>
> #include <linux/smp.h>
> +#include <linux/cpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/percpu-defs.h>
> +#include <linux/mutex.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/archrandom.h>
> @@ -23,6 +27,18 @@ static u32 tdx_global_keyid __ro_after_init;
> static u32 tdx_guest_keyid_start __ro_after_init;
> static u32 tdx_nr_guest_keyids __ro_after_init;
>
> +static unsigned int tdx_global_init_status;
> +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
> +#define TDX_GLOBAL_INIT_DONE _BITUL(0)
> +#define TDX_GLOBAL_INIT_FAILED _BITUL(1)
> +
> +static DEFINE_PER_CPU(unsigned int, tdx_lp_init_status);
> +#define TDX_LP_INIT_DONE _BITUL(0)
> +#define TDX_LP_INIT_FAILED _BITUL(1)
I'm curious, why do we have to track three states: uninitialized
(!done), initialized (done + ! failed), permanent error (done + failed).
[besides: why can't you use an enum and share that between global and pcpu?]
Why can't you have a pcpu "bool tdx_lp_initialized" and "bool
tdx_global_initialized"?
I mean, if there was an error during previous initialization, it's not
initialized: you'd try initializing again -- and possibly fail again --
on the next attempt. I doubt that a "try to cache failed status to keep
failing fast" is really required.
Is there any other reason (e.g., second init attempt would set your
computer on fire) why it can't be simpler?
[...]
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
[not found] ` <50386eddbb8046b0b222d385e56e8115ed566526.1685887183.git.kai.huang@intel.com>
` (2 preceding siblings ...)
2023-06-09 10:02 ` kirill.shutemov
@ 2023-06-19 13:29 ` David Hildenbrand
2023-06-19 23:51 ` Huang, Kai
3 siblings, 1 reply; 144+ messages in thread
From: David Hildenbrand @ 2023-06-19 13:29 UTC (permalink / raw)
To: Kai Huang, linux-kernel, kvm
Cc: linux-mm, dave.hansen, kirill.shutemov, tony.luck, peterz, tglx,
seanjc, pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 04.06.23 16:27, Kai Huang wrote:
> Start to transit out the "multi-steps" to initialize the TDX module.
>
> TDX provides increased levels of memory confidentiality and integrity.
> This requires special hardware support for features like memory
> encryption and storage of memory integrity checksums. Not all memory
> satisfies these requirements.
>
> As a result, TDX introduced the concept of a "Convertible Memory Region"
> (CMR). During boot, the firmware builds a list of all of the memory
> ranges which can provide the TDX security guarantees.
>
> CMRs tell the kernel which memory is TDX compatible. The kernel takes
> CMRs (plus a little more metadata) and constructs "TD Memory Regions"
> (TDMRs). TDMRs let the kernel grant TDX protections to some or all of
> the CMR areas.
>
> The TDX module also reports necessary information to let the kernel
> build TDMRs and run TDX guests in structure 'tdsysinfo_struct'. The
> list of CMRs, along with the TDX module information, is available to
> the kernel by querying the TDX module.
>
> As a preparation to construct TDMRs, get the TDX module information and
> the list of CMRs. Print out CMRs to help user to decode which memory
> regions are TDX convertible.
>
> The 'tdsysinfo_struct' is fairly large (1024 bytes) and contains a lot
> of info about the TDX module. Fully define the entire structure, but
> only use the fields necessary to build the TDMRs and pr_info() some
> basics about the module. The rest of the fields will get used by KVM.
>
> For now both 'tdsysinfo_struct' and CMRs are only used during the module
> initialization. But because they are both relatively big, declare them
> inside the module initialization function but as static variables.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
[...]
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 67 +++++++++++++++++++++++++++++++++-
> arch/x86/virt/vmx/tdx/tdx.h | 72 +++++++++++++++++++++++++++++++++++++
> 2 files changed, 138 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index bcf2b2d15a2e..9fde0f71dd8b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -20,6 +20,7 @@
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/archrandom.h>
> +#include <asm/page.h>
> #include <asm/tdx.h>
> #include "tdx.h"
>
> @@ -191,12 +192,76 @@ int tdx_cpu_enable(void)
> }
> EXPORT_SYMBOL_GPL(tdx_cpu_enable);
>
> +static inline bool is_cmr_empty(struct cmr_info *cmr)
> +{
> + return !cmr->size;
> +}
> +
Nit: maybe it's just me, but this function seems unnecessary.
If "!cmr->size" is not expressive, then I don't know why "is_cmr_empty"
should be. Just inline that into the single user.
.. after all the single caller also uses/prints cmr->size ...
> +static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
> +{
> + int i;
> +
> + for (i = 0; i < nr_cmrs; i++) {
> + struct cmr_info *cmr = &cmr_array[i];
> +
> + /*
> + * The array of CMRs reported via TDH.SYS.INFO can
> + * contain tail empty CMRs. Don't print them.
> + */
> + if (is_cmr_empty(cmr))
> + break;
> +
> + pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
> + cmr->base + cmr->size);
> + }
> +}
> +
> +/*
> + * Get the TDX module information (TDSYSINFO_STRUCT) and the array of
> + * CMRs, and save them to @sysinfo and @cmr_array. @sysinfo must have
> + * been padded to have enough room to save the TDSYSINFO_STRUCT.
> + */
> +static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> + struct cmr_info *cmr_array)
> +{
> + struct tdx_module_output out;
> + u64 sysinfo_pa, cmr_array_pa;
> + int ret;
> +
> + sysinfo_pa = __pa(sysinfo);
> + cmr_array_pa = __pa(cmr_array);
> + ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> + cmr_array_pa, MAX_CMRS, NULL, &out);
> + if (ret)
> + return ret;
> +
> + pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
"attributes" ?
> + sysinfo->attributes, sysinfo->vendor_id,
> + sysinfo->major_version, sysinfo->minor_version,
> + sysinfo->build_date, sysinfo->build_num);
> +
> + /* R9 contains the actual entries written to the CMR array. */
> + print_cmrs(cmr_array, out.r9);
> +
> + return 0;
> +}
> +
> static int init_tdx_module(void)
> {
> + static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
> + TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
> + static struct cmr_info cmr_array[MAX_CMRS]
> + __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> + struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
> + int ret;
> +
> + ret = tdx_get_sysinfo(sysinfo, cmr_array);
> + if (ret)
> + return ret;
> +
> /*
> * TODO:
> *
> - * - Get TDX module information and TDX-capable memory regions.
> * - Build the list of TDX-usable memory regions.
> * - Construct a list of "TD Memory Regions" (TDMRs) to cover
> * all TDX-usable memory regions.
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 9fb46033c852..97f4d7e7f1a4 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -3,6 +3,8 @@
> #define _X86_VIRT_TDX_H
>
> #include <linux/types.h>
> +#include <linux/stddef.h>
> +#include <linux/compiler_attributes.h>
>
> /*
> * This file contains both macros and data structures defined by the TDX
> @@ -21,6 +23,76 @@
> */
> #define TDH_SYS_INIT 33
> #define TDH_SYS_LP_INIT 35
> +#define TDH_SYS_INFO 32
> +
> +struct cmr_info {
> + u64 base;
> + u64 size;
> +} __packed;
> +
> +#define MAX_CMRS 32
> +#define CMR_INFO_ARRAY_ALIGNMENT 512
> +
> +struct cpuid_config {
> + u32 leaf;
> + u32 sub_leaf;
> + u32 eax;
> + u32 ebx;
> + u32 ecx;
> + u32 edx;
> +} __packed;
> +
> +#define DECLARE_PADDED_STRUCT(type, name, size, alignment) \
> + struct type##_padded { \
> + union { \
> + struct type name; \
> + u8 padding[size]; \
> + }; \
> + } name##_padded __aligned(alignment)
> +
> +#define PADDED_STRUCT(name) (name##_padded.name)
> +
> +#define TDSYSINFO_STRUCT_SIZE 1024
So, it can never be larger than 1024 bytes? Not even with many cpuid
configs?
> +#define TDSYSINFO_STRUCT_ALIGNMENT 1024
> +
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-19 11:43 ` Huang, Kai
@ 2023-06-19 14:31 ` Dave Hansen
2023-06-19 14:46 ` kirill.shutemov
0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-19 14:31 UTC (permalink / raw)
To: Huang, Kai, kirill.shutemov
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/19/23 04:43, Huang, Kai wrote:
> On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
>> On 6/12/23 03:27, Huang, Kai wrote:
>>> So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
>>> it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
>>> will be seen by other cpus.
>>>
>>> Does it make sense?
>> Just use a normal old atomic_t or set_bit()/test_bit(). They have
>> built-in memory barriers are are less likely to get botched.
> Hi Dave,
>
> Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> little bit silly or overkill IMHO. Looking at the code, it seems
> arch_atomic_set() simply uses __WRITE_ONCE():
How about _adding_ a variable that protects tdmr->pamt_4k_base?
Wouldn't that be more straightforward than mucking around with existing
types?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-19 14:31 ` Dave Hansen
@ 2023-06-19 14:46 ` kirill.shutemov
2023-06-19 23:35 ` Huang, Kai
2023-06-19 23:41 ` Dave Hansen
0 siblings, 2 replies; 144+ messages in thread
From: kirill.shutemov @ 2023-06-19 14:46 UTC (permalink / raw)
To: Dave Hansen
Cc: Huang, Kai, kvm, Luck, Tony, david, bagasdotme, ak, Wysocki,
Rafael J, linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, Jun 19, 2023 at 07:31:21AM -0700, Dave Hansen wrote:
> On 6/19/23 04:43, Huang, Kai wrote:
> > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> >> On 6/12/23 03:27, Huang, Kai wrote:
> >>> So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> >>> it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> >>> will be seen by other cpus.
> >>>
> >>> Does it make sense?
> >> Just use a normal old atomic_t or set_bit()/test_bit(). They have
> >> built-in memory barriers are are less likely to get botched.
> > Hi Dave,
> >
> > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> > little bit silly or overkill IMHO. Looking at the code, it seems
> > arch_atomic_set() simply uses __WRITE_ONCE():
>
> How about _adding_ a variable that protects tdmr->pamt_4k_base?
> Wouldn't that be more straightforward than mucking around with existing
> types?
What's wrong with simple global spinlock that protects all tdmr->pamt_*?
It is much easier to follow than a custom serialization scheme.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 15/20] x86/virt/tdx: Configure global KeyID on all packages
2023-06-15 22:24 ` Huang, Kai
@ 2023-06-19 14:56 ` kirill.shutemov
2023-06-19 23:38 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-19 14:56 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, nik.borisov, linux-kernel, Hansen, Dave, david, bagasdotme,
ak, Wysocki, Rafael J, Chatre, Reinette, Christopherson,,
Sean, pbonzini, linux-mm, Yamahata, Isaku, tglx, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Thu, Jun 15, 2023 at 10:24:17PM +0000, Huang, Kai wrote:
> Hi Kirill/Dave,
>
> Since I have received couple of tags from you, may I know which way do you
> prefer?
I agree with Nikolay, removing these "if (ret)" helps readability.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand
2023-06-19 13:16 ` David Hildenbrand
@ 2023-06-19 23:28 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-19 23:28 UTC (permalink / raw)
To: kvm, linux-kernel, david
Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Christopherson,,
Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-19 at 15:16 +0200, David Hildenbrand wrote:
> On 04.06.23 16:27, Kai Huang wrote:
> > To enable TDX the kernel needs to initialize TDX from two perspectives:
> > 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
> > to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
> > on one logical cpu before the kernel wants to make any other SEAMCALLs
> > on that cpu (including those involved during module initialization and
> > running TDX guests).
> >
> > The TDX module can be initialized only once in its lifetime. Instead
> > of always initializing it at boot time, this implementation chooses an
> > "on demand" approach to initialize TDX until there is a real need (e.g
> > when requested by KVM). This approach has below pros:
> >
> > 1) It avoids consuming the memory that must be allocated by kernel and
> > given to the TDX module as metadata (~1/256th of the TDX-usable memory),
> > and also saves the CPU cycles of initializing the TDX module (and the
> > metadata) when TDX is not used at all.
> >
> > 2) The TDX module design allows it to be updated while the system is
> > running. The update procedure shares quite a few steps with this "on
> > demand" initialization mechanism. The hope is that much of "on demand"
> > mechanism can be shared with a future "update" mechanism. A boot-time
> > TDX module implementation would not be able to share much code with the
> > update mechanism.
> >
> > 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM
> > code mucks with VMX enabling. If the TDX module were to be initialized
> > separately from KVM (like at boot), the boot code would need to be
> > taught how to muck with VMX enabling and KVM would need to be taught how
> > to cope with that. Making KVM itself responsible for TDX initialization
> > lets the rest of the kernel stay blissfully unaware of VMX.
> >
> > Similar to module initialization, also make the per-cpu initialization
> > "on demand" as it also depends on VMX being enabled.
> >
> > Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
> > module and enable TDX on local cpu respectively. For now tdx_enable()
> > is a placeholder. The TODO list will be pared down as functionality is
> > added.
> >
> > In tdx_enable() use a state machine protected by mutex to make sure the
> > initialization will only be done once, as tdx_enable() can be called
> > multiple times (i.e. KVM module can be reloaded) and may be called
> > concurrently by other kernel components in the future.
> >
> > The per-cpu initialization on each cpu can only be done once during the
> > module's life time. Use a per-cpu variable to track its status to make
> > sure it is only done once in tdx_cpu_enable().
> >
> > Also, a SEAMCALL to do TDX module global initialization must be done
> > once on any logical cpu before any per-cpu initialization SEAMCALL. Do
> > it inside tdx_cpu_enable() too (if hasn't been done).
> >
> > tdx_enable() can potentially invoke SEAMCALLs on any online cpus. The
> > per-cpu initialization must be done before those SEAMCALLs are invoked
> > on some cpu. To keep things simple, in tdx_cpu_enable(), always do the
> > per-cpu initialization regardless of whether the TDX module has been
> > initialized or not. And in tdx_enable(), don't call tdx_cpu_enable()
> > but assume the caller has disabled CPU hotplug, done VMXON and
> > tdx_cpu_enable() on all online cpus before calling tdx_enable().
> >
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >
> > v10 -> v11:
> > - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
> > - Return the actual error code for tdx_enable() instead of -EINVAL.
> > - Added Isaku's Reviewed-by.
> >
> > v9 -> v10:
> > - Merged the patch to handle per-cpu initialization to this patch to
> > tell the story better.
> > - Changed how to handle the per-cpu initialization to only provide a
> > tdx_cpu_enable() function to let the user of TDX to do it when the
> > user wants to run TDX code on a certain cpu.
> > - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
> > call lockdep_assert_cpus_held() to assume the caller has done that.
> > - Improved comments around tdx_enable() and tdx_cpu_enable().
> > - Improved changelog to tell the story better accordingly.
> >
> > v8 -> v9:
> > - Removed detailed TODO list in the changelog (Dave).
> > - Added back steps to do module global initialization and per-cpu
> > initialization in the TODO list comment.
> > - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
> >
> > v7 -> v8:
> > - Refined changelog (Dave).
> > - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
> > - Add a "TODO list" comment in init_tdx_module() to list all steps of
> > initializing the TDX Module to tell the story (Dave).
> > - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
> > comments (Dave).
> > - Simplified __tdx_enable() to only handle success or failure.
> > - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
> > - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
> > - Improved comments (Dave).
> > - Pointed out 'tdx_module_status' is software thing (Dave).
> >
> > v6 -> v7:
> > - No change.
> >
> > v5 -> v6:
> > - Added code to set status to TDX_MODULE_NONE if TDX module is not
> > loaded (Chao)
> > - Added Chao's Reviewed-by.
> > - Improved comments around cpus_read_lock().
> >
> > - v3->v5 (no feedback on v4):
> > - Removed the check that SEAMRR and TDX KeyID have been detected on
> > all present cpus.
> > - Removed tdx_detect().
> > - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
> > hotplug lock and return early with error message.
> > - Improved dmesg printing for TDX module detection and initialization.
> >
> >
> > ---
> > arch/x86/include/asm/tdx.h | 4 +
> > arch/x86/virt/vmx/tdx/tdx.c | 179 ++++++++++++++++++++++++++++++++++++
> > arch/x86/virt/vmx/tdx/tdx.h | 13 +++
> > 3 files changed, 196 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index b489b5b9de5d..03f74851608f 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -102,8 +102,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> >
> > #ifdef CONFIG_INTEL_TDX_HOST
> > bool platform_tdx_enabled(void);
> > +int tdx_cpu_enable(void);
> > +int tdx_enable(void);
> > #else /* !CONFIG_INTEL_TDX_HOST */
> > static inline bool platform_tdx_enabled(void) { return false; }
> > +static inline int tdx_cpu_enable(void) { return -ENODEV; }
> > +static inline int tdx_enable(void) { return -ENODEV; }
> > #endif /* CONFIG_INTEL_TDX_HOST */
> >
> > #endif /* !__ASSEMBLY__ */
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index e62e978eba1b..bcf2b2d15a2e 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -13,6 +13,10 @@
> > #include <linux/errno.h>
> > #include <linux/printk.h>
> > #include <linux/smp.h>
> > +#include <linux/cpu.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/percpu-defs.h>
> > +#include <linux/mutex.h>
> > #include <asm/msr-index.h>
> > #include <asm/msr.h>
> > #include <asm/archrandom.h>
> > @@ -23,6 +27,18 @@ static u32 tdx_global_keyid __ro_after_init;
> > static u32 tdx_guest_keyid_start __ro_after_init;
> > static u32 tdx_nr_guest_keyids __ro_after_init;
> >
> > +static unsigned int tdx_global_init_status;
> > +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
> > +#define TDX_GLOBAL_INIT_DONE _BITUL(0)
> > +#define TDX_GLOBAL_INIT_FAILED _BITUL(1)
> > +
> > +static DEFINE_PER_CPU(unsigned int, tdx_lp_init_status);
> > +#define TDX_LP_INIT_DONE _BITUL(0)
> > +#define TDX_LP_INIT_FAILED _BITUL(1)
>
> I'm curious, why do we have to track three states: uninitialized
> (!done), initialized (done + ! failed), permanent error (done + failed).
>
> [besides: why can't you use an enum and share that between global and pcpu?]
>
> Why can't you have a pcpu "bool tdx_lp_initialized" and "bool
> tdx_global_initialized"?
>
> I mean, if there was an error during previous initialization, it's not
> initialized: you'd try initializing again -- and possibly fail again --
> on the next attempt. I doubt that a "try to cache failed status to keep
> failing fast" is really required.
>
> Is there any other reason (e.g., second init attempt would set your
> computer on fire) why it can't be simpler?
No other reasons but only the one that you mentioned above: I didn't want to
retry in case of permanent error.
Yes I agree we can have a pcpu "bool tdx_lp_initialized" and a "bool
tdx_global_initialized" to simplify the logic.
Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-19 14:46 ` kirill.shutemov
@ 2023-06-19 23:35 ` Huang, Kai
2023-06-19 23:41 ` Dave Hansen
1 sibling, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-19 23:35 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 19, 2023 at 07:31:21AM -0700, Dave Hansen wrote:
> > On 6/19/23 04:43, Huang, Kai wrote:
> > > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> > > > On 6/12/23 03:27, Huang, Kai wrote:
> > > > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > > > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > > > > will be seen by other cpus.
> > > > >
> > > > > Does it make sense?
> > > > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > > > built-in memory barriers are are less likely to get botched.
> > > Hi Dave,
> > >
> > > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> > > little bit silly or overkill IMHO. Looking at the code, it seems
> > > arch_atomic_set() simply uses __WRITE_ONCE():
> >
> > How about _adding_ a variable that protects tdmr->pamt_4k_base?
> > Wouldn't that be more straightforward than mucking around with existing
> > types?
>
> What's wrong with simple global spinlock that protects all tdmr->pamt_*?
> It is much easier to follow than a custom serialization scheme.
>
For this patch I think it's overkill to use spinlock because when the rebooting
cpu is reading this all other cpus have been stopped already, so there's no
concurrent thing here.
However I just recall that the next #MC handler patch can also take advantage of
this too because #MC handler can truly run concurrently with module
initialization. Currently that one reads tdx_module_status first but again we
may have the same memory order issue. So having a spinlock makes sense from #MC
handler patch's point of view.
I'll change to use spinlock if Dave is fine?
Thanks for feedback!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 15/20] x86/virt/tdx: Configure global KeyID on all packages
2023-06-19 14:56 ` kirill.shutemov
@ 2023-06-19 23:38 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-19 23:38 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
Luck, Tony, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, nik.borisov, linux-mm,
linux-kernel, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown,
Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-19 at 17:56 +0300, kirill.shutemov@linux.intel.com wrote:
> On Thu, Jun 15, 2023 at 10:24:17PM +0000, Huang, Kai wrote:
> > Hi Kirill/Dave,
> >
> > Since I have received couple of tags from you, may I know which way do you
> > prefer?
>
> I agree with Nikolay, removing these "if (ret)" helps readability.
>
OK I'll change. Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-19 14:46 ` kirill.shutemov
2023-06-19 23:35 ` Huang, Kai
@ 2023-06-19 23:41 ` Dave Hansen
2023-06-20 0:56 ` Huang, Kai
2023-06-20 7:48 ` Peter Zijlstra
1 sibling, 2 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-19 23:41 UTC (permalink / raw)
To: kirill.shutemov
Cc: Huang, Kai, kvm, Luck, Tony, david, bagasdotme, ak, Wysocki,
Rafael J, linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/19/23 07:46, kirill.shutemov@linux.intel.com wrote:
>>>
>>> Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
>>> little bit silly or overkill IMHO. Looking at the code, it seems
>>> arch_atomic_set() simply uses __WRITE_ONCE():
>> How about _adding_ a variable that protects tdmr->pamt_4k_base?
>> Wouldn't that be more straightforward than mucking around with existing
>> types?
> What's wrong with simple global spinlock that protects all tdmr->pamt_*?
> It is much easier to follow than a custom serialization scheme.
Quick, what prevents a:
spin_lock() => #MC => spin_lock()
deadlock?
Plain old test/sets don't deadlock ever.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory
2023-06-19 13:29 ` David Hildenbrand
@ 2023-06-19 23:51 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-19 23:51 UTC (permalink / raw)
To: kvm, linux-kernel, david
Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Christopherson,,
Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
> > +static inline bool is_cmr_empty(struct cmr_info *cmr)
> > +{
> > + return !cmr->size;
> > +}
> > +
>
> Nit: maybe it's just me, but this function seems unnecessary.
>
> If "!cmr->size" is not expressive, then I don't know why "is_cmr_empty"
> should be. Just inline that into the single user.
>
> .. after all the single caller also uses/prints cmr->size ...
Agreed. I'll remove this function. Thanks!
>
> > +static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_cmrs; i++) {
> > + struct cmr_info *cmr = &cmr_array[i];
> > +
> > + /*
> > + * The array of CMRs reported via TDH.SYS.INFO can
> > + * contain tail empty CMRs. Don't print them.
> > + */
> > + if (is_cmr_empty(cmr))
> > + break;
> > +
> > + pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
> > + cmr->base + cmr->size);
> > + }
> > +}
> > +
> > +/*
> > + * Get the TDX module information (TDSYSINFO_STRUCT) and the array of
> > + * CMRs, and save them to @sysinfo and @cmr_array. @sysinfo must have
> > + * been padded to have enough room to save the TDSYSINFO_STRUCT.
> > + */
> > +static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> > + struct cmr_info *cmr_array)
> > +{
> > + struct tdx_module_output out;
> > + u64 sysinfo_pa, cmr_array_pa;
> > + int ret;
> > +
> > + sysinfo_pa = __pa(sysinfo);
> > + cmr_array_pa = __pa(cmr_array);
> > + ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> > + cmr_array_pa, MAX_CMRS, NULL, &out);
> > + if (ret)
> > + return ret;
> > +
> > + pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
>
>
> "attributes" ?
Appreciate! :)
[...]
> > +#define TDSYSINFO_STRUCT_SIZE 1024
>
> So, it can never be larger than 1024 bytes? Not even with many cpuid
> configs?
Correct. The TDX module spec(s) says:
TDSYSINFO_STRUCT’s size is 1024B.
Which is an architectural sentence to me.
We (Intel) already published TDX IO, and TDSYSINFO_STRUCT is 1024B for all TDX
module versions.
>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 02/20] x86/virt/tdx: Detect TDX during kernel boot
2023-06-19 12:12 ` David Hildenbrand
@ 2023-06-19 23:58 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-19 23:58 UTC (permalink / raw)
To: kvm, linux-kernel, david
Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Christopherson,,
Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
[...]
> > + /*
> > + * Just use the first TDX KeyID as the 'global KeyID' and
> > + * leave the rest for TDX guests.
> > + */
> > + tdx_global_keyid = tdx_keyid_start;
> > + tdx_guest_keyid_start = ++tdx_keyid_start;
> > + tdx_nr_guest_keyids = --nr_tdx_keyids;
>
> tdx_guest_keyid_start = tdx_keyid_start + 1;
> tdx_nr_guest_keyids = nr_tdx_keyids - 1;
>
> Easier to get, because the modified values are unused.
Will do.
>
> I'd probably avoid the "tdx" terminology in the local variables
> ("keid_start", "nr_keyids") to give a better hint what the global
> variables are (tdx_*), but just a personal preference.
>
Yeah in general I agree but I chose to have "tdx_*" because it allows me to
easily distinguish function local variables and static variables, especially
this file contains more than ~1500 LoC (it also makes life easier to name the
local variables). So I'd like to keep the "tdx_*".
>
> Apart from that,
>
> Reviewed-by: David Hildenbrand <david@redhat.com>
>
Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-19 23:41 ` Dave Hansen
@ 2023-06-20 0:56 ` Huang, Kai
2023-06-20 1:06 ` Dave Hansen
2023-06-20 7:48 ` Peter Zijlstra
1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-20 0:56 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-19 at 16:41 -0700, Dave Hansen wrote:
> On 6/19/23 07:46, kirill.shutemov@linux.intel.com wrote:
> > > >
> > > > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> > > > little bit silly or overkill IMHO. Looking at the code, it seems
> > > > arch_atomic_set() simply uses __WRITE_ONCE():
> > > How about _adding_ a variable that protects tdmr->pamt_4k_base?
> > > Wouldn't that be more straightforward than mucking around with existing
> > > types?
> > What's wrong with simple global spinlock that protects all tdmr->pamt_*?
> > It is much easier to follow than a custom serialization scheme.
>
> Quick, what prevents a:
>
> spin_lock() => #MC => spin_lock()
>
> deadlock?
>
> Plain old test/sets don't deadlock ever.
Agreed. So I think having any locking in #MC handle is kinda dangerous.
Adding "a" variable has another advantage: We can have a more precise result of
whether we need to reset PAMT pages, even those PAMTs are already allocated and
set to the TDMRs, because the TDX module only starts to write PAMTs using global
KeyID until some SEAMCALL.
Any comments to below?
+static bool tdx_private_mem_begin;
+
/*
* Wrapper of __seamcall() to convert SEAMCALL leaf function error code
* to kernel error code. @seamcall_ret and @out contain the SEAMCALL
@@ -1141,6 +1143,8 @@ static int init_tdx_module(void)
*/
wbinvd_on_all_cpus();
+ WRITE_ONCE(tdx_private_mem_begin, true);
+
/* Config the key of global KeyID on all packages */
ret = config_global_keyid();
if (ret)
@@ -1463,6 +1467,14 @@ static void tdx_memory_shutdown(void)
*/
WARN_ON_ONCE(num_online_cpus() != 1);
+ /*
+ * It's not possible to have any TDX private pages if the TDX
+ * module hasn't started to write any memory using the global
+ * KeyID.
+ */
+ if (!READ_ONCE(tdx_private_mem_begin))
+ return;
+
tdmrs_reset_pamt_all(&tdx_tdmr_list);
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-20 0:56 ` Huang, Kai
@ 2023-06-20 1:06 ` Dave Hansen
2023-06-20 7:58 ` Peter Zijlstra
2023-06-25 15:30 ` Huang, Kai
0 siblings, 2 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-20 1:06 UTC (permalink / raw)
To: Huang, Kai, kirill.shutemov
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/19/23 17:56, Huang, Kai wrote:
> Any comments to below?
Nothing that I haven't already said in this thread:
> Just use a normal old atomic_t or set_bit()/test_bit(). They have
> built-in memory barriers are are less likely to get botched.
I kinda made a point of literally suggesting "atomic_t or
set_bit()/test_bit()". I even told you why: "built-in memory barriers".
Guess what READ/WRITE_ONCE() *don't* have. Memory barriers.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-19 23:41 ` Dave Hansen
2023-06-20 0:56 ` Huang, Kai
@ 2023-06-20 7:48 ` Peter Zijlstra
1 sibling, 0 replies; 144+ messages in thread
From: Peter Zijlstra @ 2023-06-20 7:48 UTC (permalink / raw)
To: Dave Hansen
Cc: kirill.shutemov, Huang, Kai, kvm, Luck, Tony, david, bagasdotme,
ak, Wysocki, Rafael J, linux-kernel, Chatre, Reinette,
Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Shahar, Sagi,
imammedo, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
Huang, Ying, Williams, Dan J
On Mon, Jun 19, 2023 at 04:41:13PM -0700, Dave Hansen wrote:
> On 6/19/23 07:46, kirill.shutemov@linux.intel.com wrote:
> >>>
> >>> Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> >>> little bit silly or overkill IMHO. Looking at the code, it seems
> >>> arch_atomic_set() simply uses __WRITE_ONCE():
> >> How about _adding_ a variable that protects tdmr->pamt_4k_base?
> >> Wouldn't that be more straightforward than mucking around with existing
> >> types?
> > What's wrong with simple global spinlock that protects all tdmr->pamt_*?
> > It is much easier to follow than a custom serialization scheme.
>
> Quick, what prevents a:
>
> spin_lock() => #MC => spin_lock()
>
> deadlock?
>
> Plain old test/sets don't deadlock ever.
Depends on what you mean; anything that spin-waits will deadlock,
doesn't matter if its a test-and-set or not.
The thing with these non-maskable exceptions/interrupts is that they
must be wait-free. If serialization is required it needs to be try based
and accept failure without waiting.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-20 1:06 ` Dave Hansen
@ 2023-06-20 7:58 ` Peter Zijlstra
2023-06-25 15:30 ` Huang, Kai
1 sibling, 0 replies; 144+ messages in thread
From: Peter Zijlstra @ 2023-06-20 7:58 UTC (permalink / raw)
To: Dave Hansen
Cc: Huang, Kai, kirill.shutemov, kvm, Luck, Tony, david, bagasdotme,
ak, Wysocki, Rafael J, linux-kernel, Chatre, Reinette,
Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Shahar, Sagi,
imammedo, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
Huang, Ying, Williams, Dan J
On Mon, Jun 19, 2023 at 06:06:30PM -0700, Dave Hansen wrote:
> On 6/19/23 17:56, Huang, Kai wrote:
> > Any comments to below?
>
> Nothing that I haven't already said in this thread:
>
> > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > built-in memory barriers are are less likely to get botched.
>
> I kinda made a point of literally suggesting "atomic_t or
> set_bit()/test_bit()". I even told you why: "built-in memory barriers".
>
> Guess what READ/WRITE_ONCE() *don't* have. Memory barriers.
x86 has built-in memory barriers for being TSO :-) Specifically all
barriers provided by spinlock (acquire/release) are no-ops on x86.
(strictly speaking locks imply stronger order than they have to because
TSO atomic ops imply stronger ordering than required)
There is one (and only the one) re-ordering possible on TSO and that is
the store-buffer, later loads can fail to observe prior stores.
If that is a concern, you need explicit barriers.
This is #MC, much care and explicit open-coded crap is expected. Also,
this is #MC, much broken is also expected :-( As in, the current #MC
handler is a know pile of shit.
Basically the whole of #MC should be noinstr -- it isn't and that's a
significant problem.
Also we still very much suffer the NMI <- #MC problem and the #MC latch
is known broken garbage.
Whatever you do, do it very carefully, double check and be more careful.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-12 3:06 ` Huang, Kai
2023-06-12 7:58 ` kirill.shutemov
@ 2023-06-20 8:11 ` Peter Zijlstra
2023-06-20 10:42 ` Huang, Kai
1 sibling, 1 reply; 144+ messages in thread
From: Peter Zijlstra @ 2023-06-20 8:11 UTC (permalink / raw)
To: Huang, Kai
Cc: kirill.shutemov, kvm, Hansen, Dave, david, bagasdotme, ak,
Wysocki, Rafael J, linux-kernel, Chatre, Reinette,
Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
> + __mb();
__mb() is not a valid interface to use.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
2023-06-19 12:21 ` David Hildenbrand
@ 2023-06-20 10:31 ` Huang, Kai
2023-06-20 15:39 ` Dave Hansen
1 sibling, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-20 10:31 UTC (permalink / raw)
To: kvm, linux-kernel, david
Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Christopherson,,
Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-19 at 14:21 +0200, David Hildenbrand wrote:
> On 04.06.23 16:27, Kai Huang wrote:
> > TDX memory has integrity and confidentiality protections. Violations of
> > this integrity protection are supposed to only affect TDX operations and
> > are never supposed to affect the host kernel itself. In other words,
> > the host kernel should never, itself, see machine checks induced by the
> > TDX integrity hardware.
> >
> > Alas, the first few generations of TDX hardware have an erratum. A
> > "partial" write to a TDX private memory cacheline will silently "poison"
> > the line. Subsequent reads will consume the poison and generate a
> > machine check. According to the TDX hardware spec, neither of these
> > things should have happened.
> >
> > Virtually all kernel memory accesses operations happen in full
> > cachelines. In practice, writing a "byte" of memory usually reads a 64
> > byte cacheline of memory, modifies it, then writes the whole line back.
> > Those operations do not trigger this problem.
>
> So, ordinary writes to TD private memory are not a problem?
>
Not a problem for the kernel as such write won't poison the memory directly, so
if the kernel reads those memory there won't be #MC.
However if TDX guest reads those memory (which was previous written by kernel or
userspace), the memory is marked as poison when read and #MC is triggered.
> I thought
> one motivation for the unmapped-guest-memory discussion was to prevent
> host (userspace) writes to such memory because it would trigger a MC and
> eventually crash the host.
Yeah the #MC will be triggered inside the TDX guest. I think in most cases such
#MC won't cause host kernel crash but only the victim TDX guest is killed. But
there might be some cases we may not be able to handle #MC gracefully, e.g., in
some particular BIOS setting. One example is with LMCE disabled, any #MC would
be broadcast to all LPs causing all other TDX guests running on other LPs being
killed.
Also quoted from Chao, Peng, who has been working on the unmapped-guest-memory
since early time:
"
The problem is we may not always be able to handle #MC gracefully, in
some configurations (BIOS settings) the #MC can cause the whole system
reset, not just kill the TD. At least this is the original motivation
for Intel to start this series. I think the case is still true unless I
missed something. From KVM community, they have motivation to unmap the
private memory from userspace even the #MC is not fatal, just to prevent
possible unintended accesses from userspace (that's why they ask AMD to
use this series even their machine doesn't cause system reset when the
same happens).
"
>
> I recall that this would happen easily (not just in some weird "partial"
> case and that the spec would allow for it)
No as mentioned above, this partial write #MC is different from the one
triggered in TDX guest as mentioned above.
>
> 1) Does that, in general, not happen anymore (was the hardware fixed?)?
>
> 2) Will new hardware prevent/"fix" that completely (was the spec updated?)?
Yes this erratum will be fixed in later generations of TDX hardware. It only
appears on SPR and EMR (the first two generations of TDX hardware).
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-19 12:52 ` David Hildenbrand
@ 2023-06-20 10:37 ` Huang, Kai
2023-06-20 12:20 ` kirill.shutemov
2023-06-20 15:15 ` Dave Hansen
1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-20 10:37 UTC (permalink / raw)
To: kvm, linux-kernel, david
Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Christopherson,,
Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-19 at 14:52 +0200, David Hildenbrand wrote:
> On 04.06.23 16:27, Kai Huang wrote:
> > TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
> > mode runs only the TDX module itself or other code to load the TDX
> > module.
> >
> > The host kernel communicates with SEAM software via a new SEAMCALL
> > instruction. This is conceptually similar to a guest->host hypercall,
> > except it is made from the host to SEAM software instead. The TDX
> > module establishes a new SEAMCALL ABI which allows the host to
> > initialize the module and to manage VMs.
> >
> > Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very similar
> > to the TDCALL ABI and leverages much TDCALL infrastructure.
> >
> > SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD
> > when CPU is not in VMX operation. Currently, only KVM code mocks with
> > VMX enabling, and KVM is the only user of TDX. This implementation
> > chooses to make KVM itself responsible for enabling VMX before using
> > TDX and let the rest of the kernel stay blissfully unaware of VMX.
> >
> > The current TDX_MODULE_CALL macro handles neither #GP nor #UD. The
> > kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
> > first. Architecturally, there is no CPU flag to check whether the CPU
> > is in VMX operation. Also, if a BIOS were buggy, it could still report
> > valid TDX private KeyIDs when TDX actually couldn't be enabled.
> >
> > Extend the TDX_MODULE_CALL macro to handle #UD and #GP to return error
> > codes. Introduce two new TDX error codes for them respectively so the
> > caller can distinguish.
> >
> > Also add a wrapper function of SEAMCALL to convert SEAMCALL error code
> > to the kernel error code, and print out SEAMCALL error code to help the
> > user to understand what went wrong.
> >
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
>
> I agree with Dave that a buggy bios is not a good motivation for this
> patch. The real strength of this infrastructure IMHO is central error
> handling and expressive error messages. Maybe it makes some corner cases
> (reboot -f) easier to handle. That would make a better justification
> than buggy bios -- and should be spelled out in the patch description.
Agreed. Will do. Thanks!
>
> [...]
>
>
> > +/*
> > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> > + * leaf function return code and the additional output respectively if
> > + * not NULL.
> > + */
> > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > + u64 *seamcall_ret,
> > + struct tdx_module_output *out)
> > +{
> > + int cpu, ret = 0;
> > + u64 sret;
> > +
> > + /* Need a stable CPU id for printing error message */
> > + cpu = get_cpu();
> > +
> > + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > +
>
>
> Why not
>
> cpu = get_cpu();
> sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> put_cpu();
Hmm.. I think this is also OK. The worst case is the error message will be
printed on remote cpu but the message still have the correct "cpu id" printed.
I'll change to above.
>
>
> > + /* Save SEAMCALL return code if the caller wants it */
> > + if (seamcall_ret)
> > + *seamcall_ret = sret;
> > +
> > + /* SEAMCALL was successful */
> > + if (!sret)
> > + goto out;
>
> Why not move that into the switch statement below to avoid th goto?
> If you do the put_cpu() early, you can avoid "ret" as well.
Yeah can do.
>
> switch (sret) {
> case 0:
> /* SEAMCALL was successful */
> return 0;
> case TDX_SEAMCALL_GP:
> pr_err_once("[firmware bug]: TDX is not enabled by BIOS.\n");
> return -ENODEV;
> ...
> }
>
[...]
> > + /*
> > + * SEAMCALL caused #GP or #UD. By reaching here %eax contains
> > + * the trap number. Convert the trap number to the TDX error
> > + * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
> > + *
> > + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
> > + * only accepts 32-bit immediate at most.
>
> Not sure if that comment is really helpful here. It's a common pattern
> for large immediates, no?
I am not sure. I guess I am not expert of x86 assembly but only casual writer.
Hi Dave, Kirill,
Are you OK to remove it?
>
> > + */
> > + mov $TDX_SW_ERROR, %r12
> > + orq %r12, %rax
> >
> > + _ASM_EXTABLE_FAULT(1b, 2b)
> > +.Lseamcall_out:
> > .else
> > tdcall
> > .endif
>
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
2023-06-19 13:00 ` David Hildenbrand
@ 2023-06-20 10:39 ` Huang, Kai
2023-06-20 11:14 ` David Hildenbrand
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-20 10:39 UTC (permalink / raw)
To: kvm, linux-kernel, david
Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Christopherson,,
Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
> > @@ -33,12 +34,24 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > struct tdx_module_output *out)
> > {
> > int cpu, ret = 0;
> > + int retry;
> > u64 sret;
> >
> > /* Need a stable CPU id for printing error message */
> > cpu = get_cpu();
> >
> > - sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > + /*
> > + * Certain SEAMCALL leaf functions may return error due to
> > + * running out of entropy, in which case the SEAMCALL should
> > + * be retried. Handle this in SEAMCALL common function.
> > + *
> > + * Mimic the existing rdrand_long() to retry
> > + * RDRAND_RETRY_LOOPS times.
> > + */
> > + retry = RDRAND_RETRY_LOOPS;
>
> Nit: I'd just do a "int retry = RDRAND_RETRY_LOOPS" and simplify this
> comment to "Mimic rdrand_long() retry behavior."
OK will do.
But I think you are talking about replacing the second paragraph but not the
entire comment?
>
> > + do {
> > + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > + } while (sret == TDX_RND_NO_ENTROPY && --retry);
> >
> > /* Save SEAMCALL return code if the caller wants it */
> > if (seamcall_ret)
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> > index 48ad1a1ba737..55dbb1b8c971 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.h
> > +++ b/arch/x86/virt/vmx/tdx/tdx.h
> > @@ -4,6 +4,23 @@
> >
> > #include <linux/types.h>
> >
> > +/*
> > + * This file contains both macros and data structures defined by the TDX
> > + * architecture and Linux defined software data structures and functions.
> > + * The two should not be mixed together for better readability. The
> > + * architectural definitions come first.
> > + */
> > +
> > +/*
> > + * TDX SEAMCALL error codes
> > + */
> > +#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL
> > +
> > +/*
> > + * Do not put any hardware-defined TDX structure representations below
> > + * this comment!
> > + */
> > +
> > struct tdx_module_output;
> > u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > struct tdx_module_output *out);
>
> In general, LGTM
>
> Reviewed-by: David Hildenbrand <david@redhat.com>
Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-20 8:11 ` Peter Zijlstra
@ 2023-06-20 10:42 ` Huang, Kai
2023-06-20 10:56 ` Peter Zijlstra
0 siblings, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-20 10:42 UTC (permalink / raw)
To: peterz
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-kernel, linux-mm,
Luck, Tony, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, 2023-06-20 at 10:11 +0200, Peter Zijlstra wrote:
> On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
>
> > + __mb();
>
> __mb() is not a valid interface to use.
Thanks for feedback!
May I ask why, for education purpose? :)
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-20 10:42 ` Huang, Kai
@ 2023-06-20 10:56 ` Peter Zijlstra
0 siblings, 0 replies; 144+ messages in thread
From: Peter Zijlstra @ 2023-06-20 10:56 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-kernel, linux-mm,
Luck, Tony, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, Jun 20, 2023 at 10:42:32AM +0000, Huang, Kai wrote:
> On Tue, 2023-06-20 at 10:11 +0200, Peter Zijlstra wrote:
> > On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
> >
> > > + __mb();
> >
> > __mb() is not a valid interface to use.
>
> Thanks for feedback!
>
> May I ask why, for education purpose? :)
it's the raw MFENCE wrapper, not one of the *many* documented barriers.
Also, typicaly you do *not* want MFENCE, MFENCE bad.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error
2023-06-20 10:39 ` Huang, Kai
@ 2023-06-20 11:14 ` David Hildenbrand
0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2023-06-20 11:14 UTC (permalink / raw)
To: Huang, Kai, kvm, linux-kernel
Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Christopherson,,
Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 20.06.23 12:39, Huang, Kai wrote:
>
>>> @@ -33,12 +34,24 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>>> struct tdx_module_output *out)
>>> {
>>> int cpu, ret = 0;
>>> + int retry;
>>> u64 sret;
>>>
>>> /* Need a stable CPU id for printing error message */
>>> cpu = get_cpu();
>>>
>>> - sret = __seamcall(fn, rcx, rdx, r8, r9, out);
>>> + /*
>>> + * Certain SEAMCALL leaf functions may return error due to
>>> + * running out of entropy, in which case the SEAMCALL should
>>> + * be retried. Handle this in SEAMCALL common function.
>>> + *
>>> + * Mimic the existing rdrand_long() to retry
>>> + * RDRAND_RETRY_LOOPS times.
>>> + */
>>> + retry = RDRAND_RETRY_LOOPS;
>>
>> Nit: I'd just do a "int retry = RDRAND_RETRY_LOOPS" and simplify this
>> comment to "Mimic rdrand_long() retry behavior."
>
> OK will do.
>
> But I think you are talking about replacing the second paragraph but not the
> entire comment?
>
Yes.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-20 10:37 ` Huang, Kai
@ 2023-06-20 12:20 ` kirill.shutemov
2023-06-20 12:39 ` David Hildenbrand
0 siblings, 1 reply; 144+ messages in thread
From: kirill.shutemov @ 2023-06-20 12:20 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm, linux-kernel, david, Hansen, Dave, Luck, Tony, bagasdotme,
ak, Wysocki, Rafael J, Christopherson,,
Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, Jun 20, 2023 at 10:37:16AM +0000, Huang, Kai wrote:
> > > + /*
> > > + * SEAMCALL caused #GP or #UD. By reaching here %eax contains
> > > + * the trap number. Convert the trap number to the TDX error
> > > + * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
> > > + *
> > > + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
> > > + * only accepts 32-bit immediate at most.
> >
> > Not sure if that comment is really helpful here. It's a common pattern
> > for large immediates, no?
>
> I am not sure. I guess I am not expert of x86 assembly but only casual writer.
>
> Hi Dave, Kirill,
>
> Are you OK to remove it?
I would rather keep it. I wanted to ask why separate MOV is needed here,
before I read the comment. Also size of $TDX_SW_ERROR is not visible here,
so it contributes to possible confusion without the comment.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-20 12:20 ` kirill.shutemov
@ 2023-06-20 12:39 ` David Hildenbrand
0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2023-06-20 12:39 UTC (permalink / raw)
To: kirill.shutemov, Huang, Kai
Cc: kvm, linux-kernel, Hansen, Dave, Luck, Tony, bagasdotme, ak,
Wysocki, Rafael J, Christopherson,,
Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 20.06.23 14:20, kirill.shutemov@linux.intel.com wrote:
> On Tue, Jun 20, 2023 at 10:37:16AM +0000, Huang, Kai wrote:
>>>> + /*
>>>> + * SEAMCALL caused #GP or #UD. By reaching here %eax contains
>>>> + * the trap number. Convert the trap number to the TDX error
>>>> + * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
>>>> + *
>>>> + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
>>>> + * only accepts 32-bit immediate at most.
>>>
>>> Not sure if that comment is really helpful here. It's a common pattern
>>> for large immediates, no?
>>
>> I am not sure. I guess I am not expert of x86 assembly but only casual writer.
>>
>> Hi Dave, Kirill,
>>
>> Are you OK to remove it?
>
> I would rather keep it. I wanted to ask why separate MOV is needed here,
> before I read the comment. Also size of $TDX_SW_ERROR is not visible here,
> so it contributes to possible confusion without the comment.
>
Fine with me, but I'd assume that the assembler will simply complain in
case we'd try to use a large immediate.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure
2023-06-19 12:52 ` David Hildenbrand
2023-06-20 10:37 ` Huang, Kai
@ 2023-06-20 15:15 ` Dave Hansen
1 sibling, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-20 15:15 UTC (permalink / raw)
To: David Hildenbrand, Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/19/23 05:52, David Hildenbrand wrote:
>> + /*
>> + * SEAMCALL caused #GP or #UD. By reaching here %eax contains
>> + * the trap number. Convert the trap number to the TDX error
>> + * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
>> + *
>> + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
>> + * only accepts 32-bit immediate at most.
>
> Not sure if that comment is really helpful here. It's a common pattern
> for large immediates, no?
It's a question of whether you write the comments for folks that read
x86 assembly all the time or not.
I think the comment is helpful.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
2023-06-19 12:21 ` David Hildenbrand
2023-06-20 10:31 ` Huang, Kai
@ 2023-06-20 15:39 ` Dave Hansen
2023-06-20 16:03 ` David Hildenbrand
1 sibling, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-20 15:39 UTC (permalink / raw)
To: David Hildenbrand, Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 6/19/23 05:21, David Hildenbrand wrote:
> So, ordinary writes to TD private memory are not a problem? I thought
> one motivation for the unmapped-guest-memory discussion was to prevent
> host (userspace) writes to such memory because it would trigger a MC and
> eventually crash the host.
Those are two different problems.
Problem #1 (this patch): The host encounters poison when going about its
normal business accessing normal memory. This happens when something in
the host accidentally clobbers some TDX memory and *then* reads it.
Only occurs with partial writes.
Problem #2 (addressed with unmapping): Host *userspace* intentionally
and maliciously clobbers some TDX memory and then the TDX module or a
TDX guest can't run because the memory integrity checks (checksum or TD
bit) fail. This can also take the system down because #MC's are nasty.
Host userspace unmapping doesn't prevent problem #1 because it's the
kernel who screwed up with the _kernel_ mapping.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
2023-06-19 11:37 ` Huang, Kai
@ 2023-06-20 15:44 ` Dave Hansen
2023-06-20 23:11 ` Huang, Kai
0 siblings, 1 reply; 144+ messages in thread
From: Dave Hansen @ 2023-06-20 15:44 UTC (permalink / raw)
To: Huang, Kai, kvm, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On 6/19/23 04:37, Huang, Kai wrote:
> Please let me know for any comments?
Can you please go look at where most of the X86_BUG_* bits are set? Can
you set yours near one of the existing sites instead of plopping a new
one into the code?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
2023-06-20 15:39 ` Dave Hansen
@ 2023-06-20 16:03 ` David Hildenbrand
2023-06-20 16:21 ` Dave Hansen
0 siblings, 1 reply; 144+ messages in thread
From: David Hildenbrand @ 2023-06-20 16:03 UTC (permalink / raw)
To: Dave Hansen, Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo
On 20.06.23 17:39, Dave Hansen wrote:
> On 6/19/23 05:21, David Hildenbrand wrote:
>> So, ordinary writes to TD private memory are not a problem? I thought
>> one motivation for the unmapped-guest-memory discussion was to prevent
>> host (userspace) writes to such memory because it would trigger a MC and
>> eventually crash the host.
>
> Those are two different problems.
>
> Problem #1 (this patch): The host encounters poison when going about its
> normal business accessing normal memory. This happens when something in
> the host accidentally clobbers some TDX memory and *then* reads it.
> Only occurs with partial writes.
>
> Problem #2 (addressed with unmapping): Host *userspace* intentionally
> and maliciously clobbers some TDX memory and then the TDX module or a
> TDX guest can't run because the memory integrity checks (checksum or TD
> bit) fail. This can also take the system down because #MC's are nasty.
>
> Host userspace unmapping doesn't prevent problem #1 because it's the
> kernel who screwed up with the _kernel_ mapping.
Ahh, thanks for verifying. I was hoping that problem #2 would get fixed
in HW as well (and treated like a BUG).
Because problem #2 also sounds like something that directly violates the
first paragraph of this patch description "violations of
this integrity protection are supposed to only affect TDX operations and
are never supposed to affect the host kernel itself."
So I would expect the TDX guest to fail hard, but not other TDX guests
(or the host kernel).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
2023-06-20 16:03 ` David Hildenbrand
@ 2023-06-20 16:21 ` Dave Hansen
0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2023-06-20 16:21 UTC (permalink / raw)
To: David Hildenbrand, Kai Huang, linux-kernel, kvm
Cc: linux-mm, kirill.shutemov, tony.luck, peterz, tglx, seanjc,
pbonzini, dan.j.williams, rafael.j.wysocki, ying.huang,
reinette.chatre, len.brown, ak, isaku.yamahata, chao.gao,
sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo, Raj,
Ashok
On 6/20/23 09:03, David Hildenbrand wrote:
> On 20.06.23 17:39, Dave Hansen wrote:
>> On 6/19/23 05:21, David Hildenbrand wrote:
>>> So, ordinary writes to TD private memory are not a problem? I thought
>>> one motivation for the unmapped-guest-memory discussion was to prevent
>>> host (userspace) writes to such memory because it would trigger a MC and
>>> eventually crash the host.
>>
>> Those are two different problems.
>>
>> Problem #1 (this patch): The host encounters poison when going about its
>> normal business accessing normal memory. This happens when something in
>> the host accidentally clobbers some TDX memory and *then* reads it.
>> Only occurs with partial writes.
>>
>> Problem #2 (addressed with unmapping): Host *userspace* intentionally
>> and maliciously clobbers some TDX memory and then the TDX module or a
>> TDX guest can't run because the memory integrity checks (checksum or TD
>> bit) fail. This can also take the system down because #MC's are nasty.
>>
>> Host userspace unmapping doesn't prevent problem #1 because it's the
>> kernel who screwed up with the _kernel_ mapping.
>
> Ahh, thanks for verifying. I was hoping that problem #2 would get fixed
> in HW as well (and treated like a BUG).
No, it's really working as designed.
#1 _can_ be fixed because the hardware can just choose to let the host
run merrily along corrupting TDX data and blissfully unaware of the
carnage until TDX stumbles on the mess. Blissful ignorance really is a
useful feature here. It means, for instance, that if the kernel screws
up, it can still blissfully kexec(), reboot , boot a new kernel, or dump
to the console without fear of #MC.
#2 is much harder because the TDX data is destroyed and yet the TDX side
still wants to run. The SEV folks chose page faults on write to stop
SEV from running and the TDX folks chose #MC on reads as the mechanism.
All of the nastiness on the TDX side is (IMNHO) really a consequence of
that decision to use machine checks.
(Aside: I'm not specifically crapping on the TDX CPU designers here. I
don't particularly like the SEV approach either. But this mess is a
result of the TDX design choices. There are other messes in other
patch series from SEV. )
> Because problem #2 also sounds like something that directly violates the
> first paragraph of this patch description "violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself."
>
> So I would expect the TDX guest to fail hard, but not other TDX guests
> (or the host kernel).
This is more fallout from the #MC design choice.
Let's use page faults as an example since our SEV friends are using
them. *ANY* instruction that reads memory can page fault, have the
kernel fix up the fault, and continue merrily along its way.
#MC is fundamentally different. The exceptions can be declared to be
unrecoverable. The CPU says, "whoopsie, I managed to deliver this #MC,
but it would be too hard for me so I can't continue." These "too hard"
scenarios are shrinking over time, but they do exist. They're fatal.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum
2023-06-20 15:44 ` Dave Hansen
@ 2023-06-20 23:11 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-20 23:11 UTC (permalink / raw)
To: kvm, Hansen, Dave, linux-kernel
Cc: Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
kirill.shutemov, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, Yamahata, Isaku, linux-mm, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Tue, 2023-06-20 at 08:44 -0700, Dave Hansen wrote:
> On 6/19/23 04:37, Huang, Kai wrote:
> > Please let me know for any comments?
>
> Can you please go look at where most of the X86_BUG_* bits are set? Can
> you set yours near one of the existing sites instead of plopping a new
> one into the code?
>
Sure I'll try again and sync with Kirill first. Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-20 1:06 ` Dave Hansen
2023-06-20 7:58 ` Peter Zijlstra
@ 2023-06-25 15:30 ` Huang, Kai
2023-06-25 23:26 ` Huang, Kai
1 sibling, 1 reply; 144+ messages in thread
From: Huang, Kai @ 2023-06-25 15:30 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, peterz, Shahar,
Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Mon, 2023-06-19 at 18:06 -0700, Dave Hansen wrote:
> On 6/19/23 17:56, Huang, Kai wrote:
> > Any comments to below?
>
> Nothing that I haven't already said in this thread:
>
> > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > built-in memory barriers are are less likely to get botched.
>
> I kinda made a point of literally suggesting "atomic_t or
> set_bit()/test_bit()". I even told you why: "built-in memory barriers".
>
> Guess what READ/WRITE_ONCE() *don't* have. Memory barriers.
>
Hi Dave,
Sorry to bring this up again. I thought more on this topic, and I think using
atotmic_t is only necessary if we add it right after setting up tdmr->pamt_* in
tdmr_set_up_pamt(), because there we need both compiler barrier and CPU memory
barrier to make sure memory order (as Kirill commented in the first reply).
However, if we add a new variable like below ...
+static bool tdx_private_mem_begin;
+
/*
* Wrapper of __seamcall() to convert SEAMCALL leaf function error code
* to kernel error code. @seamcall_ret and @out contain the SEAMCALL
@@ -1123,6 +1125,8 @@ static int init_tdx_module(void)
*/
wbinvd_on_all_cpus();
+ tdx_private_mem_begin = true;
... then we don't need any more explicit barrier, because: 1) it's not possible
for compiler to optimize the order between setting tdmr->pamt_* and
tdx_private_mem_begin; 2) no CPU memory barrier is needed as WBINVD is a
serializing instruction so the wbinvd_on_all_cpus() above has already implied
memory barrier.
Does this make sense?
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
2023-06-08 23:24 ` [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs kirill.shutemov
2023-06-08 23:43 ` Dave Hansen
@ 2023-06-25 15:38 ` Huang, Kai
1 sibling, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-25 15:38 UTC (permalink / raw)
To: kirill.shutemov
Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Luck, Tony,
peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Fri, 2023-06-09 at 02:24 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jun 05, 2023 at 02:27:25AM +1200, Kai Huang wrote:
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index fa9fa8bc581a..5f0499ba5d67 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -265,7 +265,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> > * overlap.
> > */
> > static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> > - unsigned long end_pfn)
> > + unsigned long end_pfn, int nid)
> > {
> > struct tdx_memblock *tmb;
> >
> > @@ -276,6 +276,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> > INIT_LIST_HEAD(&tmb->list);
> > tmb->start_pfn = start_pfn;
> > tmb->end_pfn = end_pfn;
> > + tmb->nid = nid;
> >
> > /* @tmb_list is protected by mem_hotplug_lock */
> > list_add_tail(&tmb->list, tmb_list);
> > @@ -303,9 +304,9 @@ static void free_tdx_memlist(struct list_head *tmb_list)
> > static int build_tdx_memlist(struct list_head *tmb_list)
> > {
> > unsigned long start_pfn, end_pfn;
> > - int i, ret;
> > + int i, nid, ret;
> >
> > - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > /*
> > * The first 1MB is not reported as TDX convertible memory.
> > * Although the first 1MB is always reserved and won't end up
> > @@ -321,7 +322,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
> > * memblock has already guaranteed they are in address
> > * ascending order and don't overlap.
> > */
> > - ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> > + ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
> > if (ret)
> > goto err;
> > }
>
>
> These three hunks and change to struct tdx_memblock looks unrelated.
> Why not fold this to 09/20?
Sorry I missed to reply this.
The @nid is used to try to allocate the PAMT from local node. It only gets used
in this patch. Originally (in v7) I had it in patch 09 but Dave suggested to
move to this patch (see the first comment in below link):
https://lore.kernel.org/lkml/8e6803f5-bec6-843d-f3c4-75006ffd0d2f@intel.com/
> > +
> > +#define TDX_PS_NR (TDX_PS_1G + 1)
>
> This should be next to the rest TDX_PS_*.
>
Done.
^ permalink raw reply [flat|nested] 144+ messages in thread
* Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
2023-06-25 15:30 ` Huang, Kai
@ 2023-06-25 23:26 ` Huang, Kai
0 siblings, 0 replies; 144+ messages in thread
From: Huang, Kai @ 2023-06-25 23:26 UTC (permalink / raw)
To: kirill.shutemov, Hansen, Dave
Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
linux-kernel, Chatre, Reinette, Christopherson,,
Sean, pbonzini, tglx, linux-mm, Yamahata, Isaku, Shahar, Sagi,
peterz, imammedo, Gao, Chao, Brown, Len,
sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J
On Sun, 2023-06-25 at 15:30 +0000, Huang, Kai wrote:
> On Mon, 2023-06-19 at 18:06 -0700, Dave Hansen wrote:
> > On 6/19/23 17:56, Huang, Kai wrote:
> > > Any comments to below?
> >
> > Nothing that I haven't already said in this thread:
> >
> > > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > > built-in memory barriers are are less likely to get botched.
> >
> > I kinda made a point of literally suggesting "atomic_t or
> > set_bit()/test_bit()". I even told you why: "built-in memory barriers".
> >
> > Guess what READ/WRITE_ONCE() *don't* have. Memory barriers.
> >
>
> Hi Dave,
>
> Sorry to bring this up again. I thought more on this topic, and I think using
> atotmic_t is only necessary if we add it right after setting up tdmr->pamt_* in
> tdmr_set_up_pamt(), because there we need both compiler barrier and CPU memory
> barrier to make sure memory order (as Kirill commented in the first reply).
>
> However, if we add a new variable like below ...
>
> +static bool tdx_private_mem_begin;
> +
> /*
> * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> @@ -1123,6 +1125,8 @@ static int init_tdx_module(void)
> */
> wbinvd_on_all_cpus();
>
> + tdx_private_mem_begin = true;
>
>
> ... then we don't need any more explicit barrier, because: 1) it's not possible
> for compiler to optimize the order between setting tdmr->pamt_* and
> tdx_private_mem_begin; 2) no CPU memory barrier is needed as WBINVD is a
> serializing instruction so the wbinvd_on_all_cpus() above has already implied
> memory barrier.
>
> Does this make sense?
Sorry please ignore this. I missed a corner case that the kexec() can happen
when something goes wrong during module initialization and when PAMTs/TDMRs are
being freed. We still need explicit memory barrier for this case. I will use
atomic_t as suggested. Thanks!
^ permalink raw reply [flat|nested] 144+ messages in thread
end of thread, other threads:[~2023-06-25 23:26 UTC | newest]
Thread overview: 144+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <cover.1685887183.git.kai.huang@intel.com>
[not found] ` <af4e428ab1245e9441031438e606c14472daf927.1685887183.git.kai.huang@intel.com>
[not found] ` <a2da8af2-41a9-a0cf-dbe9-7f0a14bf05fe@linux.intel.com>
2023-06-06 22:58 ` [PATCH v11 02/20] x86/virt/tdx: Detect TDX during kernel boot Huang, Kai
2023-06-06 23:44 ` Isaku Yamahata
2023-06-19 12:12 ` David Hildenbrand
2023-06-19 23:58 ` Huang, Kai
[not found] ` <ec640452a4385d61bec97f8b761ed1ff38898504.1685887183.git.kai.huang@intel.com>
2023-06-06 23:55 ` [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure Isaku Yamahata
2023-06-07 14:24 ` Dave Hansen
2023-06-07 18:53 ` Isaku Yamahata
2023-06-07 19:27 ` Dave Hansen
2023-06-07 19:47 ` Isaku Yamahata
2023-06-07 20:08 ` Sean Christopherson
2023-06-07 20:22 ` Dave Hansen
2023-06-08 0:51 ` Huang, Kai
2023-06-08 13:50 ` Dave Hansen
2023-06-07 22:56 ` Huang, Kai
2023-06-08 14:05 ` Dave Hansen
2023-06-19 12:52 ` David Hildenbrand
2023-06-20 10:37 ` Huang, Kai
2023-06-20 12:20 ` kirill.shutemov
2023-06-20 12:39 ` David Hildenbrand
2023-06-20 15:15 ` Dave Hansen
[not found] ` <86f2a8814240f4bbe850f6a09fc9d0b934979d1b.1685887183.git.kai.huang@intel.com>
[not found] ` <20230606123821.exit7gyxs42dxotz@box.shutemov.name>
2023-06-06 22:58 ` [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum Huang, Kai
2023-06-07 15:06 ` kirill.shutemov
2023-06-07 14:15 ` Dave Hansen
2023-06-07 22:43 ` Huang, Kai
2023-06-19 11:37 ` Huang, Kai
2023-06-20 15:44 ` Dave Hansen
2023-06-20 23:11 ` Huang, Kai
2023-06-19 12:21 ` David Hildenbrand
2023-06-20 10:31 ` Huang, Kai
2023-06-20 15:39 ` Dave Hansen
2023-06-20 16:03 ` David Hildenbrand
2023-06-20 16:21 ` Dave Hansen
[not found] ` <21b3a45cb73b4e1917c1eba75b7769781a15aa14.1685887183.git.kai.huang@intel.com>
2023-06-07 15:22 ` [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand Dave Hansen
2023-06-08 2:10 ` Huang, Kai
2023-06-08 13:43 ` Dave Hansen
2023-06-12 11:21 ` Huang, Kai
2023-06-19 13:16 ` David Hildenbrand
2023-06-19 23:28 ` Huang, Kai
[not found] ` <50386eddbb8046b0b222d385e56e8115ed566526.1685887183.git.kai.huang@intel.com>
2023-06-07 15:25 ` [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory Dave Hansen
2023-06-08 0:27 ` kirill.shutemov
2023-06-08 2:40 ` Huang, Kai
2023-06-08 11:41 ` kirill.shutemov
2023-06-08 13:13 ` Dave Hansen
2023-06-12 2:00 ` Huang, Kai
2023-06-08 23:29 ` Isaku Yamahata
2023-06-08 23:54 ` kirill.shutemov
2023-06-09 1:33 ` Isaku Yamahata
2023-06-09 10:02 ` kirill.shutemov
2023-06-12 2:00 ` Huang, Kai
2023-06-19 13:29 ` David Hildenbrand
2023-06-19 23:51 ` Huang, Kai
[not found] ` <927ec9871721d2a50f1aba7d1cf7c3be50e4f49b.1685887183.git.kai.huang@intel.com>
2023-06-07 16:05 ` [PATCH v11 11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions Dave Hansen
2023-06-08 10:48 ` Huang, Kai
2023-06-08 13:11 ` Dave Hansen
2023-06-12 2:33 ` Huang, Kai
2023-06-12 14:33 ` kirill.shutemov
2023-06-12 22:10 ` Huang, Kai
2023-06-13 10:18 ` kirill.shutemov
2023-06-13 23:19 ` Huang, Kai
2023-06-08 23:02 ` kirill.shutemov
2023-06-12 2:25 ` Huang, Kai
2023-06-09 4:01 ` Sathyanarayanan Kuppuswamy
2023-06-12 2:28 ` Huang, Kai
2023-06-14 12:31 ` Nikolay Borisov
2023-06-14 22:45 ` Huang, Kai
[not found] ` <cee2f2664aac3c5314896c6d14cba50f2617c0e5.1685887183.git.kai.huang@intel.com>
2023-06-08 0:08 ` [PATCH v11 03/20] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC kirill.shutemov
[not found] ` <9b3582c9f3a81ae68b32d9997fcd20baecb63b9b.1685887183.git.kai.huang@intel.com>
2023-06-07 8:19 ` [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error Isaku Yamahata
2023-06-07 15:08 ` Dave Hansen
2023-06-07 23:36 ` Huang, Kai
2023-06-08 0:29 ` Dave Hansen
2023-06-08 0:08 ` kirill.shutemov
2023-06-09 14:42 ` Nikolay Borisov
2023-06-12 11:04 ` Huang, Kai
2023-06-19 13:00 ` David Hildenbrand
2023-06-20 10:39 ` Huang, Kai
2023-06-20 11:14 ` David Hildenbrand
2023-06-08 21:03 ` [PATCH v11 00/20] TDX host kernel support Dan Williams
2023-06-12 10:56 ` Huang, Kai
[not found] ` <468533166590ff5ed11730350c4af8cdb0b99165.1685887183.git.kai.huang@intel.com>
2023-06-07 15:48 ` [PATCH v11 09/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Dave Hansen
2023-06-07 23:22 ` Huang, Kai
2023-06-08 22:40 ` kirill.shutemov
[not found] ` <f9148e67e968d7aed4707b67ea9b1aa761401255.1685887183.git.kai.huang@intel.com>
2023-06-07 15:54 ` [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Dave Hansen
2023-06-07 15:57 ` Dave Hansen
2023-06-08 10:18 ` Huang, Kai
2023-06-08 22:52 ` kirill.shutemov
2023-06-12 2:21 ` Huang, Kai
2023-06-12 3:01 ` Dave Hansen
[not found] ` <409448809f7c78191aa27d6d2970ba1384c2d464.1685887183.git.kai.huang@intel.com>
2023-06-08 23:53 ` [PATCH v11 13/20] x86/virt/tdx: Designate reserved areas for all TDMRs kirill.shutemov
[not found] ` <4e6cd933edd2501147366df7a17e1087560a4320.1685887183.git.kai.huang@intel.com>
2023-06-08 23:53 ` [PATCH v11 14/20] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID kirill.shutemov
[not found] ` <30358db4eff961c69783bbd4d9f3e50932a9a759.1685887183.git.kai.huang@intel.com>
2023-06-08 23:53 ` [PATCH v11 15/20] x86/virt/tdx: Configure global KeyID on all packages kirill.shutemov
2023-06-15 8:12 ` Nikolay Borisov
2023-06-15 22:24 ` Huang, Kai
2023-06-19 14:56 ` kirill.shutemov
2023-06-19 23:38 ` Huang, Kai
[not found] ` <7bd7d0c6196deb58b54d6e629603775844b1307d.1685887183.git.kai.huang@intel.com>
2023-06-09 10:03 ` [PATCH v11 16/20] x86/virt/tdx: Initialize all TDMRs kirill.shutemov
[not found] ` <17bcbe3e154415ee7a4c77489809a3db0c5ddf3f.1685887183.git.kai.huang@intel.com>
2023-06-09 10:14 ` [PATCH v11 17/20] x86/kexec: Flush cache of TDX private memory kirill.shutemov
[not found] ` <116cafb15625ac0bcda7b47143921d0c42061b69.1685887183.git.kai.huang@intel.com>
2023-06-09 13:17 ` [PATCH v11 19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum kirill.shutemov
2023-06-12 3:08 ` Huang, Kai
2023-06-12 7:59 ` kirill.shutemov
2023-06-12 13:51 ` Dave Hansen
2023-06-12 23:31 ` Huang, Kai
[not found] ` <5aa7506d4fedbf625e3fe8ceeb88af3be1ce97ea.1685887183.git.kai.huang@intel.com>
2023-06-09 13:23 ` [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot kirill.shutemov
2023-06-12 3:06 ` Huang, Kai
2023-06-12 7:58 ` kirill.shutemov
2023-06-12 10:27 ` Huang, Kai
2023-06-12 11:48 ` kirill.shutemov
2023-06-12 13:18 ` David Laight
2023-06-12 13:47 ` Dave Hansen
2023-06-13 0:51 ` Huang, Kai
2023-06-13 11:05 ` kirill.shutemov
2023-06-14 0:15 ` Huang, Kai
2023-06-13 14:25 ` Dave Hansen
2023-06-13 23:18 ` Huang, Kai
2023-06-14 0:24 ` Dave Hansen
2023-06-14 0:38 ` Huang, Kai
2023-06-14 0:42 ` Huang, Kai
2023-06-19 11:43 ` Huang, Kai
2023-06-19 14:31 ` Dave Hansen
2023-06-19 14:46 ` kirill.shutemov
2023-06-19 23:35 ` Huang, Kai
2023-06-19 23:41 ` Dave Hansen
2023-06-20 0:56 ` Huang, Kai
2023-06-20 1:06 ` Dave Hansen
2023-06-20 7:58 ` Peter Zijlstra
2023-06-25 15:30 ` Huang, Kai
2023-06-25 23:26 ` Huang, Kai
2023-06-20 7:48 ` Peter Zijlstra
2023-06-20 8:11 ` Peter Zijlstra
2023-06-20 10:42 ` Huang, Kai
2023-06-20 10:56 ` Peter Zijlstra
2023-06-14 9:33 ` Huang, Kai
2023-06-14 10:02 ` kirill.shutemov
2023-06-14 10:58 ` Huang, Kai
2023-06-14 11:08 ` kirill.shutemov
2023-06-14 11:17 ` Huang, Kai
[not found] ` <4e108968c3294189ad150f62df1f146168036342.1685887183.git.kai.huang@intel.com>
2023-06-08 23:24 ` [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs kirill.shutemov
2023-06-08 23:43 ` Dave Hansen
2023-06-12 2:52 ` Huang, Kai
2023-06-25 15:38 ` Huang, Kai
2023-06-15 7:48 ` Nikolay Borisov
[not found] ` <34853e0f8f38ec2fda66b0ba480d4df63b8aab43.1685887183.git.kai.huang@intel.com>
2023-06-08 23:56 ` [PATCH v11 20/20] Documentation/x86: Add documentation for TDX host support Dave Hansen
2023-06-12 3:41 ` Huang, Kai
2023-06-16 9:02 ` Nikolay Borisov
2023-06-16 16:26 ` Dave Hansen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox